Missing Data

In Julia, missing data are represented as missing in DataFrames, and are of type Missing. To access the methods of missing values, we need to import the Missings.jl package.

using Missings
typeof(missing)
Missing

Missing values propagate in calculations, and so including missing values in your data can cause you to end up with missing answers.

mean([1, 2, missing])
missing

Missing as a Data Type

In Julia, as in many other languages, there is an abstract data type to represent missing values: Missing. Like any other data type, this means that values of type Missing have methods and attributes that control how they can be worked with. As noted above, missing values propagate, and so it is important to understand how Missing works as an abstract data type in order to understand how we can work with these data in conjunction with DataFrames.

Propagation of Missing Values

It is important to understand how missing vaules propagate so we know when we need to change missing values in order to get a desired output. Numerical functions, operations, and comparison operators all propagate missing values:

sin(missing), 1 / missing, missing == missing, 1 != missing
(missing, missing, missing, missing)

We can test whether or not a value is missing using the ismissing function.

ismissing(missing), ismissing(1)
(true, false)

Logical operators return missing only when the result cannot be determined without the missing data. This means that a comparison like true & missing returns missing, but true | missing returns true.

# should return (missing, true, missing, false, missing, missing)
true & missing, true | missing, true  missing, false & missing, false | missing, false  missing
(missing, true, missing, false, missing, missing)

Skipping Missing Values

To skip missing values, you can create an interator using skipmissing that iterates over an array and yields only non-missing values.

vals = [1, 2, missing, 4, 5, missing, 7, 8, missing, 10]
not_missing = skipmissing(vals)
Base.SkipMissing{Array{Union{Missing, Int64},1}}(Union{Missing, Int64}[1, 2, missing, 4, 5, missing, 7, 8, missing, 10])
for x in not_missing
    print(x, " ")
end
1 2 4 5 7 8 10 

The iterator can be collected into an array using the collect function, which iterates through an iterator and collects all of the values into an array.

Replacing Missing Values

To replace missing values, the coalesce function is useful. It is important to note that it applies to elements of arrays, not to arrays themselves, and so it must be used with dot notation if you want to broadcast it to an entire array.

coalesce.(vals, 0)
10-element Array{Int64,1}:
  1
  2
  0
  4
  5
  0
  7
  8
  0
 10

Missing Values in DataFrames

DataFrames.jl includes some support for dealing with missing values that are more difficult to implement using the standard functions from Missings.jl. For example, it provides the dropmissing and dropmissing! functions, which drop rows with missing values in a copy and inplace, respectively.

Recall our iris data set:

iris = CSV.read("data/iris.csv")
first(iris, 5)

5 rows × 5 columns

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Float64⍰Float64⍰Float64⍰Float64⍰String⍰
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa

If we were to set the first value of sepal length to missing, we could drop that row using dropmissing.

iris_missing = copy(iris)
iris_missing[1,Symbol("Sepal.Length")] = missing
first(iris_missing, 5)

5 rows × 5 columns

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Float64⍰Float64⍰Float64⍰Float64⍰String⍰
1missing3.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
first(dropmissing(iris), 5)

5 rows × 5 columns

Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
Float64⍰Float64⍰Float64⍰Float64⍰String⍰
14.93.01.40.2setosa
24.73.21.30.2setosa
34.63.11.50.2setosa
45.03.61.40.2setosa
55.43.91.70.4setosa

As you can see, we lost the first row of the DataFrame because we set the value to missing.

If you’re only concerned about specific columns, you can specify which columns to drop.

Exercises

Exercise 1.4.1: