Chapter 3 Availability and uniqueness

In this Chapter it is demonstrated how to check whether records are available and/or complete with respect to a set of keys, and whether they are unique. The checks described here are typically useful for data in ‘long’ format, where one column holds a value and all the other columns identify the value.

To test for missing values in individual variables, see also 2.2.
To check whether records or parts thereof are completed, see 4.1.

Data

In this Chapter the samplonomy dataset is used that comes with the validate package.

library(validate)
data(samplonomy)
head(samplonomy, 3)

##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 222000

3.1 Long data

The samplonomy data set is structured in ‘long form’. This means that each record has a single value column, and one or more columns containing character values that together describe what the value means.

head(samplonomy,3)

##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 222000

The data set contains several time series for multiple measures of the fictional country ‘Samplonia’. There are time series for several subregions of Samplonia.

Long format data is typically used as a transport format: it may be used to bulk-load data into SQL-based data base systems, or to transfer data between organisations in a unambiguous way.

Data in long form is in general much harder to check and process for statistical purpose than data in wide format, where each variable is stored in a separate column. The reason is that in long format relations between different variables are spread out across records, and those records are not necessarily ordered in any particular way prior to processing. This makes interpretation of validation fails intrinsically harder for long-form data than for wide-form data.

The samplonomy data set has a particularly nasty structure. It contains both annual and quarterly time series for GDP, Import, Export and the Balance of Trade (export less import). The period column therefore contains both quarterly and annual labels. Furthermore, there are time series for the whole of Samplonia (region Samplonia), for each of its two provinces (regions Agria and Induston) and for each of its districts within Agria (Wheaton and Greenham) and Induston (Smokely, Mudwater, Newbay and Oakdale).

Naturally, we expect that the key combinations are unique, that all time series are gapless and complete, that the Balance of trade equals Export less Import everywhere, that district values add up to the provinces’, and that province values add up to the total of Samplonia. Finally, the quarterly time series must add up to the annual values.

3.2 Uniqueness

The function is_unique() checks whether combinations of variables (usually key variables) uniquely identify a record. It accepts any positive number of variable names and returns FALSE for each record that is duplicated with respect to the designated variables.

Here, we test whether region, period, and measure uniquely identify a value in the samplonomy data set.

rule <- validator(is_unique(region, period, measure))
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]

##   name items passes fails nNA error warning
## 1   V1  1199   1197     2   0 FALSE   FALSE

There are 2 fails. After extracting the individual values for each record we can find the duplicated ones using a convenience function from validate.

violating(samplonomy, out)

##       region freq period measure  value
## 870 Induston    Q 2018Q2  export 165900
## 871 Induston    Q 2018Q2  export 170000

There are a two subtleties to keep in mind when interpreting uniqueness. The first has to do with missing values, and the second has to do with grouping. To start with the missing value problem, take a look at the following two-record data frame.

df <- data.frame(x = c(1,1), y = c("A",NA))
df

##   x    y
## 1 1    A
## 2 1 <NA>

How should we judge whether these two records are unique? A tempting option is to say the first record is unique, and to return NA for the second record since it contains a missing value: R has the habit of returning NA from calculations when an input value is NA. This choice is not invalid, but it would have consequences for determining whether the first record is unique as well. After all, it is possible to fill in a value in the missing field such that the two records are duplicates. Therefore, if one would return NA for the second record, the correct thing to do is to also return NA for the first record. In R, the choice is made to treat NA as an actual value when checking for duplicates or unique records (see ?duplicated from base R). To see this inspect the following code and output.

df <- data.frame(x=rep(1,3), y = c("A", NA, NA))
is_unique(df$x, df$y)

## [1]  TRUE FALSE FALSE

The second subtlety has to do with grouping. You may want to test whether a column is unique, given one or more other variables. It is tempting to think that this requires a split-apply-combine approach where the dataset is first split according to one or more grouping variables, check for uniqueness of the column in each group, and then combine the results. However, such an approach is not necessary as you can simply add the grouping variables to the list of variables that together must be unique.

As an example, consider the output of the following two approaches.

# y is unique, given x. But not by itself
df <- data.frame(x=rep(letters[1:2],each=3), y=rep(1:3,2))

# the split-apply-combine approach
unsplit(tapply(df$y, df$x, is_unique), df$x)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

# the combined approach
is_unique(df$x, df$y)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE

3.3 Availability of records

This section is on testing for availability of whole records. Testing for individual missing values (NA), is treated in 2.2.

We wish to ensure that for each region, and each variable, the periods 2014, 2015, \(\ldots\), 2019 are present. Using contains_at_least we can establish this.

rule <- validator(
  contains_at_least( 
      keys = data.frame(period = as.character(2014:2019))
    , by=list(region, measure) )
)
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]

##   name items passes fails nNA error warning
## 1   V1  1199   1170    29   0 FALSE   FALSE

The function contains_at_least splits the samplonomy dataset into blocks according to values of region and measure. Next, it checks that in each block the variable period contains at least the values 2014–2019.

The return value is a logical vector where the number of elements equals the number of rows in the dataset under scrutiny. It is TRUE for each block where all years are present, and FALSE for each block where one or more of the years is missing. In this case 29 records are labeled as FALSE. These can be found as follows.

head(violating(samplonomy, out))

##     region freq period measure  value
## 1    Agria    A   2014     gdp 600000
## 5    Agria    Q 2014Q1     gdp  60000
## 9    Agria    Q 2014Q2     gdp 120000
## 13   Agria    Q 2014Q3     gdp 300000
## 17   Agria    Q 2014Q4     gdp 120000
## 204  Agria    Q 2015Q1     gdp  58200

Inspection of these records shows that in this block, for Agria the GDP for "2015" is missing.

We can perform a stricter check, and test whether for each measure, all quarters "2014Q1" \(\ldots\) "2019Q4" are present for each province (Agria and Induston). First create a key set to test against.

years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)

keyset <- expand.grid(
  region = c("Agria", "Induston")
  , period = sapply(years, paste0, quarters))

head(keyset)

##     region period
## 1    Agria 2014Q1
## 2 Induston 2014Q1
## 3    Agria 2014Q2
## 4 Induston 2014Q2
## 5    Agria 2014Q3
## 6 Induston 2014Q3

This key set will be referenced in the rule, and passed to confront as reference data.

rule <- validator(
          contains_at_least(keys=minimal_keys, by=measure) 
        )
out <- confront(samplonomy, rule
              , ref=list(minimal_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]

##   name items passes fails nNA error warning
## 1   V1  1199    899   300   0 FALSE   FALSE

There are 300 fails. Inspecting the data set as above, we see that for Induston, the export is missing in "2018Q3".

Finally, we do a strict test, to check that for each measure all periods and all regions are reported. We also demand that there are no more and no less records than for each individual measure. For this, the function contains_exactly can be used.

First create a keyset.

years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)

keyset <- expand.grid(
  region  = c(
    "Agria" 
   ,"Crowdon"
   ,"Greenham"
   ,"Induston"
   ,"Mudwater"
   ,"Newbay"
   ,"Oakdale"
   ,"Samplonia"
   ,"Smokely"
   ,"Wheaton"
  )
 ,period = c(years, sapply(years, paste0, quarters))
)
head(keyset)

##     region period
## 1    Agria   2014
## 2  Crowdon   2014
## 3 Greenham   2014
## 4 Induston   2014
## 5 Mudwater   2014
## 6   Newbay   2014

The keyset is passed as reference data to the rule using confront.

rule <- validator(contains_exactly(all_keys, by=measure))
out  <- confront(samplonomy, rule
               , ref=list(all_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]

##   name items passes fails nNA error warning
## 1   V1  1199    600   599   0 FALSE   FALSE

To find where the errors reside, we first select the records with an error and then find the unique measures that occur in those records.

erroneous_records <- violating(samplonomy, out)
unique(erroneous_records$measure)

## [1] "gdp"    "export"

So here, blocks containing GDP and Export have entire records missing.

3.4 Gaps in (time) series

For time series, or possibly other series it is desirable that there is a constant distance between each two elements of the series. The mathematical term for such a series is called a linear sequence. Here are some examples of linear series.

The natural numbers: \(1,2,3,\ldots\)
The even natural numbers \(2, 4, 6, \ldots\)
Quarters periods: "2020Q1", "2020Q2", \(\ldots\)
Years (these are just natural numbers): \(2019, 2020, \ldots\)

The validate functions is_linear_sequence and in_linear_sequence check whether a variable represents a linear series, possibly in blocks defined by categorical variables. They can be used interactively or as a rule in a validator object. We first demonstrate how these functions work, and then give an example with the samplonomy dataset.

is_linear_sequence(c(1,2,3,4))

## [1] TRUE

is_linear_sequence(c(8,6,4,2))

## [1] TRUE

is_linear_sequence(c(2,4,8,16))

## [1] FALSE

For character data, the function is capable of recognizing certain formats for time periods.

is_linear_sequence(c("2020Q1","2020Q2","2020Q3","2020Q4"))

## [1] TRUE

See ?is_linear_sequence for a full specification of supported date-time formats.

It is not necessary for data to be sorted in order to be recognized as a linear sequence.

is_linear_sequence(c("2020Q4","2020Q2","2020Q3","2020Q1"))

## [1] TRUE

One can force a begin and/or end point for the sequence as well.

is_linear_sequence(c("2020Q4","2020Q2","2020Q3","2020Q1")
                 , begin = "2020Q2")

## [1] FALSE

Finally it is possible to split a variable by one or more other columns and check whether each block represents a linear sequence.

series <- c(1,2,3,4,1,2,3,3)
blocks <- rep(c("a","b"), each = 4)
is_linear_sequence(series, by = blocks)

## [1] FALSE

Now, this result is not very useful since now it is unknown which block is not a linear series. This is where the function in_linear_sequence comes in.

in_linear_sequence(series, by = blocks)

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

There are some subtleties. A single element is also a linear sequence (of length 1).

is_linear_sequence(5)

## [1] TRUE

This can yield surprises in cases of blocks of length 1.

blocks[8] <- "c"
data.frame(series = series, blocks = blocks)

##   series blocks
## 1      1      a
## 2      2      a
## 3      3      a
## 4      4      a
## 5      1      b
## 6      2      b
## 7      3      b
## 8      3      c

in_linear_sequence(series, blocks)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

We now have three linear series, namely

For "a": 1,2,3,4
For "b": 1,2,3
For "c": 3.

We can circumvent this by giving explicit bounds.

in_linear_sequence(series, blocks, begin = 1, end = 4)

## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE

We now return to the samplonomy dataset. We wish to check that for each measure and each area, the time series are linear series. Since there are time series of different frequencies, we need to split the data by frequency as well.

rule <- validator(
          in_linear_sequence(period
            , by = list(region, freq, measure))
        )
out  <- confront(samplonomy, rule)
summary(out)[1:7]

##   name items passes fails nNA error warning
## 1   V1  1199   1170    29   0 FALSE   FALSE

We can find the blocks where records are not in sequence as follows (output not printed here for brevity).

violating(samplonomy, out)

Inspection of the selected records shows that for Agria the GDP for 2015 is missing, and that for Induston the Export for 2018Q3 is missing while Export for 2018Q2 occurs twice (but with different values)