Chapter 3 Availability and uniqueness
In this Chapter it is demonstrated how to check whether records are available and/or complete with respect to a set of keys, and whether they are unique. The checks described here are typically useful for data in ‘long’ format, where one column holds a value and all the other columns identify the value.
- To test for missing values in individual variables, see also 2.2.
- To check whether records or parts thereof are completed, see 4.1.
Data
In this Chapter the samplonomy dataset is used that comes with the validate
package.
##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 2220003.1 Long data
The samplonomy data set is structured in ‘long form’. This means that each
record has a single value column, and one or more columns containing
character values that together describe what the value means.
##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 222000The data set contains several time series for multiple measures of the fictional country ‘Samplonia’. There are time series for several subregions of Samplonia.
Long format data is typically used as a transport format: it may be used to bulk-load data into SQL-based data base systems, or to transfer data between organisations in a unambiguous way.
Data in long form is in general much harder to check and process for statistical purpose than data in wide format, where each variable is stored in a separate column. The reason is that in long format relations between different variables are spread out across records, and those records are not necessarily ordered in any particular way prior to processing. This makes interpretation of validation fails intrinsically harder for long-form data than for wide-form data.
The samplonomy data set has a particularly nasty structure. It contains both
annual and quarterly time series for GDP, Import, Export and the Balance of
Trade (export less import). The period column therefore contains both quarterly
and annual labels. Furthermore, there are time series for the whole of
Samplonia (region Samplonia), for each of its two provinces (regions Agria and
Induston) and for each of its districts within Agria (Wheaton and Greenham) and
Induston (Smokely, Mudwater, Newbay and Oakdale).
Naturally, we expect that the key combinations are unique, that all time series are gapless and complete, that the Balance of trade equals Export less Import everywhere, that district values add up to the provinces’, and that province values add up to the total of Samplonia. Finally, the quarterly time series must add up to the annual values.
3.2 Uniqueness
The function is_unique() checks whether combinations of variables (usually
key variables) uniquely identify a record. It accepts any positive number of
variable names and returns FALSE for each record that is duplicated with
respect to the designated variables.
Here, we test whether region, period, and measure uniquely identify a value in
the samplonomy data set.
rule <- validator(is_unique(region, period, measure))
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]##   name items passes fails nNA error warning
## 1   V1  1199   1197     2   0 FALSE   FALSEThere are 2 fails. After extracting the individual
values for each record we can find the duplicated ones using a
convenience function from validate.
##       region freq period measure  value
## 870 Induston    Q 2018Q2  export 165900
## 871 Induston    Q 2018Q2  export 170000There are a two subtleties to keep in mind when interpreting uniqueness. The first has to do with missing values, and the second has to do with grouping. To start with the missing value problem, take a look at the following two-record data frame.
##   x    y
## 1 1    A
## 2 1 <NA>How should we judge whether these two records are unique? A tempting option is
to say the first record is unique, and to return NA for the second record
since it contains a missing value: R has the habit of returning NA from
calculations when an input value is NA. This choice is not invalid, but it
would have consequences for determining whether the first record is unique as
well. After all, it is possible to fill in a value in the missing field such
that the two records are duplicates. Therefore, if one would return NA for
the second record, the correct thing to do is to also return NA for the first
record. In R, the choice is made to treat NA as an actual value when checking
for duplicates or unique records (see ?duplicated from base R). To see this
inspect the following code and output.
## [1]  TRUE FALSE FALSEThe second subtlety has to do with grouping. You may want to test whether a column is unique, given one or more other variables. It is tempting to think that this requires a split-apply-combine approach where the dataset is first split according to one or more grouping variables, check for uniqueness of the column in each group, and then combine the results. However, such an approach is not necessary as you can simply add the grouping variables to the list of variables that together must be unique.
As an example, consider the output of the following two approaches.
# y is unique, given x. But not by itself
df <- data.frame(x=rep(letters[1:2],each=3), y=rep(1:3,2))
# the split-apply-combine approach
unsplit(tapply(df$y, df$x, is_unique), df$x)## [1] TRUE TRUE TRUE TRUE TRUE TRUE## [1] TRUE TRUE TRUE TRUE TRUE TRUE3.3 Availability of records
This section is on testing for availability of whole records. Testing for individual missing values (NA), is treated in 2.2.
We wish to ensure that for each region, and each variable, the periods 2014,
2015, \(\ldots\), 2019 are present. Using contains_at_least we can establish
this.
rule <- validator(
  contains_at_least( 
      keys = data.frame(period = as.character(2014:2019))
    , by=list(region, measure) )
)
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]##   name items passes fails nNA error warning
## 1   V1  1199   1170    29   0 FALSE   FALSEThe function contains_at_least splits the samplonomy dataset into blocks
according to values of region and measure. Next, it checks that in each
block the variable period contains at least the values 2014–2019.
The return value is a logical vector where the number of elements equals the
number of rows in the dataset under scrutiny. It is TRUE for each block
where all years are present, and FALSE for each block where one or more of the
years is missing. In this case 29 records are labeled as FALSE. These
can be found as follows.
##     region freq period measure  value
## 1    Agria    A   2014     gdp 600000
## 5    Agria    Q 2014Q1     gdp  60000
## 9    Agria    Q 2014Q2     gdp 120000
## 13   Agria    Q 2014Q3     gdp 300000
## 17   Agria    Q 2014Q4     gdp 120000
## 204  Agria    Q 2015Q1     gdp  58200Inspection of these records shows that in this block, for Agria the GDP
for "2015" is missing.
We can perform a stricter check, and test whether for each measure, all
quarters "2014Q1" \(\ldots\) "2019Q4" are present for each province (Agria
and Induston). First create a key set to test against.
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
  region = c("Agria", "Induston")
  , period = sapply(years, paste0, quarters))
head(keyset)##     region period
## 1    Agria 2014Q1
## 2 Induston 2014Q1
## 3    Agria 2014Q2
## 4 Induston 2014Q2
## 5    Agria 2014Q3
## 6 Induston 2014Q3This key set will be referenced in the rule, and passed to confront as reference
data.
rule <- validator(
          contains_at_least(keys=minimal_keys, by=measure) 
        )
out <- confront(samplonomy, rule
              , ref=list(minimal_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]##   name items passes fails nNA error warning
## 1   V1  1199    899   300   0 FALSE   FALSEThere are 300 fails. Inspecting the data set as above, we
see that for Induston, the export is missing in "2018Q3".
Finally, we do a strict test, to check that for each measure all periods and
all regions are reported. We also demand that there are no more and no less
records than for each individual measure. For this, the function
contains_exactly can be used.
First create a keyset.
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
  region  = c(
    "Agria" 
   ,"Crowdon"
   ,"Greenham"
   ,"Induston"
   ,"Mudwater"
   ,"Newbay"
   ,"Oakdale"
   ,"Samplonia"
   ,"Smokely"
   ,"Wheaton"
  )
 ,period = c(years, sapply(years, paste0, quarters))
)
head(keyset)##     region period
## 1    Agria   2014
## 2  Crowdon   2014
## 3 Greenham   2014
## 4 Induston   2014
## 5 Mudwater   2014
## 6   Newbay   2014The keyset is passed as reference data to the rule using confront.
rule <- validator(contains_exactly(all_keys, by=measure))
out  <- confront(samplonomy, rule
               , ref=list(all_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]##   name items passes fails nNA error warning
## 1   V1  1199    600   599   0 FALSE   FALSETo find where the errors reside, we first select the records with an error and then find the unique measures that occur in those records.
## [1] "gdp"    "export"So here, blocks containing GDP and Export have entire records missing.
3.4 Gaps in (time) series
For time series, or possibly other series it is desirable that there is a constant distance between each two elements of the series. The mathematical term for such a series is called a linear sequence. Here are some examples of linear series.
- The natural numbers: \(1,2,3,\ldots\)
- The even natural numbers \(2, 4, 6, \ldots\)
- Quarters periods: "2020Q1","2020Q2", \(\ldots\)
- Years (these are just natural numbers): \(2019, 2020, \ldots\)
The validate functions is_linear_sequence and in_linear_sequence check
whether a variable represents a linear series, possibly in blocks defined by
categorical variables. They can be used interactively or as a rule in a
validator object. We first demonstrate how these functions work, and then give
an example with the samplonomy dataset.
## [1] TRUE## [1] TRUE## [1] FALSEFor character data, the function is capable of recognizing certain formats for time periods.
## [1] TRUESee ?is_linear_sequence for a full specification of supported
date-time formats.
It is not necessary for data to be sorted in order to be recognized as a linear sequence.
## [1] TRUEOne can force a begin and/or end point for the sequence as well.
## [1] FALSEFinally it is possible to split a variable by one or more other columns and check whether each block represents a linear sequence.
series <- c(1,2,3,4,1,2,3,3)
blocks <- rep(c("a","b"), each = 4)
is_linear_sequence(series, by = blocks)## [1] FALSENow, this result is not very useful since now it is unknown which block
is not a linear series. This is where the function in_linear_sequence comes in.
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSEThere are some subtleties. A single element is also a linear sequence (of length 1).
## [1] TRUEThis can yield surprises in cases of blocks of length 1.
##   series blocks
## 1      1      a
## 2      2      a
## 3      3      a
## 4      4      a
## 5      1      b
## 6      2      b
## 7      3      b
## 8      3      c## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUEWe now have three linear series, namely
- For "a":1,2,3,4
- For "b":1,2,3
- For "c":3.
We can circumvent this by giving explicit bounds.
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSEWe now return to the samplonomy dataset. We wish to check that for
each measure and each area, the time series are linear series. Since there
are time series of different frequencies, we need to split the data by frequency
as well.
rule <- validator(
          in_linear_sequence(period
            , by = list(region, freq, measure))
        )
out  <- confront(samplonomy, rule)
summary(out)[1:7]##   name items passes fails nNA error warning
## 1   V1  1199   1170    29   0 FALSE   FALSEWe can find the blocks where records are not in sequence as follows (output not printed here for brevity).
Inspection of the selected records shows that for Agria the GDP for 2015 is missing, and that for Induston the Export for 2018Q3 is missing while Export for 2018Q2 occurs twice (but with different values)