Chapter 3 Availability and uniqueness
In this Chapter it is demonstrated how to check whether records are available and/or complete with respect to a set of keys, and whether they are unique. The checks described here are typically useful for data in ‘long’ format, where one column holds a value and all the other columns identify the value.
- To test for missing values in individual variables, see also 2.2.
- To check whether records or parts thereof are completed, see 4.1.
Data
In this Chapter the samplonomy
dataset is used that comes with the validate
package.
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
3.1 Long data
The samplonomy data set is structured in ‘long form’. This means that each
record has a single value
column, and one or more columns containing
character values that together describe what the value means.
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
The data set contains several time series for multiple measures of the fictional country ‘Samplonia’. There are time series for several subregions of Samplonia.
Long format data is typically used as a transport format: it may be used to bulk-load data into SQL-based data base systems, or to transfer data between organisations in a unambiguous way.
Data in long form is in general much harder to check and process for statistical purpose than data in wide format, where each variable is stored in a separate column. The reason is that in long format relations between different variables are spread out across records, and those records are not necessarily ordered in any particular way prior to processing. This makes interpretation of validation fails intrinsically harder for long-form data than for wide-form data.
The samplonomy
data set has a particularly nasty structure. It contains both
annual and quarterly time series for GDP, Import, Export and the Balance of
Trade (export less import). The period column therefore contains both quarterly
and annual labels. Furthermore, there are time series for the whole of
Samplonia (region Samplonia), for each of its two provinces (regions Agria and
Induston) and for each of its districts within Agria (Wheaton and Greenham) and
Induston (Smokely, Mudwater, Newbay and Oakdale).
Naturally, we expect that the key combinations are unique, that all time series are gapless and complete, that the Balance of trade equals Export less Import everywhere, that district values add up to the provinces’, and that province values add up to the total of Samplonia. Finally, the quarterly time series must add up to the annual values.
3.2 Uniqueness
The function is_unique()
checks whether combinations of variables (usually
key variables) uniquely identify a record. It accepts any positive number of
variable names and returns FALSE
for each record that is duplicated with
respect to the designated variariables.
Here, we test whether region, period, and measure uniquely identify a value in
the samplonomy
data set.
rule <- validator(is_unique(region, period, measure))
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1199 1197 2 0 FALSE FALSE
There are 2 fails. After extracting the individual
values for each record we can find the duplicated ones using a
convenience function from validate
.
## region freq period measure value
## 870 Induston Q 2018Q2 export 165900
## 871 Induston Q 2018Q2 export 170000
There are a two subtleties to keep in mind when interpreting uniqueness. The first has to do with missing values, and the second has to do with grouping. To start with the missing value problem, take a look at the following two-record data frame.
## x y
## 1 1 A
## 2 1 <NA>
How should we judge whether these two records are unique? A tempting option is
to say the the first record is unique, and to return NA
for the second record
since it contains a missing value: R has the habit of returning NA
from
calculations when an input value is NA
. This choice is not invalid, but it
would have consequences for determining whether the first record is unique as
well. After all, it is possible to fill in a value in the missing field such
that the two records are duplicates. Therefore, if one would return NA
for
the second record, the correct thing to do is to also return NA
for the first
record. In R, the choice is made to treat NA
as an actual value when checking
for duplicates or uniqe records (see ?duplicated
from base R). To see this
inspect the following code and output.
## [1] TRUE FALSE FALSE
The second subtlety has to do with grouping. You may want to test whether a column is unique, given one or more other variables. It is tempting to think that this requires a split-apply-combine approach where the dataset is first split according to one or more grouping variables, check for uniqueness of the column in each group, and then combine the results. However, such an approach is not necessary as you can simply add the grouping variables to the list of variables that together must be unique.
As an example, consider the output of the following two approaches.
# y is unique, given x. But not by itself
df <- data.frame(x=rep(letters[1:2],each=3), y=rep(1:3,2))
# the split-apply-combine approach
unsplit(tapply(df$y, df$x, is_unique), df$x)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
3.3 Availability of records
This section is on testing for availability of whole records. Testing for individual missing values (NA), is treated in 2.2.
We wish to ensure that for each region, and each variable, the periods 2014,
2015, \(\ldots\), 2019 are present. Using contains_at_least
we can establish
this.
rule <- validator(
contains_at_least(
keys = data.frame(period = as.character(2014:2019))
, by=list(region, measure) )
)
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1199 1170 29 0 FALSE FALSE
The function contains_at_least
splits the samplonomy
dataset into blocks
according to values of region
and measure
. Next, it checks that in each
block the variable period
contains at least the values 2014–2019.
The return value is a logical vector where the number of elements equals the
number of rows in the dataset under scrutiny. It is TRUE
for each block
where all years are present, and FALSE
for each block where one or more of the
years is missing. In this case 29 records are labeled as FALSE. These
can be found as follows.
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 5 Agria Q 2014Q1 gdp 60000
## 9 Agria Q 2014Q2 gdp 120000
## 13 Agria Q 2014Q3 gdp 300000
## 17 Agria Q 2014Q4 gdp 120000
## 204 Agria Q 2015Q1 gdp 58200
Inspection of these records shows that in this block, for Agria
the GDP
for "2015"
is missing.
We can perform a stricter check, and test whether for each measure
, all
quarters "2014Q1"
\(\ldots\) "2019Q4"
are present for each province (Agria
and Induston
). First create a key set to test against.
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
region = c("Agria", "Induston")
, period = sapply(years, paste0, quarters))
head(keyset)
## region period
## 1 Agria 2014Q1
## 2 Induston 2014Q1
## 3 Agria 2014Q2
## 4 Induston 2014Q2
## 5 Agria 2014Q3
## 6 Induston 2014Q3
This key set will be referenced in the rule, and passed to confront
as reference
data.
rule <- validator(
contains_at_least(keys=minimal_keys, by=measure)
)
out <- confront(samplonomy, rule
, ref=list(minimal_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1199 899 300 0 FALSE FALSE
There are 300 fails. Inspecting the data set as above, we
see that for Induston, the export
is missing in "2018Q3"
.
Finally, we do a strict test, to check that for each measure
all periods and
all regions are reported. We also demand that there are no more and no less
records than for each individual measure. For this, the function
contains_exactly
can be used.
First create a keyset.
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
region = c(
"Agria"
,"Crowdon"
,"Greenham"
,"Induston"
,"Mudwater"
,"Newbay"
,"Oakdale"
,"Samplonia"
,"Smokely"
,"Wheaton"
)
,period = c(years, sapply(years, paste0, quarters))
)
head(keyset)
## region period
## 1 Agria 2014
## 2 Crowdon 2014
## 3 Greenham 2014
## 4 Induston 2014
## 5 Mudwater 2014
## 6 Newbay 2014
The keyset is passed as reference data to the rule using confront
.
rule <- validator(contains_exactly(all_keys, by=measure))
out <- confront(samplonomy, rule
, ref=list(all_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1199 600 599 0 FALSE FALSE
To find where the errors reside, we first select the records with an error and then find the unique measures that occur in those records.
## [1] "gdp" "export"
So here, blocks containing GDP and Export have entire records missing.
3.4 Gaps in (time) series
For time series, or possibly other series it is desirable that there is a constant distance between each two elements of the series. The mathematical term for such a series is called a linear sequence. Here are some examples of linear series.
- The natural numbers: \(1,2,3,\ldots\)
- The even natural numbers \(2, 4, 6, \ldots\)
- Quarters periods:
"2020Q1"
,"2020Q2"
, \(\ldots\) - Years (these are just natural numbers): \(2019, 2020, \ldots\)
The validate
functions is_linear_sequence
and in_linear_sequence
check
whether a variable represents a linear series, possibly in blocks defined by
categorical variables. They can be used interactively or as a rule in a
validator object. We first demonstrate how these functions work, and then give
an example with the samplonomy
dataset.
## [1] TRUE
## [1] TRUE
## [1] FALSE
For character data, the function is capable of recognizing certain formats for time periods.
## [1] TRUE
See ?is_linear_sequence
for a full specification of supported
date-time formats.
It is not necessary for data to be sorted in order to be recognized as a linear sequence.
## [1] TRUE
One can force a begin and/or end point for the sequence as well.
## [1] FALSE
Finally it is possible to split a variable by one or more other columns and check whether each block represents a linear sequence.
series <- c(1,2,3,4,1,2,3,3)
blocks <- rep(c("a","b"), each = 4)
is_linear_sequence(series, by = blocks)
## [1] FALSE
Now, this result is not very useful since now it is unknown which block
is not a linear series. This is where the function in_linear_sequence
comes in.
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
There are some subtleties. A single element is also a linear sequence (of length 1).
## [1] TRUE
This can yield surprises in cases of blocks of length 1.
## series blocks
## 1 1 a
## 2 2 a
## 3 3 a
## 4 4 a
## 5 1 b
## 6 2 b
## 7 3 b
## 8 3 c
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
We now have three linear series, namely
- For
"a"
:1,2,3,4
- For
"b"
:1,2,3
- For
"c"
:3
.
We can circumvent this by giving explicit bounds.
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
We now return to the samplonomy
dataset. We wish to check that for
each measure and each area, the time series are linear series. Since there
are time series of different frequencies, we need to split the data by frequency
as well.
rule <- validator(
in_linear_sequence(period
, by = list(region, freq, measure))
)
out <- confront(samplonomy, rule)
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1199 1170 29 0 FALSE FALSE
We can find the blocks where records are not in sequence as follows (output not printed here for brevity).
Inspection of the selected records shows that for Agria the GDP for 2015 is missing, and that for Induston the Export for 2018Q3 is missing while Export for 2018Q2 occurs twice (but with different values)