# Chapter 3 Availability and uniqueness

In this Chapter it is demonstrated how to check whether records are available and/or complete with respect to a set of keys, and whether they are unique. The checks described here are typically useful for data in ‘long’ format, where one column holds a value and all the other columns identify the value.

- To test for missing values in individual variables, see also 2.2.
- To check whether records or parts thereof are completed, see 4.1.

**Data**

In this Chapter the `samplonomy`

dataset is used that comes with the `validate`

package.

```
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
```

## 3.1 Long data

The samplonomy data set is structured in ‘long form’. This means that each
record has a single `value`

column, and one or more columns containing
character values that together describe what the value means.

```
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
```

The data set contains several time series for multiple measures of the fictional country ‘Samplonia’. There are time series for several subregions of Samplonia.

Long format data is typically used as a transport format: it may be used to bulk-load data into SQL-based data base systems, or to transfer data between organisations in a unambiguous way.

Data in long form is in general much harder to check and process for statistical purpose than data in wide format, where each variable is stored in a separate column. The reason is that in long format relations between different variables are spread out across records, and those records are not necessarily ordered in any particular way prior to processing. This makes interpretation of validation fails intrinsically harder for long-form data then for wide-form data.

The `samplonomy`

data set has a particularly nasty structure. It contains both
annual and quarterly time series for GDP, Import, Export and the Balance of
Trade (export less import). The period column therefore contains both quarterly
and annual labels. Furthermore, there are time series for the whole of
Samplonia (region Samplonia), for each of its two provinces (regions Agria and
Induston) and for each of its districts within Agria (Wheaton and Greenham) and
Induston (Smokeley, Mudwater, Newbay and Oakdale).

Naturally, we expect that the key combinations are unique, that all time series are gapless and complete, that the Balance of trade equals Export less Import everywhere, that district values add up to the provinces’, and that province values add up to the total of Samplonia. Finally, the quarterly time series must add up to the annual values.

## 3.2 Uniqueness

The function `is_unique()`

checks whether combinations of variables (usually
key variables) uniquely identify a record. It accepts any positive number of
variable names and returns `FALSE`

for each record that is duplicated with
respect to the designated variariables.

Here, we test whether region, period, and measure uniquely identify a value in
the `samplonomy`

data set.

```
rule <- validator(is_unique(region, period, measure))
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]
```

```
## name items passes fails nNA error warning
## 1 V1 1199 1197 2 0 FALSE FALSE
```

There are 2 fails. After extracting the individual
values for each record we can find the duplicated ones using a
convenience function from `validate`

.

```
## region freq period measure value
## 870 Induston Q 2018Q2 export 165900
## 871 Induston Q 2018Q2 export 170000
```

There are a two subtleties to keep in mind when interpreting uniqueness. The first has to do with missing values, and the second has to do with grouping. To start with the missing value problem, take a look at the following two-record data frame.

```
## x y
## 1 1 A
## 2 1 <NA>
```

How should we judge whether these two records are unique? A tempting option is
to say the the first record is unique, and to return `NA`

for the second record
since it contains a missing value: R has the habit of returning `NA`

from
calculations when an input value is `NA`

. This choice is not invalid, but it
would have consequences for determining whether the first record is unique as
well. After all, it is possible to fill in a value in the missing field such
that the two records are duplicates. Therefore, if one would return `NA`

for
the second record, the correct thing to do is to also return `NA`

for the first
record. In R, the choice is made to treat `NA`

as an actual value when checking
for duplicates or uniqe records (see `?duplicated`

from base R). To see this
inspect the following code and output.

`## [1] TRUE FALSE FALSE`

The second subtlety has to do with grouping. You may want to test whether a
column is unique, given one or more other variables. It is tempting to think
that this requires a split-apply-combine approach where the dataset is first
split according to one or more grouping variables, check for uniqueness of the
column in each group, and then combine the results. However, such an approach
is not necessary as you can simply add the grouping variables to the list of
variables that *together* must be unique.

As an example, consider the output of the following two approaches.

```
# y is unique, given x. But not by itself
df <- data.frame(x=rep(letters[1:2],each=3), y=rep(1:3,2))
# the split-apply-combine approach
unsplit(tapply(df$y, df$x, is_unique), df$x)
```

`## [1] TRUE TRUE TRUE TRUE TRUE TRUE`

`## [1] TRUE TRUE TRUE TRUE TRUE TRUE`

## 3.3 Availability of records

This section is on testing for availability of whole records. Testing for individual missing values (NA), is treated in 2.2.

We wish to ensure that for each region, and each variable, the periods 2014,
2015, \(\ldots\), 2019 are present. Using `contains_at_least`

we can establish
this.

```
rule <- validator(
contains_at_least(
keys = data.frame(period = as.character(2014:2019))
, by=list(region, measure) )
)
out <- confront(samplonomy, rule)
# showing 7 columns of output for readability
summary(out)[1:7]
```

```
## name items passes fails nNA error warning
## 1 V1 1199 1170 29 0 FALSE FALSE
```

The function `contains_at_least`

splits the `samplonomy`

dataset into blocks
according to values of `region`

and `measure`

. Next, it checks that in each
block the variable `period`

contains at least the values 2014–2019.

The return value is a logical vector where the number of elements equals the
number of rows in the dataset under scrutiny. It is `TRUE`

for each block
where all years are present, and `FALSE`

for each block where one of the
years is missing. In this case 29 records are labeled as FALSE. These
can be found as follows.

```
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 5 Agria Q 2014Q1 gdp 60000
## 9 Agria Q 2014Q2 gdp 120000
## 13 Agria Q 2014Q3 gdp 300000
## 17 Agria Q 2014Q4 gdp 120000
## 204 Agria Q 2015Q1 gdp 58200
```

Inspection of these records shows that in this block, for `Agria`

the GDP
for `"2015"`

is missing.

We can perform a stricter check, and test whether for each `measure`

, all
quarters `"2014Q1"`

\(\ldots\) `"2019Q4"`

are present for each province (`Agria`

and `Induston`

). First create a key set to test against.

```
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
region = c("Agria", "Induston")
, period = sapply(years, paste0, quarters))
head(keyset)
```

```
## region period
## 1 Agria 2014Q1
## 2 Induston 2014Q1
## 3 Agria 2014Q2
## 4 Induston 2014Q2
## 5 Agria 2014Q3
## 6 Induston 2014Q3
```

This key set will be referenced in the rule, and passed to `confront`

as reference
data.

```
rule <- validator(
contains_at_least(keys=minimal_keys, by=measure)
)
out <- confront(samplonomy, rule
, ref=list(minimal_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]
```

```
## name items passes fails nNA error warning
## 1 V1 1199 899 300 0 FALSE FALSE
```

There are 300 fails. Inspecting the data set as above, we
see that for Induston, the `export`

is missing in `"2018Q3"`

.

Finally, we do a strict test, to check that for each `measure`

all periods and
all regions are reported. We also demand that there are no more and no less
records than for each individual measure. For this, the function
`contains_exactly`

can be used.

First create a keyset.

```
years <- as.character(2014:2019)
quarters <- paste0("Q",1:4)
keyset <- expand.grid(
region = c(
"Agria"
,"Crowdon"
,"Greenham"
,"Induston"
,"Mudwater"
,"Newbay"
,"Oakdale"
,"Samplonia"
,"Smokeley"
,"Wheaton"
)
,period = c(years, sapply(years, paste0, quarters))
)
head(keyset)
```

```
## region period
## 1 Agria 2014
## 2 Crowdon 2014
## 3 Greenham 2014
## 4 Induston 2014
## 5 Mudwater 2014
## 6 Newbay 2014
```

The keyset is passed as reference data to the rule using `confront`

.

```
rule <- validator(contains_exactly(all_keys, by=measure))
out <- confront(samplonomy, rule
, ref=list(all_keys=keyset))
# showing 7 columns of output for readability
summary(out)[1:7]
```

```
## name items passes fails nNA error warning
## 1 V1 1199 0 1199 0 FALSE FALSE
```

To find where the errors reside, we first select the records with an error and then find the unique measures that occur in those records.

`## [1] "gdp" "import" "export" "balance"`

So here, blocks containing GDP and Export have entire records missing.

## 3.4 Gaps in (time) series

For time series, or possibly other series it is desirable that
there is a constant distance between each two elements of the series.
The mathematical term for such a series is called a *linear sequence*.
Here are some examples of linear series.

- The natural numbers: \(1,2,3,\ldots\)
- The even natural numbers \(2, 4, 6, \ldots\)
- Quarters periods:
`"2020Q1"`

,`"2020Q2"`

, \(\ldots\) - Years (these are just natural numbers): \(2019, 2020, \ldots\)

The `validate`

functions `is_linear_sequence`

and `in_linear_sequence`

check
whether a variable represents a linear series, possibly in blocks defined by
categorical variables. They can be used interactively or as a rule in a
validator object. We first demonstrate how these functions work, and then give
an example with the `samplonomy`

dataset.

`## [1] TRUE`

`## [1] TRUE`

`## [1] FALSE`

For character data, the function is capable of recognizing certain formats for time periods.

`## [1] TRUE`

See `?is_linear_sequence`

for a full specification of supported
date-time formats.

It is not necessary for data to be sorted in order to be recognized as a linear sequence.

`## [1] TRUE`

One can force a begin and/or end point for the sequence as well.

`## [1] FALSE`

Finally it is possible to split a variable by one or more other columns and check whether each block represents a linear sequence.

```
series <- c(1,2,3,4,1,2,3,3)
blocks <- rep(c("a","b"), each = 4)
is_linear_sequence(series, by = blocks)
```

`## [1] FALSE`

Now, this result is not very useful since now it is unknown which block
is not a linear series. This is where the function `in_linear_sequence`

comes in.

`## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE`

There are some subtleties. A single element is also a linear sequence (of length 1).

`## [1] TRUE`

This can yield surprises in cases of blocks of length 1.

```
## series blocks
## 1 1 a
## 2 2 a
## 3 3 a
## 4 4 a
## 5 1 b
## 6 2 b
## 7 3 b
## 8 3 c
```

`## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE`

We now have three linear series, namely

- For
`"a"`

:`1,2,3,4`

- For
`"b"`

:`1,2,3`

- For
`"c"`

:`3`

.

We can circumvent this by giving explicit bounds.

`## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE`

We now return to the `samplonomy`

dataset. We wish to check that for
each measure and each area, the time series are linear series. Since there
are time series of different frequencies, we need to split the data by frequency
as well.

```
rule <- validator(
in_linear_sequence(period
, by = list(region, freq, measure))
)
out <- confront(samplonomy, rule)
summary(out)[1:7]
```

```
## name items passes fails nNA error warning
## 1 V1 1199 1170 29 0 FALSE FALSE
```

We can find the blocks where records are not in sequence as follows (output not printed here for brevity).

Inspection of the selected records shows that for Agria the GDP for 2015 is missing, and that for Induston the Export for 2018Q3 is missing while Export for 2018Q2 occurs twice (but with different values)