Chapter 5 Statistical checks

Statistical checks involve group properties such as the means of columns. These characteristics can be checked for whole columns or grouped by one or more categorical variables. It is also possible to use group-wise computed statistics in validation rules. For example if you want to compare individual values with a mean within a group.

For long-form data it is possible to compare aggregate values with underlying details. For example to test whether quarterly time series add up to annual totals. It is also possible to check properties of groups, for example whether in every household (a group of persons) there is exactly one head of household.

Data

In this Chapter we will use the SBS2000 dataset that comes with validate.

##      id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01  sc0      0.02    75       NA        NA      1130          NA
## 2 RET02  sc3      0.14     9     1607        NA      1607         131
## 3 RET03  sc3      0.14    NA     6886       -33      6919         324
##   total.costs profit vat
## 1       18915  20045  NA
## 2        1544     63  NA
## 3        6493    426  NA

We shall also use the samplonomy dataset that also comes with validate. See also 3.1.

##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 222000

5.1 Statistical and groupwise characteristics

Any R expression that ultimately is an equality or inequality check is interpreted as a validation rule by validate. This means that any statistical calculation can be input to a rule.

Here we check the mean profit and correlation coefficient between profit and turnover.

##   name items passes fails nNA error warning
## 1   V1     1      1     0   0 FALSE   FALSE
## 2   V2     1      1     0   0 FALSE   FALSE

There are a few helper functions to compute group-wise statistics, and to make comparing values with group aggregates possible.

For example, here we check whether each turnover is less than ten times the group-wise median.

##   name items passes fails nNA error warning
## 1   V1    60     53     3   4 FALSE   FALSE

Here, in the right-hand side of the rule the group-wise median of turnover is computed. The function do_by is very similar to functions such as tapply in base R. The difference is that do_by works on vectors only (not on data frames) and always repeats the values of fun so that the length of the output is equal to the length of the input.

##   size median
## 1  sc0    351
## 2  sc3   2891
## 3  sc3   2891
## 4  sc3   2891
## 5  sc3   2891
## 6  sc0    351

There are also some convenience functions, including sum_by, mean_by, min_by, and max_by.

5.2 Group properties

In this section, we group data by one or more categorical variables and check for each group whether a rule is satisfied. In particular we are going to check whether each household in a small dataset has a unique ‘head of household’.

We first create some data with household id (hhid) a person id (person) and that person’s role in the household (hhrole).

##   hhid person hhrole
## 1    1      1      h
## 2    1      2      h
## 3    2      3      m
## 4    1      4      m
## 5    2      5      h
## 6    2      6      m
## 7    3      7      m

With exists_one() we can check that there is exactly one person with the role "h" (head) in each household, by grouping on household id.

##   name items passes fails nNA error warning
## 1   V1     7      3     4   0 FALSE   FALSE
##                             expression
## 1 exists_one(hhrole == "h", by = hhid)

We can inspect the results by selecting the violating record groups.

##   hhid person hhrole
## 1    1      1      h
## 2    1      2      h
## 4    1      4      m
## 7    3      7      m

We see that household 1 has two heads of household, while household 3 has no head of household.

To test whether at least one head of household exists, one can use exists_any:

##   hhid person hhrole
## 7    3      7      m

In the following example we check whether there is exactly one region called Samplonia for each period and each measure in the samplonomy dataset.

The first argument of exists_one() is a rule that has to be checked in every group indicated by the by argument. The output is a logical vector with an element for each record in the dataset under scrutiny. If a group of data fails the test, each record in that group is indicated as wrong (FALSE).

##   name items passes fails nNA error warning
## 1   V1  1199   1199     0   0 FALSE   FALSE

Here, there are no groups that violate this assumption.

## [1] region  freq    period  measure value  
## <0 rows> (or 0-length row.names)

5.3 Code hierarchies and aggregation

Classifications and ontologies often have a hierarchical structure. A well-known example is the NACE classification of economic activities. In the NACE classification, the economy is divided into 10 basic types of activities such as ‘Agriculture’ or ‘Mining and Quarrying’, and each activity is again divided into subclasses, such as ‘Growing of rice’ and ‘Growing of Grapes’ under ‘Agriculture’. The subdividing can go on for several levels. For statistics that describe an economy according to the NACE classification, it is desirable that the statistics of subclasses add up to their parent classes. This is what the function ‘hierarchy’ does in ‘validate’.

The validate package comes with a version of the NACE classification (Revision 2, 2008) so we will use that as an example.

##    Order Level  Code Parent
## 1 398481     1     A       
## 2 398482     2    01      A
## 3 398483     3  01.1     01
## 4 398484     4 01.11   01.1
## 5 398485     4 01.12   01.1
## 6 398486     4 01.13   01.1

The second and third column contain the necessary information: they list the parent for each NACE code (where each parent is also a NACE code). To demonstrate how hierarchy() works, we first create some example data.

##    nace volume
## 1    01    100
## 2  01.1     70
## 3 01.11     30
## 4 01.12     40
## 5  01.2     25

We see that the volumes for subclasses "01.11" and "01.12" add up to "01.1" ( \(30+40=70\) ). However, the volumes for "01.1" and "01.2" do not add up to the volume for "01" (\(70+25\not=100\)). The hierarchy() function checks all these relations.

Before using hierarchy in the setting of a validator object, we can examine it directly.

##    nace volume check
## 1    01    100 FALSE
## 2  01.1     70 FALSE
## 3 01.11     30  TRUE
## 4 01.12     40  TRUE
## 5  01.2     25 FALSE

We see that hierarchy() returns a logical vector with one element for each record in the data. Each record that is involved in one or more aggregation checks that fail is labeled FALSE. Here, this concerns the records with labels "01", "01.1" and "01.2".

We will next look at a more complicated example, but first note the following. The hierarchy() function

  • can handle any statistical aggregate, sum() is just the default;
  • supports globbing and regular expressions in the child values;
  • has an adjustable tolerance value for comparing observed with computed aggregates;
  • has configurable behaviour for cases of missing data;
  • can be applied per-group, defined by one or more grouping variables (see next example).

See the help file ?hierarchy for specification and examples.

A more complicated example

Samplonia is divided in two districts, each of which is divided into several provinces. Let us define the hierarchical code list.

##     region    parent
## 1    Agria Samplonia
## 2 Induston Samplonia
## 3  Wheaton     Agria
## 4 Greenham     Agria
## 5  Smokely  Induston
## 6 Mudwater  Induston
## 7   Newbay  Induston
## 8  Crowdon  Induston

Recall the structure of the samplonomy dataset.

##   region freq period measure  value
## 1  Agria    A   2014     gdp 600000
## 2  Agria    A   2014  import 210000
## 3  Agria    A   2014  export 222000
## 4  Agria    A   2014 balance  12000
## 5  Agria    Q 2014Q1     gdp  60000
## 6  Agria    Q 2014Q1  import  21000

We will check whether regions sum to their parent regions, for each period and for each measure.

##   name items passes fails nNA error warning
## 1   V1  1199    237   954   8 FALSE    TRUE
##                                                                            expression
## 1 hierarchy(value, region, hierarchy = ref[["codelist"]], by = list(period, measure))

We see that some aggregates add up correctly, and some don’t. There is also a warning which we should investigate.

## $V1
## [1] "Parent 'Induston' occurs more than once (2 times) in group (2018Q2, export)"

If one of the groups contains a parent more than once it is not possible to check whether child values add up to the aggregate. For this reason the duplicated parent and all it’s children are marked FALSE. Indeed we find a duplicated record.

##       region freq period measure  value
## 870 Induston    Q 2018Q2  export 165900
## 871 Induston    Q 2018Q2  export 170000

Just to see if we can remove the warning, let us remove the duplicate and re-run the check.

##   name items passes fails nNA error warning
## 1   V1  1198    238   952   8 FALSE   FALSE

The hierarchy() function marks every record FALSE that is involved in any check. This may make it hard to figure out which check it failed. One can get more detailed information, by checking different parts of the hierarchy in separate rules.

##     name items passes fails nNA error warning
## 1 level0  1198   1194     4   0 FALSE   FALSE
## 2 level1  1198    240   950   8 FALSE   FALSE
##                                                              expression
## 1 hierarchy(value, region, ref[["level0"]], by = list(period, measure))
## 2 hierarchy(value, region, ref[["level1"]], by = list(period, measure))

We can now select records involved in violating the highest level rules separately.

##        region freq period measure   value
## 260  Induston    A   2015     gdp 1358000
## 340 Samplonia    A   2015     gdp 1940000
## 814     Agria    Q 2018Q3  export  118500
## 954 Samplonia    Q 2018Q3  export  284400

From this it appears that in 2015, the GDP for Agria is missing, and in 2018Q3 there is no value for the export of Induston.

5.4 General aggregates in long-form data

Checking aggregations in long-form format is more involved than for data in wide format (as in Section 4.2).

Here, we check in the samplonomy dataset that for each measure and each period, the subregional data adds up to the regional data.

The first argument of part_whole_relation() is the name of the variable containing the values. Here, the column value from the samplonomy dataset. The argument labels indicates the variable that labels parts and wholes. Next, we define the label value that indicates a total. Here, a record with region label "Samplonia" indicates a total. Under argument part we specify the labels that have to add up to Samplonia, here the provinces Agria and Induston. Note that there are more subregions in the dataset, for example the district of Wheaton (a subregion of Agria). Since we do not specify them, these are ignored. In the by argument we specify that the dataset must be split into measure and period prior to checking the regional aggregates.

The output is one boolean value per record. For each block, defined by values of measure and period either all values are TRUE, FALSE, or NA. The latter indicates that the aggregate could not be computed because one of the values is missing, or the computed aggregate could not be compared with the aggregate in the data because it is missing (either the whole record may be missing, or the value may be NA).

##   name items passes fails nNA error warning
## 1   V1  1199   1191     8   0 FALSE   FALSE

We can extract the truth values and then inspect the blocks with erroneous values using standard R functionality.

##        region freq period measure   value
## 260  Induston    A   2015     gdp 1358000
## 340 Samplonia    A   2015     gdp 1940000
## 810     Agria    Q 2018Q2  export   47400
## 814     Agria    Q 2018Q3  export  118500
## 870  Induston    Q 2018Q2  export  165900
## 871  Induston    Q 2018Q2  export  170000
## 950 Samplonia    Q 2018Q2  export  213300
## 954 Samplonia    Q 2018Q3  export  284400

Recall that the rule was executed per block defined by measure and period. Thus, the result indicates three errors: one in the block of records defined by measure=="gdp" and period=="2015", also in the blocks defined by measure=="export" and period==2018Q2 or period=="2018Q3".

First, it seems that the 2015 GDP of Agria is missing from the data set. This turns out indeed to be the case.

## [1] region  freq    period  measure value  
## <0 rows> (or 0-length row.names)

Second, it can be seen that for Induston, there are two export values for "2018Q2" while the export value for "2018Q3" is missing.

Notes

Specifying (group-wise) aggregates is a fairly detailed job in the case of long data. There are a few things to keep in mind when using this function.

  • The argument part is optional. If not specified, every record not matching with whole will be considered a detail that is to be used to compute the total. In the current example this was not possible because besides Agria and Induston, we have other subregions.
  • In the example we used literal values to specify the keys that define parts and wholes. It is possible to, recognize patterns, for example any years (4 digits) as a whole and a quarter as a part. See also the next example. Supported patterns include regular expressions (shown here) and globbing (see help file).
  • It is important that the variables listed in by (if any) uniquely specify a single aggregate. So here, for each measure and period, the label "Samplonia" should occur at most once (if it does not occur the result will be NA).
  • The default way to aggregate is to take the sum. You can specify other ways to aggregate by passing an aggregator argument. For example aggregator=mean.
  • By default, the aggregate in the data is compared with the computed aggregate up to a tolerance of \(10^{-8}\). This tolerance can be set using the tol argument. E.g. for integer data you may want to set tol=0.

5.5 Aggregates of time series in long format

We are going to check whether quarterly time series add up to the annual time series. This is more complicated because of two subtleties.

First there is not one fixed aggregate key, like "Samplonia". Rather, we have a key pattern. Each total is defined by a period label that consists of precisely four digits. So rather than recognizing a specific year we want to recognize that a key represents any year. This can be done using a regular expression of the form "^\\d{4}$", where the ^ indicates ‘start of string’, the \\d{4} indicates ‘four times a digit’ and $ indicates ‘end of string’.

Second, we wish to check annual totals against the sum over quarters for each region and each measure. However, a value-combination of measure and region does not single out a single value for year. For example, for the Induston export we have the following annual data.

##        region freq period measure  value
## 63   Induston    A   2014  export 518000
## 262  Induston    A   2015  export 525000
## 462  Induston    A   2016  export 532000
## 662  Induston    A   2017  export 560000
## 862  Induston    A   2018  export 553000
## 1062 Induston    A   2019  export 553000

So in fact, we need to do the check by year as well as by measure and region. Fortunately, in this case it is easy to derive a variable that indicates the year by selecting the first four characters from period.

We use rx("^\\d{4}") to tell part_whole_relation that this string must be interpreted as a regular expression. Here, we do not indicate part labels explicitly: by default any record not matching whole will be treated as a detail that must be used to compute the total.

## named list()
##   name items passes fails nNA error warning
## 1   V1  1199   1180     9  10 FALSE   FALSE

We now get 9 fails and 10 missing values. We can filter out records that have NA (lacking) results.

##       region freq period measure value
## 24   Crowdon    A   2014 balance  1600
## 28   Crowdon    Q 2014Q1 balance    NA
## 32   Crowdon    Q 2014Q2 balance   480
## 36   Crowdon    Q 2014Q3 balance   480
## 40   Crowdon    Q 2014Q4 balance   320
## 1181 Wheaton    A   2019  import 62000
## 1185 Wheaton    Q 2019Q1  import  6200
## 1189 Wheaton    Q 2019Q2  import    NA
## 1193 Wheaton    Q 2019Q3  import 31000
## 1197 Wheaton    Q 2019Q4  import 12400

There are two blocks where the annual total could not be compared with the sum over quarterly series. The balance value of Crowdon is missing for "2014Q1" as well as the import value of Wheaton for "2019Q2".

Similarly, we can inspect the failing blocks

##       region freq period measure  value
## 204    Agria    Q 2015Q1     gdp  58200
## 208    Agria    Q 2015Q2     gdp 116400
## 212    Agria    Q 2015Q3     gdp 291000
## 216    Agria    Q 2015Q4     gdp 116400
## 862 Induston    A   2018  export 553000
## 866 Induston    Q 2018Q1  export 110600
## 870 Induston    Q 2018Q2  export 165900
## 871 Induston    Q 2018Q2  export 170000
## 878 Induston    Q 2018Q4  export 110600