Chapter 5 Statistical checks
Statistical checks involve group properties such as the means of columns. These characteristics can be checked for whole columns or grouped by one or more categorical variables. It is also possible to use group-wise computed statistics in validation rules. For example if you want to compare individual values with a mean within a group.
For long-form data it is possible to compare aggregate values with underlying details. For example to test whether quarterly time series add up to annual totals. It is also possible to check properties of groups, for example whether in every household (a group of persons) there is exactly one head of household.
Data
In this Chapter we will use the SBS2000
dataset that comes with validate
.
## id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01 sc0 0.02 75 NA NA 1130 NA
## 2 RET02 sc3 0.14 9 1607 NA 1607 131
## 3 RET03 sc3 0.14 NA 6886 -33 6919 324
## total.costs profit vat
## 1 18915 20045 NA
## 2 1544 63 NA
## 3 6493 426 NA
We shall also use the samplonomy
dataset that also comes with validate
. See also
3.1.
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
5.1 Statistical and groupwise characteristics
Any R expression that ultimately is an equality or inequality check is interpreted as a validation rule by validate. This means that any statistical calculation can be input to a rule.
Here we check the mean profit and correlation coefficient between profit and turnover.
rule <- validator(
mean(profit, na.rm=TRUE) >= 1
, cor(turnover, staff, use="pairwise.complete.obs") > 0
)
out <- confront(SBS2000, rule)
# suppress some columns for brevity
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1 1 0 0 FALSE FALSE
## 2 V2 1 1 0 0 FALSE FALSE
There are a few helper functions to compute group-wise statistics, and to make comparing values with group aggregates possible.
For example, here we check whether each turnover is less than ten times the group-wise median.
rule <- validator(
turnover <= 10*do_by(turnover, by=size, fun=median, na.rm=TRUE)
)
out <- confront(SBS2000, rule)
# suppress some columns for brevity
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 60 53 3 4 FALSE FALSE
Here, in the right-hand side of the rule the group-wise median of turnover is
computed. The function do_by
is very similar to functions such as tapply
in base R. The difference is that do_by
works on vectors only (not on data
frames) and always repeats the values of fun
so that the length of the output is
equal to the length of the input.
medians <- with(SBS2000, do_by(turnover, by=size, fun=median, na.rm=TRUE))
head(data.frame(size = SBS2000$size, median=medians))
## size median
## 1 sc0 351
## 2 sc3 2891
## 3 sc3 2891
## 4 sc3 2891
## 5 sc3 2891
## 6 sc0 351
There are also some convenience functions, including sum_by
, mean_by
, min_by
, and max_by
.
5.2 Group properties
In this section, we group data by one or more categorical variables and check for each group whether a rule is satisfied. In particular we are going to check whether each household in a small dataset has a unique ‘head of household’.
We first create some data with household id (hhid
) a person id (person
) and
that person’s role in the household (hhrole
).
d <- data.frame(
hhid = c(1, 1, 2, 1, 2, 2, 3 )
, person = c(1, 2, 3, 4, 5, 6, 7 )
, hhrole = c("h","h","m","m","h","m","m")
)
d
## hhid person hhrole
## 1 1 1 h
## 2 1 2 h
## 3 2 3 m
## 4 1 4 m
## 5 2 5 h
## 6 2 6 m
## 7 3 7 m
With exists_one()
we can check that there is exactly one person
with the role "h"
(head) in each household, by grouping on household id.
rule <- validator(exists_one(hhrole == "h", by=hhid))
out <- confront(d, rule)
# suppress some columns for brevity
summary(out)
## name items passes fails nNA error warning
## 1 V1 7 3 4 0 FALSE FALSE
## expression
## 1 exists_one(hhrole == "h", by = hhid)
We can inspect the results by selecting the violating record groups.
## hhid person hhrole
## 1 1 1 h
## 2 1 2 h
## 4 1 4 m
## 7 3 7 m
We see that household 1 has two heads of household, while household 3 has no head of household.
To test whether at least one head of household exists, one can use
exists_any
:
## hhid person hhrole
## 7 3 7 m
In the following example we check whether there is exactly one region called Samplonia
for each period and each measure in the samplonomy
dataset.
The first argument of exists_one()
is a rule that has to be checked in every group
indicated by the by
argument. The output is a logical vector with an element for
each record in the dataset under scrutiny. If a group of data fails the test, each record
in that group is indicated as wrong (FALSE
).
## name items passes fails nNA error warning
## 1 V1 1199 1199 0 0 FALSE FALSE
Here, there are no groups that violate this assumption.
## [1] region freq period measure value
## <0 rows> (or 0-length row.names)
5.3 Code hierarchies and aggregation
Classifications and ontologies often have a hierarchical structure. A well-known example is the NACE classification of economic activities. In the NACE classification, the economy is divided into 10 basic types of activities such as ‘Agriculture’ or ‘Mining and Quarrying’, and each activity is again divided into subclasses, such as ‘Growing of rice’ and ‘Growing of Grapes’ under ‘Agriculture’. The subdividing can go on for several levels. For statistics that describe an economy according to the NACE classification, it is desirable that the statistics of subclasses add up to their parent classes. This is what the function ‘hierarchy’ does in ‘validate’.
The validate
package comes with a version of the NACE classification
(Revision 2, 2008) so we will use that as an example.
## Order Level Code Parent
## 1 398481 1 A
## 2 398482 2 01 A
## 3 398483 3 01.1 01
## 4 398484 4 01.11 01.1
## 5 398485 4 01.12 01.1
## 6 398486 4 01.13 01.1
The second and third column contain the necessary information: they list the
parent for each NACE code (where each parent is also a NACE code). To demonstrate
how hierarchy()
works, we first create some example data.
dat <- data.frame(
nace = c("01","01.1","01.11","01.12", "01.2")
, volume = c(100 ,70 , 30 ,40 , 25 )
)
dat
## nace volume
## 1 01 100
## 2 01.1 70
## 3 01.11 30
## 4 01.12 40
## 5 01.2 25
We see that the volumes for subclasses "01.11"
and "01.12"
add up to
"01.1"
( \(30+40=70\) ). However, the volumes for "01.1"
and "01.2"
do not
add up to the volume for "01"
(\(70+25\not=100\)). The hierarchy()
function
checks all these relations.
Before using hierarchy
in the setting of a validator
object, we can examine
it directly.
## nace volume check
## 1 01 100 FALSE
## 2 01.1 70 FALSE
## 3 01.11 30 TRUE
## 4 01.12 40 TRUE
## 5 01.2 25 FALSE
We see that hierarchy()
returns a logical
vector with one element for each
record in the data. Each record that is involved in one or more aggregation
checks that fail is labeled FALSE
. Here, this concerns the records with
labels "01"
, "01.1"
and "01.2"
.
We will next look at a more complicated example, but first note the following.
The hierarchy()
function
- can handle any statistical aggregate,
sum()
is just the default; - supports globbing and regular expressions in the child values;
- has an adjustable tolerance value for comparing observed with computed aggregates;
- has configurable behaviour for cases of missing data;
- can be applied per-group, defined by one or more grouping variables (see next example).
See the help file ?hierarchy
for specification and examples.
A more complicated example
Samplonia is divided in two districts, each of which is divided into several provinces. Let us define the hierarchical code list.
samplonia <- data.frame(
region = c("Agria", "Induston"
, "Wheaton", "Greenham"
, "Smokely", "Mudwater", "Newbay", "Crowdon")
, parent = c(rep("Samplonia",2), rep("Agria",2), rep("Induston",4))
)
samplonia
## region parent
## 1 Agria Samplonia
## 2 Induston Samplonia
## 3 Wheaton Agria
## 4 Greenham Agria
## 5 Smokely Induston
## 6 Mudwater Induston
## 7 Newbay Induston
## 8 Crowdon Induston
Recall the structure of the samplonomy
dataset.
## region freq period measure value
## 1 Agria A 2014 gdp 600000
## 2 Agria A 2014 import 210000
## 3 Agria A 2014 export 222000
## 4 Agria A 2014 balance 12000
## 5 Agria Q 2014Q1 gdp 60000
## 6 Agria Q 2014Q1 import 21000
We will check whether regions sum to their parent regions, for each period and for each measure.
rule <- validator(
hierarchy(value, region, hierarchy=ref$codelist, by=list(period, measure))
)
out <- confront(samplonomy, rule, ref=list(codelist=samplonia))
summary(out)
## name items passes fails nNA error warning
## 1 V1 1199 237 954 8 FALSE TRUE
## expression
## 1 hierarchy(value, region, hierarchy = ref[["codelist"]], by = list(period, measure))
We see that some aggregates add up correctly, and some don’t. There is also a warning which we should investigate.
## $V1
## [1] "Parent 'Induston' occurs more than once (2 times) in group (2018Q2, export)"
If one of the groups contains a parent more than once it is not possible
to check whether child values add up to the aggregate. For this reason
the duplicated parent and all it’s children are marked FALSE
. Indeed we
find a duplicated record.
## region freq period measure value
## 870 Induston Q 2018Q2 export 165900
## 871 Induston Q 2018Q2 export 170000
Just to see if we can remove the warning, let us remove the duplicate and re-run the check.
i <- !duplicated(samplonomy[c("region","period","measure")])
samplonomy2 <- samplonomy[i, ]
out <- confront(samplonomy2, rule, ref=list(codelist=samplonia))
# suppress some columns for brevity
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 1198 238 952 8 FALSE FALSE
The hierarchy()
function marks every record FALSE
that is involved
in any check. This may make it hard to figure out which check it failed.
One can get more detailed information, by checking different parts
of the hierarchy in separate rules.
rules <- validator(
level0 = hierarchy(value, region, ref$level0, by=list(period, measure))
, level1 = hierarchy(value, region, ref$level1, by=list(period, measure))
)
out <- confront(samplonomy2, rules
, ref=list(level0=samplonia[1:2,], level1=samplonia[3:8,])
)
summary(out)
## name items passes fails nNA error warning
## 1 level0 1198 1194 4 0 FALSE FALSE
## 2 level1 1198 240 950 8 FALSE FALSE
## expression
## 1 hierarchy(value, region, ref[["level0"]], by = list(period, measure))
## 2 hierarchy(value, region, ref[["level1"]], by = list(period, measure))
We can now select records involved in violating the highest level rules separately.
## region freq period measure value
## 260 Induston A 2015 gdp 1358000
## 340 Samplonia A 2015 gdp 1940000
## 814 Agria Q 2018Q3 export 118500
## 954 Samplonia Q 2018Q3 export 284400
From this it appears that in 2015, the GDP for Agria is missing, and in 2018Q3 there is no value for the export of Induston.
5.4 General aggregates in long-form data
Checking aggregations in long-form format is more involved than for data in wide format (as in Section 4.2).
Here, we check in the samplonomy
dataset that for each measure and each
period, the subregional data adds up to the regional data.
rules <- validator(
part_whole_relation(value
, labels=region
, whole="Samplonia"
, part =c("Agria","Induston")
, by=list(measure, period)
)
)
The first argument of part_whole_relation()
is the name of the variable
containing the values. Here, the column value
from the samplonomy dataset.
The argument labels
indicates the variable that labels parts and wholes.
Next, we define the label value that indicates a total. Here, a record with
region label "Samplonia"
indicates a total. Under argument part
we specify
the labels that have to add up to Samplonia, here the provinces Agria and
Induston. Note that there are more subregions in the dataset, for example the
district of Wheaton (a subregion of Agria). Since we do not specify them, these
are ignored. In the by
argument we specify that the dataset must be split
into measure and period prior to checking the regional aggregates.
The output is one boolean value per record. For each block, defined by values
of measure
and period
either all values are TRUE
, FALSE
, or NA
. The
latter indicates that the aggregate could not be computed because one of the
values is missing, or the computed aggregate could not be compared with the
aggregate in the data because it is missing (either the whole record may be
missing, or the value may be NA
).
## name items passes fails nNA error warning
## 1 V1 1199 1191 8 0 FALSE FALSE
We can extract the truth values and then inspect the blocks with erroneous values using standard R functionality.
## region freq period measure value
## 260 Induston A 2015 gdp 1358000
## 340 Samplonia A 2015 gdp 1940000
## 810 Agria Q 2018Q2 export 47400
## 814 Agria Q 2018Q3 export 118500
## 870 Induston Q 2018Q2 export 165900
## 871 Induston Q 2018Q2 export 170000
## 950 Samplonia Q 2018Q2 export 213300
## 954 Samplonia Q 2018Q3 export 284400
Recall that the rule was executed per block defined by measure
and period
.
Thus, the result indicates three errors: one in the block of records defined
by measure=="gdp"
and period=="2015"
, also in the blocks defined by
measure=="export"
and period==2018Q2
or period=="2018Q3"
.
First, it seems that the 2015 GDP of Agria is missing from the data set. This turns out indeed to be the case.
## [1] region freq period measure value
## <0 rows> (or 0-length row.names)
Second, it can be seen that for Induston, there are two export values for
"2018Q2"
while the export value for "2018Q3"
is missing.
Notes
Specifying (group-wise) aggregates is a fairly detailed job in the case of long data. There are a few things to keep in mind when using this function.
- The argument
part
is optional. If not specified, every record not matching withwhole
will be considered a detail that is to be used to compute the total. In the current example this was not possible because besides Agria and Induston, we have other subregions. - In the example we used literal values to specify the keys that define parts and wholes. It is possible to, recognize patterns, for example any years (4 digits) as a whole and a quarter as a part. See also the next example. Supported patterns include regular expressions (shown here) and globbing (see help file).
- It is important that the variables listed in
by
(if any) uniquely specify a single aggregate. So here, for each measure and period, the label"Samplonia"
should occur at most once (if it does not occur the result will beNA
). - The default way to aggregate is to take the sum. You can specify other ways
to aggregate by passing an
aggregator
argument. For exampleaggregator=mean
. - By default, the aggregate in the data is compared with the computed aggregate
up to a tolerance of \(10^{-8}\). This tolerance can be set using the
tol
argument. E.g. for integer data you may want to settol=0
.
5.5 Aggregates of time series in long format
We are going to check whether quarterly time series add up to the annual time series. This is more complicated because of two subtleties.
First there is not one fixed aggregate key, like "Samplonia"
. Rather, we
have a key pattern. Each total is defined by a period label that consists of
precisely four digits. So rather than recognizing a specific year we want to
recognize that a key represents any year. This can be done using a regular
expression of the form "^\\d{4}$"
, where the ^
indicates ‘start of string’,
the \\d{4}
indicates ‘four times a digit’ and $
indicates ‘end of string’.
Second, we wish to check annual totals against the sum over quarters for each region and each
measure. However, a value-combination of measure and region does not single out
a single value for year
. For example, for the Induston export we have the following
annual data.
## region freq period measure value
## 63 Induston A 2014 export 518000
## 262 Induston A 2015 export 525000
## 462 Induston A 2016 export 532000
## 662 Induston A 2017 export 560000
## 862 Induston A 2018 export 553000
## 1062 Induston A 2019 export 553000
So in fact, we need to do the check by year as well as by measure and region.
Fortunately, in this case it is easy to derive a variable that indicates the year
by selecting the first four characters from period
.
rules <- validator(part_whole_relation(value
, labels = period
, whole = rx("^\\d{4}$")
, by = list(region, substr(period,1,4), measure)
))
out <- confront(samplonomy, rules)
We use rx("^\\d{4}")
to tell part_whole_relation
that this string must be
interpreted as a regular expression. Here, we do not indicate part
labels
explicitly: by default any record not matching whole
will be treated as a
detail that must be used to compute the total.
## named list()
## name items passes fails nNA error warning
## 1 V1 1199 1180 9 10 FALSE FALSE
We now get 9 fails and 10 missing values. We can filter out records that
have NA
(lacking) results.
## region freq period measure value
## 24 Crowdon A 2014 balance 1600
## 28 Crowdon Q 2014Q1 balance NA
## 32 Crowdon Q 2014Q2 balance 480
## 36 Crowdon Q 2014Q3 balance 480
## 40 Crowdon Q 2014Q4 balance 320
## 1181 Wheaton A 2019 import 62000
## 1185 Wheaton Q 2019Q1 import 6200
## 1189 Wheaton Q 2019Q2 import NA
## 1193 Wheaton Q 2019Q3 import 31000
## 1197 Wheaton Q 2019Q4 import 12400
There are two blocks where the annual total could not be compared with
the sum over quarterly series. The balance value of Crowdon is missing
for "2014Q1"
as well as the import value of Wheaton for "2019Q2"
.
Similarly, we can inspect the failing blocks
## region freq period measure value
## 204 Agria Q 2015Q1 gdp 58200
## 208 Agria Q 2015Q2 gdp 116400
## 212 Agria Q 2015Q3 gdp 291000
## 216 Agria Q 2015Q4 gdp 116400
## 862 Induston A 2018 export 553000
## 866 Induston Q 2018Q1 export 110600
## 870 Induston Q 2018Q2 export 165900
## 871 Induston Q 2018Q2 export 170000
## 878 Induston Q 2018Q4 export 110600