Chapter 7 Working with validate
In this section we dive deeper into the the central object types used in the
package: the validator
object type for storing lists of rules, and the
confrontation
object type for storing the results of a validation.
7.1 Manipulating rule sets
Validate stores rulesets into something called a validator
object. The
validator()
function creates such an object.
## Object of class 'validator' with 3 elements:
## V1: speed >= 0
## V2: dist >= 0
## V3: speed/dist <= 1.5
Validator objects behave a lot like lists. For example, you can select items
to get a new validator
. Here, we select the first and third element.
Here w
is a new validator object holding only the first and third rule from
v
. If not specified by the user, rules are given the default names "V1"
,
"V2"
, and so on. Those names can also be used for selecting rules.
Validator objects are reference objects. This means that if you do
w <- v
then w
is not a copy of v
. It is just another name for the same physical
object as v
. To make an actual copy, you can select everything.
w <- v[]
It is also possible to concatenate two validator objects. For example when you
read two rule sets from two files (See 8.1). This is done
by adding them together with +
.
An empty validator object is created with validator()
.
If you select a single element of a validator object, an object of class ‘rule’ is returned. This is the validating expression entered by the user, plus some (optional) metadata.
##
## Object of class rule.
## expr : speed/dist <= 1.5
## name : V3
## label :
## description:
## origin : command-line
## created : 2023-05-01 11:10:39
## meta : language<chr>, severity<chr>
Users never need to manipulate rule objects, but it can be convenient to inspect them. As you see, the rules have some automatically created metadata. In the next section we demonstrate how to retrieve and set the metadata.
7.2 Rule metadata
Validator objects behave a lot like lists. The only metadata in an R
list are the names
of its elements. You can get and set names of a list
using the names<-
function. Similarly, there are getter/setter functions
for rule metadata.
origin()
: Where was a rule defined?names()
: The name per rulecreated()
: when were the rules created?label()
: Short description of the ruledescription()
: Long description of the rulemeta()
: Set or get generic metadata
Names can be set on the command line, just like how you would do it for an R list.
## Object of class 'validator' with 2 elements:
## positive_speed: speed >= 0
## ratio : speed/dist <= 1.5
Getting and setting names works the same as for lists.
## [1] "positive_speed" "ratio"
The functions origin()
, created()
, label()
, and description()
work in
the same way. It is also possible to add generic key-value pairs as metadata.
Getting and setting follows the usual recycling rules of R.
# add 'foo' to the first rule:
meta(rules[1],"foo") <- 1
# Add 'bar' to all rules
meta(rules,"bar") <- "baz"
Metadata can be made visible by selecting a single rule:
##
## Object of class rule.
## expr : speed >= 0
## name : V1
## label :
## description:
## origin : command-line
## created : 2023-05-01 11:10:39
## meta : language<chr>, severity<chr>
Or by extracting it to a data.frame
## name label description origin created language
## 1 V1 command-line 2023-05-01 11:10:39 validate 1.1.3
## 2 V2 command-line 2023-05-01 11:10:39 validate 1.1.3
## 3 V3 command-line 2023-05-01 11:10:39 validate 1.1.3
## severity
## 1 error
## 2 error
## 3 error
Some general information is obtained with summary
,
## block nvar rules linear
## 1 1 2 3 2
Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.
The number of rules can be requested with length
## [1] 3
With variables
, the variables occurring per rule, or over all the rules can be requested.
## [1] "speed" "dist"
## variable
## rule speed dist
## V1 TRUE FALSE
## V2 FALSE TRUE
## V3 TRUE TRUE
7.3 Rules in data frames
You can read and write rules and their metadata from and to data frames. This is convenient, for example in cases where rules are retrieved from a central rule repository in a data base.
Exporting rules and their metadata can be done with as.data.frame
.
Reading from a data frame is done through the .data
argument.
It is not necessary to define all possible metadata in the data frame. It is
sufficient to have three character columns, named rule
, name
and
description
in any order.
7.4 Validation rule syntax
Conceptually, any R statement that will evaluate to a logical
is considered a
validating statement. The validate package checks this when the user defines a
rule set, so for example calling validator( mean(height) )
will result in a
warning since just computing mean(x)
does not validate anything.
You will find a concise description of the syntax in the syntax
help file.
In short, you can use
- Type checks: any function starting with
is.
- Binary comparisons:
<, <=, ==, !=, >=, >
and%in%
- Unary logical operators:
!, all(), any()
- Binary logical operators:
&, &&, |, ||
and logical implication, e.g.if (staff > 0) staff.costs > 0
- Pattern matching
grepl
- Functional dependency: \(X\to Y + Z\) is represented by
X ~ Y + Z
.
There are some extra syntax elements that help in defining complex rules.
- Inspect the whole data set using
.
, e.g.validator( nrow(.) > 10)
. - Reuse a variable using
:=
, e.g.validator(m := mean(x), x < 2*m )
. - Apply the same rule to multiple groups with
var_group
. For examplevalidator(G:=var_group(x,y), G > 0)
is equivalent tovalidator(x>0, y>0)
.
A few helper functions are available to compute groupwise values on
variables (vectors). They differ from functions like aggregate
or tapply
in that their result is always of the same length as the input.
## [1] 15 15 15 15 15 40 40 40 40 40
This is useful for rules where you want to compare individual values with group aggregates.
function | computes |
---|---|
do_by |
generic groupwise calculation |
sum_by |
groupwise sum |
min_by , max_by |
groupwise min, max |
mean_by |
groupwise mean |
median_by |
groupwise median |
See also Section 5.1.
There are a number of functions that perform a particular validation task that would be hard to express with basic syntax. These are treated extensively in Chapters 2 to 5, but here is a quick overview.
function | checks |
---|---|
in_range |
Numeric variable range |
is_unique |
Uniqueness of variable combinations |
all_unique |
Equivalent to all(is_unique()) |
is_complete |
Completeness of records |
all_complete |
Equivalent to all(is_complete()) |
exists_any |
For each group, check if any record satisfies a rule |
exists_one |
For each group, check if exactly one record satisfies a rule |
is_linear_sequence |
Linearity of numeric or date/time/period series |
in_linear_sequence |
Linearity of numeric of date/time/period series |
hierarchy |
Hierarchical aggregations |
part_whole_relation |
Generic part-whole relations |
field_length |
Field length |
number_format |
Numeric format in text fields |
field_format |
Field format |
contains_exactly |
Availability of records |
contains_at_least |
Availability of records |
contains_at_most |
Availability of records |
does_not_contain |
Correctness of key combinations |
7.5 Confrontation objects
The outcome of confronting a validator object with a data set is an object of
class confrontation
. There are several ways to extract information from a
confrontation
object.
summary
: summarize output; returns adata.frame
aggregate
: aggregate validation in several wayssort
: aggregate and sort in several waysvalues
: Get the values in an array, or a list of arrays if rules have different output dimension structureerrors
: Retrieve error messages caught during the confrontationwarnings
: Retrieve warning messages caught during the confrontation.
By default aggregates are produced by rule.
## NULL
To aggregate by record, use by='record'
## list()
Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.
## NULL
Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.
summary(cf[c(1,3)])
7.6 Confrontation options
By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise
option to "errors"
or "all"
. The following
example contains a specification error: hite
should be height
and therefore the rule errors
on the women
data.frame because it does not contain a column hite
. The error is caught
(not resulting in a R error) and shown in the summary,
## name items passes fails nNA error warning expression
## 1 V1 0 0 0 0 TRUE FALSE hite > 0
## 2 V2 15 15 0 0 FALSE FALSE weight > 0
Setting raise
to all
results in a R error:
## Error in fun(...): object 'hite' not found
Linear equalities form an important class of validation rules. To prevent
equalities to be strictly tested, there is an option called lin.eq.eps
(with
default value \(10^{-8}\)) that allows one to add some slack to these tests. The
amount of slack is intended to prevent false negatives (unnecessary failures)
caused by machine rounding. If you want to check whether a sum-rule is
satisfied to within one or two units of measurement, it is cleaner to define
two inequalities for that.
7.7 Using reference data
For some checks it is convenient to compare the data under scrutiny with other data artifacts. Two common examples include:
- Data is checked against an earlier version of the same dataset.
- We wish to check the contents of a column against a code list, and we do not want to put the code list hard-coded into the rule set.
For this, we can use the ref
option in confront. Here is how
to compare columns from two data frames row-by-row. The user
has to make sure that the rows of the data set under scrutiny
(women
) matches row-wise with the reference data set (women1
).
women1 <- women
rules <- validator(height == women_reference$height)
cf <- confront(women, rules, ref = list(women_reference = women1))
summary(cf)
## name items passes fails nNA error warning
## 1 V1 15 15 0 0 FALSE FALSE
## expression
## 1 height == women_reference[["height"]]
Here is how to make a code list available.
rules <- validator( fruit %in% codelist )
fruits <- c("apple", "banana", "orange")
dat <- data.frame(fruit = c("apple","broccoli","orange","banana"))
cf <- confront(dat, rules, ref = list(codelist = fruits))
summary(cf)
## name items passes fails nNA error warning expression
## 1 V1 4 3 1 0 FALSE FALSE fruit %vin% codelist