Chapter 7 Working with validate

In this section we dive deeper into the the central object types used in the package: the validator object type for storing lists of rules, and the confrontation object type for storing the results of a validation.

7.1 Manipulating rule sets

Validate stores rulesets into something called a validator object. The validator() function creates such an object.

## Object of class 'validator' with 3 elements:
##  V1: speed >= 0
##  V2: dist >= 0
##  V3: speed/dist <= 1.5

Validator objects behave a lot like lists. For example, you can select items to get a new validator. Here, we select the first and third element.

Here w is a new validator object holding only the first and third rule from v. If not specified by the user, rules are given the default names "V1", "V2", and so on. Those names can also be used for selecting rules.

Validator objects are reference objects. This means that if you do

w <- v

then w is not a copy of v. It is just another name for the same physical object as v. To make an actual copy, you can select everything.

w <- v[]

It is also possible to concatenate two validator objects. For example when you read two rule sets from two files (See 8.1). This is done by adding them together with +.

An empty validator object is created with validator().

If you select a single element of a validator object, an object of class ‘rule’ is returned. This is the validating expression entered by the user, plus some (optional) metadata.

## 
## Object of class rule.
##  expr       : speed/dist <= 1.5 
##  name       : V3 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2023-05-01 11:10:39
##  meta       : language<chr>, severity<chr>

Users never need to manipulate rule objects, but it can be convenient to inspect them. As you see, the rules have some automatically created metadata. In the next section we demonstrate how to retrieve and set the metadata.

7.2 Rule metadata

Validator objects behave a lot like lists. The only metadata in an R list are the names of its elements. You can get and set names of a list using the names<- function. Similarly, there are getter/setter functions for rule metadata.

  • origin() : Where was a rule defined?
  • names() : The name per rule
  • created() : when were the rules created?
  • label() : Short description of the rule
  • description(): Long description of the rule
  • meta() : Set or get generic metadata

Names can be set on the command line, just like how you would do it for an R list.

## Object of class 'validator' with 2 elements:
##  positive_speed: speed >= 0
##  ratio         : speed/dist <= 1.5

Getting and setting names works the same as for lists.

## [1] "positive_speed" "ratio"

The functions origin(), created(), label(), and description() work in the same way. It is also possible to add generic key-value pairs as metadata. Getting and setting follows the usual recycling rules of R.

Metadata can be made visible by selecting a single rule:

## 
## Object of class rule.
##  expr       : speed >= 0 
##  name       : V1 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2023-05-01 11:10:39
##  meta       : language<chr>, severity<chr>

Or by extracting it to a data.frame

##   name label description       origin             created       language
## 1   V1                   command-line 2023-05-01 11:10:39 validate 1.1.3
## 2   V2                   command-line 2023-05-01 11:10:39 validate 1.1.3
## 3   V3                   command-line 2023-05-01 11:10:39 validate 1.1.3
##   severity
## 1    error
## 2    error
## 3    error

Some general information is obtained with summary,

##   block nvar rules linear
## 1     1    2     3      2

Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.

The number of rules can be requested with length

## [1] 3

With variables, the variables occurring per rule, or over all the rules can be requested.

## [1] "speed" "dist"
##     variable
## rule speed  dist
##   V1  TRUE FALSE
##   V2 FALSE  TRUE
##   V3  TRUE  TRUE

7.3 Rules in data frames

You can read and write rules and their metadata from and to data frames. This is convenient, for example in cases where rules are retrieved from a central rule repository in a data base.

Exporting rules and their metadata can be done with as.data.frame.

Reading from a data frame is done through the .data argument.

It is not necessary to define all possible metadata in the data frame. It is sufficient to have three character columns, named rule, name and description in any order.

7.4 Validation rule syntax

Conceptually, any R statement that will evaluate to a logical is considered a validating statement. The validate package checks this when the user defines a rule set, so for example calling validator( mean(height) ) will result in a warning since just computing mean(x) does not validate anything.

You will find a concise description of the syntax in the syntax help file.

In short, you can use

  • Type checks: any function starting with is.
  • Binary comparisons: <, <=, ==, !=, >=, > and %in%
  • Unary logical operators: !, all(), any()
  • Binary logical operators: &, &&, |, || and logical implication, e.g. if (staff > 0) staff.costs > 0
  • Pattern matching grepl
  • Functional dependency: \(X\to Y + Z\) is represented by X ~ Y + Z.

There are some extra syntax elements that help in defining complex rules.

  • Inspect the whole data set using ., e.g. validator( nrow(.) > 10).
  • Reuse a variable using :=, e.g. validator(m := mean(x), x < 2*m ).
  • Apply the same rule to multiple groups with var_group. For example validator(G:=var_group(x,y), G > 0) is equivalent to validator(x>0, y>0).

A few helper functions are available to compute groupwise values on variables (vectors). They differ from functions like aggregate or tapply in that their result is always of the same length as the input.

##  [1] 15 15 15 15 15 40 40 40 40 40

This is useful for rules where you want to compare individual values with group aggregates.

function computes
do_by generic groupwise calculation
sum_by groupwise sum
min_by, max_by groupwise min, max
mean_by groupwise mean
median_by groupwise median

See also Section 5.1.

There are a number of functions that perform a particular validation task that would be hard to express with basic syntax. These are treated extensively in Chapters 2 to 5, but here is a quick overview.

function checks
in_range Numeric variable range
is_unique Uniqueness of variable combinations
all_unique Equivalent to all(is_unique())
is_complete Completeness of records
all_complete Equivalent to all(is_complete())
exists_any For each group, check if any record satisfies a rule
exists_one For each group, check if exactly one record satisfies a rule
is_linear_sequence Linearity of numeric or date/time/period series
in_linear_sequence Linearity of numeric of date/time/period series
hierarchy Hierarchical aggregations
part_whole_relation Generic part-whole relations
field_length Field length
number_format Numeric format in text fields
field_format Field format
contains_exactly Availability of records
contains_at_least Availability of records
contains_at_most Availability of records
does_not_contain Correctness of key combinations

7.5 Confrontation objects

The outcome of confronting a validator object with a data set is an object of class confrontation. There are several ways to extract information from a confrontation object.

  • summary: summarize output; returns a data.frame
  • aggregate: aggregate validation in several ways
  • sort : aggregate and sort in several ways
  • values: Get the values in an array, or a list of arrays if rules have different output dimension structure
  • errors: Retrieve error messages caught during the confrontation
  • warnings: Retrieve warning messages caught during the confrontation.

By default aggregates are produced by rule.

## NULL

To aggregate by record, use by='record'

## list()

Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.

## NULL

Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.

summary(cf[c(1,3)])

7.6 Confrontation options

By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise option to "errors" or "all". The following example contains a specification error: hite should be height and therefore the rule errors on the women data.frame because it does not contain a column hite. The error is caught (not resulting in a R error) and shown in the summary,

##   name items passes fails nNA error warning expression
## 1   V1     0      0     0   0  TRUE   FALSE   hite > 0
## 2   V2    15     15     0   0 FALSE   FALSE weight > 0

Setting raise to all results in a R error:

## Error in fun(...): object 'hite' not found

Linear equalities form an important class of validation rules. To prevent equalities to be strictly tested, there is an option called lin.eq.eps (with default value \(10^{-8}\)) that allows one to add some slack to these tests. The amount of slack is intended to prevent false negatives (unnecessary failures) caused by machine rounding. If you want to check whether a sum-rule is satisfied to within one or two units of measurement, it is cleaner to define two inequalities for that.

7.7 Using reference data

For some checks it is convenient to compare the data under scrutiny with other data artifacts. Two common examples include:

  • Data is checked against an earlier version of the same dataset.
  • We wish to check the contents of a column against a code list, and we do not want to put the code list hard-coded into the rule set.

For this, we can use the ref option in confront. Here is how to compare columns from two data frames row-by-row. The user has to make sure that the rows of the data set under scrutiny (women) matches row-wise with the reference data set (women1).

##   name items passes fails nNA error warning
## 1   V1    15     15     0   0 FALSE   FALSE
##                              expression
## 1 height == women_reference[["height"]]

Here is how to make a code list available.

##   name items passes fails nNA error warning           expression
## 1   V1     4      3     1   0 FALSE   FALSE fruit %vin% codelist