Chapter 7 Working with validate

In this section we dive deeper into the central object types used in the package: the validator object type for storing lists of rules, and the confrontation object type for storing the results of a validation.

7.1 Manipulating rule sets

Validate stores rulesets into something called a validator object. The validator() function creates such an object.

v <- validator(speed >= 0, dist>=0, speed/dist <= 1.5)
v

## Object of class 'validator' with 3 elements:
##  V1: speed >= 0
##  V2: dist >= 0
##  V3: speed/dist <= 1.5

Validator objects behave a lot like lists. For example, you can select items to get a new validator. Here, we select the first and third element.

w <- v[c(1,3)]

Here w is a new validator object holding only the first and third rule from v. If not specified by the user, rules are given the default names "V1", "V2", and so on. Those names can also be used for selecting rules.

w <- v[c("V1","V3")]

Validator objects are reference objects. This means that if you do

w <- v

then w is not a copy of v. It is just another name for the same physical object as v. To make an actual copy, you can select everything.

w <- v[]

It is also possible to concatenate two validator objects. For example when you read two rule sets from two files (See 8.1). This is done by adding them together with +.

rules1 <- validator(speed>=0)
rules2 <- validator(dist >= 0)
all_rules <- rules1 + rules2

An empty validator object is created with validator().

If you select a single element of a validator object, an object of class ‘rule’ is returned. This is the validating expression entered by the user, plus some (optional) metadata.

v[[3]]

## 
## Object of class rule.
##  expr       : speed/dist <= 1.5 
##  name       : V3 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2025-04-25 14:31:10
##  meta       : language<chr>, severity<chr>

Users never need to manipulate rule objects, but it can be convenient to inspect them. As you see, the rules have some automatically created metadata. In the next section we demonstrate how to retrieve and set the metadata.

7.2 Rule metadata

Validator objects behave a lot like lists. The only metadata in an R list are the names of its elements. You can get and set names of a list using the names<- function. Similarly, there are getter/setter functions for rule metadata.

origin() : Where was a rule defined?
names() : The name per rule
created() : when were the rules created?
label() : Short description of the rule
description(): Long description of the rule
meta() : Set or get generic metadata

Names can be set on the command line, just like how you would do it for an R list.

rules <- validator(positive_speed = speed >= 0, ratio = speed/dist <= 1.5)
rules

## Object of class 'validator' with 2 elements:
##  positive_speed: speed >= 0
##  ratio         : speed/dist <= 1.5

Getting and setting names works the same as for lists.

names(rules)

## [1] "positive_speed" "ratio"

names(rules)[1] <- "nonnegative_speed"

The functions origin(), created(), label(), and description() work in the same way. It is also possible to add generic key-value pairs as metadata. Getting and setting follows the usual recycling rules of R.

# add 'foo' to the first rule:
meta(rules[1],"foo") <- 1
# Add 'bar' to all rules
meta(rules,"bar") <- "baz"

Metadata can be made visible by selecting a single rule:

v[[1]]

## 
## Object of class rule.
##  expr       : speed >= 0 
##  name       : V1 
##  label      :  
##  description:  
##  origin     : command-line 
##  created    : 2025-04-25 14:31:10
##  meta       : language<chr>, severity<chr>

Or by extracting it to a data.frame

meta(v)

##   name label description       origin             created       language
## 1   V1                   command-line 2025-04-25 14:31:10 validate 1.1.5
## 2   V2                   command-line 2025-04-25 14:31:10 validate 1.1.5
## 3   V3                   command-line 2025-04-25 14:31:10 validate 1.1.5
##   severity
## 1    error
## 2    error
## 3    error

Some general information is obtained with summary,

summary(v)

##   block nvar rules linear
## 1     1    2     3      2

Here, some properties per block of rules is given. Two rules occur in the same block if when they share a variable. In this case, all rules occur in the same block.

The number of rules can be requested with length

length(v)

## [1] 3

With variables, the variables occurring per rule, or over all the rules can be requested.

variables(v)

## [1] "speed" "dist"

variables(v,as="matrix")

##     variable
## rule speed  dist
##   V1  TRUE FALSE
##   V2 FALSE  TRUE
##   V3  TRUE  TRUE

7.3 Rules in data frames

You can read and write rules and their metadata from and to data frames. This is convenient, for example in cases where rules are retrieved from a central rule repository in a data base.

Exporting rules and their metadata can be done with as.data.frame.

rules <- validator(speed >= 0, dist >= 0, speed/dist <= 1.5)
df <- as.data.frame(rules)

Reading from a data frame is done through the .data argument.

rules <- validator(.data=df)

It is not necessary to define all possible metadata in the data frame. It is sufficient to have three character columns, named rule, name and description in any order.

7.4 Validation rule syntax

Conceptually, any R statement that will evaluate to a logical is considered a validating statement. The validate package checks this when the user defines a rule set, so for example calling validator( mean(height) ) will result in a warning since just computing mean(x) does not validate anything.

You will find a concise description of the syntax in the syntax help file.

?syntax

In short, you can use

Type checks: any function starting with is.
Binary comparisons: <, <=, ==, !=, >=, > and %in%
Unary logical operators: !, all(), any()
Binary logical operators: &, &&, |, || and logical implication, e.g. if (staff > 0) staff.costs > 0
Pattern matching grepl
Functional dependency: \(X\to Y + Z\) is represented by X ~ Y + Z.

There are some extra syntax elements that help in defining complex rules.

Inspect the whole data set using ., e.g. validator( nrow(.) > 10).
Reuse a variable using :=, e.g. validator(m := mean(x), x < 2*m ).
Apply the same rule to multiple groups with var_group. For example validator(G:=var_group(x,y), G > 0) is equivalent to validator(x>0, y>0).

A few helper functions are available to compute groupwise values on variables (vectors). They differ from functions like aggregate or tapply in that their result is always of the same length as the input.

sum_by(1:10, by = rep(c("a","b"), each=5) )

##  [1] 15 15 15 15 15 40 40 40 40 40

This is useful for rules where you want to compare individual values with group aggregates.

function	computes
`do_by`	generic groupwise calculation
`sum_by`	groupwise sum
`min_by`, `max_by`	groupwise min, max
`mean_by`	groupwise mean
`median_by`	groupwise median

function	checks
`in_range`	Numeric variable range
`is_unique`	Uniqueness of variable combinations
`all_unique`	Equivalent to `all(is_unique())`
`is_complete`	Completeness of records
`all_complete`	Equivalent to `all(is_complete())`
`exists_any`	For each group, check if any record satisfies a rule
`exists_one`	For each group, check if exactly one record satisfies a rule
`is_linear_sequence`	Linearity of numeric or date/time/period series
`in_linear_sequence`	Linearity of numeric of date/time/period series
`hierarchy`	Hierarchical aggregations
`part_whole_relation`	Generic part-whole relations
`field_length`	Field length
`number_format`	Numeric format in text fields
`field_format`	Field format
`contains_exactly`	Availability of records
`contains_at_least`	Availability of records
`contains_at_most`	Availability of records
`does_not_contain`	Correctness of key combinations

7.5 Confrontation objects

The outcome of confronting a validator object with a data set is an object of class confrontation. There are several ways to extract information from a confrontation object.

summary: summarize output; returns a data.frame
aggregate: aggregate validation in several ways
sort : aggregate and sort in several ways
values: Get the values in an array, or a list of arrays if rules have different output dimension structure
errors: Retrieve error messages caught during the confrontation
warnings: Retrieve warning messages caught during the confrontation.

By default aggregates are produced by rule.

v  <- validator(height>0, weight>0,height/weight < 0.5)
cf <- confront(women, rules)
aggregate(cf)

## NULL

To aggregate by record, use by='record'

head(aggregate(cf,by='record'))

## list()

Aggregated results can be automatically sorted, so records with the most violations or rules that are violated most sort higher.

# rules with most violations sorting first:
sort(cf)

## NULL

Confrontation objects can be subsetted with single bracket operators (like vectors), to obtain a sub-object pertaining only to the selected rules.

summary(cf[c(1,3)])

7.6 Confrontation options

By default, all errors and warnings are caught when validation rules are confronted with data. This can be switched off by setting the raise option to "errors" or "all". The following example contains a specification error: hite should be height and therefore the rule errors on the women data.frame because it does not contain a column hite. The error is caught (not resulting in a R error) and shown in the summary,

v <- validator(hite > 0, weight>0)
summary(confront(women, v))

##   name items passes fails nNA error warning expression
## 1   V1     0      0     0   0  TRUE   FALSE   hite > 0
## 2   V2    15     15     0   0 FALSE   FALSE weight > 0

Setting raise to all results in a R error:

# this gives an error
confront(women, v, raise='all')

## Error in fun(...): object 'hite' not found

Linear equalities form an important class of validation rules. To prevent equalities to be strictly tested, there is an option called lin.eq.eps (with default value \(10^{-8}\)) that allows one to add some slack to these tests. The amount of slack is intended to prevent false negatives (unnecessary failures) caused by machine rounding. If you want to check whether a sum-rule is satisfied to within one or two units of measurement, it is cleaner to define two inequalities for that.

7.7 Using reference data

For some checks it is convenient to compare the data under scrutiny with other data artifacts. Two common examples include:

Data is checked against an earlier version of the same dataset.
We wish to check the contents of a column against a code list, and we do not want to put the code list hard-coded into the rule set.

For this, we can use the ref option in confront. Here is how to compare columns from two data frames row-by-row. The user has to make sure that the rows of the data set under scrutiny (women) matches row-wise with the reference data set (women1).

women1 <- women
rules <- validator(height == women_reference$height)
cf <- confront(women, rules, ref = list(women_reference = women1))
summary(cf)

##   name items passes fails nNA error warning
## 1   V1    15     15     0   0 FALSE   FALSE
##                              expression
## 1 height == women_reference[["height"]]

Here is how to make a code list available.

rules <- validator( fruit %in% codelist )
fruits <-  c("apple", "banana", "orange")
dat <- data.frame(fruit = c("apple","broccoli","orange","banana"))
cf <- confront(dat, rules, ref = list(codelist = fruits))
summary(cf)

##   name items passes fails nNA error warning           expression
## 1   V1     4      3     1   0 FALSE   FALSE fruit %vin% codelist