Chapter 1 Introduction to validate

Data Validation is an activity verifying whether or not a combination of values is a member of a set of acceptable combinations (Di Zio et al , 2015).

The validate package is intended to make checking your data easy, maintainable, and reproducible. It does this by allowing you to

  • test data against a reusable set of data validation rules:
  • investigate, summarise, and visualise data validation results;
  • import and export rule sets from and to various formats;
  • filter, select and otherwise manipulate data validation rules’;
  • investigate, summarise, and visualise rule sets.

For advanced rule manipulation there is the validatetools package.

1.1 A quick example

Here’s an example demonstrating the typical workflow. We’ll use the built-in cars data set, which contains 50 cases of speed and stopping distances of cars.

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4

Validating data is all about checking whether a data set meets presumptions or expectations you have about it, and the validate package makes it easy for you to define those expectations. Let’s do a quick check on variables in the cars data set. We first load the package, and create a list of data quality demands with the validator() function.

Here, the first three rules are record-wise checks: each record will yield one answer. In the last rule we check whether speed and distance are positively correlated this will yield a single TRUE or FALSE for the whole data set. We now confront the data with those rules and save the output into a variable called out.

The easiest way to check the results is with summary().

##   name items passes fails nNA error warning              expression
## 1   V1    50     50     0   0 FALSE   FALSE     speed - 0 >= -1e-08
## 2   V2    50     50     0   0 FALSE   FALSE      dist - 0 >= -1e-08
## 3   V3    50     48     2   0 FALSE   FALSE       speed/dist <= 1.5
## 4   V4     1      1     0   0 FALSE   FALSE cor(speed, dist) >= 0.2

This returns a data frame with one line of information for each rule V1, V2, V3 and V4. To be precise:

  • How many data items were checked against each rule.
  • How many items passed, failed or resulted in NA.
  • Whether the check resulted in an error (could not be performed) or gave an warning.
  • The expression that was actually evaluated to perform the check.

The same information can be summarized graphically as follows .

In this plot each horizontal bar indicates the percentage of Failing, Passing, and Missing cases. The table in the legend lists the total number of Fails, Passes and Missings, summed over all checks. Here, we have 4 rules. The first three rules yield 50 results each, while the last rule yields a single result. Hence there are 151 validation results in total.

Using the function violating we can select the records that violate one or more rules. We select only the first three results because the last rule can not be interpreted record by record.

##   speed dist
## 1     4    2
## 3     7    4

We can extract all individual resuls results using for example as.data.frame.

##   name value          expression
## 1   V1  TRUE speed - 0 >= -1e-08
## 2   V1  TRUE speed - 0 >= -1e-08
## 3   V1  TRUE speed - 0 >= -1e-08

We see that in record 1, rule V1, was satisfied (the result is TRUE), and that validate left a bit of slack when executing the rule, to avoid false negatives caused by machine rounding issues.

Summarizing, the basic workflow in validate is to create a rule set, confront a data set with the rules in the rule set, and then analyze or use the results further. To understand which checks you can perform with validate you only need to remember the following.

Any R expression that results in a logical is accepted by validate as a validation rule.

You are now ready to start validating your data, and navigate Chapters 3-5 to learn how to define specific types of checks. Chapter~7, discusses more details about working with validate.