Chapter 2 Variable checks

Variable checks are checks that can be performed on a field-by-field basis. An example is checking that a variable called Age is nonnegative, or of integer type. Variable checks are among the simplest checks.

Data

In this section we will use the SBS2000 dataset, that is included with validate.

##      id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01  sc0      0.02    75       NA        NA      1130          NA
## 2 RET02  sc3      0.14     9     1607        NA      1607         131
## 3 RET03  sc3      0.14    NA     6886       -33      6919         324
##   total.costs profit vat
## 1       18915  20045  NA
## 2        1544     63  NA
## 3        6493    426  NA

See ?SBS2000 for a description.

2.1 Variable type

In R, one can test the type of a variable using built-in functions such as is.numeric or is.character.

## [1] TRUE
## [1] FALSE

In validate, any function starting with is. (‘is’ followed by a dot) is considered a validation function.

##   name items passes fails nNA error warning           expression
## 1   V1     1      0     1   0 FALSE   FALSE   is.character(size)
## 2   V2     1      1     0   0 FALSE   FALSE is.numeric(turnover)

We see that each rule checks a single item, namely one column of data. The first rule is violated (it is in fact a factor variable). The second rule is satisfied.

2.2 Missingness

Use R’s standard is.na() to check missing items in individual variables. Negate it to check that values are available.

##   name items passes fails nNA error warning        expression
## 1   V1    60     56     4   0 FALSE   FALSE  !is.na(turnover)
## 2   V2    60     24    36   0 FALSE   FALSE !is.na(other.rev)
## 3   V3    60     55     5   0 FALSE   FALSE    !is.na(profit)

We see that in 4 cases the variable turnover is missing, while other.rev and profit are missing respectively in 36 and 5 occasions.

To demand that all items must be present or absent for a certain variable, use R’s quantifiers: any() or all(), possibly negated.

##   name items passes fails nNA error warning             expression
## 1   V1     1      1     0   0 FALSE   FALSE !any(is.na(incl.prob))
## 2   V2     1      0     1   0 FALSE   FALSE        all(is.na(vat))
  • To check whether records or parts thereof are completed, see 4.1.
  • To check whether records are available at all, see 3.3.

2.3 Field length

The number of characters in text fields can be tested using either R’s standard nchar() function, or with the convenience function field_length.

##   name items passes fails nNA error warning
## 1   V1    60     60     0   0 FALSE   FALSE
## 2   V2    60     60     0   0 FALSE   FALSE
## 3   V3    60     60     0   0 FALSE   FALSE
##                             expression
## 1       nchar(as.character(size)) >= 2
## 2              field_length(id, n = 5)
## 3 field_length(size, min = 2, max = 3)

One advantage of check_field_length is that its argument is converted to character (recall that size is a factor variable). The function field_length can be used to either test for exact field lengths or to check whether the number of characters is within a certain range.

The field length is measured as the number of code points. Use type="width" to measure the printed width (nr of columns) or type="bytes" to count the number of bytes.

2.4 Format of numeric fields

For numbers that are stored in character type, there is a convenience function called number_format() that accepts a variable name and a format specification.

To check that the numbers are formatted with one figure before, and two figures after the decimal point, we perform the following check.

##         V1
## [1,]  TRUE
## [2,]  TRUE
## [3,] FALSE
## [4,] FALSE

Here, the specification format="d.dd" describes the allowed numeric formats. In this specification the "d" stands for a digit, any other character except the asterisk (*) stands for itself. The asterisk is interpreted as ‘zero or more digits’. Here are some examples of how to define number formats.

format match non-match
0.dddd "0.4321" "0.123","1.4563"
d.ddEdd "3.14E00" "31.14E00"
d.*Edd "0.314E01","3.1415297E00" "3.1415230"
d.dd* "1.23", "1.234",\(\ldots\) "1.2"

The last example shows how to check for a minimal number of digits behind the decimal point.

There are special arguments to check the number of decimal figures after the decimal separator.

## [1] FALSE  TRUE
## [1]  TRUE FALSE
## [1]  TRUE FALSE
## [1] TRUE TRUE
## [1] TRUE

The arguments min_dig, max_dig and dec are ignored when format is specified.

This function is convenient only for fairly simple number formats. Generic pattern matching in strings is discussed in the next section.

2.5 General field format

A simple way to check for more general format is to use globbing patterns. In such patterns, the asterisk wildcard character (*) is interpreted as ‘zero or more characters’ and the question mark (?) is interpreted as ‘any character’.

For example, to check that the id variable in SBS2000 starts with "RET", and that the size variable has consists of "sc" followed by precisely one character, we can do the following.

##   name items passes fails nNA error warning                expression
## 1   V1    60     60     0   0 FALSE   FALSE  field_format(id, "RET*")
## 2   V2    60     60     0   0 FALSE   FALSE field_format(size, "sc?")

Here, the globbing pattern "RET*" is understood as ’a string starting with "RET", followed by zero or more characters. The pattern "sc?" means ’a string starting with "sc", followed by a single character.

The most general way to check whether a field conforms to a pattern is to use a regular expression. The treatment of regular expressions is out of scope for this book, but we will give a few examples. A good introduction to regular expressions is given by

J. Friedl (2006) Mastering regular expressions. O’Reilley Media.

In validate one can use grepl or field_format, with the argument type="regex"

##   name items passes fails nNA error warning
## 1   V1    60     60     0   0 FALSE   FALSE
## 2   V2    60     60     0   0 FALSE   FALSE
##                                          expression
## 1                          grepl("^sc[0-9]$", size)
## 2 field_format(id, "^RET\\\\d{2}$", type = "regex")

Here, the expression "^sc[0-9]$" is a regular expression that should be read as: the string starts ("^") with "sc", is followed by a number between 0 and 9 ("[0-9]") and then ends ("$"). The regular expression "^RET\\{d}2" indicates that a string must start ("^") with "RET", followed by two digits ("\\d{2}"), after which the string must end ("$").

Globbing patterns are easier to develop and easier to understand than regular expressions, while regular expressions offer far more flexibility but are harder to read. Complex and long regular expressions may have subtle matching behaviour that is not immediately obvious to inexperienced users. It is therefore advisable to test regular expressions with a a small dataset representing realistic cases that contains both matches and non-matches. As a rule of thumb we would advise to use globbing patterns unless those offer insufficient flexibility.

2.6 Numeric ranges

Numerical variables may have natural limits from below and/or above. For one-sided ranges, you can use the standard comparison operators.

If a variable is bounded both from above and below one can use two rules, or use the convenience function in_range.

By default, in_range includes the boundaries of the range, so the above rule is equivalent to incl.prob >= 0 and incl.prob <= 1.

Here we set lin.ineq.eps=0 to keep validate from building in a margin for machine rounding errors.

##   name items passes fails nNA error warning
## 1   TO    60     56     0   4 FALSE   FALSE
## 2   TC    60     55     0   5 FALSE   FALSE
## 3   PR    60     60     0   0 FALSE   FALSE
##                              expression
## 1                         turnover >= 0
## 2                      total.costs >= 0
## 3 in_range(incl.prob, min = 0, max = 1)

For numeric ranges it is often a better idea to work with inclusive inequalities (\(\leq\), \(\geq\)) than with strict inequalities (\(<\), \(>\)). Take as an example the strict inequality demand income > 0. This means that any income larger than zero is acceptable, including numbers such as \(0.01\), \(0.000001\) and \(10^{-\textrm{Googol}}\). In practice there is almost always a natural minimal acceptable value that is usually dictated by the unit of measurement. For example, if we measure income in whole Euros, a better demand would be income >= 1.

2.7 Ranges for times and periods

For objects of class Date and objects of class POSIXct one can use comparison operators and in_range in the same way as for numerical data. The in_range function has a convenience feature for period data that is coded in character data, as in "2018Q1" for quarterly data.

We first generate some example data.

## [1] "2018Q1" "2018Q2" "2018Q3" "2018Q4"

The in_range function is capable of recognizing certain date or period formats.

## [1]  TRUE  TRUE FALSE FALSE

It is possible to specify your own date-time format using strftime notation. See ?in_range and ?strptime for specifications.

2.8 Code lists

A code list is a set of values that a variable is allowed to assume. For small code lists, one can use the %in% operator.

##   name items passes fails nNA error warning
## 1   V1    60     60     0   0 FALSE   FALSE
##                                 expression
## 1 size %vin% c("sc0", "sc1", "sc2", "sc3")

Notice that validate replaces %in% with %vin%. The reason is that %vin% has more consistent behavior in the case of missing data. In particular,

## [1]  TRUE FALSE FALSE
## [1]  TRUE FALSE    NA

For longer code lists it is convenient to refer to an externally provided list. There are two ways of doing this: reading the list in the right-hand-size of %in%, or passing a code list to confront as reference data.

Suppose we have a file called codelist.csv with a column code. We can define a rule as follows.

The disadvantage is that the rule now depends on a path that may or may not be available at runtime.

The second option is to assume that a variable, say valid_codes exists at runtime, and pass this with confront.

##   name items passes fails nNA error warning             expression
## 1   V1    60     60     0   0 FALSE   FALSE size %vin% valid_codes

This way, (very) large code lists can be used, but note that it does require a ‘contract’ between variable names used in the rule set and variables passed as reference data.