Chapter 2 Variable checks
Variable checks are checks that can be performed on a field-by-field basis. An
example is checking that a variable called Age
is nonnegative, or of integer
type. Variable checks are among the simplest checks.
Data
In this section we will use the SBS2000
dataset, that is included with validate
.
## id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01 sc0 0.02 75 NA NA 1130 NA
## 2 RET02 sc3 0.14 9 1607 NA 1607 131
## 3 RET03 sc3 0.14 NA 6886 -33 6919 324
## total.costs profit vat
## 1 18915 20045 NA
## 2 1544 63 NA
## 3 6493 426 NA
See ?SBS2000
for a description.
2.1 Variable type
In R
, one can test the type of a variable using built-in functions such as
is.numeric
or is.character
.
## [1] TRUE
## [1] FALSE
In validate
, any function starting with is.
(‘is’ followed by a dot) is
considered a validation function.
rules <- validator(
is.character(size)
, is.numeric(turnover)
)
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 1 0 1 0 FALSE FALSE is.character(size)
## 2 V2 1 1 0 0 FALSE FALSE is.numeric(turnover)
We see that each rule checks a single item, namely one column of data. The
first rule is violated (it is in fact a factor
variable). The second rule
is satisfied.
2.2 Missingness
Use R’s standard is.na()
to check missing items in individual variables. Negate
it to check that values are available.
rule <- validator(
!is.na(turnover)
, !is.na(other.rev)
, !is.na(profit)
)
out <- confront(SBS2000, rule)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 60 56 4 0 FALSE FALSE !is.na(turnover)
## 2 V2 60 24 36 0 FALSE FALSE !is.na(other.rev)
## 3 V3 60 55 5 0 FALSE FALSE !is.na(profit)
We see that in 4 cases the variable turnover
is missing,
while other.rev
and profit
are missing respectively in 36
and 5 occasions.
To demand that all items must be present or absent for a certain variable,
use R’s quantifiers: any()
or all()
, possibly negated.
rules <- validator(
!any(is.na(incl.prob))
, all(is.na(vat)) )
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 1 1 0 0 FALSE FALSE !any(is.na(incl.prob))
## 2 V2 1 0 1 0 FALSE FALSE all(is.na(vat))
2.3 Field length
The number of characters in text fields can be tested using either R’s standard
nchar()
function, or with the convenience function field_length
.
rules <- validator(
nchar(as.character(size)) >= 2
, field_length(id, n=5)
, field_length(size, min=2, max=3)
)
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning
## 1 V1 60 60 0 0 FALSE FALSE
## 2 V2 60 60 0 0 FALSE FALSE
## 3 V3 60 60 0 0 FALSE FALSE
## expression
## 1 nchar(as.character(size)) >= 2
## 2 field_length(id, n = 5)
## 3 field_length(size, min = 2, max = 3)
One advantage of check_field_length
is that its argument is converted to
character (recall that size
is a factor
variable). The function
field_length
can be used to either test for exact field lengths or to
check whether the number of characters is within a certain range.
The field length is measured as the number of code
points. Use type="width"
to
measure the printed width (nr of columns) or type="bytes"
to count the number
of bytes.
2.4 Format of numeric fields
For numbers that are stored in character
type, there is a convenience
function called number_format()
that accepts a variable name
and a format specification.
To check that the numbers are formatted with one figure before, and two figures after the decimal point, we perform the following check.
## V1
## [1,] TRUE
## [2,] TRUE
## [3,] FALSE
## [4,] FALSE
Here, the specification format="d.dd"
describes the allowed numeric formats.
In this specification the "d"
stands for a digit, any other character except
the asterisk (*
) stands for itself. The asterisk is interpreted as ‘zero or
more digits’. Here are some examples of how to define number formats.
format | match | non-match |
---|---|---|
0.dddd |
"0.4321" |
"0.123" ,"1.4563" |
d.ddEdd |
"3.14E00" |
"31.14E00" |
d.*Edd |
"0.314E01" ,"3.1415297E00" |
"3.1415230" |
d.dd* |
"1.23" , "1.234" ,\(\ldots\) |
"1.2" |
The last example shows how to check for a minimal number of digits behind the decimal point.
There are special arguments to check the number of decimal figures after the decimal separator.
## [1] FALSE TRUE
## [1] TRUE FALSE
## [1] TRUE FALSE
## [1] TRUE TRUE
## [1] TRUE
The arguments min_dig
, max_dig
and dec
are ignored when format
is
specified.
This function is convenient only for fairly simple number formats. Generic pattern matching in strings is discussed in the next section.
2.5 General field format
A simple way to check for more general format is to use globbing
patterns. In such patterns,
the asterisk wildcard character (*
) is interpreted as ‘zero or more
characters’ and the question mark (?
) is interpreted as ‘any character’.
For example, to check that the id
variable in SBS2000
starts with "RET"
,
and that the size
variable has consists of "sc"
followed by precisely one
character, we can do the following.
rule <- validator(field_format(id, "RET*")
, field_format(size, "sc?" ))
out <- confront(SBS2000, rule)
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 60 60 0 0 FALSE FALSE field_format(id, "RET*")
## 2 V2 60 60 0 0 FALSE FALSE field_format(size, "sc?")
Here, the globbing pattern "RET*"
is understood as ’a string starting with
"RET"
, followed by zero or more characters. The pattern "sc?"
means ’a
string starting with "sc"
, followed by a single character.
The most general way to check whether a field conforms to a pattern is to use a regular expression. The treatment of regular expressions is out of scope for this book, but we will give a few examples. A good introduction to regular expressions is given by
J. Friedl (2006) Mastering regular expressions. O’Reilley Media.
In validate
one can use grepl
or field_format
, with the argument type="regex"
rule <- validator(
grepl("^sc[0-9]$", size)
, field_format(id, "^RET\\d{2}$" , type="regex") )
summary(confront(SBS2000, rule))
## name items passes fails nNA error warning
## 1 V1 60 60 0 0 FALSE FALSE
## 2 V2 60 60 0 0 FALSE FALSE
## expression
## 1 grepl("^sc[0-9]$", size)
## 2 field_format(id, "^RET\\\\d{2}$", type = "regex")
Here, the expression "^sc[0-9]$"
is a regular expression that should be read
as: the string starts ("^"
) with "sc"
, is followed by a number between 0
and 9 ("[0-9]"
) and then ends ("$"
). The regular expression "^RET\\{d}2"
indicates that a string must start ("^"
) with "RET"
, followed by two
digits ("\\d{2}"
), after which the string must end ("$"
).
Globbing patterns are easier to develop and easier to understand than regular expressions, while regular expressions offer far more flexibility but are harder to read. Complex and long regular expressions may have subtle matching behaviour that is not immediately obvious to inexperienced users. It is therefore advisable to test regular expressions with a a small dataset representing realistic cases that contains both matches and non-matches. As a rule of thumb we would advise to use globbing patterns unless those offer insufficient flexibility.
2.6 Numeric ranges
Numerical variables may have natural limits from below and/or above. For one-sided ranges, you can use the standard comparison operators.
If a variable is bounded both from above and below one can use two rules,
or use the convenience function in_range
.
By default, in_range
includes the boundaries of the range, so the above rule
is equivalent to incl.prob >= 0
and incl.prob <= 1
.
Here we set lin.ineq.eps=0
to keep validate
from building in a
margin for machine rounding errors.
## name items passes fails nNA error warning
## 1 TO 60 56 0 4 FALSE FALSE
## 2 TC 60 55 0 5 FALSE FALSE
## 3 PR 60 60 0 0 FALSE FALSE
## expression
## 1 turnover >= 0
## 2 total.costs >= 0
## 3 in_range(incl.prob, min = 0, max = 1)
For numeric ranges it is often a better idea to work with inclusive
inequalities (\(\leq\), \(\geq\)) than with strict inequalities (\(<\), \(>\)). Take
as an example the strict inequality demand income > 0
. This means that any
income larger than zero is acceptable, including numbers such as \(0.01\),
\(0.000001\) and \(10^{-\textrm{Googol}}\). In practice there is almost always a
natural minimal acceptable value that is usually dictated by the unit of
measurement. For example, if we measure income in whole Euros, a better demand
would be income >= 1
.
2.7 Ranges for times and periods
For objects of class Date
and objects of class POSIXct
one can use comparison
operators and in_range
in the same way as for numerical data. The in_range
function
has a convenience feature for period data that is coded in character data, as in
"2018Q1"
for quarterly data.
We first generate some example data.
## [1] "2018Q1" "2018Q2" "2018Q3" "2018Q4"
The in_range
function is capable of recognizing certain date or period
formats.
## [1] TRUE TRUE FALSE FALSE
It is possible to specify your own date-time format using strftime
notation.
See ?in_range
and ?strptime
for specifications.
2.8 Code lists
A code list is a set of values that a variable is allowed to assume. For small
code lists, one can use the %in%
operator.
## name items passes fails nNA error warning
## 1 V1 60 60 0 0 FALSE FALSE
## expression
## 1 size %vin% c("sc0", "sc1", "sc2", "sc3")
Notice that validate
replaces %in%
with %vin%
. The reason is that %vin%
has more consistent
behavior in the case of missing data. In particular,
## [1] TRUE FALSE FALSE
## [1] TRUE FALSE NA
For longer code lists it is convenient to refer to an externally provided list.
There are two ways of doing this: reading the list in the right-hand-size of %in%
,
or passing a code list to confront
as reference data.
Suppose we have a file called codelist.csv
with a column code
. We can define
a rule as follows.
rule <- validator(
x %in% read.csv("codelist.csv")$code
)
## Or, equivalently
rule <- validator(
valid_codes := read.csv("codelist.csv")$code
, x %in% valid_codes
)
The disadvantage is that the rule now depends on a path that may or may not be available at runtime.
The second option is to assume that a variable, say valid_codes
exists at
runtime, and pass this with confront
.
codelist <- c("sc0","sc1","sc2","sc3")
rule <- validator(size %in% valid_codes)
# pass the codelist
out <- confront(SBS2000, rule
, ref=list(valid_codes=codelist))
summary(out)
## name items passes fails nNA error warning expression
## 1 V1 60 60 0 0 FALSE FALSE size %vin% valid_codes
This way, (very) large code lists can be used, but note that it does require a ‘contract’ between variable names used in the rule set and variables passed as reference data.