Chapter 4 Multivariate checks

In this Chapter we treat tests that involve relationships between variables.

Data

In this Chapter we will use the SBS2000 dataset that comes with validate.

##      id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01  sc0      0.02    75       NA        NA      1130          NA
## 2 RET02  sc3      0.14     9     1607        NA      1607         131
## 3 RET03  sc3      0.14    NA     6886       -33      6919         324
##   total.costs profit vat
## 1       18915  20045  NA
## 2        1544     63  NA
## 3        6493    426  NA

4.1 Completeness of records

The functions is_complete() and all_complete() are convenience functions that test for missing values or combinations thereof in records.

##   name items passes fails nNA error warning
## 1   V1    60     60     0   0 FALSE   FALSE
## 2   V2    60     56     4   0 FALSE   FALSE
## 3   V3    60     52     8   0 FALSE   FALSE
## 4   V4     1      1     0   0 FALSE   FALSE

Here, the first rule checks for missing data in the id variable, the second rule checks whether subrecords with id and turnover are complete, and the third rule checks whether subrecords with id, turnover and profit are complete. The output is one logical value (TRUE or FALSE) for each record.

The fourth rule tests whether all values are present in the id column, and it results in a single TRUE or FALSE.

  • To test for missing values in individual variables, see also 2.2.
  • To check whether records are available at all, see 3.3.

4.2 Balance equalities and inequalities

Balance restrictions occur for example in economic microdata, where financial balances must be met.

##   name items passes fails nNA error warning
## 1   V1    60     39    14   7 FALSE   FALSE
## 2   V2    60     19     4  37 FALSE   FALSE
## 3   V3    60     49     6   5 FALSE   FALSE
##                                       expression
## 1 abs(total.rev - profit - total.costs) <= 1e-08
## 2 abs(turnover + other.rev - total.rev) <= 1e-08
## 3              profit - 0.6 * total.rev <= 1e-08

Here, the first rule checks a balance between income, costs, and profit; the second rule checks a sub-balance, and the third rule is a plausibility check where we do not expect profit to exceed 60 per cent of the total revenue.

Observe that the expressions have been altered by validate to account for possible machine rounding differences. Rather than testing whether variable \(x\) equals variable \(y\), validate will check \(|x-y|\leq \epsilon\), where the default value of \(\epsilon\) is \(10^{-8}\). The value of this tolerance can be controlled for linear equalities and inequalities using respectively lin.eq.eps and lin.ineq.eps.

##   name items passes fails nNA error warning
## 1   V1    60     39    14   7 FALSE   FALSE
## 2   V2    60     19     4  37 FALSE   FALSE
## 3   V3    60     49     6   5 FALSE   FALSE
##                                      expression
## 1 abs(total.rev - profit - total.costs) <= 0.01
## 2 abs(turnover + other.rev - total.rev) <= 0.01
## 3                     profit <= 0.6 * total.rev

See 7.6 for more information on setting and resetting options.

4.3 Conditional restrictions

Conditional restrictions are all about demanding certain value combinations. In the following example we check that a business with staff also has staff costs.

##   name items passes fails nNA error warning
## 1   V1    60     50     0  10 FALSE   FALSE
##                                         expression
## 1 staff - 1 < -1e-08 | (staff.costs - 1 >= -1e-08)

Here, combinations where there is a positive number of staff must be accompanied with a positive staff cost.

Validate translates the rule if ( P ) Q to an expression of the form !P | Q. The reason for this is that the latter can be evaluated faster (vectorised).

The results are to be interpreted as follows. For each record, validate will check that cases where staff>=1 are accompanied by staff.costs >= 1. In cases where this test results in FALSE this means that either the staff number is too high, or the staff costs are too low. To be precise, the results of a conditional restriction match those of an implication in first-order logic as shown in the truth table below.

\[ \begin{array}{ll|c} P & Q & P\Rightarrow Q\\ \hline T & T & T\\ T & F & F\\ F & T & T\\ F & F & F\\ \end{array} \]

4.4 Forbidden value combinations

In some cases it is more convenient to have a list of forbidden (key) value combinations than specifying such combinations individually. The function does_not_contain() supports such situations.

As an example, let’s first create some transaction data.

We assume that it is not possible for senders with codes starting with an "S" to send something to receivers starting with FG. A convenient way to encode such demands is to use globbing patterns. We create a data frame that lists forbidden combinations (here: one combination of two key patterns).

Note that the column names of this data frame correspond to the columns in the transactions data frame. We are now ready to check our transactions data frame.

##   name items passes fails nNA error warning
## 1   V1     4      3     1   0 FALSE   FALSE

Observe that we use glob(forbidden_keys) to tell does_not_contain that the key combinations in the forbidden_keys must be interpreted as globbing patterns.

The records containing forbidden keys can be selected as follows.

##   sender receiver value
## 1    S21      FG0    89

It is also possible to use regular expression patterns, by labeling the forbidden key set with rx(). If no labeling is used, the key sets are interpreted as string literals.