Chapter 4 Multivariate checks
In this Chapter we treat tests that involve relationships between variables.
Data
In this Chapter we will use the SBS2000
dataset that comes with validate
.
## id size incl.prob staff turnover other.rev total.rev staff.costs
## 1 RET01 sc0 0.02 75 NA NA 1130 NA
## 2 RET02 sc3 0.14 9 1607 NA 1607 131
## 3 RET03 sc3 0.14 NA 6886 -33 6919 324
## total.costs profit vat
## 1 18915 20045 NA
## 2 1544 63 NA
## 3 6493 426 NA
4.1 Completeness of records
The functions is_complete()
and all_complete()
are convenience functions
that test for missing values or combinations thereof in records.
rules <- validator(
is_complete(id)
, is_complete(id, turnover)
, is_complete(id, turnover, profit )
, all_complete(id)
)
out <- confront(SBS2000, rules)
# suppress last column for brevity
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 60 60 0 0 FALSE FALSE
## 2 V2 60 56 4 0 FALSE FALSE
## 3 V3 60 52 8 0 FALSE FALSE
## 4 V4 1 1 0 0 FALSE FALSE
Here, the first rule checks for missing data in the id
variable, the second
rule checks whether subrecords with id
and turnover
are complete, and the
third rule checks whether subrecords with id
, turnover
and profit
are
complete. The output is one logical value (TRUE
or FALSE
) for each record.
The fourth rule tests whether all values are present in the id
column, and
it results in a single TRUE
or FALSE
.
4.2 Balance equalities and inequalities
Balance restrictions occur for example in economic microdata, where financial balances must be met.
rules <- validator(
total.rev - profit == total.costs
, turnover + other.rev == total.rev
, profit <= 0.6*total.rev
)
out <- confront(SBS2000, rules)
summary(out)
## name items passes fails nNA error warning
## 1 V1 60 39 14 7 FALSE FALSE
## 2 V2 60 19 4 37 FALSE FALSE
## 3 V3 60 49 6 5 FALSE FALSE
## expression
## 1 abs(total.rev - profit - total.costs) <= 1e-08
## 2 abs(turnover + other.rev - total.rev) <= 1e-08
## 3 profit - 0.6 * total.rev <= 1e-08
Here, the first rule checks a balance between income, costs, and profit; the second rule checks a sub-balance, and the third rule is a plausibility check where we do not expect profit to exceed 60 per cent of the total revenue.
Observe that the expressions have been altered by validate
to account for
possible machine rounding differences. Rather than testing whether variable \(x\)
equals variable \(y\), validate
will check \(|x-y|\leq \epsilon\), where the
default value of \(\epsilon\) is \(10^{-8}\). The value of this tolerance can be
controlled for linear equalities and inequalities using respectively
lin.eq.eps
and lin.ineq.eps
.
## name items passes fails nNA error warning
## 1 V1 60 39 14 7 FALSE FALSE
## 2 V2 60 19 4 37 FALSE FALSE
## 3 V3 60 49 6 5 FALSE FALSE
## expression
## 1 abs(total.rev - profit - total.costs) <= 0.01
## 2 abs(turnover + other.rev - total.rev) <= 0.01
## 3 profit <= 0.6 * total.rev
See 7.6 for more information on setting and resetting options.
4.3 Conditional restrictions
Conditional restrictions are all about demanding certain value combinations. In the following example we check that a business with staff also has staff costs.
## name items passes fails nNA error warning
## 1 V1 60 50 0 10 FALSE FALSE
## expression
## 1 staff - 1 < -1e-08 | (staff.costs - 1 >= -1e-08)
Here, combinations where there is a positive number of staff must be accompanied with a positive staff cost.
Validate translates the rule if ( P ) Q
to an expression of the form !P | Q
. The reason for this is that the latter can be evaluated faster
(vectorised).
The results are to be interpreted as follows. For each record, validate
will
check that cases where staff>=1
are accompanied by staff.costs >= 1
. In
cases where this test results in FALSE
this means that either the staff
number is too high, or the staff costs are too low. To be precise, the results
of a conditional restriction match those of an implication in first-order
logic as shown in the truth table below.
\[ \begin{array}{ll|c} P & Q & P\Rightarrow Q\\ \hline T & T & T\\ T & F & F\\ F & T & T\\ F & F & F\\ \end{array} \]
4.4 Forbidden value combinations
In some cases it is more convenient to have a list of forbidden (key) value
combinations than specifying such combinations individually. The function
does_not_contain()
supports such situations.
As an example, let’s first create some transaction data.
transactions <- data.frame(
sender = c("S21", "X34", "S45","Z22")
, receiver = c("FG0", "FG2", "DF1","KK2")
, value = sample(70:100,4)
)
We assume that it is not possible for senders with codes starting with an "S"
to send something to receivers starting with FG
. A convenient way to encode
such demands is to use
globbing patterns.
We create a data frame that lists forbidden combinations (here: one combination
of two key patterns).
Note that the column names of this data frame correspond to the columns in the transactions data frame. We are now ready to check our transactions data frame.
rule <- validator(does_not_contain(glob(forbidden_keys)))
out <- confront(transactions, rule, ref=list(forbidden_keys=forbidden))
## Suppress columns for brevity
summary(out)[1:7]
## name items passes fails nNA error warning
## 1 V1 4 3 1 0 FALSE FALSE
Observe that we use glob(forbidden_keys)
to tell does_not_contain
that the
key combinations in the forbidden_keys
must be interpreted as globbing
patterns.
The records containing forbidden keys can be selected as follows.
## sender receiver value
## 1 S21 FG0 89
It is also possible to use regular expression patterns, by labeling the
forbidden key set with rx()
. If no labeling is used, the key sets are
interpreted as string literals.