Chapter 6 Indicators
Until now we have discussed various types of data validation rules: decisions that assign True or False values to a data frame. In some cases it is convenient to have a continuous value that can then be used in further assessing the data.
A practical example is the so-called selective editing approach to data cleaning. Here, each record in a data set is assigned a number that expresses the risk a record poses for inferring a faulty conclusion. Records are then ordered from high risk (records that both have suspicious values and large influence on the final result) to low risk (records with unsuspected values and little influence on the final result). Records with the highest risk are then scrutinized by domain experts.
In validate
, an indicator is a rule that returns an numerical value. Just
like validator
objects are lists of validation rules, indicator
objects
are lists of indicator rules. Indices can be computed by confronting data with
an indicator
, and using add_indices
, the computed indices can be added to
the dataset. You can import, export, select, and combine indicator
objects
in the same way as validator
objects.
6.1 A first example
Here is a simple example of the workflow.
library(validate)
ii <- indicator(
BMI = (weight/2.2046)/(height*0.0254)^2
, mh = mean(height)
, mw = mean(weight))
out <- confront(women, ii)
In the first statement we define an indicator
object storing indicator
expressions. Next, we confront a dataset with these indicators. The result is
an object of class indication
. It prints as follows.
## Object of class 'indication'
## Call:
## confront(dat = women, x = ii)
##
## Rules confronted: 3
## With missings: 0
## Threw warning: 0
## Threw errors : 0
To study the results, the object can be summarized.
## name items min mean max nNA error warning
## 1 BMI 15 22.0967 22.72691 24.03503 0 FALSE FALSE
## 2 mh 1 65.0000 65.00000 65.00000 0 FALSE FALSE
## 3 mw 1 136.7333 136.73333 136.73333 0 FALSE FALSE
## expression
## 1 (weight/2.2046)/(height * 0.0254)^2
## 2 mean(height)
## 3 mean(weight)
Observe that the first indicator results in one value per record
while the second and third indicators (mh
, mw
) each return a single
value. The single values are repeated when indicator values are added
to the data.
## height weight BMI mh mw
## 1 58 115 24.03503 65 136.7333
## 2 59 117 23.63114 65 136.7333
## 3 60 120 23.43589 65 136.7333
The result is a data frame with indicators attached.
The columns error
and warning
indicate whether calculation of the
indicators was problematic. For example because the output of an indicator rule
is not numeric, or when it uses variables that do not occur in the data. Use
warnings(out)
or errors(out)
to obtain the warning and error messages per
rule.
6.2 Getting indicator values
Values can be obtained with the values
function, or by converting to a
data.frame
. In this example we add a unique identifier (this is optional) to
make it easier to identify the results with data afterwards.
Compute indicators and convert to data.frame
.
## id name value expression
## 12 l BMI 22.15113 (weight/2.2046)/(height * 0.0254)^2
## 13 m BMI 22.09670 (weight/2.2046)/(height * 0.0254)^2
## 14 n BMI 22.17600 (weight/2.2046)/(height * 0.0254)^2
## 15 o BMI 22.24240 (weight/2.2046)/(height * 0.0254)^2
## 16 <NA> mh 65.00000 mean(height)
## 17 <NA> mw 136.73333 mean(weight)
Observe that there is no key for indicators mh
and mw
since these are
constructed from multiple records.