Chapter 6 Indicators
Until now we have discussed various types of data validation rules: decisions that assign True or False values to a data frame. In some cases it is convenient to have a continuous value that can then be used in further assessing the data.
A practical example is the so-called selective editing approach to data cleaning. Here, each record in a data set is assigned a number that expresses the risk a record poses for inferring a faulty conclusion. Records are then ordered from high risk (records that both have suspicious values and large influence on the final result) to low risk (records with unsuspected values and little influence on the final result). Records with the highest risk are then scrutinized by domain experts.
validate, an indicator is a rule that returns an numerical value. Just
validator objects are lists of validation rules,
are lists of indicator rules. Indices can be computed by confronting data with
indicator, and using
add_indices, the computed indices can be added to
the dataset. You can import, export, select, and combine
in the same way as
6.1 A first example
Here is a simple example of the workflow.
In the first statement we define an
indicator object storing indicator
expressions. Next, we confront a dataset with these indicators. The result is
an object of class
indication. It prints as follows.
## Object of class 'indication' ## Call: ## confront(dat = women, x = ii) ## ## Confrontations: 3 ## Warnings : 0 ## Errors : 0
To study the results, the object can be summarized.
## name items min mean max nNA error warning ## 1 BMI 15 22.0967 22.72691 24.03503 0 FALSE FALSE ## 2 mh 1 65.0000 65.00000 65.00000 0 FALSE FALSE ## 3 mw 1 136.7333 136.73333 136.73333 0 FALSE FALSE ## expression ## 1 (weight/2.2046)/(height * 0.0254)^2 ## 2 mean(height) ## 3 mean(weight)
Observe that the first indicator results in one value per record
while the second and third indicators (
mw) each return a single
value. The single values are repeated when indicator values are added
to the data.
## height weight BMI mh mw ## 1 58 115 24.03503 65.0000 136.7333 ## 2 59 117 23.63114 136.7333 65.0000 ## 3 60 120 23.43589 65.0000 136.7333
The result is a data frame with indicators attached.
warning indicate whether calculation of the
indicators was problematic. For example because the output of an indicator rule
is not numeric, or when it uses variables that do not occur in the data. Use
errors(out) to obtain the warning and error messages per
6.2 Getting indicator values
Values can be obtained with the
values function, or by converting to a
data.frame. In this example we add a unique identifier (this is optional) to
make it easier to identify the results with data afterwards.
Compute indicators and convert to
## id name value expression ## 12 l BMI 22.15113 (weight/2.2046)/(height * 0.0254)^2 ## 13 m BMI 22.09670 (weight/2.2046)/(height * 0.0254)^2 ## 14 n BMI 22.17600 (weight/2.2046)/(height * 0.0254)^2 ## 15 o BMI 22.24240 (weight/2.2046)/(height * 0.0254)^2 ## 16 <NA> mh 65.00000 mean(height) ## 17 <NA> mw 136.73333 mean(weight)
Observe that there is no key for indicators
mw since these are
constructed from multiple records.