Chapter 6 Indicators

Until now we have discussed various types of data validation rules: decisions that assign True or False values to a data frame. In some cases it is convenient to have a continuous value that can then be used in further assessing the data.

A practical example is the so-called selective editing approach to data cleaning. Here, each record in a data set is assigned a number that expresses the risk a record poses for inferring a faulty conclusion. Records are then ordered from high risk (records that both have suspicious values and large influence on the final result) to low risk (records with unsuspected values and little influence on the final result). Records with the highest risk are then scrutinized by domain experts.

In validate, an indicator is a rule that returns an numerical value. Just like validator objects are lists of validation rules, indicator objects are lists of indicator rules. Indices can be computed by confronting data with an indicator, and using add_indices, the computed indices can be added to the dataset. You can import, export, select, and combine indicator objects in the same way as validator objects.

6.1 A first example

Here is a simple example of the workflow.

library(validate)
ii <- indicator(
    BMI = (weight/2.2046)/(height*0.0254)^2 
  , mh  = mean(height)
  , mw  = mean(weight))
out <- confront(women, ii)

In the first statement we define an indicator object storing indicator expressions. Next, we confront a dataset with these indicators. The result is an object of class indication. It prints as follows.

out

## Object of class 'indication'
## Call:
##     confront(dat = women, x = ii)
## 
## Rules confronted: 3
##    With missings: 0
##    Threw warning: 0
##    Threw errors : 0

To study the results, the object can be summarized.

summary(out)

##   name items      min      mean       max nNA error warning
## 1  BMI    15  22.0967  22.72691  24.03503   0 FALSE   FALSE
## 2   mh     1  65.0000  65.00000  65.00000   0 FALSE   FALSE
## 3   mw     1 136.7333 136.73333 136.73333   0 FALSE   FALSE
##                            expression
## 1 (weight/2.2046)/(height * 0.0254)^2
## 2                        mean(height)
## 3                        mean(weight)

Observe that the first indicator results in one value per record while the second and third indicators (mh, mw) each return a single value. The single values are repeated when indicator values are added to the data.

head(add_indicators(women, out), 3)

##   height weight      BMI mh       mw
## 1     58    115 24.03503 65 136.7333
## 2     59    117 23.63114 65 136.7333
## 3     60    120 23.43589 65 136.7333

The result is a data frame with indicators attached.

The columns error and warning indicate whether calculation of the indicators was problematic. For example because the output of an indicator rule is not numeric, or when it uses variables that do not occur in the data. Use warnings(out) or errors(out) to obtain the warning and error messages per rule.

6.2 Getting indicator values

Values can be obtained with the values function, or by converting to a data.frame. In this example we add a unique identifier (this is optional) to make it easier to identify the results with data afterwards.

women$id <- letters[1:15]

Compute indicators and convert to data.frame.

out <- confront(women, ii,key="id")
tail( as.data.frame(out) )

##      id name     value                          expression
## 12    l  BMI  22.15113 (weight/2.2046)/(height * 0.0254)^2
## 13    m  BMI  22.09670 (weight/2.2046)/(height * 0.0254)^2
## 14    n  BMI  22.17600 (weight/2.2046)/(height * 0.0254)^2
## 15    o  BMI  22.24240 (weight/2.2046)/(height * 0.0254)^2
## 16 <NA>   mh  65.00000                        mean(height)
## 17 <NA>   mw 136.73333                        mean(weight)

Observe that there is no key for indicators mh and mw since these are constructed from multiple records.