Intro
Errorlocate uses validation rules from package validate
to locate faulty values in observations (or in database slang:
erronenous fields in records).
It follows this simple recipe (Felligi-Holt):
- Check if a record is valid (using supplied validation rules)
- If not valid then adjust the minimum number of values to make it valid.
errorlocate
does this by translating this into a mixed
integer problem (see
vignette("inspect_mip", package="errorlocate"
) and solving
it using lpSolveAPI
.
Methods
errorlocate
has two main functions to be used:
-
locate_errors
for detecting errors -
replace_errors
for replacing faulty values withNA
Let’s start with a simple example:
We have a rule that age cannot be negative:
rules <- validator(age > 0)
And we have the following data set
"age, income
-10, 0
15, 2000
25, 3000
NA, 1000
" -> csv
d <- read.csv(textConnection(csv), strip.white = TRUE)
#> age income
#> 1 -10 0
#> 2 15 2000
#> 3 25 3000
#> 4 NA 1000
le <- locate_errors(d, rules)
summary(le)
#> Variable:
#> name errors missing
#> 1 age 1 1
#> 2 income 0 0
#> Errors per record:
#> errors records
#> 1 0 3
#> 2 1 1
summary(le)
gives an overview of the errors found in
this data set. The complete error listing can be found with:
le$errors
#> age income
#> [1,] TRUE FALSE
#> [2,] FALSE FALSE
#> [3,] FALSE FALSE
#> [4,] NA FALSE
Which says that record 1 has a faulty value for age.
Suppose we expand our rules
rules <- validator( r1 = age > 0
, r2 = if (income > 0) age > 16
)
With validate::confront
we can see that rule
r2
is violated (record 2).
summary(confront(d, rules))
#> name items passes fails nNA error warning expression
#> 1 r1 4 2 1 1 FALSE FALSE age > 0
#> 2 r2 4 2 1 1 FALSE FALSE income <= 0 | (age > 16)
What errors will be found by locate_errors
?
set.seed(1)
le <- locate_errors(d, rules)
le$errors
#> age income
#> [1,] TRUE FALSE
#> [2,] TRUE FALSE
#> [3,] FALSE FALSE
#> [4,] NA FALSE
It now detects that age
in observation 2 is also faulty,
since it violates the second rule. Note that we use
set.seed
. This is needed because in this example, either
age
or income
can be considered faulty.
set.seed
assures that the procedure is reproducible.
With replace_errors
we can remove the errors (which
still need to be imputed).
d_fixed <- replace_errors(d, le)
summary(confront(d_fixed, rules))
#> name items passes fails nNA error warning expression
#> 1 r1 4 1 0 3 FALSE FALSE age > 0
#> 2 r2 4 2 0 2 FALSE FALSE income <= 0 | (age > 16)
In which replace_errors
set all faulty values to
NA
.
d_fixed
#> age income
#> 1 NA 0
#> 2 NA 2000
#> 3 25 3000
#> 4 NA 1000
Weights
locate_errors
allows for supplying weigths for the
variables. It is common that the quality of the observed variables
differs. When we have more trust in age
we can give it more
weight so it chooses income when it has to decide between the two
(record 2):
set.seed(1) # good practice, although not needed in this example
weight <- c(age = 2, income = 1)
le <- locate_errors(d, rules, weight)
le$errors
#> age income
#> [1,] TRUE FALSE
#> [2,] FALSE TRUE
#> [3,] FALSE FALSE
#> [4,] NA FALSE
Weights can be specified in different ways: (see also
errorlocate::expand_weights
):
- not specifying: all variables will have weight 1
- named
vector
: all records will have same set of weights. Unspeficied columns will have weight 1. - named
matrix
ordata.frame
, same dimension as the data: specify weights per record. - Use
Inf
weights to fixate a variable, so it won’t be changed.
Performance / Parallelisation
locate_errors
solves a mixed integer problem. When the
number of interactions between validation rules is large, finding an
optimal solution can become computationally intensive. Both
locate_errors
as well as replace_errors
have a
parallization option: Ncpus
making use of multiple
processors. The $duration
(s) property of each solution
indicates the time spent to find a solution for each record. This can be
restricted using the argument timeout
(s).
# duration is in seconds.
le$duration
#> [1] 0.001949072 0.001564026 0.000000000 0.001436234