Skip to contents

Locate fields in a data.frame that are likely erroneous under a set of validation rules. The method returns an errorlocation-class() object, computed with localizer x.

Usage

locate_errors(
  data,
  x,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

# S4 method for class 'data.frame,validator'
locate_errors(
  data,
  x,
  weight = NULL,
  ref = NULL,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

# S4 method for class 'data.frame,ErrorLocalizer'
locate_errors(
  data,
  x,
  weight = NULL,
  ref = NULL,
  ...,
  cl = NULL,
  Ncpus = getOption("Ncpus", 1),
  timeout = 60
)

Arguments

data

data to be checked

x

validation rules or errorlocalizer object to be used for finding possible errors.

...

optional parameters that are passed to lpSolveAPI::lp.control() (see details)

cl

optional parallel / cluster.

Ncpus

number of nodes to use. See details

timeout

maximum number of seconds that the localizer should use per record.

weight

numeric optional weight specification to be used in the error localization (see expand_weights()).

ref

data.frame optional reference data to be used in the rules checking

Value

errorlocation-class() object describing the errors found.

Details

Use replace_errors() to remove flagged fields, typically by setting them to NA. Use base::set.seed() beforehand to make calls reproducible.

Use an Inf weight specification to fix variables that should not be changed. See expand_weights() for more details.

locate_errors uses lpSolveAPI to formulate and solve a mixed integer problem. See the vignettes for details. The solver has many options, see lpSolveAPI::lp.control.options(). Noteworthy options include:

  • timeout: restricts the time the solver spends on a record (seconds)

  • break.at.value: set this to minimum weight + 1 to improve speed.

  • presolve: default in errorlocate is "rows". Set to "none" when you have solutions where all variables are deemed wrong.

locate_errors can run on multiple cores using package parallel.

  • The easiest option is setting Ncpus to the number of desired cores.

  • Alternatively one can create a cluster object (parallel::makeCluster()) and use cl to pass the cluster object.

  • Or set cl to an integer which results in parallel::mclapply(), which only works on non-Windows systems.

Examples

rules <- validator( profit + cost == turnover
                  , cost >= 0.6 * turnover # cost should be at least 60% of turnover
                  , turnover >= 0 # cannot be negative.
                  )
data <- data.frame( profit   = 755
                  , cost     = 125
                  , turnover = 200
                  )

# use set.seed to make results reproducible
set.seed(42)
le <- locate_errors(data, rules)

print(le)
#> call:  locate_errors(data = data, fh, ref = ref, weight = weight, ...,      cl = cl, Ncpus = Ncpus, timeout = timeout) 
#> located  1  error(s).
#> located  0  missing value(s).
#> Use 'summary', 'values', '$errors' or '$weight', to explore and retrieve the errors.
summary(le)
#> Variable:
#>       name errors missing
#> 1   profit      1       0
#> 2     cost      0       0
#> 3 turnover      0       0
#> Errors per record:
#>   errors records
#> 1      1       1

v_categorical <- validator( branch %in% c("government", "industry")
                          , tax %in% c("none", "VAT")
                          , if (tax == "VAT") branch == "industry"
)

data <- read.csv(text=
"   branch, tax
government, VAT
industry  , VAT
", strip.white = TRUE)
locate_errors(data, v_categorical)$errors
#>      branch   tax
#> [1,]  FALSE  TRUE
#> [2,]  FALSE FALSE

v_logical <- validator( citizen %in% c(TRUE, FALSE)
                      , voted %in% c(TRUE, FALSE)
                      ,  if (voted == TRUE) citizen == TRUE
                      )

data <- data.frame(voted = TRUE, citizen = FALSE)

set.seed(42)
locate_errors(data, v_logical, weight=c(2,1))$errors
#>      voted citizen
#> [1,] FALSE    TRUE

# try a conditional rule
v <- validator( married %in% c(TRUE, FALSE)
              , if (married==TRUE) age >= 17
              )
data <- data.frame( married = TRUE, age = 16)

set.seed(42)
locate_errors(data, v, weight=c(married=1, age=2))$errors
#>      married   age
#> [1,]    TRUE FALSE


# different weights per row
data <- read.csv(text=
"married, age
    TRUE,  16
    TRUE,  14
", strip.white = TRUE)

weight <- read.csv(text=
"married, age
       1,   2
       2,   1
", strip.white = TRUE)

set.seed(42)
locate_errors(data, v, weight = weight)$errors
#>      married   age
#> [1,]    TRUE FALSE
#> [2,]   FALSE  TRUE

# fixate / exclude a variable from error localization
# using an Inf weight
weight <- c(age = Inf)

set.seed(42)
locate_errors(data, v, weight = weight)$errors
#>      married   age
#> [1,]    TRUE FALSE
#> [2,]    TRUE FALSE