Correction scenarios
Common errors and their corrections
Source:vignettes/scenarios.Rmd
scenarios.Rmd
Introduction
The goal of dcmodifydb
is to apply modification rules specified with dcmodify
on a database table, allowing for documented, reproducable data cleaning adjustments in a database.
This document provides examples and common error corrections scenario’s for which dcmodify
can be used.
Document rules
A good practice is to store the intent and description of each rule: this enables future (co)workers on the dataset to evaluate the use of each rule. It is therefore recommended to store rules in a yaml
format and document each rule accordingly.
Error scenario’s
Replacing common mistakes
m <- modifier( if (year < 25) year = year + 2000)
m <- modifier( if (year < 100){
if (year > 25) {
year = year + 1900
} else {
year = year + 2000
}
})
Replacing missing values
A value is not measured (NA
), but can be deduced to be of a certain value, e.g. zero. Suppose a health questionair contains questions if a person smokes and if so how many cigarretes per day. Typically if the answer to the first question is “no”, the second question is not asked an thus “unmeasured”, but can be deduced to be 0.
smokes | cigarettes |
---|---|
TRUE | 10 |
FALSE | NA |
if (smokes == FALSE) cigarretes = 0
#> Warning in `[<-.data.frame`(`*tmp*`, I, , value = structure(list(smokes =
#> FALSE, : provided 3 variables to replace 2 variables
smokes | cigarettes |
---|---|
TRUE | 10 |
FALSE | NA |
Syntax
if
rule
m <- modifier( if (age < 12) income = 0)
The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.
if (age < 12) income = 0 # R noobs
if (age < 12) income <- 0 # a bit more R-y
income[age < 12] <- 0 # very R-y
else
Each if
rule may be followed with an else
or else if
m <- modifier(if (age > 67) {retired = TRUE} else {retired = FALSE})
The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.
if (age > 67) {retired = TRUE} else {retired = FALSE} # R noobs
if (age > 67) {retired <- TRUE} else {retired <- FALSE} # R-y
retired <- if (age > 67) TRUE else FALSE # very R-y
retired <- age > 67 # very R-y
multiple assignments
m <- modifier(
if (age > 67) {
retired = TRUE
salary = 0
}
)