Skip to contents

Introduction

The goal of dcmodifydb is to apply modification rules specified with dcmodify on a database table, allowing for documented, reproducable data cleaning adjustments in a database.

This document provides examples and common error corrections scenario’s for which dcmodify can be used.

Document rules

A good practice is to store the intent and description of each rule: this enables future (co)workers on the dataset to evaluate the use of each rule. It is therefore recommended to store rules in a yaml format and document each rule accordingly.

Error scenario’s

Replacing common mistakes

m <- modifier( if (year < 25) year = year + 2000)
m <- modifier( if (year < 100){
  if (year > 25) {
    year = year + 1900
  } else {
    year = year + 2000
  }
})

Replacing missing values

A value is not measured (NA), but can be deduced to be of a certain value, e.g. zero. Suppose a health questionair contains questions if a person smokes and if so how many cigarretes per day. Typically if the answer to the first question is “no”, the second question is not asked an thus “unmeasured”, but can be deduced to be 0.

smokes cigarettes
TRUE 10
FALSE NA
if (smokes ==  FALSE) cigarretes = 0
#> Warning in `[<-.data.frame`(`*tmp*`, I, , value = structure(list(smokes =
#> FALSE, : provided 3 variables to replace 2 variables
smokes cigarettes
TRUE 10
FALSE NA

Recoding

SQL translation

Syntax

if rule

m <- modifier( if (age < 12) income = 0)

The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.

if (age < 12) income = 0  # R noobs
if (age < 12) income <- 0 # a bit more R-y
income[age < 12] <- 0     # very R-y

else

Each if rule may be followed with an else or else if

m <- modifier(if (age > 67) {retired = TRUE} else {retired = FALSE})

The following statements are equivalent. It is wise to choose a syntax that is familiar to the persons specifying the correction rules.

if (age > 67) {retired = TRUE} else {retired = FALSE} # R noobs
if (age > 67) {retired <- TRUE} else {retired <- FALSE} # R-y
retired <- if (age > 67) TRUE else FALSE # very R-y
retired <- age > 67 # very R-y

multiple assignments

m <- modifier( 
  if (age > 67) {
    retired = TRUE
    salary = 0 
  }
)

else if