Preface

This book is about checking data with the validate package for R.

This version of the book was rendered with validate version 1.1.3. The latest release of validate can be installed from CRAN as follows.

The purposes of this book include demonstrating the main tools and workflows of the validate package, giving examples of common data validation tasks, and showing how to analyze data validation results.

The book is organized as follows. Chapter 1 discusses the bare necessities to be able to follow the rest of the book. Chapters 2 to 5 form the ‘cookbook’ part of the book and discuss many different ways to check your data by example. Chapter 6 is devoted to deriving plausibility measures with the validate package. Chapters 7 and 8 treat working with validate in-depth. Chapter 10 discusses how to compare two or more versions of a dataset, possibly automated through the lumberjack package. The section with Biblographical Notes lists some references and points out some literature for further reading.

Prerequisites

Readers of this book are expected to have some knowledge of R. In particular, you should know how to import data into R and know a little about working with data frames and vectors.

Citing this work

To cite the validate package please use the following citation.

MPJ van der Loo and E de Jonge (2021). Data Validation Infrastructure for R. Journal of Statistical Software, 97(10) paper.

To cite this cookbook, please use the following citation.

MPJ van der Loo (2023) The Data Validation Cookbook version 1.1.3. https://data-cleaning.github.io/validate

Acknowledgements

This work was partially funded by European Grant Agreement 88287–NL-VALIDATION of the European Statistcal System.

Contributing

If you find a mistake, or have some suggestions, please file an issue or a pull request on the github page of the package: https://github.com/data-cleaning/validate. If you do not have or want a github account, you can contact the author via the e-mail address that is listed with the package.