Chapter 9 Rules from SDMX
Note This functionality is available for validate
versions 1.1.0
or higher.
In this Chapter we first demonstrate how to use SDMX with the validate
package. In 9.5 we provide a bit more general information on the
SDMX landscape, registries, and their APIs.
9.1 SDMX and validate
Statistical Data and Metadata eXchange, or SDMX is a standard for storing data
and the description of its structure, meaning, and content. The standard is
developed by the SDMX consortium (https://sdmx.org
). It is used, amongst
others, in the Official Statistics community to
exchange data in a standardized way.
A key aspect of SDMX is a standardized way to describe variables, data structure (how is it stored), and code lists. This metadata is defined in an SDMX registry where data producers can download or query the necessary metadata. Alternatively, metadata is distributed in a so-called Data Structure Definition (DSD) file, which is usually in XML format.
For data validation, some aspects of the metadata are of interest. In particular, code lists are interesting objects to test against. In validate there are two ways to use SDMX codelists. The first is by referring to a specific code list for a specific variable in an SDMX registry. The second way is to derive a rule set from a DSD file that can be retrieved from a registry.
Below we discuss the following functions.
function | what it does |
---|---|
sdmx_endpoint |
retrieve URL for SDMX endpoint |
sdmx_codelist |
retrieve sdmx codelist |
estat_codelist |
retrieve codelist from Eurostat SDMX registry |
global_codelist |
retrieve codelist from Global SDMX registry |
validator_from_dsd |
derive validation rules from DSD in SDMX registry |
9.2 SDMX and API locations
SDMX metadata is typically exposed through a standardized REST API. To query an SDMX registry, one needs to supply at least the following information:
- The registry’s API entry point. This is the base URL for the online registry. You can specify it literally, or use one of the helper functions that are aware of certan known SDMX registries.
- Agency ID: the ID of the agency that is responsible for the code list
- Resource ID: the name of the SDMX resource. This is usually the name of a type of statistic, like STS (short term statistics).
- version: the code list version.
Some API endpoints are stored with the package. The function sdmx_endpoint()
returns endpoint URLs for several SDMX registries. Use
to get a list of valid endpoints. As an example, to retrieve the endpoint for the global SDMX registry, use the following.
## GLOBAL
## "https://registry.sdmx.org/ws/public/sdmxapi/rest"
9.3 Code lists from SDMX registries
Code lists can be retrieved on-the-fly from one of the online SDMX registries. In the following rule we retrieve the codelist of economic activities from the global SDMX registry.
codelist <- sdmx_codelist(
endpoint = sdmx_endpoint("global")
, agency_id = "ESTAT"
, resource_id = "CL_ACTIVITY")
head(codelist)
[1] "_T" "_X" "_Z" "A" "A_B" "A01"
Equivalently, and as a convenience, you could use global_codelist()
to avoid
specifying the API endpoint explicitly. The output can be used in a rule.
Since downloading codelists can take some time, any function that accesses online SDMX registries will store the download in memory for the duration of the R session.
There is also a estat_codelist()
function for downloading codelists from
the Eurostat SDMX registry.
9.4 Derive rules from DSD
The functions described in the previous subsection allow you to check variables against a particular SDMX code list. It is also possible to download a complete Data Structure Definition and generate all checks implied by the DSD.
rules <- validator_from_dsd(endpoint = sdmx_endpoint("ESTAT")
, agency_id = "ESTAT", resource_id = "STSALL", version="latest")
length(rules)
[1] 13
rules[1]
Object of class 'validator' with 1 elements:
CL_FREQ: FREQ %in% sdmx_codelist(endpoint = "https://ec.europa.eu/tools/cspa_services_global/sdmxregistry/rest", agency_id = "SDMX", resource_id = "CL_FREQ", version = "2.0")
Rules are evaluated using locally defined options
There are 13 rules in total. For brevity, we only show the first rule here.
Observe that the first rule checks the variable CL_FREQ
against a code list
that is retrieved from the global SDMX registry. A demonstration of the fact
that a DSD does not have to be fully self-contained and can refer to
metadata in other standard registries. If a data set is checked against this
rule, validate
will download the codelist from the global registry and
compare each value in column CL_FREQ
against the codelist.
Note that the validator_from_dsd
function adds relevant metadata such as a
rule name, the origin of the rule and a short description. Try
to see all information.
9.5 More on SDMX
The Statistical Data and Metadata eXchange (SDMX) standard is an ISO standard designed to facilitate the exchange or dissemination of Official Statistics. At the core it has a logical information model describing the key characteristics of statistical data and metadata, which can be applied to any statistical domain. Various data formats have been defined based on this information model, such as SDMX-CSV, SDMX-JSON), and - by far the most widely known - SDMX-ML (data in XML). A key aspect of the SDMX standard is that one defines the metadata, including data structure, variables, and code lists beforehand in order to describe what data is shared or published. This metadata is defined in an SDMX registry where data producers can download or query the necessary metadata. Alternatively metadata is distributed in a so-called Data Structure Definition (DSD) file, which is usually an XML format. Both types of modes should result in exactly the same metadata agreements.
SDMX registries can be accessed through a REST API, using a standardized set of parameters. We can distinguish between registries that provide metadata and registries that provide the actual data. For the validate package, the metadata registries are of interest. Some of widely used metada registries include the following.
- Global SDMX Registry: for global metadata, hosted by the SDMX consortium. The central place for ESS-wide metadata. This registry hosts important statistical metadata such as for CPI/HICP, National Accounts (NA), Environmental accounting (SEEA), BOP, GFS, FDI and many more. Unfortunately not all ESS metadata is present in this registry.
- Eurostat SDMX Registry: for Eurostat-wide metadata, hosted by Eurostat. This registry contains statistical metadata for all other official statistics in the European Statistical System (ESS). Access is offered via SDMX 2.1 REST API.
- IMF SDMX Central: Registry by the IMF.
- UNICEF: Registry by UNICEF
Organisations that at the time of writing (spring 2023) actively offer
automated access to their data (not just metadata) via an SDMX API include (but
not limited to) the European Central Bank
(ECB),
the OECD (in
SDMX-JSON or
SDMX-ML format),
Eurostat,
the International Labour Organisation [ILO (https://www.ilo.org/sdmx/index.html
)],
the Worldbank,
the Bank for International Settlements
(BIS),
and the Italian Office of National Statistics (ISTAT).
The SDMX consortium does not maintain a list of active SDMX endpoints. The
rsdmx R package maintains such a
list based on an earlier inventory of Data Sources, but at the time of writing
not all those links appear to be active.
Ideally, all SDMX providers would have implemented SDMX in a coordinated way so that a client looking for SDMX metadata to validate its data before sending could query the respective sources using one and the same API. The latest version of the REST API is 2.1 which is described very well in the easy to use SDMX API cheat sheet Inspecting the endpoints shows that not all providers implement all same resource values. Depending on the provider an organization may decide which elements of the API are exposed. For example, the API standard defines methods to retrieve code lists from a DSD, but this functionality may or may not be offered by an API instance. If it is not offered, this means the client software needs to retrieve this metadata via other resource requests or alternatively extract them locally from a DSD file. Finally we signal that on a technical level the API of the various institutes may differ considerably and that not all SDMX services implement the same version of SDMX.
This means that users should typically familiarize themselves somewhat with the
specific API they try to access (e.g. from validate
).