Schema patterns¶
- Top-object
- Permissive schema
- Object identifiers
- Spreadsheet-first
- Deprecation
- Flexible vocabularies
- Packaging
- Immutability
- Merging
- Extensibility
Top-object¶
Problem¶
Data is often stored in source systems using a relational data model that stores many related entities. Relational data can be represented in many different ways.
Solution¶
A standard is opinionated about the 'top-object' that describes the key entity being exchanged, and all other data is nested within this object.
Method¶
The selection of the top-object will be based on the conceptual model for the standard. It will need to be informed by consultation with data producers and users.
Where a standard needs more than one top-object, consider treating the project as one of API design, rather than the design of a singular data standard.
Example¶
The Open Contracting Data Standard uses 'Contracting Process' as it's top-object, nesting information on each stage of contracting within this. This partially reflects the data found during research (though this was mostly structured around the idea of a 'notice', a 'contract' or an 'award') and substantially reflects user-demand for joined up data from across all stages of contracting. The choice of 'contracting process' plays a substantial normative role and seeks to change how existing data systems are understood.
The 360 Giving Data Standard uses Grant as it's top-concept, rather than grantmaking process. This reflects the design-principle of the standard to adopt a simple, static, representation of grants made.
Permissive schema¶
Problem¶
A schema can enforce validation rules. However, when data owners encounter lots of validation errors, it can act as a barrier to standard adoption.
When a data owner does not have data to fill in a required field, or to fill it in the desired format, they may be prevented from using the standard by strict validation.
Solution¶
Minimise the use of required
properties and validation rules, unless absolutely necessary to the technical functioning of the standard.
Indicate recommend fields through guidance, implementation tools and validation platforms.
This builds on the idea of designing to allow for 'the tussle'. A policy-related standard provides the framework within which different data producers and users can tussle over the exact data that should be provided in a particular context.
(The applicability of this pattern varies substantially based on the policy context of a standard.)
Method¶
Additional checks can be used to report data quality issues to users in a validator.
A mapping document that indicates which fields, or field-value pairs are required for particular use-cases can guide contextualised recommendations about what to publish.
Object identifiers¶
Problem¶
When transforming data between serialisations, updating data, or comparing datasets, it can be difficult to work out how to handle nested objects.
Solution¶
Provide every object with an identifier field.
Method¶
Instead of:
{
"objects":[
{
"title":"First object"
},
{
"title":"Second object"
}
]
}
always design a schema as:
{
"objects":[
{
"id":1,
"title":"First object"
},
{
"id":2,
"title":"Second object"
}
]
}
Flatten-tool and our merging tools recognise id
as a special property.
This pattern is not needed for objects that are not contained in an array.
Example¶
See above.
Related patterns¶
Spreadsheet-first¶
Problem¶
Many potential users of data are most comfortable with spreadsheet tools.
Data structures which make sense in a hierarchical data format may be tricky to work with when flattened out.
Solution¶
Design with flattened representations in mind.
Consider how a spreadsheet user would be able to analyse the data using simple spreadsheet functions such as pivot tables, or vLookup functions.
Deprecation¶
Problem¶
Fields may need to be removed from a standard. When these are removed, users may not know how to update their data.
Solution¶
Mark fields as deprecated for at least one version prior to their removal. Provide a deprecation message that explains to users how to change their data.
Method¶
Example¶
OCDS Version 1.1 deprecated a number of fields. The validator will report when deprecated fields are encountered in data.
Flexible vocabularies¶
Problem¶
Source systems may use many different classification schemes for their data. Getting data owners to harmonise the codelists and classifications they use, or to adopt common identifier schemes, can be very difficult - and may inhibit adoption of a standard.
Solution¶
Rather than just having a field for classification values, split this into at least:
vocabulary
orscheme
- the list/codelist/scheme from which identifiers or classifications are drawn;code
orid
- the actual value from the specified list
Provide a codelist of recognise vocabularies or schemes, and provide recommendations on the one to use where appropriate.
Where mappings are available between vocabularies and schemes, make users aware of this.
Example¶
org-id.guide provides a list of scheme
values for identifying organisations. For example, the following identifier block is recommended by org-id.guide to represent a UK company number.
{
"scheme": "GB-COH",
"id": "09506232
}
An alternative pattern, that org-id.guide recognises, is concatenation of scheme and identifier, such that the above company number could also be represented as 'GB-COH-09506232'.
Packaging¶
Problem¶
When data is exchanged users may need to know about the source, the version of schema being used and the license data is under.
Solution¶
Provide a packaging schema, in which an array of the schema's top objects can be nested.
Method¶
A separate packaging schema can use recognised meta-data keywords. The package provides meta-data about the data, rather than describing the entities that the schema represents.
A package schema can use the JSON schema $ref
element to point to the main schema of the standard.
In some cases, meta-data may need to be embedded within each top object, particularly in cases where data from multiple sources it to be merged together.
Immutability¶
Problem¶
Users may want to understand how data has changed over time. Source systems may or may not provide a full change-log.
Solution¶
The normative guidance of a standard may specify immutability. Any top-object with a given id
, once created, should not change. The id
value should be incremented whenever the object changes.
Merging¶
Problem¶
When data about the same entity is produced from different systems, and at different times, and the immutability pattern is used, it can be tricky to get a full picture of the current state of an entity.
Solution¶
Merging together data in sequential order (oldest first) can create an object that reflects the latest state of the entity represented.
Method¶
To be documented.
Example¶
The OCDS releases and records model makes use of merging.
Extensibility¶
Problem¶
Source systems may contain data not covered by the standard, leading to under-publication of valuable information.
A group of users may have a need for additional fields not specified by the standard.
Solution¶
An extension mechanism can allow data owners and data users to declare and document additional fields that they publish or would like to see published.
Method¶
Extensions can be represented using a JSON Merge Patch.
An extension registry can help data owners and users to discover relevant extensions.
When extensions are declared in packaging meta-data, validators and other tools can check data against them.
Example¶
The OCDS Extension Template and extensions registry document a technical approach to extensions.