Schema development

This section outlines a number of the pattens we commonly use to develop the schema for an open data standard.

Learning / reflection

We currently jump straight from the conceptual framework document, to working up the data model and schema for a standard in a schema language.

This differs from the approach proposed here of maintaining the data model as a narrative document, only then given form by a schema as a reference implementation.

Schema language

Our preferred schema language is JSON Schema v0.4.

This allows us to provide field structures and definitions. Although less expressive than other schema languages, the constraints of JSON Schema enable us to focus on keeping data simple enough for a wide range of users.

We generally use simple CSV files to represent codelists.

We have a number of extensions to JSON Schema 0.4 we use (documented below).

Serializations

We design with a range of serializations in mind, and, where possible, to enable round-tripping of data between different serializations.

In particular, through flatten-tool we design with support for:

  • Structured JSON serialization;
  • Excel serialization;
  • CSV serialization.

Flatten-tool can use the titles in a schema to provide 'friendly' column headings, and with use of a metatab also supports packaging meta-data and options to control how spreadsheets are parsed.

Extended JSON schema

We use a number of custom properties in our JSON Schema implementation. A patch against JSON Schema 0.4 to include these is found here.

Codelist properties

  • codelist - the filename of a .csv file that contains at least a Code column. Used by the CoVE validator to check for acceptable values.
  • openCodelist - a boolean value to indicate whether values can only come from the codelist, or whether additional values not on the codelist are permitted. When openCodelist = 'true' then encountering a value not on the codelist should generate a warning. when openCodelist = 'false' then encountering a value not on the codelist should generate an error.

Deprecation properties

"deprecation is the discouragement of use of some terminology, feature, design, or practice; typically because it has been superseded or is no longer considered efficient or safe – but without completely removing it or prohibiting its use."

See: Deprecation (Wikipedia)

  • deprecated - and object to indicate that the field is deprecated, consisting of fields for:
    • description - a message that explains the deprecation, and that should be presented by validators to any publisher using this field.
    • deprecatedVersion - a string indicating the version in which the field was first deprecated.

We also use the column title Deprecated with a version number as the cell value in codelist CSV files when a code has been deprecated.

Merge strategies

The Open Contracting Data Standard describes an approach to merge together releases of data from different point in time. We add a number of properties to indicate how merging should be approached.

  • omitWhemMerged
  • wholeListMerge
  • versionId

Behaviour for these is described in the OCDS documentation.

Design patterns

Developing a good schema is an art as much as a science. It requires sensitivity to the needs of both data producers and data users, and an understanding of the incentive structures that will drive adoption of a standard.

The following section provides links to a non-exhaustive set of design patterns that can be drawn upon when developing a schema.

Schema patterns: