Data and Datasets overview
This note provides some background on the various notions of "data" and "dataset" related to Schema.org.
Schema.org as a project, and as a collection of terms, is entirely devoted to data. We define types such as Event, NewsArticle, Review, Person, as well as properties that characterize and interlink instances of these types. For example, the "alumni" property links people with educational organizations.
Schema.org itself also contains some dedicated vocabulary that can be used in applications which publish, discover or integrate different kinds of data. Just as schema.org defines vocabulary to help describe people, volcanos and public toilets, it can also be used to describe data. This capability is in addition to schema.org's general nature as a collection of structured data schemas, and complements numerous other data-related formats and standards.
- When describing collections of data, for example as published in scientific, scholarly or governmental "open data" repositories, the Dataset type can be used, alongside DataCatalog to indicate the larger collection, and DataDownload for specific representations of a dataset. These "datasets", unlike typical use of Schema.org, can be in arbitrary formats. For example, they may include data that is stored in collections of spreadsheet files, or as digital images, or in dedicated scientific, geospatial and engineering file formats. Such diversity reflects the complexity of real-world data, but the use of diverse and often incompatible formats also makes it hard to integrate the information that they encode, e.g. for use in unified "knowledge graphs" such as Wikidata and DataCommons.org. Schema.org's Dataset vocabulary was originally based on DCAT, which in turn used used Dublin Core and FOAF terms.
- When aggregating and integrating statistical observations that describe collections ("populations") of individual entities, the StatisticalPopulation and Observation types can be used. See proposal and overview document for details, and DataCommons.org for an application of this approach to large scale knowledge graphs. This approach emphasises the use of schema.org vocabulary to integrate information from multiple independent statistical datasets, by using schema.org and related vocabulary to explain the content of the statistical data.
Other related work includes W3C's CSVW and RDF Data Cube specifications, as well as the DSPL 2.0 specification. DSPL 2.0 combines Schema.org for per-dataset metadata with the use of CSV files to represent code lists, enumerations and statistical observations. These technologies all in turn depend on lower-level standards, such as for JSON-LD, RDFa, Microdata, XML, Unicode etc., and share a broadly RDF-like approach to representing information.