The Open Provenance Model Vocabulary, OPMV, is a lightweight vocabulary that provides terms to enable practitioners of data publishing to publish their data responsibly. It is closely based on the community provenance data model, the Open Provenance Model (OPM). OPMV can be used together with other provenance-related RDF/OWL vocabularies/ontologies, such as Dublin Core, FOAF, the Changeset Vocabulary, and the Provenance Vocabulary.
This document, the OPMV guide, is one of the two core documents of OPMV; the other is the OPMV vocabulary document [OPMV vocabulary]. The OPMV guide is aimed at both data publishers (those wishing to publish their datasets on the Web responsibly), and data consumers (those wishing to be aware of the quality of the datasets that they query and use in their applications). We assume that readers of this document are familiar with the core concepts about the Web of Data, such as URIs and RDF, and with the Turtle syntax for RDF. Basic knowledge about certain widely-used vocabularies such as Dublin Core (DC) and Friend of a Friend (FOAF) is also assumed.
The Open Provenance Model Vocabulary (OPMV) is a vocabulary defined using OWL that implements the Open Provenance Model, a community provenance model that is driven by the need of facilitating interoperability between provenance systems [The OPM Specification].
OPMV aims to be as lightweight as possible. It tries to take full advantage of Semantic Web technologies by using minimum OWL constructs and reuse existing RDF vocabularies wherever possible. An alternative OPM OWL serialization, OPMO, is available at [OPMO Ontology], which uses more complex OWL2.0 constructs to define more constraints. Users should opt to OPMO if they need to perform complex reasoning over or validity checking of their OPM provenance information.
The Open Provenance Vocabulary currently is implemented as an OWL-DL ontology and is available in its namespace http://purl.org/net/opmv/ns#. The vocabulary is partitioned into the core OPMV vocabulary and several supplementary modules that provide less frequently used terms and a broad range of specializations of the core terms. At the moment we have the following implemented modules:
The document is aimed at Linked Data practitioners who want to publish their data responsibly and it covers only how the OPMV Core should be used in practice. Information about individual supplementary modules can be found in corresponding short guides. We use concrete examples to explain how the OPMV Core can be used to publish basic as well as detailed and precise provenance information for linked data. Most of the examples are based on use cases from the data.gov.uk team.
All examples in this document are written in the Turtle RDF syntax. Throughout this document, the following namespaces are used:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix dcterms: <http://dublincore.org/documents/dcmi-terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix doap: <http://usefulinc.com/ns/doap#> .
@prefix org: <http://www.w3.org/ns/org#> .
@prefix opmv: <http://purl.org/net/opmv/ns#> .
@prefix common: <http://purl.org/net/opmv/types/common#> .
@prefix xslt: <http://purl.org/net/opmv/types/xslt#> .
@prefix sparql: <http://purl.org/net/opmv/types/sparql#> .
@prefix gate: <http://purl.org/net/opmv/types/gate#> .
@prefix eg: <http://example.org.uk/>
OPMV does not explicitly implement all the structures from the OPM specification as OWL classes or properties. It aims to take full advantage of existing Semantic Web technologies, such as Named Graphs, and existing vocabularies, such as the W3C Time Ontology:
The following sections describe in details how each part of the OPM specification is supported in OPMV in combination with existing technologies and vocabularies.
The three top OPM entities and five top properties are implemented in OPMV as classes and object properties:
These terms can be used to express some basic provenance information about data creation and transformation.
We define opmv:Process
as disjoint with opmv:Agent
and opmv:Artifact
. We also define sub-properties for properties like opmv:wasControlledBy
, to enable users to express provenance information in a more specific way.
According to OPM, roles are used to "designate an artifact's or agent's function in a process" [The OPM Specification]. This structure can be used to refine provenance information expressed using the basic terms and to express provenance information more specifically. For example, an agent could have controlled the execution process or simply played a "performer" role. Instead of defining a class of roles, OPMV defines sub-properties of the five top abstract object properties to reflect the different roles that an artifact or an agent plays in a process.
For example, we define sub-property opmv:wasPerformedBy
for opmv:wasControlledBy
, to distinguish the roles played by an agent. We differentiate the different roles played by an artifact by refining the property of opmv:used
. This has been implemented in the common module, which defines common:usedData
and common:usedScript
as two sub-properties of opmv:used
; in the first case the artifact played the role of "data" and in the latter case it played the role of a (configuration) script.
The provenance information about an artifact could be expressed at different levels of abstractions or from different viewpoints [The OPM Specification]. OPM specification introduces the concept of "account" to "represent a description at some level of detail as provided by one or more observers".
OPMV does not provide specific terms to define accounts. We suggest using the Named Graphs to represent such information. A separate named graph can be created for provenance information provided by a separate observer. Provenance information at different levels of abstractions could either be extracted by queries (using for example SPARQL) or be defined in separate named graphs.
OPM provides a very refined representation for time-related information. It differentiate instantaneous occurrences and those not. It recognizes four instantaneous occurrences: the creation and use of artifacts, and the starting and ending of processes.
In OPMV, we define object properties by reusing the W3C Time Ontology (http://www.w3.org/TR/owl-time/) to express this time-related information:
At a very fine-grained level, the time when an artifact was created might be different from the time when the process creating the artifact was finished; hence we define both opmv:wasGeneratedAt
and opmv:wasEndedAt
. Similarly, the time when an artifact was used (opmv:wasUsedAt
) might be different from the time when the process using the artifact was started (opmv:wasStartedAt
).
Provenance information about an entity, either an artifact or an agent, can be very broad and very fine-grained. Although very detailed provenance information provides a very precise recording of what happened and evidence for the existence of the entity, it can lead to unnecessary performance and scalability burdens. The minimum provenance information about an entity at a specific state should include at least information about the when and who, for example, when an artifact was created, and by whom. This section describes how OPMV and other vocabularies can be used to provide the basic provenance and the following section explains how more detailed provenance information can be expressed using OPMV and related vocabularies.
We start with describing how the basic provenance information about an artifact can be represented using OPMV or Dublin Core and what the implications of using either or both vocabularies are. It then describes how the basic provenance information about an agent can be represented using DOAP and/or Dublin Core. Finally, it provide further examples to show how OPMV can be used to describe provenance of different types of artifacts, either those that are merely physical or those with only a digital representation, and how it can be used together with Named Graphs to describe provenance of artifacts at different levels of granularity.
The following example shows how the OPMV core can be used to express provenance information about an artifact, i.e. when it was created, by whom.
#### when an artifact was created, by whom
eg:d0
rdf:type opmv:Artifact ;
opmv:wasGeneratedAt eg:t0 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p0
]
.
eg:t0
rdf:type time:Instant ;
time:inXSDDateTime "2010-10-07T12:09:00Z"^^xsd:dateTime ;
.
eg:p0
rdf:type opmv:Agent, foaf:Agent ;
.
Because OPMV is a process-oriented provenance vocabulary, the existence of an entity must be scoped in a process. For instance, in our example, the creator of an entity cannot be expressed without explicitly stating the process in which the creator operated the process that led to the creation of this entity. On the contrary, the Dublin Core is a resource-oriented metadata schema. A provenance statement can be directly associated with a resource, making it much less verbose than OPMV for expressing the above simple provenance information, as shown by the example below.
#### when an artifact was created, by whom
eg:d0
rdf:type opmv:Artifact ;
dcterms:created "2010-10-07T12:09:00Z" ;
dcterms:creator eg:p0 ;
.
eg:p0
rdf:type dcterms:Agent, foaf:Agent, opmv:Agent
.
The DC Terms can be used together with OPMV to describe the provenance of an entity. However, users should note that the range of dcterms:created
is a literal, which is different from that of opmv:wasGeneratedAt
. The dcterms:creator
can be used to replace the similar statement expressed using OPMV. However, we have no effective means to express the mapping between dcterms:creator
and opmv:wasPerformedBy
at the moment. When using DC Terms to express the creator information, users lose the interoperability of their provenance information with other expressed using OPMV or other OPM serializations. This is one drawback to be aware.
An agent can be a person or an organization who controlled a process execution; it can also be a service or a software tool that performed the execution. The DOAP (Description of a Project) vocabulary, a vocabulary for describing software project (http://usefulinc.com/ns/doap), can be used to express provenance information about a service or tool. For example, we can describe the provenance of the software tool that was used for creating the artifact eg:d0, including when it was created and who developed it.
### when a software release was created, by whom
eg:s0
rdf:type doap:Version ; ### a specific version of a software project release
doap:revision "0.0" ;
doap:created "2010-10-19" ;
.
eg:prj0
rdf:type doap:Project ;
doap:release eg:s0 ;
doap:developer eg:stuart ;
doap:maintainer eg:stuart ;
.
eg:stuart rdf:type foaf:Person .
Similarly, some of the above information can equally be expressed using the DC Terms, as shown below.
### when a software release was created, by whom
eg:s0
rdf:type doap:Version ; ### a specific version of a software project release
doap:revision "0.0" ;
dcterms:created "2010-10-19" ;
.
eg:prj0
rdf:type doap:Project ;
doap:release eg:s0 ;
dcterms:creator eg:stuart ;
doap:maintainer eg:stuart ;
.
eg:stuart rdf:type foaf:Person .
dcterms:created
can be used to express the same information as doap:created
, while dcterms:creator might have a slightly different semantics from doap:developer.
OPMV can be used to describe both provenance of artifacts which may have a physical embodiment in a physical object, such as an organization, and that of those with only a digital representation in a computer system, such as an RDF graph.
Our following example shows how OPMV can be used to describe provenance of a non-digital object. The Organization Ontology (http://www.epimorphics.com/public/vocabulary/org.html) is a vocabulary for describing organizational structures. OPMV has been reused in the Organization Ontology to describe historical changes of organizational structure, as illustrated below. Because an org:Organization is an opmv:Artifact, we can use OPMV to express who created the organization and when.
#### when an organization was created, by whom
eg:org0
rdf:type org:Organization, opmv:Artifact ;
opmv:wasGeneratedAt eg:t1 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p1
]
.
eg:t1
rdf:type time:Instant ;
time:inXSDDateTime "2007-10-07T14:51:00Z"^^xsd:dateTime ;
.
eg:p1
rdf:type opmv:Agent, foaf:Agent ;
.
An rg:Organization
"represents a collection of people organized together into a community or other social, commercial or political structure". It is a sub-class of foaf:Agent
as well as opmv:Artifact
. This is consistent with the OPMV vocabulary because we do not define opmv:Agent
as being disjoint with opmv:Artifact
. Because foaf:Agent
is defined as owl:equivalentClass
of opmv:Agent
, the example above also shows how OPMV can be used to describe the provenance of a human type of agent.
OPMV can be used together with Named Graphs to describe provenance information for artifacts of different levels of granularity. An OPMV artifact can be of any level of granularity, an RDF triple or a collection of RDF triples. A Named Graph can be used to refer to that one RDF triple or that collection of RDF triples. Such a graph can be an opmv:Artifact
. The example below shows how a Named Graph is used to refer to one RDF statement so that we can describe who published that statement and when.
#### when an organization was created, by whom
eg:g0 {
eg:d1 rdf:type org:Organization .
}
eg:g0 rdf:type , opmv:Artifact ;
opmv:wasGeneratedAt eg:t2 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p2
]
.
eg:t2
rdf:type time:Instant ;
time:inXSDDateTime "2009-10-10T15:14:00Z"^^xsd:dateTime ;
.
eg:p2
rdf:type opmv:Agent, foaf:Agent ;
.
Named Graphs are our recommended way to describe provenance of a set of RDF statements. However, due to performance reasons, users might have to choose to use RDF reification, which can express the same information with more terse expressions, although with different semantics, which is beyond the discussion of this document. For example, the above example that describes the provenance of the RDF statement <eg:d1 rdf:type org:Organization> can be expressed using OPMV and RDF reification as the following:
eg:statementxxxx rdf:type rdf:Statement ;
rdf:subject eg:d1 ;
rdf:predicate rdf:type ;
rdf:object org:organization ;
rdf:type opmv:Artifact ;
opmv:wasGeneratedAt eg:t2 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p2
]
.
The previous section shows how to express the basic information about when an artifact was created and by whom, which gives some basic credibility to the artifact, just enough to track the responsibility for that artifact. A higher level of credibility can be established by knowing what tools were used to created the artifact and what other source artifacts were used. In this section, we show how OPMV can be used to provide more detailed, finer-grained provenance information, including the source artifact used to create an artifact and more specific information about the process leading to the given state of this artifact.
In this way, users can spot quality issues by inspecting the tools used or tracing who created the tools or operated the tools and trace the propagation of artifacts of bad qualities through the derivation paths of artifacts.
Our example describes the exact process and source data that led to an artifact, i.e. eg:school1
, which represents some information about a school in RDF format, that was transformed from some legacy data format into RDF by a script and published in a Named Graph, identified by eg:school1
.
eg:school1
rdf:type opmv:Artifact, ;
opmv:wasDerivedFrom eg:queryResult ;
opmv:wasGeneratedBy eg:p0
.
eg:p0
rdf:type opmv:Process ;
opmv:used eg:queryResult ;
opmv:wasPerformedBy eg:netcode ;
opmv:wasControlledBy ;
.
eg:queryResult rdf:type opmv:Artifact ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:used ;
opmv:used eg:query ;
]
.
eg:netcode rdf:type opmv:Agent ;
rdfs:label ".NET code that formats the result of a SQL query on the database as RDF/XML" ;
.
Our example shows that the graph eg:school1
was derived from another artifact, eg:queryResult
and was generated in a process that used this query result and a piece of code identified by eg:netcode
, that was performed by <http://www.jenitennison.com/#me>
.
The artifact eg:school1
can be defined as both an opmv:Artifact
and an RDF graph. An opmv:Artifact
can also be a foaf:Document
or a prv:DataItem
(from the Provenance Vocabulary), if appropriate. However, because an opmv:Artifact
is immutable, an foaf:Document
regarded as an opmv:Artifact
implies that this foaf:Document
refers to a document in a specific state, rather than as an abstract embodiment of some work that can be conceived as "documents". The same applies to a http://www.w3.org/2004/03/trix/rdfg-1/Graph
. A prv:DataItem
from the Provenance Vocabulary shares the same semantics as an OPMV opmv:Artifact
, i.e. it refers to an immutable representation of data.
Apart from the basic provenance (the when and who about artifacts and agents), and the detailed provenance describing the process and source artifacts, another type of provenance could be information about changes of an artifact.
An opmv:Artifact
can represent both something that has a physical embodiment and that exists only by a digital representation in a computer system. Describing changes of a physical object, such as an organization or a legislation document, can help data consumers to find out how things have changed over time and trace the relationship between source and resulting objects. Describing changes of digital objects, such as descriptions about an entity that are available in RDF format, is essential for data consumers to find out which descriptions about an entity is most up-to-date and trustworthy and how descriptions about the entity have changed over time.
The latter is particularly important in the context of the Web of Data. Due to the openness of WoD, linked datasets are often replicated and hosted at different locations, under the same or different URI namespaces. Even though these datasets are updated over the time, little care was given when datasets got updated. Commonly, different copies of statements about the same set of entities could exist at the same time on the Web, completely interconnected and intertwined. Without sufficient context information about these statements, data consumers are confronted with choices: Which statement(s) provides more updated information? How information about an entity has changed over time?
In this section, we show how OPMV can be used to track changes of an artifact, either as a physical or as a digital object.
Using the change history of organizations as the example: Any aspect of organizational structure is subject to change over time. When organizations change substantially, such as a merger, they result in a new organization and the new organization will typically be denoted by a new URI. To track changes over time and trace the relationship between the original and resulting organizations, we can provide the following provenance information:
The following example describes when each organization eg:org0
and eg:org1
was created (eg:t3
and eg:t5
respectively) and by whom (eg:p1
in both cases). Additionally, it describes how the resulting organization eg:org1
was changed from the source organization eg:org0
during the change event eg:changeEvent0
, which took eg:org0
as the input and produced eg:org1
as the output.
#### when the first organization was created, by whom, and that it was changed in the #### change event _:changeEvent0
eg:org0
rdf:type org:Organization, opmv:Artifact ;
opmv:wasGeneratedAt eg:t3 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p1
];
org:changedBy eg:changeEvent0 ;
.
#### when the second organization was created, by whom, and that it was resulted from the #### change event eg:changeEvent0 and derived from the source organization eg:org0
eg:org1
rdf:type org:Organization, opmv:Artifact ;
opmv:wasGeneratedAt eg:t5 ;
opmv:wasGeneratedBy eg:changeEvent0 ;
org:resultedFrom eg:changeEvent0 ;
opmv:wasDerivedFrom eg:org0 ;
.
#### describe the change event _:changeEvent0, what was the source and the result, when #### it was performed, and by whom
eg:changeEvent0
rdf:type org:ChangeEvent, opmv:Process ;
org:originalOrganization eg:org0 ;
opmv:used eg:org0 ;
opmv:wasPerformedBy eg:p1 ;
org:resultingOrganization eg:org1 ;
opmv:wasPerformedAt eg:t4 ;
.
eg:t3
rdf:type time:Instant ;
time:inXSDDateTime "2007-10-07T14:51:00Z"^^xsd:dateTime ;
.
eg:t4
rdf:type time:Instant ;
time:inXSDDateTime "2010-10-20T14:51:00Z"^^xsd:dateTime ;
.
eg:t5
rdf:type time:Instant ;
time:inXSDDateTime "2010-10-20T15:11:00Z"^^xsd:dateTime ;
.
eg:p1
rdf:type opmv:Agent, foaf:Agent ;
.
Not only the organization itself can change, but also the descriptions about an organization. In the context of Linked Data, such descriptions are a set of triples that are associated with an organization resource. A new URI is used to identify a new organization, but a new URI is not always created when descriptions about an organization were changed. It is up to the data publishers to decide when a new URI should be coined if information about an organization were changed, such as its name or its location. Mechanisms such as Named Graphs can be used to handle such cases. If each Named Graph identifies the descriptions about an organization at a given state that remains immutable at that given state, we can describe the provenance of these Named Graphs just like how we describe that of an opmv:Artifact
.
In the following example, the first Named Graph eg:g1
contains information about the organization eg:org2
, which was generated in 1960, while the second graph eg:g2
contains information about the same organization, which was created in 2000 with changes in the organization title. For the purpose of the demonstration, we use the same URI to identify this organization. In practice, it is more appropriate to create a new URI to represent this organization whose title was updated. The example also describes the relationship between these two graphs, as
eg:g2
was derived from eg:g1
.
#### when an organization was created, by whom
eg:g1 {
eg:org2 rdf:type org:Organization ;
dc:title "Computing Laboratory" ;
org:hasPrimarySite eg:40002001 ;
.
eg:40002001
rdf:type org:Site ;
dc:title "OUCS" ;
geo:lat "51.76001"^^xsd:float ;
geo:long "-1.26035"^^xsd:float ;
.
}
eg:g2 {
eg:org2 rdf:type org:Organization ;
dc:title "Computing Services" ;
org:hasPrimarySite eg:40002001 ;
.
eg:40002001
rdf:type org:Site ;
dc:title "OUCS" ;
geo:lat "51.76001"^^xsd:float ;
geo:long "-1.26035"^^xsd:float ;
.
}
eg:g1 rdf:type , opmv:Artifact ;
opmv:wasGeneratedAt eg:t6 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p1
]
.
eg:g2 rdf:type , opmv:Artifact ;
opmv:wasGeneratedAt eg:t7 ;
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:wasPerformedBy eg:p1
];
opmv:wasDerivedFrom eg:g1
.
eg:t6
rdf:type time:Instant ;
time:inXSDDateTime "1960-10-20T14:51:00Z"^^xsd:dateTime ;
.
eg:t5
rdf:type time:Instant ;
time:inXSDDateTime "2000-10-20T15:11:00Z"^^xsd:dateTime ;
.
Additionally, we can provide more information about the change of information about organization eg:org1
, like who did it and when, as shown in the following example.
eg:g2
opmv:wasGeneratedBy [
rdf:type opmv:Process ;
opmv:used eg:g1 ;
opmv:wasPerformedBy eg:p2 ;
opmv:wasPerformedAt eg:t5 ;
]
.
Our examples show that the relationship between older and more updated artifacts are expressed in a derivation path, using properties like opmv:wasDerivedFrom
or concepts like org:ChangeEvent
. However, this information is not very sufficient to support queries like, "finding the latest information about an organization", or "finding information about an organization ". Additional metadata could be provided in different patterns to achieve different query efficiency [FlyWeb Provenance].
Data publishers can use an official URI, which is destined to always provide the latest information about the resource identified by that URI. This resource URI is linked to previous copies of information about that resource by properties like opmv:wasDerivedFrom
. However, this does not work for cases shown in the second and third examples. A vocabulary that can express versioning of datasets is needed and this is out of the scope of the OPMV vocabulary. Users can refer to related terms from Dublin Core or others.
OPMV is created as a very simple provenance vocabulary. Its generic terms might not be sufficient to express provenance information as precisely as expected. We encourage users to extend OPMV in a separate module in order to define the more specialized terms for their specific needs.
So far, the data.gov.uk team has created 3 OPMV typed modules, as mentioned at the beginning of this document. Examples showing how we can use each module to express provenance information more accurately and in more detail can be found in the following links:
Of the 3 type modules that extend OPMV, the common module is designed to keep terms that are commonly needed but not defined in the OPM specification.
Those who wish to propose new terms to the OPMV vocabularies should consider first whether such terms should follow into the common module. Those who wish to create a new typed module, as the XSLT and SPARQl modules, should base themselves upon the common module as well as the OPMV core, in order to reuse as many terms as possible.
Users can host their own OPMV supplementary modules. But we encourage users to use namespace patterns like http://purl.org/net/opmv/types/examplemoduel# and to inform us of their extensions.