View on GitHub

Workflow Run RO-Crate

RO-Crate profiles to capture the provenance of workflow runs

Process Run Crate

This profile uses terminology from the RO-Crate 1.1 specification, and extends it with additional terms from the workflow-run ro-terms namespace.

Overview

This profile is used to describe the execution of an implicit workflow, indicating that one or more computational tools have been executed, typically generating some result files that are represented as data entities in the RO-Crate.

By “implicit workflow” we mean that the composition of these tools may have been done by hand (a user executes one tool following another) or by some script that has not yet been included as part of the crate (for instance because it is an embedded part of a larger application).

This profile requires the indication of Software used to create files, namely a SoftwareApplication (the tool) and a CreateAction (the execution of said tool).

The following diagram shows the relationships between provenance-related entities. Note the distinction between prospective provenance (plans for activities, e.g., an application) and retrospective provenance (what actually happened, e.g. the execution of an application).

Entity-relationship diagram

Example Metadata File (ro-crate-metadata.json)

{ "@context": "https://w3id.org/ro/crate/1.1/context", 
  "@graph": [
    {
        "@id": "ro-crate-metadata.json",
        "@type": "CreativeWork",
        "conformsTo": {"@id": "https://w3id.org/ro/crate/1.1"},
        "about": {"@id": "./"}
    },
    {
        "@id": "./",
        "@type": "Dataset",
        "conformsTo": {"@id": "https://w3id.org/ro/wfrun/process/0.1"},
        "hasPart": [
            {"@id": "pics/2017-06-11%2012.56.14.jpg"},
            {"@id": "pics/sepia_fence.jpg"}
        ],
        "mentions": {"@id": "#SepiaConversion_1"},
        "name": "My Pictures"
    },
    {   "@id": "https://w3id.org/ro/wfrun/process/0.1",
        "@type": "CreativeWork",
        "name": "Process Run Crate",
        "version": "0.1"
    },
    {
        "@id": "https://www.imagemagick.org/",
        "@type": "SoftwareApplication",
        "url": "https://www.imagemagick.org/",
        "name": "ImageMagick",
        "softwareVersion": "6.9.7-4"
    },
    {
        "@id": "#SepiaConversion_1",
        "@type": "CreateAction",
        "name": "Convert dog image to sepia",
        "description": "convert -sepia-tone 80% test_data/sample/pics/2017-06-11\\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg",
        "endTime": "2018-09-19T17:01:07+10:00",
        "instrument": {"@id": "https://www.imagemagick.org/"},
        "object": {"@id": "pics/2017-06-11%2012.56.14.jpg"},
        "result": {"@id": "pics/sepia_fence.jpg"},
        "agent": {"@id": "https://orcid.org/0000-0001-9842-9718"}
    },
    {
        "@id": "pics/2017-06-11%2012.56.14.jpg",
        "@type": "File",
        "description": "Original image",
        "encodingFormat": "image/jpeg",
        "name": "2017-06-11 12.56.14.jpg (input)"
    },
    {
        "@id": "pics/sepia_fence.jpg",
        "@type": "File",
        "description": "The converted picture, now sepia-colored",
        "encodingFormat": "image/jpeg",
        "name": "sepia_fence (output)"
    },
    {
        "@id": "https://orcid.org/0000-0001-9842-9718",
        "@type": "Person",
        "name": "Stian Soiland-Reyes"
    }
]
}

Note that the command line shown in the action’s description is not directly re-executable, as file paths are not required to match the RO-Crate locations. For a more structural and reproducible description of tool executions, see Workflow Run Crate.

Requirements

Property Required? Description
Dataset (the root data entity, e.g. "@id": "./")
conformsTo MUST MUST reference a CreativeWork entity with an @id URI that is consistent with the versioned Permalink of this document, e.g. {"@id": "https://w3id.org/ro/wfrun/process/0.1"}
SoftwareApplication
@type MUST SHOULD include SoftwareApplication, SoftwareSourceCode or ComputationalWorkflow
@id MUST SHOULD be an absolute URI, but MAY be a relative URI to a data entity in the crate (e.g. "bin/simulation4") or a local identifier for tools that are not otherwise described on the web (e.g. "#statistical-analysis")
name SHOULD A human readable name for the tool in general (not just how it was used here)
url SHOULD Homepage, documentation or source for the tool
version SHOULD The version string for the software application. In the case of a SoftwareApplication, this MAY be provided via the more specific softwareVersion. SoftwareApplication entities SHOULD NOT specify both version and softwareVersion: in this case, consumers SHOULD prioritize softwareVersion. In order to facilitate comparison attempts by consumers, it is RECOMMENDED to specify a machine-readable version string if available (see for instance Python's PEP 440).
CreateAction
@type MUST SHOULD be CreateAction to indicate that this tool created the result data entities. MAY be ActivateAction if the provenance does not include any result. MAY be UpdateAction if the tool modified an existing data entity or database in-place.
@id MUST A unique identifier for the execution, e.g. "urn:uuid:50ec5c76-1f7a-4130-8ef6-846756b228c1", "#f99a8e6c". MAY be an absolute URI, e.g. http://example.com/runs/846756b228c1. The use of randomly generated UUIDs (type 4) is RECOMMENDED. SHOULD be listed under mentions of the root data entity.
name SHOULD Short human-readable description of the execution.
description SHOULD Details of the execution, for instance command line arguments or settings. This field is for information only, no particular structure is to be assumed.
endTime SHOULD The time the process ended, i.e. when the last of the entities in result has been created. SHOULD be a DateTime in ISO 8601 format.
startTime MAY The time the process started, i.e. the earliest time the process may have accessed an entity in object. SHOULD be a DateTime in ISO 8601 format.
instrument MUST Identifier of the executed tool.
agent SHOULD Identifier of a Person or Organization contextual entity that started/executed this tool.
object MAY The identifier of one or more entities of the RO-Crate that were consumed by this action, e.g. input files or reference datasets.
result SHOULD The identifier of one or more entities that were created or modified by this action, e.g. output files.
actionStatus MAY SHOULD be CompletedActionStatus if the process completed successfully or FailedActionStatus if it failed to complete. In the latter case, consumers should be prepared for the absence of any dependent actions in the metadata. If this attribute is not specified, consumers should assume that the process completed successfully.
error MAY Additional information on the cause of the failure, such as an error message from the application, if available. SHOULD NOT be specified unless actionStatus is set to FailedActionStatus.

Entities referenced by an action’s object or result SHOULD be of type File (an RO-Crate alias for MediaObject) for files, Dataset for directories and Collection for multi-file datasets, but MAY be a CreativeWork for other types of data (e.g. an online database); they MAY be of type PropertyValue to capture numbers/strings that are not stored as files.

Data entities involved in an application’s input and output SHOULD have an @id that reflects the original file or directory name as processed by the application, but MAY be renamed to avoid clashes with other entities in the crate. In this case, they SHOULD refer to the original name via alternateName. This is particularly important to support reproducibility in cases where an application expects to find input in specific locations and with specific names (see the MIRAX example in Representing multi-file objects).

Multiple processes

A process crate can be used to indicate one single execution as a single CreateAction, or a series of processes that generate different data entities. These actions MAY form an implicit workflow by following the links between entities that appear as result in an action and as object in the following one, but a process crate is not required to ensure such consistency (e.g. there may be an intermediate action that has not been recorded).

Multiple processes diagram

Referencing configuration files

Some applications support the modification of their behavior via configuration files. Typically, these are not part of the input interface, but are searched for by the application among a set of possible predefined file system paths. In the case of applications that support a configuration file, the specific configuration file used during a run SHOULD be added to the object attribute of the corresponding CreateAction, especially if its settings are different from the default ones.

    {
        "@id": "#SepiaConversion_1",
        "@type": "CreateAction",
        "name": "Convert dog image to sepia",
        "description": "convert -sepia-tone 80% test_data/sample/pics/2017-06-11\\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg",
        "endTime": "2018-09-19T17:01:07+10:00",
        "instrument": {"@id": "https://www.imagemagick.org/"},
        "object": [
            {"@id": "pics/2017-06-11%2012.56.14.jpg"},
            {"@id": "SepiaConversion_1/colors.xml"}
        ]
        "result": {"@id": "pics/sepia_fence.jpg"},
        "agent": {"@id": "https://orcid.org/0000-0001-9842-9718"}
    },
    {
        "@id": "SepiaConversion_1/colors.xml",
        "@type": "File",
        "description": "Imagemagick color names configuration",
        "encodingFormat": "text/xml",
        "name": "colors"
    }

Representing multi-file objects

In some formats, the data belonging to a digital entity is stored in more than one file. For instance, the Mirax2-Fluorescence-2 image is stored as the following set of files:

Mirax2-Fluorescence-2.mrxs
Mirax2-Fluorescence-2/Index.dat
Mirax2-Fluorescence-2/Slidedat.ini
Mirax2-Fluorescence-2/Data0000.dat
Mirax2-Fluorescence-2/Data0001.dat
...
Mirax2-Fluorescence-2/Data0023.dat

An application that reads this format needs to be pointed to the .mrxs file, and expects to find a directory containing the other files in the same location as the .mrxs file, with the same name minus the extension. Thus, even though an application that processes MIRAX files would probably take only the .mrxs file as argument, the other ones must be present in the expected location and with the expected names (in CWL, this kind of relationship is expressed via secondaryFiles). In this case, the object SHOULD be represented by a contextual entity of type Collection listing all files under hasPart, with a mainEntity referencing the main file. The collection SHOULD be referenced from the root data entity via mentions.

{
    "@id": "./",
    "@type": "Dataset",
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"},
        {"@id": "Mirax2-Fluorescence-2.png"}
    ],
    "mentions": [
        {"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip"},
        {"@id": "#conversion_1"}
    ]
},
{
    "@id": "https://openslide.org/",
    "@type": "SoftwareApplication",
    "url": "https://openslide.org/",
    "name": "OpenSlide",
    "version": "3.4.1"
},
{
    "@id": "#conversion_1",
    "@type": "CreateAction",
    "name": "Convert image to PNG",
    "endTime": "2018-09-19T17:01:07+10:00",
    "instrument": {"@id": "https://openslide.org/"},
    "object": {"@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip"},
    "result": {"@id": "Mirax2-Fluorescence-2.png"}
},
{
    "@id": "https://openslide.cs.cmu.edu/download/openslide-testdata/Mirax/Mirax2-Fluorescence-2.zip",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"}
    ]
},
{
    "@id": "Mirax2-Fluorescence-2.mrxs",
    "@type": "File"
},
{
    "@id": "Mirax2-Fluorescence-2/",
    "@type": "Dataset"
},
{
    "@id": "Mirax2-Fluorescence-2.png",
    "@type": "File"
}

If the collection does not have a web presence, its @id can be an arbitrary internal one, possibly randomly generated (as for any other contextual entity):

{
    "@id": "#af0253d688f3409a2c6d24bf6b35df7c4e271292",
    "@type": "Collection",
    "mainEntity": {"@id": "Mirax2-Fluorescence-2.mrxs"},
    "hasPart": [
        {"@id": "Mirax2-Fluorescence-2.mrxs"},
        {"@id": "Mirax2-Fluorescence-2/"}
    ]
}

The use case shown here is an example of a situation where it’s important to refer to the original names in case any renamings took place, as described in Requirements:

{
    "@id": "#af0253d688f3409a2c6d24bf6b35df7c4e271292",
    "@type": "Collection",
    "mainEntity": {"@id": "f62aa607a75508ac5fc6a22e9c0e39ef58a2c852"},
    "hasPart": [
        {"@id": "f62aa607a75508ac5fc6a22e9c0e39ef58a2c852"},
        {"@id": "c7398fbf741b851e80ae731d60cbee9258ff81f3/"}
    ]
},
{
    "@id": "f62aa607a75508ac5fc6a22e9c0e39ef58a2c852",
    "@type": "File",
    "alternateName": "Mirax2-Fluorescence-2.mrxs"
},
{
    "@id": "c7398fbf741b851e80ae731d60cbee9258ff81f3/",
    "@type": "Dataset",
    "alternateName": "Mirax2-Fluorescence-2/",
    "hasPart": [
        {"@id": "c7398fbf741b851e80ae731d60cbee9258ff81f3/46c443af080a36000c9298b49b675eb240eeb41c"},
        ...
    ]
},
{
    "@id": "c7398fbf741b851e80ae731d60cbee9258ff81f3/46c443af080a36000c9298b49b675eb240eeb41c",
    "@type": "File",
    "alternateName": "Mirax2-Fluorescence-2/Index.dat"
},
...

Representing environment variable settings

The behavior of some applications may be modified by setting appropriate environment variables. These are different from ordinary application inputs in that they are part of the environment in which the process runs, rather than parameters supplied through a command line or a graphical interface. To represent the fact that an environment variable was set to a certain value during the execution of an action, use the environment property from the workflow-run ro-terms namespace, making it point to a PropertyValue that describes the setting:

{
    "@context": [
        "https://w3id.org/ro/crate/1.1/context",
        "https://w3id.org/ro/terms/workflow-run"
    ],
    "@graph": [
        ...
        {
            "@id": "#SepiaConversion_1",
            "@type": "CreateAction",
            "instrument": {"@id": "https://www.imagemagick.org/"},
            "object": {"@id": "pics/2017-06-11%2012.56.14.jpg"},
            "result": {"@id": "pics/sepia_fence.jpg"},
            "environment": [
                {"@id": "#height-limit-pv"},
                {"@id": "#width-limit-pv"}
            ]
        },
        {
            "@id": "#width-limit-pv",
            "@type": "PropertyValue",
            "name": "MAGICK_WIDTH_LIMIT",
            "value": "4096"
        },
        {
            "@id": "#height-limit-pv",
            "@type": "PropertyValue",
            "name": "MAGICK_HEIGHT_LIMIT",
            "value": "3072"
        }
    ]
}

Note that we added the workflow-run context to the @context entry in order to bring in the definition of environment.

Environment variable settings SHOULD be listed if they are different from the default ones (usually unset) and affected the results of the action.

Representing container images

An application may use one or more container images (e.g. Docker container images) to perform its duty. An action MAY indicate that a container image was used during the execution via the containerImage property, defined in the workflow-run ro-terms namespace.

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": {"@id": "#samtools-image"}
},
{
    "@id": "#samtools-image",
    "@type": "ContainerImage",
    "additionalType": {"@id": "https://w3id.org/ro/terms/workflow-run#DockerImage"},
    "registry": "docker.io",
    "name": "biocontainers/samtools",
    "tag": "v1.9-4-deb_cv1",
    "sha256": "da61624fda230e94867c9429ca1112e1e77c24e500b52dfc84eaf2f5820b4a2a"
}

The ContainerImage type (note the leading lowercase “C”) and most of the properties shown above are also defined in the workflow-run namespace. The additionalType describes the specific image type (e.g., DockerImage, SIFImage); the registry is the service that hosts the image (e.g., “docker.io”, “quay.io”); the name is the identifier of the image within the registry; tag describes the image tag and sha256 its sha256 checksum. A ContainerImage entity SHOULD list at least the additionalType, registry and name properties.

Alternatively, the containerImage could point to a URL. For instance:

{
    "@id": "#cb04c897-eb92-4c53-8a38-bcc1a16fd650",
    "@type": "CreateAction",
    "instrument": {"@id": "bam2fastq.cwl"},
    ...
    "containerImage": "https://example.com/samtools.sif"
}

Specifying software dependencies

Software dependencies MAY be specified using softwareRequirements to a SoftwareApplication:

{
    "@id": "script.py",
    "@type": "SoftwareApplication",
    "name": "Analysis Script",
    "version": "0.1",
    "softwareRequirements": {"@id": "https://pypi.org/project/numpy/1.26.2/"}
},
{
    "@id": "https://pypi.org/project/numpy/1.26.2/",
    "@type": "SoftwareApplication",
    "name": "NumPy",
    "version": "1.26.2"
}