Creating a community annotation compliant ISA descriptor: Dataset Maturity Level 2¶

Abstract:¶
With the recipe, we will continue using the ISA model to capture study metadata and dataset description, but we will be moving up on the scale of data set maturity by showing how to implement four important aspects:
community agreed experimental design descriptors
community agreed annotation levels for description of biological materials
extensive semantic annotation and reduction of free text annotation
machine-readable and more interoperable format: serialization to ISA-JSON
Let’s get started! Loading the ISA-API¶
Creating an ISA investigation¶
Declaring Semantic Resources
used to annotated ISA objects¶
ncbitaxon = OntologySource(name="NCIBTaxon",
description="NCBI Taxonomy")
efo = OntologySource(name="EFO",
description="Experimental Factor Ontology")
obi = OntologySource(name='OBI',
description="Ontology for Biomedical Investigations")
chebi = OntologySource(name="CHEBI",
description="Chemical Entity of Biological Interest")
pato = OntologySource(name='PATO',
description="Phenotype and Trait Ontology")
investigation.ontology_source_references = [chebi,efo,obi,pato,ncbitaxon]
Building and Annotation ISA objects to a Community Compliant
level¶
The annotation requirements we are dealing with are:
those of ENA Short Read Archive Repository on the one hand
those of a Minimal Information checklist: MINSEQE minimum information for a sequencing experiment.
study = Study(filename="s_BII-S-3-synthesis.txt")
study.identifier = "BII-S-3-synth"
study.title = "Metagenomes and Metatranscriptomes of phytoplankton blooms from an ocean acidification mesocosm experiment"
study.description = "Sequencing the metatranscriptome can provide information about the response of organisms to \
varying environmental conditions. We present a methodology for obtaining random whole-community mRNA from a complex \
microbial assemblage using Pyrosequencing. The metatranscriptome had, with minimum contamination by ribosomal RNA,\
significant coverage of abundant transcripts, and included significantly more potentially novel proteins than in the\
metagenome. This experiment is part of a much larger experiment. We have produced 4 454 metatranscriptomic datasets \
and 6 454 metagenomic datasets. These were derived from 4 samples."
study.submission_date = "2008-08-15"
study.public_release_date = "2008-08-15"
These NCBI SRA related ISA Comments fields are required and must be present for the ISA SRAconverter
to be invoked later
src_comment_sra1 = Comment(name="SRA Broker Name", value="OXFORD")
src_comment_sra2 = Comment(name="SRA Center Name", value="OXFORD")
src_comment_sra3 = Comment(name="SRA Center Project Name", value="OXFORD")
src_comment_sra4 = Comment(name="SRA Lab Name", value="Oxford e-Research Centre")
src_comment_sra5 = Comment(name="SRA Submission Action", value="ADD")
study.comments.append(src_comment_sra1)
study.comments.append(src_comment_sra2)
study.comments.append(src_comment_sra3)
study.comments.append(src_comment_sra4)
study.comments.append(src_comment_sra5)
These ISA Comments
are optional and may be used to report funding information
Declaring all the protocols used in the ISA study.¶
tip
Note also the declaration of ISA Protocol Parameters
when needed.
study.protocols = [
Protocol(name="environmental material collection - standard procedure 1",
description="Waters samples were prefiltered through a 1.6 um GF/A glass fibre filter to reduce Eukaryotic \
contamination. Filtrate was then collected on a 0.2 um Sterivex (millipore) filter which was frozen in \
liquid nitrogen until nucelic acid extraction. CO2 bubbled through 11000 L mesocosm to simulate ocean \
acidification predicted conditions. Then phosphate and nitrate were added to induce a phytoplankton bloom.",
protocol_type=OntologyAnnotation(term="sample collection"),
parameters=[
ProtocolParameter(parameter_name=OntologyAnnotation(term="filter pore size"))
]
),
Protocol(name="aliquoting-0",
description="aliquoting",
protocol_type=OntologyAnnotation(term="sample collection")),
Protocol(name="nucleic acid extraction",
description="Total nucleic acid extraction was done as quickly as possible using the method of \
Neufeld et al, 2007.",
protocol_type=OntologyAnnotation(term="nucleic acid extraction")
),
Protocol(name="mRNA extraction - standard procedure 3",
description="RNA MinElute + substrative Hybridization + MEGAclear For transcriptomics, total RNA\
was separated from the columns using the RNA MinElute clean-up kit (Qiagen) and checked for integrity of\
rRNA using an Agilent bioanalyser (RNA nano6000 chip). High integrity rRNA is essential for subtractive\
hybridization. Samples were treated with Turbo DNA-free enzyme (Ambion) to remove contaminating DNA.\
The rRNA was removed from mRNA by subtractive hybridization (Microbe Express Kit, Ambion), and absence\
of rRNA and DNA contamination was confirmed using the Agilent bioanalyser. The mRNA was further\
purified with the MEGAclearTM kit (Ambion). Reverse transcription of mRNA was performed using\
the SuperScript III enzyme (Invitrogen) with random hexamer primers (Promega).\
The cDNA was treated with RiboShredderTM RNase Blend (Epicentre) to remove trace RNA contaminants. \
To improve the yield of cDNA, samples were subjected to random amplification using the \
GenomiPhi V2 method (GE Healthcare). GenomiPhi technology produces branched DNA molecules \
that are recalcitrant to the pyrosequencing methodology. Therefore amplified samples were treated with\
S1 nuclease using the method of Zhang et al.2006.",
protocol_type=OntologyAnnotation(term="labeling") #nucleic acid extraction
),
Protocol(name="genomic DNA extraction - standard procedure 4",
description="superscript+random hexamer primer",
protocol_type=OntologyAnnotation(term="nucleic acid extraction")
),
Protocol(name="reverse transcription - standard procedure 5",
description="",
protocol_type=OntologyAnnotation(term="reverse transcription"),
),
Protocol(name="library construction",
description="",
protocol_type=OntologyAnnotation(term="library construction"),
parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="library strategy")),
ProtocolParameter(parameter_name=OntologyAnnotation(term="library layout")),
ProtocolParameter(parameter_name=OntologyAnnotation(term="library selection"))
]
),
Protocol(name="nucleic acid sequencing", #pyrosequencing - standard procedure 6",
description="1. Sample Input and Fragmentation: The Genome Sequencer FLX System supports the sequencing of \
samples from a wide variety of starting materials including genomic DNA, PCR products, BACs, and cDNA. \
Samples such as genomic DNA and BACs are fractionated into small, 300- to 800-base pair fragments. \
For smaller samples, such as small non-coding RNA or PCR amplicons, fragmentation is not required. \
Instead, short PCR products amplified using Genome Sequencer fusion primers can be used for immobilization\
onto DNA capture beads as shown below.",
protocol_type=OntologyAnnotation(term="nucleic acid sequencing"),
parameters=[ProtocolParameter(parameter_name=OntologyAnnotation(term="sequencing instrument"))
]
),
Protocol(name="sequence analysis - standard procedure 7",
description="",
protocol_type=OntologyAnnotation(term="data transformation")
)
]
Adding a Study Design descriptor to the ISA Study object
Declaring the ISA Study Factors
, i.e. The Independent Variables of the Experiment¶
study.factors = [
StudyFactor(name="compound",
factor_type=OntologyAnnotation(term="chemical substance",
term_accession="http://purl.obolibrary.org/obo/CHEBI_59999",
term_source=chebi)),
StudyFactor(name="dose",
factor_type=OntologyAnnotation(term="dose",
term_accession="http://www.ebi.ac.uk/efo/EFO_0000428",
term_source=efo)),
StudyFactor(name="collection time",
factor_type=OntologyAnnotation(term="time",
term_accession="http://purl.obolibrary.org/obo/PATO_0000165",
term_source=pato))
]
Declaring the ISA FactorValues
i.e. The factor levels to each of the Indepedent Variables or ISA Study Factor
.¶
fv1 = FactorValue(factor_name=study.factors[0], value=OntologyAnnotation(term="carbon dioxide"))
fv2 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="high"))
fv3 = FactorValue(factor_name=study.factors[1], value=OntologyAnnotation(term="normal"))
fv4 = FactorValue(factor_name=study.factors[2], value="may 13th, 2006")
fv5 = FactorValue(factor_name=study.factors[2], value="may 19th, 2006")
Adding the publications associated to the study¶
study.publications = [
Publication(doi="10.1371/journal.pone.0003042",
pubmed_id="18725995",
title="Detection of large numbers of novel sequences in the metatranscriptomes \
of complex marine microbial communities.",
status=OntologyAnnotation(term="indexed in PubMed"),
author_list="Gilbert JA, Field D, Huang Y, Edwards R, Li W, Gilna P, Joint I."),
Publication(doi="10.1111/j.1462-2920.2008.01745.x",
title="Potential for phosphonoacetate utilization by marine bacteria in temperate coastal waters",
pubmed_id="18783384",
status=OntologyAnnotation(term="indexed in PubMed"),
author_list="Gilbert JA, Thomas S, Cooley NA, Kulakova A, Field D, Booth T, McGrath JW, Quinn JP, Joint I.")
]
Adding both assays to the ISA Study object¶
# assay_g = investigation.studies[0].assays[1]
# assay_t = investigation.studies[0].assays[0]
# assay_t.samples=investigation.studies[0].samples
# extract1 = Material(name="GSM255770.e1")
# extract1.type = "Extract Name"
# extract2 = Material(name="GSM255771.e1")
# extract2.type = "Extract Name"
# extract3 = Material(name="GSM255772.e1")
# extract3.type = "Extract Name"
# extract4 = Material(name="GSM255773.e1")
# extract4.type = "Extract Name"
# extract5 = Material(name="GSM255774.e1")
# extract5.type = "Extract Name"
# assay_t.other_material.append(extract1)
# assay_t.other_material.append(extract2)
# assay_t.other_material.append(extract3)
# assay_t.other_material.append(extract4)
# assay_t.other_material.append(extract5)
# for i in range(len(study.samples)):
# assay_t.process_sequence.append(Process(
# executes_protocol=study.protocols[1],
# inputs=study.samples[i],
# outputs=assay_t.other_material[i]
# ))
# data=DataFile(filename="sequenced-data-{}".format(i), label="Raw Data File")
# assay_t.process_sequence.append(Process(
# executes_protocol=study.protocols[3],
# inputs=assay_t.other_material[i]
# ))
# plink(assay_t.process_sequence[0], assay_t.process_sequence[1])
# assay_t.process_sequence[-1].outputs.append(data)
# assay_t.data_files.append(data)
Writing the ISA object representation to file with the ISA-API dump
function¶
Reading the ISA document from disk back in, loading it into memory and writing to disk again to check that the ISA-API load function works nominally¶
et Voilà!