Plant genomic and genetic variation data submission to EMBL-EBI databases

Recipe Overview
Reading Time
30 minutes
Executable Code
No
Difficulty
Improving dataset maturity - MIAPPE-compliant submission to EMBL-EBI databases
FAIRPlus logo
Recipe Type
Guidance
Audience
Everyone
Maturity Level & Indicator
DSM-2-C1DSM-2-C6
hover me Tooltip text

Main Objective

Plant genotyping and phenotyping data are often used in combination to make evidence-based inferences about different trait expressions. A challenge here is not necessarily to collect the data, but to offer them in a stable way over the long term in public repositories with sufficient metadata information to make conclusions about them, in line with FAIR principles. Since these repositories are often not linked directly, it is even more important to provide metadata that allows users to recognise these links by their identifiers. A key point here is sample management: the identifiers assigned here help both humans and machines to understand which experimental data are linked.

The main objective of the recipe is to provide a means of submitting to public repositories and tracking genotyping data, with a particular focus on plants. This includes:

1) Submission of sample data and metadata information to BioSamples.
2) Submission of sequencing data and metadata to ENA.
3) Retrieval of the correct genome assembly for the genotyping experiment
4) Conversion of the resulting analysis file (in VCF format) to be FAIR
5) Submission of the genotyping results to EVA.

In terms of FAIRification goals, this means obtaining stable, resolvable identifiers for the datasets and meeting community annotation requirements as expressed in the MIAPPE requirements.

Summary

This recipe provides guidance for submitting plant genotyping data to public repositories. It explains in a step-wise fashion which work should be done and when. Special attention should be paid to the metadata maintenance of the data that will be deposited in different repositories as part of this recipe. A prerequisite for fully understanding this recipe is a basic knowledge of the MIAPPE standard.

The exact listing of the metadata fields required for a FAIRification of the genotyping data set within a VCF file is also part of this recipe with examples and explanations (See details in Section 4.2).

Graphical Overview

Table of Data Standards

Data Formats

Terminologies

Model

FASTQ

FASTA

MIAPPE

Variant File Format (.vcf)

BioSamples - Plant MIAPPE checklist

Step-by-Step Process

Step 1: Take DNA sample

The experimentalist takes a sample of plant biological material. The sample metadata are collected according to the MIAPPE specifications, Biological Material section. It enriches minimal MCPD fields with sample traceability information.

Step 2: Register sample material at BioSamples

This is done in general through JSON API (Python and Shell commands are also available). Refer to the official documentation for the complete details, for training material and slides regarding this, refer to 3. Here is the proposed procedure:

  1. Create a user account

First you need an account to submit samples to EMBL-EBI BioSamples database. We recommend new users, or those planning to make downstream submissions to ENA, to use the Webin Authentication service. You can create a Webin account using Webin web interface. Please refer to ENA documentation for more details about creating an account.

  1. Login to the system and get a JSON Web Token (JWT)

Webin uses JSON Web Token (JWT) for authentication. Use your login credentials to retrieve a JWT. You can either use the Webin Swagger interface (A) or a programmatic method (B):

A: Go to Webin Swagger and use POST /token endpoint to retrieve a JWT.

B: Use CURL or a programmatic method.

`TOKEN=$(curl -X POST "https://www.ebi.ac.uk/ena/submit/webin/auth/token" -H "accept: */*" 
-H "Content-Type: application/json" -d "{\"authRealms\":[\"ENA\"],\"password\":\"PASSWORD\",\"username\":\"WEBIN_ID\"}")`
  1. Allocate accessions via the pre-accessioning endpoint.

Use the pre-accessioning endpoint to reserve BioSamples ID. This will create private, empty samples with future release dates. The only mandatory field for pre-accessioning is the sample name.

The following CURL command returns 3 accessions as the body contains names for 3 samples.

`curl 'https://www.ebi.ac.uk/biosamples/samples/bulk-accession?authProvider=WEBIN' 
-i -X POST -H "Content-Type: application/json;charset=UTF-8" -H "Accept: application/hal+json" 
-H "Authorization: Bearer $TOKEN" -d '[{  "name" : "FakeSample 1"}, { "name" : "FakeSample 2"}, {  "name" : "FakeSample 3"}]'`

Please refer to the BioSamples documentation.

More general information is available on the RDMkit 1. A specific checklist is used: BioSamples - Plant MIAPPE checklist.

Step 3: Perform sequencing of DNA sample

The sequencing staff performs the sequencing of the DNA sample, which is followed by a quality control. The reads are then archived in the institutional Laboratory Information Management System (LIMS).

Step 4: Register and submit sequencing reads to ENA

Submit Sequencing reads to ENA, using BioSamples IDs to identify material.

4.1 The Study

To begin, you should register a study. Recall that a study describes the purpose of the work you have done, groups other objects beneath it, and controls when the data becomes public. A study is required for all submissions to ENA.

  1. Log in to the Webin Submission Portal with your Webin credentials.

  2. Click the ‘New Submission’ tab and find the ‘Register study (project)’ radio button. Click ‘Next’ to see the study registration interface

  1. The ‘Short Name’ field should be filled in with something brief and meaningful, e.g.:barley_study_2021

  2. You should take time to provide a descriptive title and informative abstract for your own studies, but these can be edited later if needed. For now, use as your title: GBS Study of Barley from <Your Town/Lab>

  3. When you have completed all required fields, click ‘Submit’ and then confirm.

  4. Now navigate to the ‘Studies’ tab to see the study you just registered. You might need to refresh the page!
    Make a note of its accession numbers (ERP#####, PRJEB#####), which will resemble. For a submission, these would be the numbers you would cite in any publications involving the data.

4.2 The Sample

The next step is to register the sample, which will give other users essential context for the sequence data you are submitting. The sample describes the source biological material of your sequencing work.

As discussed above, samples are best submitted through BioSamples.

In ENA, samples are required to conform to a checklist of values. Checklists define a set of mandatory and recommended descriptor fields for a given type of sample. It is recommended that you look at these early and make sure you collect all required metadata items for the type of sample you will be registering.

The selection of available checklists can be browsed at: https://www.ebi.ac.uk/ena/browser/checklists. Currently, three plant specific checklists are available: ERC000020, ERC000035 and ERC000037.

  1. Return to the ‘New Submission’ tab in the submission service. You may need to click the ‘Restart Submission’ button at the bottom of the page.

  2. Find the ‘Register samples’ radio button and click ‘Next’.

  3. You must choose an appropriate checklist of values to be provided for your sample: click ‘Select checklist’ and expand the ‘Other checklists’ group to browse checklists of this type.

  4. Select the appropriate checklist. For general purposes ERC000037 (‘ENA Plant Sample Checklist’) should be the default when submitting plant sequencing data. Now click ‘Next’. add an image of the web ui for https://www.ebi.ac.uk/ena/browser/view/ERC000037

  5. Submitters now have the option of including additional fields in their checklist. It is not necessary to include any additional fields, but you can take this opportunity to see which fields are included by default, what requirements they have, and what else is available.

    Click ‘Next’.

  6. Now you have the opportunity to construct a template for a submission of many samples: notice the phrase ‘Template Basic Details’ at the top of the interface. This allows you to define values which are common to all samples in your submission.

    Since we are only submitting one sample, skip this screen by clicking ‘Next’.

  7. On the next screen, click the ‘+ Add’ button to create your sample.

  8. Give a ‘Unique name’ to your sample:

     barley_sample_1
    

    Hint: This value must always be unique among all samples submitted by your account!

  9. Give your sample a title, e.g.:

     Barley sample from <Your Town/Lab> greenhouse
    
  10. Write a brief description for your sample.

  11. Find the appropriate taxon for your sample: For example enter ‘Hordeum vulgare’ into the box and select it when it appears.

    Further taxonomic details are automatically filled below this box. This box will only accept values which match with a species-rank entry in the NCBI Taxonomy database.

  12. Fill out the remaining details by reference to the values to the best of your knowledge. Only fields marked with a ***** are required, but it is best to fill as many fields as possible.

Use a markdown directive (note: not supported by hackMD but supported by jupyter notebook)

Note

You can always find out more about what a field means and what information it accepts by hovering over the blue ‘i’.

  1. By the time you have completed this, your sample will be well-annotated and understandable to people finding it in the database later.

    You are ready to submit when all the checks on the right of the page are green ticks, and none are red crosses.

    Click ‘Submit’.

  2. Now navigate to the ‘Samples’ tab to see the sample you just registered. You might need to refresh the page.

    Make a note of its accession numbers as you will need these later:

     ERS####### and SAMEA####### 
    

4.3 The Read Data

Now that study and sample metadata have been registered, it is time to submit the read data you have produced.

  1. Return to the ‘New Submission’ tab in the submission service. You may need to click the ‘Restart Submission’ button at the bottom of the page.

  2. Find the ‘Submit sequence reads and experiments’ radio button and click

  3. Next, you are asked to choose which study these reads should be added to. Select the study you registered earlier and click ‘Next’.

  4. Now, you have the opportunity to register samples as part of this submission. However, you already registered a sample, so click ‘Skip’.

  5. Now, select ‘One Fastq file (Single)’ from the list of file options.

  6. Fill out the form to describe your experiments and register the files. Ordinarily, you would need to upload the files to a staging area before submitting them through this interface, refer to the ENA upload guidelines.

    Appropriate metadata for your read data submission should look similar to this:

    • Sample reference:

    • Instrument Model: Illumina HiSeq 2500

    • Library Name: barley_library_1

    • Library Source: GENOMIC

    • Library Selection: Restriction Digest

    • Library Strategy: GBS

    • Library Layout: SINGLE

    • First File Name: HOR_1361_BRG.fastq.gz

Note

If you did not note down your sample accession, you can safely visit the ‘Samples’ tab, copy the accession, then return to the New Submission tab: your progress will be saved.

  1. Click ‘Submit’ and see if your submission validates successfully. If you encounter errors, try using the ‘Download Template Spreadsheet’ button, open the file and check it in this plain text format; it can be easier to fix errors this way than in the interface.

Note

Note that the use of template spreadsheets is the best way to submit multiple datasets through this interface.

  1. If you manage to complete your submission, visit the ‘Runs’ tab to review your submission.

Step 5: Check if used reference genome assembly is available

Is a GCF / GCA accession number available ? Check on https://www.ebi.ac.uk/ena/browser.

  • If yes, proceed directly to VCF submission at step 4.

  • If no, submit reference genome assembly to INSDC (NCBI Genbank / EMBL-EBI ENA / DDBJ) and wait until accession number is issued, then proceed to step 4.

Step 6: Analyse sequencing results

The bioinformatician performs the computational analysis, then the genotyping results are archived into the LIMS.

Step 7: Prepare genotyping dataset for submission of VCF file to EVA

In order to ensure interoperability of VCF files, in accordance with the good practice outlined in 2, the following VCF meta-information lines should be used:

Obligatory meta-information line :

##fileformat : file format.

Example: ##fileformat=VCFv4.3 

In case of adding new fields :

Please check the official format specifications to avoid redundancy and possible incompatibilities.

Step 8: Submit VCF file to EVA

Once the metadata and data has been formatted according to the specifications above, make sure that the resulting VCF file complies with VCF specifications. For that purpose, we propose the VCF validator tool on GitHub

vcf_validator -i /path/to/file.vcf
vcf_validator -i /path/to/compressed_file.vcf.gz

Once the file has been fully validated without any error messages, you can submit the VCF file to EVA, using BioSamples IDs to identify the material, GCF/GCA accession for the reference genome assembly, and ENA accession numbers for the sequencing reads of the material used. Refer to the official documentation.

Reference

Authors