Critical Assessment of Metagenome Interpretation

File formats
Metagenome assembly

Outline

The CAMI assembly format is a FASTA-formatted contig and scaffold file.

The header section

The header section of the FASTA files must follow the standard FASTA guidelines and must start with ">" character followed by any alphanumeric "[A-Za-z]", whitespace (spaces and tabs), brackets "[]", underscores "_", colons "[:;]", commas ",", periods ".", pipes "|", or hypens "-" ending with a newline character.

The sequence section

Sequences must contain only capitalized nucleotide characters, including the ambiguous character 'N': /[ATCGN]+/.

There is no requirement on the number of nucleotides per line, but community best practices advise limiting the amount of nucleotides to 120 characters.

Examples

>NODE_1331_length_1852_cov_20.7969_ID_2661 CTATTTTTAAATATCCGCGTACTTACTGAGTTATCGTGGCAGTTTTTGCTAATTTTGCGATTTTTGCCACGATAACTTGT TAAAAAATATGACGTCTGCAATTCTTGTTGCTTTTACACTACAAAAAAGATAAAATATTAGTATGAAAGCAACAGAAGTG AGGCTATTATGTCAAACAAAGAGATGCTCCTTGACTATTTAGATAACCATCATGGAATAATTACATACAAAGATTGTAAA ACATTAAATATTCCAACTATATATTTAACTCGTTTGAAAAACGAAGGTGTTCTTAATCGAATTGAAAGGGGAATTTTTCT CTCTTCCAATGGTGATTATGACGAATATTACTTCTTTCAATATCGTTATCCACAAGCTGTATTCTCTTATGTTTCAGCTC TTTATCTTCAAGGTTTTACAGATGAAATTCCACAACATTTTGAAGTAAGCGTTCCTAGAGGATATCGTTTTAATAATCCA CCATCAAATTTAACTATTCATTGGGTCTCAAAATTGTATAGCAAATTAGGCATTACTACAACGATTACCACAATGGGGAA TAAAGTACGTATTTATGATTTTGAACGCATTATTTGTGATTTTATAATAAACAGAAACAGTATTGATTCTGAACTCTTTG TTAAAACTCTTCAAGCTTACAGTAGGTATAACGGGAAAGATCTTATAAAGC >54257 AAAATAGACGCACAAGCTGGCACATCACAACGAATAACCAAGCATAGAAATGAAGAAAAAGAAAATAATCATGGGAGTAG GAACAGGAATCCTTTTGGCTGCTGTTGCTTTTTGGGGGTGGCATTCCACACAAGCAACTTCGACAGAAATAACAAATGAA ATGGAAAGCGCCATGCACAACGAGCCTGTTGGTCCTGCTTTTGAAGCCGATTCTGCGTACCGTTATATTGAAGCACAATG TAGTTTCGGCCCTCGCACCATGAACAGCGAAGCACACGAACAATGTGCGGAGTGGATTAT
Genome and taxon binning

Outline

The CAMI genome and taxon binning format contains a header section followed by lines of tab-separated sequence ID-to-bin ID assignments. The binning can be of the reads of a dataset sample or the contigs of an assembly.

The header section

#CAMI Submission for Binning

This line is an optional comment line describing the file content.

@SampleID:SAMPLEID

SAMPLEID is a unique identifier for a dataset sample.

@@SEQUENCEID TAXID BINID

The last line is the header for the binning section and begins with '@@'. It defines the content of the columns in a tab-separated format.

The binning section

The TAXID is the taxonomic assignment, of an NCBI taxon ID, for your binned sequence. It is used for evaluation of your predictions with taxonomy-based measures.

The BINID entries can be arbitrary alphanumeric identifiers for the genome bins. No taxonomy-based evaluation is performed using the entries in this column.

There are three different scenarios for binning tools.

The first case, example A below: If you create taxonomic bins as output without further resolution, you do not need to include the BINID colummn, but only the TAXID column, in your output.

The second case, example B below: If you create bins that do not include taxonomic assignments you do not need to include the TAXID column, but only the BINID column, in your output.

The third case, example C below, is if you perform taxonomic binning and additionally resolve bins below existing taxonomic IDs, e.g. to define bins representing novel strains. In this case, you add both the TAXID. It will be easiest to check for consistency for us if you in this case use for the BINID the TAXID entry and a numeric identifier appended (e.g. 562.2).

If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.

Format for multiple samples

Binnings for datasets with multiple samples are supported. Simply concatenate the binnings of the different samples into a single file. The binning for each sample must start with the header tag @SampleID identifying the sample and the binning header columns starting with '@@'.

Examples

Example A

#CAMI Format for Binning @SampleID:SAMPLEID @@SEQUENCEID TAXID read1201 123 read1202 123 read1203 131564 read1204 562 read1205 562

Example B

#CAMI Format for Binning @SampleID:SAMPLEID @@SEQUENCEID BINID read1201 12346BIN read1202 ANOTHERBIN read1203 BIN6 read1204 BIN5 read1205 BIN5

Example C

#CAMI Format for Binning @SampleID:SAMPLEID @@SEQUENCEID TAXID BINID read1201 123 123 read1202 123 123 read1203 131564 131564 read1204 562 562.1 read1205 562 562.2
Taxonomic profiling

Outline

The CAMI taxonomic profiling format contains a header section followed by lines of tab-separated taxon fields (ID, rank, etc.) and respective relative abundance.

The header section

#CAMI Submission for Taxonomic Profiling

This line is an optional commment line describing the file content.

@SampleID:SAMPLEID

SAMPLEID is a unique identifier for a sequence sample.

@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE

The last line is the tab-separated header for the profiling section and begins with '@@'. The tags are described in the output section below.

The profiling section

The TAXID specifies a unique alphanumeric ID for a node in a reference tree such as the NCBI taxonomy. This will be used for evaluation of your predictions with taxonomy-based measures. If you resolve to bins below existing taxonomic IDs in your reference tree, e.g. to resolve novel strains, you can append a numeric identifier to your classification, e.g. 5232.1(according to the provided taxonomy version from the download site).

RANK includes alphanumeric identifiers that define ranks in the reference taxonomy at which the respective taxon is located. The used ranks in the NCBI taxonomy and in the CAMI contest are: superkingdom, phylum, class, order, family, genus and species. Please refer to the leave taxon below species level as rank strain.

The PERCENTAGE field specifies what percentage of the sample was assigned to the respective TAXID. The PERCENTAGE can be a real number between 0 and 100. The percentages given for all taxa from the same rank should sum up to <= 100%, that is, if something is unassigned, this will be reflected in a percentage of less than 100% being assigned. For instance, let's assume there are two species A and B in a data-set with equal genome (or cell) abundance. Let's also assume that the genome of A is three times as long as the genome of B. Then the PERCENTAGE field should be 50 for both species and not 75 for A and 25 for B, which would reflect the amount of sequence data (or number of reads) rather than the genome (or species) abundance. PERCENTAGE is cumulative for sub-paths, therefore the PERCENTAGE value of a parent taxon must be greater equal the sum of all contained taxa.

The TAXPATH and TAXPATHSN tag is the path from the root of the reference taxonomy to the respective taxon, where alphanumeric identifiers such as the TAXIDs (TAXPATH) or the taxonomic names (TAXPATHSN) for all taxa that lie on this path are included. We are not parsing this column for the competition but will use the distributed NCBI reference taxonomy version. Please ensure that your predictions comply with this version. Taxonomic identifers in the TAXPATH and TAXPATHSN for different taxonomic ranks are separated by "|". If a TAXID is missing at a rank or there is no name, this can be marked with an empty entry in the TAXPATH and TAXPATHSN "||". For partial paths from the root to an internal node, you only need to fill in the path until you reach the node.

If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.

Example

#CAMI Submission for Taxonomic Profiling @SampleID:SAMPLEID @@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE 2 superkingdom 2 Bacteria 98.81211 2157 superkingdom 2157 Archaea 1.18789 1239 phylum 2|1239 Bacteria|Firmicutes 59.75801 1224 phylum 2|1224 Bacteria|Proteobacteria 18.94674 28890 phylum 2157|28890 Archaea|Euryarchaeotes 1.18789 91061 class 2|1239|91061 Bacteria|Firmicutes|Bacilli 59.75801 28211 class 2|1224|28211 Bacteria|Proteobacteria|Alphaproteobacteria 18.94674 183925 class 2157|28890|183925 Archaea|Euryarchaeotes|Methanobacteria 1.18789 1385 order 2|1239|91061|1385 Bacteria|Firmicutes|Bacilli|Bacillales 59.75801 356 order 2|1224|28211|356 Bacteria|Proteobacteria|Alphaproteobacteria|Rhizobacteria 10.52311 204455 order 2|1224|28211|204455 Bacteria|Proteobacteria|Alphaproteobacteria|Rhodobacterales 8.42263 2158 order 2157|28890|183925|2158 Archaea|Euryarchaeotes|Methanobacteria|Methanobacteriales 1.18789