File formats

Metagenome assembly

Outline

The CAMI assembly format is a FASTA-formatted contig and scaffold file.

The header section

The header section of the FASTA files must follow the standard FASTA guidelines and must start with ">" character followed by any alphanumeric "[A-Za-z]", whitespace (spaces and tabs), brackets "[]", underscores "_", colons "[:;]", commas ",", periods ".", pipes "|", or hypens "-" ending with a newline character.

The sequence section

Sequences must contain only capitalized nucleotide characters, including the ambiguous character 'N': /[ATCGN]+/.

There is no requirement on the number of nucleotides per line, but community best practices advise limiting the amount of nucleotides to 120 characters.

Examples

>NODE_1331_length_1852_cov_20.7969_ID_2661
CTATTTTTAAATATCCGCGTACTTACTGAGTTATCGTGGCAGTTTTTGCTAATTTTGCGATTTTTGCCACGATAACTTGT
TAAAAAATATGACGTCTGCAATTCTTGTTGCTTTTACACTACAAAAAAGATAAAATATTAGTATGAAAGCAACAGAAGTG
AGGCTATTATGTCAAACAAAGAGATGCTCCTTGACTATTTAGATAACCATCATGGAATAATTACATACAAAGATTGTAAA
ACATTAAATATTCCAACTATATATTTAACTCGTTTGAAAAACGAAGGTGTTCTTAATCGAATTGAAAGGGGAATTTTTCT
CTCTTCCAATGGTGATTATGACGAATATTACTTCTTTCAATATCGTTATCCACAAGCTGTATTCTCTTATGTTTCAGCTC
TTTATCTTCAAGGTTTTACAGATGAAATTCCACAACATTTTGAAGTAAGCGTTCCTAGAGGATATCGTTTTAATAATCCA
CCATCAAATTTAACTATTCATTGGGTCTCAAAATTGTATAGCAAATTAGGCATTACTACAACGATTACCACAATGGGGAA
TAAAGTACGTATTTATGATTTTGAACGCATTATTTGTGATTTTATAATAAACAGAAACAGTATTGATTCTGAACTCTTTG
TTAAAACTCTTCAAGCTTACAGTAGGTATAACGGGAAAGATCTTATAAAGC
>54257
AAAATAGACGCACAAGCTGGCACATCACAACGAATAACCAAGCATAGAAATGAAGAAAAAGAAAATAATCATGGGAGTAG
GAACAGGAATCCTTTTGGCTGCTGTTGCTTTTTGGGGGTGGCATTCCACACAAGCAACTTCGACAGAAATAACAAATGAA
ATGGAAAGCGCCATGCACAACGAGCCTGTTGGTCCTGCTTTTGAAGCCGATTCTGCGTACCGTTATATTGAAGCACAATG
TAGTTTCGGCCCTCGCACCATGAACAGCGAAGCACACGAACAATGTGCGGAGTGGATTAT

Genome and taxon binning

Outline

The CAMI genome and taxon binning format contains a header section followed by lines of tab-separated sequence ID-to-bin ID assignments. The binning can be of the reads of a dataset sample or the contigs of an assembly.

The header section

#CAMI Submission for Binning

This line is an optional comment line describing the file content.

@SampleID:SAMPLEID

SAMPLEID is a unique identifier for a dataset sample.

@@SEQUENCEID TAXID BINID

The last line is the header for the binning section and begins with '@@'. It defines the content of the columns in a tab-separated format.

The binning section

The TAXID is the taxonomic assignment, of an NCBI taxon ID, for your binned sequence. It is used for evaluation of your predictions with taxonomy-based measures.

The BINID entries can be arbitrary alphanumeric identifiers for the genome bins. No taxonomy-based evaluation is performed using the entries in this column.

There are three different scenarios for binning tools.

The first case, example A below: If you create taxonomic bins as output without further resolution, you do not need to include the BINID colummn, but only the TAXID column, in your output.

The second case, example B below: If you create bins that do not include taxonomic assignments you do not need to include the TAXID column, but only the BINID column, in your output.

The third case, example C below, is if you perform taxonomic binning and additionally resolve bins below existing taxonomic IDs, e.g. to define bins representing novel strains. In this case, you add both the TAXID. It will be easiest to check for consistency for us if you in this case use for the BINID the TAXID entry and a numeric identifier appended (e.g. 562.2).

If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.

Format for multiple samples

Binnings for datasets with multiple samples are supported. Simply concatenate the binnings of the different samples into a single file. The binning for each sample must start with the header tag @SampleID identifying the sample and the binning header columns starting with '@@'.

Examples

Example A

#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID	TAXID	
read1201	123	
read1202	123	
read1203	131564	
read1204	562	
read1205	562

Example B

#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID	BINID	
read1201	12346BIN
read1202	ANOTHERBIN	
read1203	BIN6	
read1204	BIN5	
read1205	BIN5

Example C

#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID	TAXID	BINID
read1201	123	123
read1202	123	123
read1203	131564	131564
read1204	562	562.1
read1205	562	562.2

Taxonomic profiling

Outline

The CAMI taxonomic profiling format contains a header section followed by lines of tab-separated taxon fields (ID, rank, etc.) and respective relative abundance.

The header section

#CAMI Submission for Taxonomic Profiling

This line is an optional commment line describing the file content.

@Version:0.9.1
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@SampleID:SAMPLEID

SAMPLEID is a unique identifier for a sequence sample.

@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE

The last line is the tab-separated header for the profiling section and begins with '@@'. The tags are described in the output section below.

The profiling section

The TAXID specifies a unique alphanumeric ID for a node in a reference tree such as the NCBI taxonomy. This will be used for evaluation of your predictions with taxonomy-based measures. If you resolve to bins below existing taxonomic IDs in your reference tree, e.g. to resolve novel strains, you can append a numeric identifier to your classification, e.g. 5232.1(according to the provided taxonomy version from the download site).

RANK includes alphanumeric identifiers that define ranks in the reference taxonomy at which the respective taxon is located. The used ranks in the NCBI taxonomy and in the CAMI contest are: superkingdom, phylum, class, order, family, genus and species. Please refer to the leave taxon below species level as rank strain.

The PERCENTAGE field specifies what percentage of the sample was assigned to the respective TAXID. The PERCENTAGE can be a real number between 0 and 100. The percentages given for all taxa from the same rank should sum up to <= 100%, that is, if something is unassigned, this will be reflected in a percentage of less than 100% being assigned. For instance, let's assume there are two species A and B in a data-set with equal genome (or cell) abundance. Let's also assume that the genome of A is three times as long as the genome of B. Then the PERCENTAGE field should be 50 for both species and not 75 for A and 25 for B, which would reflect the amount of sequence data (or number of reads) rather than the genome (or species) abundance. PERCENTAGE is cumulative for sub-paths, therefore the PERCENTAGE value of a parent taxon must be greater equal the sum of all contained taxa.

The TAXPATH and TAXPATHSN tag is the path from the root of the reference taxonomy to the respective taxon, where alphanumeric identifiers such as the TAXIDs (TAXPATH) or the taxonomic names (TAXPATHSN) for all taxa that lie on this path are included. We are not parsing this column for the competition but will use the distributed NCBI reference taxonomy version. Please ensure that your predictions comply with this version. Taxonomic identifers in the TAXPATH and TAXPATHSN for different taxonomic ranks are separated by "|". If a TAXID is missing at a rank or there is no name, this can be marked with an empty entry in the TAXPATH and TAXPATHSN "||". For partial paths from the root to an internal node, you only need to fill in the path until you reach the node.

If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.

Example

#CAMI Submission for Taxonomic Profiling
@Version:0.9.1
@Ranks:superkingdom|phylum|class|order|family|genus|species|strain
@SampleID:SAMPLEID
@@TAXID	RANK	TAXPATH	TAXPATHSN	PERCENTAGE
2	superkingdom	2	Bacteria	98.81211
2157	superkingdom	2157	Archaea	1.18789
1239	phylum	2|1239	Bacteria|Firmicutes	59.75801
1224	phylum	2|1224	Bacteria|Proteobacteria	18.94674
28890	phylum	2157|28890	Archaea|Euryarchaeotes	1.18789
91061	class	2|1239|91061	Bacteria|Firmicutes|Bacilli	59.75801
28211	class	2|1224|28211	Bacteria|Proteobacteria|Alphaproteobacteria	18.94674
183925	class	2157|28890|183925	Archaea|Euryarchaeotes|Methanobacteria	1.18789
1385	order	2|1239|91061|1385	Bacteria|Firmicutes|Bacilli|Bacillales	59.75801
356	order	2|1224|28211|356	Bacteria|Proteobacteria|Alphaproteobacteria|Rhizobacteria	10.52311
204455	order	2|1224|28211|204455	Bacteria|Proteobacteria|Alphaproteobacteria|Rhodobacterales	8.42263
2158	order	2157|28890|183925|2158	Archaea|Euryarchaeotes|Methanobacteria|Methanobacteriales	1.18789