Outline
The CAMI assembly format is a FASTA-formatted contig and scaffold file.
The header section
The header section of the FASTA files must follow the standard FASTA guidelines and must start with ">" character followed by any alphanumeric "[A-Za-z]", whitespace (spaces and tabs), brackets "[]", underscores "_", colons "[:;]", commas ",", periods ".", pipes "|", or hypens "-" ending with a newline character.
The sequence section
Sequences must contain only capitalized nucleotide characters, including the ambiguous character 'N': /[ATCGN]+/.
There is no requirement on the number of nucleotides per line, but community best practices advise limiting the amount of nucleotides to 120 characters.
Examples
>NODE_1331_length_1852_cov_20.7969_ID_2661
CTATTTTTAAATATCCGCGTACTTACTGAGTTATCGTGGCAGTTTTTGCTAATTTTGCGATTTTTGCCACGATAACTTGT
TAAAAAATATGACGTCTGCAATTCTTGTTGCTTTTACACTACAAAAAAGATAAAATATTAGTATGAAAGCAACAGAAGTG
AGGCTATTATGTCAAACAAAGAGATGCTCCTTGACTATTTAGATAACCATCATGGAATAATTACATACAAAGATTGTAAA
ACATTAAATATTCCAACTATATATTTAACTCGTTTGAAAAACGAAGGTGTTCTTAATCGAATTGAAAGGGGAATTTTTCT
CTCTTCCAATGGTGATTATGACGAATATTACTTCTTTCAATATCGTTATCCACAAGCTGTATTCTCTTATGTTTCAGCTC
TTTATCTTCAAGGTTTTACAGATGAAATTCCACAACATTTTGAAGTAAGCGTTCCTAGAGGATATCGTTTTAATAATCCA
CCATCAAATTTAACTATTCATTGGGTCTCAAAATTGTATAGCAAATTAGGCATTACTACAACGATTACCACAATGGGGAA
TAAAGTACGTATTTATGATTTTGAACGCATTATTTGTGATTTTATAATAAACAGAAACAGTATTGATTCTGAACTCTTTG
TTAAAACTCTTCAAGCTTACAGTAGGTATAACGGGAAAGATCTTATAAAGC
>54257
AAAATAGACGCACAAGCTGGCACATCACAACGAATAACCAAGCATAGAAATGAAGAAAAAGAAAATAATCATGGGAGTAG
GAACAGGAATCCTTTTGGCTGCTGTTGCTTTTTGGGGGTGGCATTCCACACAAGCAACTTCGACAGAAATAACAAATGAA
ATGGAAAGCGCCATGCACAACGAGCCTGTTGGTCCTGCTTTTGAAGCCGATTCTGCGTACCGTTATATTGAAGCACAATG
TAGTTTCGGCCCTCGCACCATGAACAGCGAAGCACACGAACAATGTGCGGAGTGGATTAT
Outline
The CAMI genome and taxon binning format contains a header section followed by lines of tab-separated sequence ID-to-bin ID assignments. The binning can be of the reads of a dataset sample or the contigs of an assembly.
The header section
#CAMI Submission for Binning
This line is an optional comment line describing the file content.
@SampleID:SAMPLEID
SAMPLEID is a unique identifier for a dataset sample.
@@SEQUENCEID TAXID BINID
The last line is the header for the binning section and begins with '@@'. It defines the content of the columns in a tab-separated format.
The binning section
The TAXID is the taxonomic assignment, of an NCBI taxon ID, for your binned sequence. It is used for evaluation of your predictions with taxonomy-based measures.
The BINID entries can be arbitrary alphanumeric identifiers for the genome bins. No taxonomy-based evaluation is performed using the entries in this column.
There are three different scenarios for binning tools.
The first case, example A below: If you create taxonomic bins as output without further resolution, you do not need to include the BINID colummn, but only the TAXID column, in your output.
The second case, example B below: If you create bins that do not include taxonomic assignments you do not need to include the TAXID column, but only the BINID column, in your output.
The third case, example C below, is if you perform taxonomic binning and additionally resolve bins below existing taxonomic IDs, e.g. to define bins representing novel strains. In this case, you add both the TAXID. It will be easiest to check for consistency for us if you in this case use for the BINID the TAXID entry and a numeric identifier appended (e.g. 562.2).
If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.
Format for multiple samples
Binnings for datasets with multiple samples are supported. Simply concatenate the binnings of the different samples into a single file. The binning for each sample must start with the header tag @SampleID identifying the sample and the binning header columns starting with '@@'.
Examples
Example A
#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID TAXID
read1201 123
read1202 123
read1203 131564
read1204 562
read1205 562
Example B
#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID BINID
read1201 12346BIN
read1202 ANOTHERBIN
read1203 BIN6
read1204 BIN5
read1205 BIN5
Example C
#CAMI Format for Binning
@SampleID:SAMPLEID
@@SEQUENCEID TAXID BINID
read1201 123 123
read1202 123 123
read1203 131564 131564
read1204 562 562.1
read1205 562 562.2
Outline
The CAMI taxonomic profiling format contains a header section followed by lines of tab-separated taxon fields (ID, rank, etc.) and respective relative abundance.
The header section
#CAMI Submission for Taxonomic Profiling
This line is an optional commment line describing the file content.
@SampleID:SAMPLEID
SAMPLEID is a unique identifier for a sequence sample.
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
The last line is the tab-separated header for the profiling section and begins with '@@'. The tags are described in the output section below.
The profiling section
The TAXID specifies a unique alphanumeric ID for a node in a reference tree such as the NCBI taxonomy. This will be used for evaluation of your predictions with taxonomy-based measures. If you resolve to bins below existing taxonomic IDs in your reference tree, e.g. to resolve novel strains, you can append a numeric identifier to your classification, e.g. 5232.1(according to the provided taxonomy version from the download site).
RANK includes alphanumeric identifiers that define ranks in the reference taxonomy at which the respective taxon is located. The used ranks in the NCBI taxonomy and in the CAMI contest are: superkingdom, phylum, class, order, family, genus and species. Please refer to the leave taxon below species level as rank strain.
The PERCENTAGE field specifies what percentage of the sample was assigned to the respective TAXID. The PERCENTAGE can be a real number between 0 and 100. The percentages given for all taxa from the same rank should sum up to <= 100%, that is, if something is unassigned, this will be reflected in a percentage of less than 100% being assigned. For instance, let's assume there are two species A and B in a data-set with equal genome (or cell) abundance. Let's also assume that the genome of A is three times as long as the genome of B. Then the PERCENTAGE field should be 50 for both species and not 75 for A and 25 for B, which would reflect the amount of sequence data (or number of reads) rather than the genome (or species) abundance. PERCENTAGE is cumulative for sub-paths, therefore the PERCENTAGE value of a parent taxon must be greater equal the sum of all contained taxa.
The TAXPATH and TAXPATHSN tag is the path from the root of the reference taxonomy to the respective taxon, where alphanumeric identifiers such as the TAXIDs (TAXPATH) or the taxonomic names (TAXPATHSN) for all taxa that lie on this path are included. We are not parsing this column for the competition but will use the distributed NCBI reference taxonomy version. Please ensure that your predictions comply with this version. Taxonomic identifers in the TAXPATH and TAXPATHSN for different taxonomic ranks are separated by "|". If a TAXID is missing at a rank or there is no name, this can be marked with an empty entry in the TAXPATH and TAXPATHSN "||". For partial paths from the root to an internal node, you only need to fill in the path until you reach the node.
If you want to specify extra columns in the output, prefix each column name with _YourProgramName_ and only append columns to the right of the standard ones.
Example
#CAMI Submission for Taxonomic Profiling
@SampleID:SAMPLEID
@@TAXID RANK TAXPATH TAXPATHSN PERCENTAGE
2 superkingdom 2 Bacteria 98.81211
2157 superkingdom 2157 Archaea 1.18789
1239 phylum 2|1239 Bacteria|Firmicutes 59.75801
1224 phylum 2|1224 Bacteria|Proteobacteria 18.94674
28890 phylum 2157|28890 Archaea|Euryarchaeotes 1.18789
91061 class 2|1239|91061 Bacteria|Firmicutes|Bacilli 59.75801
28211 class 2|1224|28211 Bacteria|Proteobacteria|Alphaproteobacteria 18.94674
183925 class 2157|28890|183925 Archaea|Euryarchaeotes|Methanobacteria 1.18789
1385 order 2|1239|91061|1385 Bacteria|Firmicutes|Bacilli|Bacillales 59.75801
356 order 2|1224|28211|356 Bacteria|Proteobacteria|Alphaproteobacteria|Rhizobacteria 10.52311
204455 order 2|1224|28211|204455 Bacteria|Proteobacteria|Alphaproteobacteria|Rhodobacterales 8.42263
2158 order 2157|28890|183925|2158 Archaea|Euryarchaeotes|Methanobacteria|Methanobacteriales 1.18789