How unpublished genomes will be used by CAMI
Unpublished genomes are crucial for CAMI. We will create artificial metagenomes from these genomes and use these as test datasets in CAMI. Raw sequence reads will be created from mixtures of the unpublished genome sequences to simulate metagenome sequence samples. These read data sets, as well as an assembled version thereof will be provided in anonymized form exclusively to registered participants of the CAMI challenge.
What CAMI challenge participants have to agree to
Participants are only allowed to use these sequences for the CAMI challenge. The participants have to agree that the data provided by the data contributor or the respective OTUs and strains that will be sequenced will not be redistributed to any third person or institute. Furthermore, the data recipient is not allowed to publish the data, deposit the sequences in any database or use it for other purposes than participating in the CAMI challenge. The data recipient has to agree to delete and/or trash the data in all forms when the CAMI Challenge is ended. After the data has been released by all data contributors, the data sets of the CAMI Challenge will be made publicly available for future benchmarking uses.
Publication of the provided data sets (e.g. by upload to public web portals or into the International Nucleotide Sequence Database Collaboration - INSDC databases) remains up to the owners of these genomes.
What data contributors have to agree to
The data contributor species a list of data recipients who will have access to the data before or during the time the CAMI challenge is running. The data contributor and all listed data recipients agree that they will keep the provided data and metadata specified above confidential and that it will not be redistributed to any third person or institute until the end of the 2nd CAMI challenge, which will likely be by the end of October 2018.
The data contributor agrees that they will not reveal information about the identities or relative abundances of the taxa in the provided data set for the CAMI contest, e.g. in presentations. The data contributor agrees to a release of the provided sequences and taxonomic information for the purposes of reproducing the evaluation results of the CAMI challenge at a time point specified by him or her.
How to contribute genomic data to CAMI
In order to contribute a genome to be used as part of the challenge data sets, please provide us the genome sequence as a fasta file (consisting of one or multiple contigs). The file name should consist of a unique identifier ('genome_ID') and the file extension .fasta or .fna, for example "12345.fna". In addition to that please provide a metadata file, with each line describing one of your genome fasta files.
Strain sequence data should be described using the following metadata fields in tab-separated format (please see here for an example file: metadata_table.template.tsv):
genome_ID l source l technology l NCBI_ID l NCBI_name l contact l other
If the value of a field is unknown you may leave it empty. Mandatory fields are genome_ID, source, technology and contact. The fields have the following meaning:
- 'genome_ID' specifies an identifier for a sequence sample from a particular strain, might include multiple sequences.
- 'source' specifies your sample (e.g. Arabidopsis thaliana root sample).
- 'technology' specifies the sequencing technology used (e.g. Illumina paired end).
- 'NCBI_ID' specifies an identifier for a strain in a reference taxonomy. This can also be an identifier at a higher taxonomic rank, if the strain is not represented in the taxonomy yet.
- 'NCBI_name' specifies the respective name of the taxon in the reference taxonomy.
- 'contact' specifies the email of the owner of a contributed assembled isolate strain sequence sample.
- 'other' is a field for comments.