Critical Assessment of Metagenome Interpretation

FAQ - Frequently Asked Questions

What is a binning method?

A binning method assigns an identifier to every sequence in a sequence sample, where the total number of identifiers is ideally less than the total number of sequences. Thus the act of binning places the sequences into broader categories. A bin includes all the sequences with the same identifier. If these identifiers identify taxa from a taxonomy, the method is a taxonomic binning method.

What is a profiling method?

A profiling method returns an estimate for the frequencies of different taxa in a sequenced microbial community based on analysis of the sequence sample. The main output is a vector with relative abundances for the different sample taxa. The relative abundances of taxa from the same 'rank' of the taxonomy (e.g. superkingdom, including archaea, bacteria and eukaryotes) cannot sum up to more than 1.

What is an assembly method?

An assembly method returns longer nucleotide sequences derived by puzzling together individual sequencing reads. These sequences are assumed to represent contiguous stretches from one genome included in the microbiome sample that was sequenced.

Is CAMI open and transparent?

Yes! The development process is done in an open community. Everybody is invited to participate. People who intend to participate will not be involved in any part of the development process that provides any information or advantage for the actual competition.

What about reproducibility?

CAMI encourages participants to submit reproducible results by providing their executable software in docker containers, along with submission of their predictions (see details below). CAMI has worked together with bioboxes to define formats for a standardized setup and execution of profiling, binning and assembly tools in docker containers. This will make the results of these tools reproducible, and also facilitate a continuous monitoring of their performance on future test data sets.

Where do I find the relevant information when I want to participate in the CAMI contest?

You can find all information on the CAMI page entitled participate.

Why should I participate in the contest?

There are various reasons why you might want to participate.

  • Receive feedback for your methods performance on a larger number of data sets.
  • Facilitated benchmarking. By submitting your software in a docker container, you will be able to continuously monitor its performance against other software that has been submitted this way. It will eventually be possible to automatically run all software on new datasets with the CAMI benchmarking portal and compare the performance using a range of common performance metrics. If you update your software, you simply have to update your docker container that you submitted.
  • Coauthorship. As in the first contest participants who would like to disclose their identities and deliver reproducible results have the option to be authors on a joint CAMI evaluation publication.
  • Interact with computational metagenomics community in workshops, hackathons and at meetings. CAMI will be organizing several events where you can help define the most relevant evaluation metrics, bioboxing software and questions to assess in detail for the second CAMI challenge.

How large are the data sets?

Please watch out for the details of individual data sets on the CAMI data portal. Some of the data sets will be very large, reaching up to 1 TB. You will also be able to download subsets of samples for testing.

What are the data access policies for the CAMI challenge?

CAMI toy datasets are generated from published genomes. These data sets do not have any restrictions.

CAMI challenge data sets have been generated from unpublished genome data provided by multiple laboratories who want to control the data release date. Therefore, the CAMI data sets will become fully available, together with the gold standard, after the competition has ended and once the data contributors have released their data. Until then, a data recipient is not allowed to publish the data, deposit the sequences in any database or use it for other purposes than participating in the CAMI challenge. The data recipient agrees to delete and/or trash the data in all forms when the CAMI Challenge is ended to the time until it has been officially released.

Why do I have to register with the CAMI data portal for download of the real challenge datasets?

If there is a problem with the submitted files, we have the chance to contact the contestant. In a later stage of the competition we will show evaluation metrics with an anonymized name for the results (assembly, profiling, binning) provided by the contestant. Participants can choose to have their results displayed anonymously only, or reveal their identities for participation in a joint publication.

If I do not have access to sufficient compute time on my own, what are my options?

The Pittsburgh Supercomputing Center (PSC) and the de.NBI Cloud are making compute time and storage available on their system to run analyses for the CAMI challenge. Check our the Compute Resources page for details.

What kind of data sets will be provided?

CAMI is providing simulated metagenome data sets created from hundreds of predominantly unpublished (contributed) isolate genome sequences, with as much realism built in as we could manage. For instance, they will include multiple strains from the same species.

Where can I find exemplary datasets?

We have made example data sets with gold standards (simulated from public genomes) available at our CAMI benchmarking portal and under DOI 10.5524/100344. Click on the up- and download button: here you can find all currently available benchmark data sets and under the databases tab reference databases provided that should be used with the respective data sets for analyses. We encourage participants to run their tool on these data sets and upload their results, to familiarize with the up- and download site. Give us feedback if anything is unclear. The data sets are available at: /datasets.

Why are we providing new simulated data sets and not using existing public ones?

For real metagenome samples, we do not know which read comes from which genome, and in our simulations, this should be the same way. All public simulated samples also have the correct solution also available, which would make the contest less realistic. Furthermore, we want to simulate different relevant scenarios that are used in metagenomics for which no simulated data sets exist. The gold standards for these data sets will be provided after the contest has ended.

I have problems finding out which output format I should use. What should I do?

You can contact us to get advice.

What are the requirements to submit my results?

All submitted results must be reproducible by the software provided.
Tools can be submitted in one of the following ways:

  • Docker container containing the complete tool/workflow
  • Bioconda script
  • Software repository including detailed installation instructions

The output format must conform to the CAMI standards to allow automatic benchmarking of results. For software using large custom databases, please contact the CAMI team before providing it.

Can I submit the results without providing my tool?

If you want to submit your results without them being reproducible, you can do so and get the feedback for your personal information. In this case, your tool will only be considered to be included in a joint result publication if it is a publicly accessible webtool.

How can I submit my results?

We provide instructions in our page Submitting results to CAMI.

What is the format of the CAMI assembly file?

The CAMI assembly format is a FASTA-formatted contig and scaffold file as specified here.

What is the format of the CAMI binning file?

The CAMI format for binning is specified here.

What is the format of the CAMI profiling file?

The CAMI format for profiling is specified in our here.

How do I submit binning and profiling results for multi-sample data sets?

You can submit results for a complete data set and for each sample of a data set. In the latter case, just concatenate the results of all samples into a single file, for example, using a command similar to cat profile_sample0 profile_sample1 > profile_all. Make sure that all concatenated files are in the CAMI format and that the SampleID in the header of each file is set to the number of the respective sample prefixed by the sample name, for example, SampleID: marmgCAMI2_short_read_sample_0 (sample 0 of the Marine dataset: Simulated Illumina HiSeq metagenome data). You can then submit the concatenated file as usual (see Submitting results to CAMI).

When generating the fingerprint with the CAMI client for my profiling output, I get the errors Invalid TAXID and Invalid TAXPATH. Why?

Please ensure that the taxonomy in the output of your profiling method matches the NCBI taxonomy for the CAMI 2 challenge.

I cannot get the docker installation to work, what should I do?

Please contact us, we are happy to help.

Can I use special Hardware for a Dockers container (like GPUs, ...)?

Yes, all hardware platforms are supported. If you don't have access to the required hardware, please contact us regarding compute resources the Pittsburgh Supercomputing Center could provide to you.

How does the speed of a method contribute to the result?

Within CAMI II compute time and memory requirements of submitted, executable software will be asssessed.

Which metrics are being used to evaluate methods?

CAMI has worked with community members in defining multiple evaluation metrics for assessing performance of different method categories (see Sczyrba et al. Nature Methods 2017). Definition of the most relevant ones is an ongoing effort and will be continued in a public meeting of users and developers after the second CAMI challenge.

I don't want to upload my unpublished tool to Docker Hub. What should I do?

We can provide a private Docker Hub, so that just the CAMI team has access to the unpublished tool. Please contact support@cami-challenge.org, we will tell you the next steps.

Do I have to submit results for binning, assembly and profiling?

You are free to submit for one or multiple tasks, there is no need to submit results for all of them.

Do I have to submit my codes in any particular language?

You can submit your codes / software in any language you like. It should be installed ready to execute in a docker container. The instructions how to install it we will make available soon. We will also offer help with this, if wanted, via Skype.

How can I recompute the assembly, binning, and profiling metrics of the CAMI challenges?

You can reproduce assembly, binning, and profiling comparisons, as well as compute metrics for the results of other tools, using the assessment packages MetaQUAST (assembly), AMBER (genome binning) and OPAL (profiling). Gold standards and the results of participating binners and profilers will be made available on the CAMI data portal. Submissions of CAMI challenges participants are also available on GigaDB, in file program_results.tar.gz, and in the CAMI Community on Zenodo.

How can I compare fairly the performance of my new method to results of previous CAMI challenges such as those in Meyer, Fritz, et al. Nature Methods 2022 and Sczyrba, Hofmann, Belmann, et al. Nature Methods 2017?

To compare your results fairly, your method should be applied under similar conditions as those faced by participating methods at the time of the challenges. In particular:

  1. Your method should only use data provided by CAMI at the time of the challenges. If it requires other training or reference data, your method should not profit from new sequences in those data nor should you guarantee that they contain the sequences nor underlying genomes of the CAMI datasets. Training or reference data should be from a database released before the end of the challenges (see schedule and some database options at /datasets, 'Databases' tab). The taxonomy, if applicable, should be the one provided for the particular CAMI dataset in question (see also /reference-databases).
  2. The method should not be optimized using gold standards (aka standards of truth), which usually become available after the challenges.
  3. For performance assessment, you should use CAMI gold standards and not create your own ones.
  4. If you intend to publish your new results, we strongly recommend that you ensure their reproducibility and follow the FAIR principles. This includes providing the command lines or parameters of your and compared methods, their versions, reference databases, as well as metric definitions and any other relevant information.
  5. We also recommend that you provide a digital object identifier (DOI) for your 'raw' results, such as assembly, binning, or taxonomic profile, by storing them in the public repository of the CAMI Zenodo Community in the appropriate CAMI format.

How can I hear the latest news from CAMI?

You can: