We have initiated a project aiming to generate draft genomes for all the remaining unsequenced Caenorhabditis species

The sequencing of the genome of the nematode Caenorhabditis elegans remains one of the milestones of modern biology, and this genome sequence is the essential backdrop to a vast body of work on this key model organism. As Dobzhansky said, “Nothing in biology makes sense except in the light of evolution”, and it is clear that complete understanding of C. elegans will only be achieved when it is placed in an evolutionary context.

Once the genomes we sequence are “completed”, they will be submitted to EMBL/GenBank and WormBase in the normal way: this site exists to allow you “early access” to the data, to foster collaboration and understanding. We release the data here under the usual Fort Lauderdale/Bermuda agreements – where we retain the rights to publish whole genome analyses of the new species, but do not prevent you from using the data for per-gene/per-system analyses. Obviously, the best way to promote whole genome analysis and publication is to collaborate with us.

 

Introduction 

Genomics of C. elegans is but one nematode, an “anecdotal” instance of how a genomic system generates a complex organism. But how did this system come to be? Which parts are historical accident and which are the result of selection? What competing forces are at work in shaping the genome – its composition, size, synteny and linkage dynamics, repeat content, mobile element diversity, gene structure, gene birth and death, sequence diversity, … ? To deliver answers to these questions (and many more) we contend that genome sequence information from as many related species within the genus Caenorhabditis will form an essential backdrop to specific research programmes.

The time is now ripe for a programme to sequence the diversity of Caenorhabditis. In the last 10 years, there has been a remarkable global effort of discovery of new species, sparked by the Félix lab’s discovery of the likely “true” ecology of Caenorhabditis in rotting fruits and other plant material [2]. The number of species in culture now exceeds 40 (and is growing) and their relationships have been robustly inferred using multi-locus analyses by the Kiontke lab [3]. Several genomes are already available. After the success of the C. elegans genome project, the genome of C. briggsae was sequenced [4], the NHGRI sponsored the sequencing of C. nigoni, C. brenneri and C. tropicalis at WUGSC [5], the Sternberg lab has sequenced C. angaria [6], the Phillips lab has sequenced C. remanei and the Blaxter lab has sequenced C. wallacei (aka sp. 16), sp. 5 and sp.1.

A step change in sequencing technologies, and in assembly algorithms, now means that good-enough genomes can be generated quickly, efficiently and cheaply. We therefore have embarked on a project to “complete” the sequencing of all Caenorhabditis species currently available in culture, a Caenorhabditis Genomes Project (CGP). The project will be funded largely from generous application of intramural support from Edinburgh Genomics (http://genomics.ed.ac.uk), and led by the Blaxter laboratory in Edinburgh (http://www.nematodes.org), but we invite all interested researchers to join us in an open collaboration. Additional funding will be sought to improve the genome assemblies, and any support available in the community will significantly improve what can be done. We expect that additional species will be discovered, and would hope to add them to the project as they are defined.

The strategy
The current roster of genomes, and their status, is now available at caenorhabditis.org (previously http://caenorhabditis.bio.ed.ac.uk). We intend that the GCP will be an open collaboration and will be making data available for free download under the “usual” agreements – basically that anyone carrying out whole genome analyses contacts us before proceeding to publication (and preferably much earlier) so that we can all coordinate efforts. There is so much to be done that collaboration will be essential.

Data generation
Our strategy is to ask researchers with live cultures, preferably inbred strains, to make DNA and RNA and to ship these to Edinburgh for sequencing. We are not demanding that inbred lines be generated, as this process often takes many months, and can generate very sick nematodes that are unlikely to be good representatives for their species. Advances in assembly routines mean that we are much better able to deal with heterozygosity issues during assembly. We are currently generating a standard dataset for each species (125 b paired end data from two short insert genomic libraries at 350 and 550 bases [~80 M read pairs, or ~100x coverage], and stranded RNASeq data [~25 M read pairs]) using Illumina HiSeq2500v4 instruments. For selected species we may also produce Illumina mate pair libraries and / or PacBio data (and would encourage colleagues with special interest in a species to “sponsor” the generation of these additional scaffolding data).

Primary analyses
Raw data will be posted on the project website as it is generated and passes QC (and also uploaded to SRA). Colleagues are free to download and analyse the raw data. We will be building best-effort assemblies for each genome, possibly by having collaboratively competitive mini-assemblathons for each set of species as they come off the sequencers. Assemblies will be posted along with explicit recipies describing how they were generated and core quality metrics.

Annotation
We will perform best-practice gene finding on each species using the stranded RNASeq and comparative data from other species, and decorate the genomes with annotation (sequence similarity, domains, expression values). The genome annotation files (and a description of the protocols used) will be posted for download. A combination of skills and approaches will give the best results and we will coordinate “annotatathons”, perhaps using collaborative platforms such as WebApollo. In particular, we propose to perform bulk reannotation of all species, following the same protocols for each, periodically (for example when we hit 15 or 20, or all species).

Genome databasing and publication
Genome sequences, genes and annotations will be made available through a local genome explorer (an BADGER [7] instance). The BADGER “versions” of the genomes will not act as “databases of record” – we are not intending to replicate WormBase – but rather interim homes for the data to spur research and cooperation. When a genome reaches a stable annotation status, we will deposit it in INSDC (ENA/GenBank/DDBJ) and WormBase [8]. We will aim to promote peer-reviewed publication of the genomes and analyses, and will also publish data papers so that the genomes can be sensibly used and cited as early as is possible.

Project timing, oversight, staffing
We have started the project. In addition to the three species sequenced by the Blaxter lab in collaboration with Asher Cutter and Marie-Anne Félix already, the Félix lab has provided genomic DNA and RNA from eight new species, and data has been generated for four of these (as of 01 Nov 2014). For many other species DNA and RNA are being generated, and the Rockman, Phillips, Fierst and Wang labs are sequencing additional taxa (and strains). We hope to complete the sequencing in Edinburgh by late Spring 2015, and have assemblies by late Summer 2015. Obviously as data is to be released as we generate it, there will be incremental updates as we approach completion.

We will maintain a project blog, announcing upcoming data, and also an annotation/interest roster where individuals and groups can express interests in species or analysis topics. An open google group will be used to foster discussion and data sharing. Management and oversight will be light. We propose that an oversight group (composed of – minimally – Mark Blaxter, Marie-Anne Félix, Karin Kiontke, Erich Schwarz, and a WormBase representative) will coordinate data release announcements and assure quality through open conference calls.