1KP file-naming conventions
Every sample is assigned a unique four letter code that never changes. If we have to resequence because of a failed experiment, the new sample is given a new code. To aid identification, some (but not all) of the file names and file contents are supplemented with a species name. However species names are not universally recognized, and of course we make mistakes too. Hence the names can change, even up to the last minute before publication.
To avoid having to make repeated changes to the file names and file contents, we generally keep the names initially assigned. Our current species names are at http://www.onekp.com/samples/list.php and the way to index them is to use the unique four letter codes because they never change. For example, the directory
spiderwort-mature_leaf has the code
WHHY and the species is currently (2012-11-30) listed as _Boerhavia coccinea. However, the species name assigned when the sample was first acquired (Boerhavia cf. spiderwort) is used throughout the directory, in the file names and in the file contents.
In most instances we sequenced only one tissue sample per species. But for those few species where we sequenced more than one tissue, a “combined sample” is created by pooling of all of the read-pairs for that species. A similar thing was done when, for whatever reason, we happened to sequence the same species and tissue more than once. The resultant data sets are given names like
2_samples_combined, indicating, in this instance, a pool of the two samples named
Altogether, we sequenced 1345 samples (from 1174 species), and if we include the 111 combined directories, there are 1456 assemblies. Given that we do not correct the species names until publication, we are maintaining another website to track all known naming problems, as well as possible contamination issues.
A typical sample directory, is shown below. Additional files containing intermediate results might also be present in the sample directory.
Unassembled RNAseq read-pairs
All of the sequencing was done at BGI-Shenzhen. Read-pairs that fail a minimum quality threshold are discarded. The remaining read-pairs are divided across two FASTQ files and stored under the
solexa-reads folder. These files use the Phred+64 convention to represent quality scores. See http://en.wikipedia.org/wiki/Fastq for details. Note that the above-mentioned "combined samples" will not have a
In 2012, we ran an assembly using the newly developed SOAPdenovo-Trans and GapCloser software from BGI-Shenzhen (http://soap.genomics.org.cn/SOAPdenovo-Trans.html). The assembled sequences are located in the
assembly folder. The scaffold names (e.g.
scaffold-AALA-2079325-Meliosma_cuneifolia) embed the sample’s four-letter code, a scaffold number, and a description of the source material, in this case just the species name. Scaffold numbers are unique to an assembly. In particular, they begin at 2000000 for the current assemblies. Lower numbers are used for older assemblies and, should we do them, higher numbers would be issued for newer assemblies.
Major configuration parameters used for SOAPdenovo-Trans:
k-mer size used for de Bruijn graphs
minimum read coverage for contigs
minimum contig length for scaffolds
run the internal gap filling step for SOAPdenovo-Trans
maximum number of putative alternative splice forms
average insert size used in sequencing library
Default values were used for any parameter not listed above.
Assembly statistics per scaffold
Each line has a scaffold name, an approximate mean read depth, the number of reads in the scaffold, the number of bases in these reads, the total length of the scaffold, and the numbers of A, T, C, G, and N bases. Information about the numbers of reads and their bases is obtained from the readOnScaf file. Due the specifics of how the SOAPdenovo-Trans works, these parameters are only approximate and are probably a little on the low side.
For example “scaffold-KEFD-2000876-Encalypta_streptocarpa 6.7 27 2235 335 108 114 51 43 19” is a scaffold assembled from 27 reads with a mean read depth of 6.7 (i.e. 2235/335), has 19 undefined (gap) bases, and a GC-content of 30 %= (51+43)/(108+114+51+43).-
BLASTX to NCBI
All assembled scaffolds were searched against the NCBI’s nr peptide sequence database (non-redundant GenBank CDS translations+RefSeq Proteins+PDB+SwissProt+PIR+PRF, Release 54, July 2012) using BLASTX. The output was filtered at a maximum E-value of 1E-10 and only the top 5 sequence matches were retained in the
SOAPdenovo-Trans-assembly.fa.bz2_blastx file under the
Translation to protein sequences
All assembled scaffolds longer than 300 bp were queried against all NCBI RefSeq plant sequences (Release 54, July 2012) using BLASTX. The best matching protein coding genes were used to generate GeneWise translations using a modified TransPipes pipeline (described here). Inferred amino acid sequences can be found in the
assembly directory for each taxon (
CODE-SOAPdenovo-Trans-translated.tar.bz2). Translated assembly names match the nucleotide assemblies from which they are derived.
This issue is not repeated for the orthogroup analysis, which labels sequences using the original seven digit numbers.
Gene clusters by orthoMCL
The lab of Claude dePamphilis developed a gene family circumscription pipeline employing OrthoMCL software to produce a framework for operationally defined gene families. OrthoMCL clustering of gene models from 22 annotated land plant genomes (represented in the tree below) resulted in circumscription of 53,136 gene clusters which we are treating as hypothetical gene families. Sequences were aligned for each cluster and HMM profiles were estimated. The resultant profiles were used to sort the translated amino acid sequences into gene families.
HMM profiles were used to query the inferred protein sequences for each taxon using hmmsearch (part of the HMMer suite). Bit-scores for matches with E-values of better than 1E-10 were retained. A cumulative probability distribution for these bit-scores was assessed to identify one or more HMMs accounting for 95% of the distribution. Most transcripts sorted into a single gene family for which the HMM match had a probability of 95% or greater, but some transcripts sorted to two or more families when bit-score probabilities were required from multiple HMMs in order to reach a 95% confidence level that the assembly was sorted to a correct gene family (i.e. orthoMCL cluster).
A web api is available at http://iptol-api.iplantcollaborative.org/onekp/v1.
Obtain an authorization token to query the database
- username (required): Username
- password (required): Password
Important: the API uses ?digest access authentication.
Example (using cURL)
curl -X GET --digest -sku "username:password" "http://iptol-api.iplantcollaborative.org/onekp/v1/login
GET | POST /orthogroups
Obtain the sequences for all the members of an orthogroup, given the ID of one of the genes in the group from these 22 sequences species.
- accession (required, string): One or more valid gene identifier from one of these 22 genomes, separated by whitespace. (e.g. PACid:18158545, AT2G43210.1)
- token (required, string): Authentication token.
- format (optional): The format to be returned: faa: amino acid sequence in fasta format, fna: nucleotide sequence in fasta format, zip: zipped amino acid and nucleotide sequences, json: Java Script Object Notation object. Defaults to faa for a single query identifier and to json for multiple query identifiers.
Graphical user interfaces are also provided for single and multiple queries. Visit http://iptol-api.iplantcollaborative.org/onekp/v1/
Website for BLAST searches
The entire 1KP data set is available for BLAST searches courtesy of the China National GeneBank at a password protected website. Go to the http://www.onekp.com/ website and click on "view available sequences". Users will be able to search either all of the samples, or a phylogenetically defined subset of samples, with the caveat that the categorizations are subset to change after the phylogenomics analysis for the capstone is complete. Notice that by default it shows only the public data. For access to the complete dataset, you must log in.
- For questions regarding assemblies and data access please contact Eric Carpenter
- For questions regarding the BLAST service please contact Zhixiang Yan
- For questions regarding translations and orthogroups please contact Naim Matasci
- For questions regarding data usage please contact Gane Ka-Shu Wong
- Scientific inquiries of broad interest can be sent to the 1KP Gene Families mailing list