Sample source and purity

Verification of Source Species for an Assembly

As a general principal one can distinguish between reports that a sample i) contains the expected species, ii) does not contain that species, and iii) has no information on this issue.

In particular the recent 18S rRNA analyses has many samples that did not pass that source validation test. However, that does not indicate that these samples are wrong, merely that the 18S analysis does not provide an answer. Accordingly, these are type iii results and a different analysis can validate the sample without conflict.

As we have more reports about each sample, some of them will inevitably conflict. From past experience, some of these reports will not be definite, but may be statements like, "sample <XXXX> is odd, maybe it is not a <SPECIES>." Therefore, each report, will have to classified as to whether it is a robust result or not.

Summary of Multiple Reports

Should there be multiple conflicting reports, analyses judged to be more definite/robust will be preferred. Reports that are specific to that sample are also preferred. The assumption is that if someone has specifically investigated one sample, that is likely a more accurate result than a general project wide analysis.

Samples with significantly conflicting status reports or which remain unvalidated will be flagged for detailed follow-up.

Worrisome Contamination

Significant contamination (other plant material) will be assessed using the same principles. However, the degree to which an analysis is considered definitive may differ between the two analyses.

Combined Assemblies

A report for a source sample used in a combined assembly may or may/not apply to the combined assembly. It will depend on whether the report is positive or negative. Similarly, a report on the combined assembly may or may not also apply to the source materials.

18S rRNA Analysis

To confirm sample source and purity, 1KP assemblies were compared by blastn to a reference set of 18S rRNA sequences derived from the SILVA SSU database (http://www.arb-silva.de/).  Only SILVA entries with a clear 18S rRNA annotation in the NCBI nt database were used.

Nuclear 18S sequences were preferred because more reference sequences are available, ensuring a dense sampling across the Viridiplantae.  Alignments to chloroplast and mitochondrial SSU sequences were detected by searching for the patterns chloroplast*, plastid, mitochondri* -- and subsequently ignored.

Short and low-identity alignments are not reliable for determining taxonomic sources as they may be taxonomically ambiguous, aligning well with sequences from distantly related species.  Hence alignments shorter than 300 bp or with E-values above 10e-9 were also ignored.

Thank you to Shaungxiu Wu (CAS Key Laboratory of Genome Sciences and Information, Beijing) and her group who have done this work.

Categories of Validation

Many 1KP samples contain non-plant sequences especially from bacterial, fungal, or insect sources.  This kind of "contamination" is not a problem for most analyses and is described in our summaries as "harmless".  It is reported when scaffolds are present for which the best alignments were to non-plant sequences.

A sample source is validated if the best alignments for all of the ribosomal scaffolds are to sequences from species within the same taxonomic family as the sample source. If the best alignment for one of the scaffolds match the expected source at either the genus or species level then this more precise validation was also noted.

Lastly, "worrisome" contamination was reported when a scaffold's best alignment was to a plant species (Viridiplantae, Glaucocystophyceae, Rhodophyceae, Cryptophyceae, Haptophyta, or Stramenopiles) outside of the expected source family.  This status does not mean that a problem has occurred, only that more attention is warranted.  Final status will be assigned after a manual inspection of the assemblies by plant experts within the 1KP consortium, which is ongoing and not yet complete.

Limitations of Method

Our analysis relies on ribosomal small sub-unit material being assembled from each sample. Because a significant fraction of a cell's RNA is ribosomal, this is likely to be a sensitive detector of contamination.  However, if the contamination is from a closely related species, the sequences will co-assemble. Experimentally, we have found that this can happen when ribosome sequences differ by 2% or less. Such contamination will not be reported by our methodologies.

Comparison with Other Results - 1. Barkman

Todd Barkman has constructed trees with SABATH methyltransferase sequences and then manually decided whether samples are taxonomically misplaced. When results from his efforts are compared with the 18S RNA taxonomic validation they agree for 94% of samples.

Barkman's Classification

18S Validated

Not Validated

No 18S Result

Taxonomically Good

831

35

18

Problems/Questionable

18

12

1

No Data

376

25

11

His detailed report with an assessment for each assembly is available 1kp-Barkman.xlsx. The category codes are explained on the second sheet of the workbook. The above table groups categories 1-3 and 4-5. Also available is a spreadsheet listing samples which failed either source validation Sample Source Issues.xlsx.

Comparison with Other Results - 2. Mirarab

A number of samples have noticeably odd locations in the capstone test MAFFT tree produced by Siavash Mirarab. These are:

LVNW

Basal Eudicots

Cocculus laurifolius

WPYJ

Magnoliids

Frankenia laevis

DYFF

Core Eudicots/Asterids

Pycnanthemum tenuifolium

XMQO

Basal Eudicots

Gunnera manicata

JLLY

Core Eudicots/Rosids

Melaleuca quinquenervia

CYVA

Basal Eudicots

Cimicifuga racemosa

QJXB

Core Eudicots/Rosids

Wikstroemia indica

FWBF

Core Eudicots

Alangium chinense

FONV

Core Eudicots/Rosids

Greyia sutherlandii

NPND

Basalmost angiosperms

Ceratophyllum demersum

ULGV

Core Eudicots/Asterids

Morinda citrifolia

JBGU

Core Eudicots

Amaranthus palmeri

YMES

Monocots/Commelinids

Typhonium blumei

JBLI

Eusporangiate Monilophytes

Bolbitis repanda

FITN

Liverworts

Treubia lacunosa

NIJU

Core Eudicots/Rosids

Heteropyxis natalensis

UZNH

Core Eudicots/Asterids

Curtisia dentata

IQJU

Hornworts

Anthoceros formosae

FANS

Hornworts

Leiosporoceros dussii

Comparison with Other Results - 3. Human Genome

For each of the datasets was mapped to a human genome reference (available at https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.29 ) using Bowtie 2 (version 2.2.4).  Then the number of read-pairs that cleanly aligned was counted.

This provides a count of human-like reads in the library.  For most samples these reads are small fraction of the total.  However, a few cases have much larger counts suggesting that substantial contamination with human material may have occurred.  A spreadsheet with details is here.

This technique is not intend to be perfect, but provides a rapid estimate.  For RNA contamination the result wlll be an under-count, as introns will prevent the reads from aligning with the genome and being counted.  Similarly, read-ends that do not align in the expected paired-end fashion are not counted.

Example of the commands used:

# align reads to the genome reference - output temporary file (AALA.sam)
bowtie2 --phred64 --no-unal -x GCF_000001405.29_GRCh38.p3_genomic \
   -1 AALA-read_2.fq -2 AALA-read_2.fq -S AALA.sam

# print first read of properly-mapped (flag 64+2) read-pairs and count (lines)
samtools view -f 66 AALA.sam | wc -l

SUMMARY OF RESULTS

Here now are the latest results, BEFORE manual inspection by our plant experts. Detailed analysis reports are available, 1328_statistics_final.xls and 1328_blast_info_2.xls.

Unconfirmed Source (No Worrisome Contamination)

IRBN

Scapania nemorosa

YZGX

Cyrilla racemiflora

TFDQ

Monoclea gottschei

HTDC

Tamarix chinensis

AEXY

Blasia sp.

TMAJ

Neckera douglasii

QGLJ

Cocos nucifera

UHJR

Citrus x paradisi

QAIR

Opuntia sp.

No Family-level Reference Sequence in Database

AZBL

Petiveria alliacea

OQHZ

Quillaja saponaria

BNTL

Souroubea exauriculata

YQEC

Woodsia ilvensis

PNZO

Culcita macrocarpa

EWXK

Thyrsopteris elegans

CWLL

Schlegelia parasitica

YJJY

Woodsia scopulina

OQON

Entocladia endozoica

EGNB

Scourfieldia sp.

GAKQ

Schlegelia parasitica

PQED

Gloeochaete wittrockiana

OCZL

Homalosorus pycnocarpos

COBX

Polypremum procumbens

HTFH

Onoclea sensibilis

TAVP

Calliergon cordifolium

OFTV

Barbilophozia barbata

AJAU

Helicodictyon planctonicum

VHIJ

Blastophysa cf. rhizopus

VJDZ

Botryococcus sudeticus

POOW

Glaucocystis cf. nostochinearum

YGAT

Phyllanthus sp.

HYZL

Akania lucens

 

 

No SSU rRNA Sequence Found in Assembly

SHEZ

Dianthus caryophyllus

WWSS

Taxus baccata

JVBR

Aloe vera

ZYAX

Taxus cuspidata

PAWA

Aristolochia elegans

XSZI

Peperomia fraseri

HSXO

Ancistrocladus tectorius

 

 

 

 

Unconfirmed Source & Worrisome Contamination

XXHP

Cystopteris fragilis

TJES

Spergularia media

HEGQ

Gymnocarpium dryopteris

RXEN

Polycarpaea repens

XONJ

Camptotheca acuminata

DCCI

Calceolaria pinifolia

EZXQ

Cleome violacea

YOWV

Cystopteris protrusa

RICC

Cystopteris reevesiana

NIJU

Heteropyxis natalensis

QZZU

Pyrenacantha malvifolia

LVNW

Cocculus laurifolius

EDXZ

Schlegelia violacea

MBQU

Cleome gynandra

KUXM

Selaginella selaginoides

IQJU

Anthoceros formosae

QSKP

Polanisia trachysperma

ZYCD

Selaginella acanthonota

UZNH

Curtisia dentata

HNDZ

Cystopteris utahensis

JDQB

Neocallitropsis pancheri

RTTY

Salvadora sp.

RNBN

Mollugo cerviana

FITN

Treubia lacunosa

UPZX

Cleome viscosa

SKNL

Saponaria officinalis

PKMO

Cistus inflatus

OLES

Schiedea membranacea

GIWN

Sarcobatus vermiculatus

RUUB

Physena madagascariensis

CWZU

Betula pendula

ZFGK

Selaginella kraussiana

FAKD

Nelumbo sp.

CTYH

Basella alba

JBGU

Amaranthus palmeri

OTAN

Deutzia scabra

FANS

Leiosporoceros dussii

VWIP

Carya glabra

HUSX

Roridula gorgonias

PVGM

Oncotheca balansae

FIDQ

Undaria pinnatifida

KVAY

Tribulus eichlerianus

QJXB

Wikstroemia indica

JPDJ

Symplocus tinctoria

ZLOA

Cleome gynandra

NLOM

Pediastrum duplex

LWDA

Alnus serrulata

YXNR

Triodia aff. bynoei

VBMM

Claopodium rostratum

LKKX

Talinum sp.

EBWI

Ochromonas sp.

ULGV

Morinda citrifolia

TJLC

Nothofagus obliqua

HELY

Cleome violacea

QTJY

Euptelea pleiosperma

LHLE

Cystopteris fragilis

OKEF

Hibbertia grossulariifolia

OBTI

Peganum harmala

 

 

 

 

Worrisome Contamination (Source Confirmed)

MLPX

Papaver setigerum

IAYV

Rhodomonas sp.

LJPN

Gracilaria blodgettii

LJQF

Draba ossetica

PWKQ

Gracilaria sp.

VKVG

Synura sp.

GTSV

Draba hispida

AZZW

Chlorokybus atmophyticus

LSKK

Orchidantha maxillarioides

OJCW

Maesa lanceolata

UKUC

Dunaliella salina

JGGD

Sargassum muticum

BXBF

Draba sachalinensis

NBYP

Mesotaenium kramstae

VZWX

Ceramium kondoi

ZZEI

Phylloglossum drummondii

RFAD

Pavlova lutheri

RTMU

Calypogeia fissa

ULKT

Lycopodiella appressa

BZSH

Golenkinia longispicula

LXRN

Prymnesium parvum

FOYQ

Microspora cf. tumidula

WZFE

Ascarina rubricaulis

XAXW

Neosiphonia japonica

MWAN

Chlorella minutissima

IEHF

Dumontia simplex

XKWQ

Pediastrum duplex

QHVS

Ophioglossum vulgatum

WEJN

Mazzaella japonica

VNAL

Gracilaria vermiculophylla

BAKF

Cryptomonas curvata

RKGT

Eschscholzia californica

UGPM

Chondrus crispus

IIFB

Oenothera gaura

FZQN

Silene latifolia

IFCJ

Canella winterana

IOVS

Pseudotsuga wilsoniana

KRUQ

Porella navicularis

HVBQ

Tetraphis pellucida

LACT

Oenothera gaura

PTBJ

Plantago virginica

NMAK

Pavlova lutheri

BAJW

Isochrysis sp.

YBQN

Odontoschisma prostratum

 

 

Status Changed After Manual Review

LETF

Planophila laetevirens

culture collection no longer uses original species identification (P. terrestris)

GJIY

Pseudoneochloris marina

species change to match culture collection

EEJO

Ettlia oleoabundans

species change to match culture collection

WDCW

Mesotaenium endlicherianum

18S rRNA chimeric with human

MFYC

Oocystaceae species

18S rRNA indicates change (was Nannochloris atomus)

CYVA

Apiales species

multiple assembly confirmation (was Cimicifuga racemosa)

PZIF

Scenedesmus dimorphus

"contaminant" assembly found to be in genus by manual blastn search of nr

Manually Confirmed Problems

NLOM

Pediastrum duplex

fungal sequence present

GDUD

Chloromonas reticulata

brown algae 18S rRNA present

WGMD

Zygnema sp.

brown algae 18S rRNA present

OVHR

Chlamydomonas bilatus

brown algae 18S rRNA present