2013-11-14

Trouble with chimeras - getting all complete viral genomes from the NCBI

Back in 2009, I wrote some Python scripts to use the NCBI Entrez Utilities to search for and download all known complete virus genomes in GenBank format, which I then processed to make FASTA files and BLAST databases. Recently I updated them and ran into some problems... false positives like entire bacterial genomes! This turns out to be due to a few bacteria with integrated phage being annotated as chimeras - genomes combined from multiple organisms.


The core of this task uses esearch (NCBI Entrez Search) on the nucleotide database, originally with the following query: "txid10239[orgn] AND complete[prop]". This restricted the search to organisms (shortened to [orgn] or [ORGN]) under the NCBI taxonomy tree for Taxon ID 10239, the superkingdom for viruses, with the "complete" property (shortened to [prop] or [PROP]).

Fast-forward to 2013, as the databases grew there are now noticeable numbers of "complete" CDS sequences which I don't want. After some poking about exploring the current property set via the advanced search in the Entrez web interface, I settled on this query "txid35237[orgn] AND complete[prop] AND genome" instead (although there ought to be a more precise way to restrict this to genome sequences).

All seemed well, I had my fetch_viruses.py script talking to the NCBI and caching all the GenBank format files locally (using the Biopython Bio.Entrez module), and merge_virurses.py turning them into non-redundant BLAST databases of complete virus genomes, their genes, and proteins. However, a week later I spotted a false positive on a set of local BLAST results, the complete genome of Bacillus licheniformis DSM 13 = ATCC 14580 was in my (dsDNA) virus genome database!

Adding a length filter to the NCBI esearch query "txid35237[orgn] AND complete[prop] AND genome AND 1000000:100000000000[Sequence Length]" there are multiple complete bacteria showing up under viruses (and some cool enormous megaviruses, minivirus, and pandoraviruses). Here are the first three hits - all apparent false positives:
Looking at these in the GenBank format all becomes clear - they are annotated as chimeras, bacteria with integrated phage, e.g.

LOCUS       AE017333             4222645 bp    DNA     circular BCT 23-MAY-2013
DEFINITION  Bacillus licheniformis DSM 13 = ATCC 14580, complete genome.
ACCESSION   AE017333
VERSION     AE017333.1  GI:52346357
...
FEATURES             Location/Qualifiers
     source          order(1..1317753,1345263..1422555,1464175..4222645)
                     /organism="Bacillus licheniformis DSM 13 = ATCC 14580"
                     /mol_type="genomic DNA"
                     /strain="DSM 13"
                     /culture_collection="ATCC:14580"
                     /culture_collection="DSMZ:DSM 13"
                     /db_xref="taxon:279010"
                     /focus
     source          1317754..1345262
                     /organism="Bacillus phage BLi_Pp2"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:1230651"
                     /note="PBSX homolog phage"
     source          1422556..1464174
                     /organism="Bacillus phage BLi_Pp3"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:1230652"
                     /note="putative PBLD homologe phage"
...

This seems unusual, but understandable. Historically GenBank files for bacteria had a single source feature covering the whole of the genome. The fact these (mostly) bacterial genomes now show up in virus specific searches is an unfortunate side effect.

This is still a rare problem as shown by this search which looks for complete sequences which are under both virus (taxon id 35237) and cellular organisms (taxon id 131567), "txid35237[orgn] AND complete[prop] AND genome AND txid131567[orgn]" (currently 10 matches).

The solution to avoiding these chimeras is the search "txid35237[orgn] AND complete[prop] AND genome NOT txid131567[orgn]" (viruses but not cellular organisms). I've fixed my script to do this, and am regenerating my complete virus BLAST databases.

Update (18 April 2014)

With hindsight, I neglected to consider a more straight forward approach of using the NCBI FTP site, specifically the ftp://ftp.ncbi.nih.gov/genomes/Viruses/ folder where you can download many species individually - or all the GenBank or FASTA files as a tar-ball.

Rodney Brister from the Virus Genomes Group at the NCBI also emailed to draw my attention to ftp://ftp.ncbi.nih.gov/genomes/Viruses/Viruses_RefSeq_and_neighbors_genome_data.tab which lists all virus RefSeqs and their and manually curated "neighbours" and looks useful too. He also confirmed the "chimera" problems are just a side effect of more detailed "source" annotations in some recent submissions.

Update (2 Feb 2015)

See also Brister et al. (2015) NCBI Viral Genomes Resource.

4 comments:

  1. Peter, thanks a lot for this extremely useful hint to a better use of research!
    I have a few questions: in your script why do you loop through dsDnaViruses, dsRnaViruses, ssDnaViruses, ssRnaViruses, and then allViruses? Wouldn't it be enough to just cycle through allViruses only? I checked and, unless I did some mistake, allViruses is a superset of all others.

    That said, I still find some glitches in the annotation. For example, the search http://www.ncbi.nlm.nih.gov/nuccore/?term=txid10239%5Borgn%5D+AND+complete%5Bprop%5D+AND+genome+NOT+txid131567%5Borgn%5D does not return at least one strain of HEV C 104: http://www.ncbi.nlm.nih.gov/nuccore/AB686524.1, although I think it should. Any idea on why?

    I'm quite sure that you know this already, but just in case: you can download much faster from NCBI by fetching multiple ids at a time (I read somewhere on the NCBI website around 500 at a time). My script to download looks like this.

    ids_count = len(names) # names is the list of genbank ids
    step = 250 # download 250 sequences at a time
    handle = open('your_nt_db.fasta', 'w+')
    for i in range(0, ids_count, step):
    ids = ','.join(names[i:i + step])
    fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
    data = fetch_handle.read()
    fetch_handle.close()
    handle.write(data)

    if i % 1000 == 0:
    print time.ctime(), ' Fetched %d sequences' % i

    # download the sequences outside of the range
    ids = ','.join(names[i:ids_count])
    fetch_handle = Entrez.efetch("nuccore", rettype="fasta", id=ids)
    data = fetch_handle.read()
    fetch_handle.close()
    handle.write(data)

    handle.close()


    Thanks again.

    ReplyDelete
    Replies
    1. I have that loop over dsDNA, dsRNA, ssDNA, ssRNa, then all viruses as I wanted to make BLAST databases for each type of virus as well as all viruses. And yes, it is faster to download batches of records - but instead I wanted a local cache of one file per virus, which makes updates much more efficient.

      Regarding AB686524, for some reason it does not have the complete property set - this is likely a bug in the annotation metadata, please tell the NCBI. e.g. Try this search on the Nucleotide database: AB686524 AND complete[prop]

      Delete
  2. Thanks Peter.
    In the meanwhile, I also realised that a better search might be "complete genome[Title] AND txid10239[orgn] NOT txid131567[orgn]". It returns ~35 thousand hits, a bit more than before. The point is that if you look for "complete[prop]" you exclude those hits that are described as complete genome in the title, but where the authors forgot to mention the property "completeness: complete".

    ReplyDelete
    Replies
    1. This neatly shows the trouble with inconsistent annotation. Consider this search appears to show over 5000 complete virus genomes missing the completeness property: "complete genome"[Title] AND txid10239[orgn] NOT complete[prop]

      Ideally all of those entries' metadata could be fixed...

      Delete