Previous Up Next

Chapter 7  Accessing NCBI’s Entrez databases

Entrez (http://www.ncbi.nlm.nih.gov/Entrez) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. You can access Entrez from a web browser to manually enter queries, or you can use Biopython’s Bio.Entrez module for programmatic access to Entrez. The latter allows you for example to search PubMed or download GenBank records from within a Python script.

The Bio.Entrez module makes use of the Entrez Programming Utilities, consisting of eight tools that are described in detail on NCBI’s page at http://www.ncbi.nlm.nih.gov/entrez/utils/. Each of these tools corresponds to one Python function in the Bio.Entrez module, as described in the sections below. This module makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI.

The output returned by the Entrez Programming Utilities is typically in XML format. To parse such output, you have several options:

  1. Use Bio.Entrez’s parser to parse the XML output into a Python object;
  2. Use the DOM (Document Object Model) parser in Python’s standard library;
  3. Use the SAX (Simple API for XML) parser in Python’s standard library;
  4. Read the XML output as raw text, and parse it by string searching and manipulation.

For the DOM and SAX parsers, see the Python documentation. The parser in Bio.Entrez is discussed below.

For sequence databases, the Entrez Programming Utilities can also generate output in other formats (such as the Fasta and GenBank file format). This can then be parsed into a SeqRecord using Bio.SeqIO (see Chapter 4, and the example below).

7.1  Entrez Guidelines

Before using Biopython to access the NCBI’s online resources (via Bio.Entrez or some of the other modules), please read the NCBI’s Entrez User Requirements. If the NCBI finds you are abusing their systems, they can and will ban your access!

To paraphrase:

For large queries, the NCBI also recommend using their session history feature (the WebEnv session cookie string). This is only slightly more complicated.

In conclusion, be sensible with your usage levels. If you plan to download lots of data, consider other options. For example, if you want easy access to all the human genes, consider fetching each chromosome by FTP as a GenBank file, and importing these into your own BioSQL database (see Section 9.5).

7.2  EInfo: Obtaining information about the Entrez databases

EInfo provides field index term counts, last update, and available links for each of NCBI’s databases. In addition, you can use EInfo to obtain a list of all database names accessible through the Entrez utilities:

>>> from Bio import Entrez
>>> handle = Entrez.einfo(email="A.N.Other@example.com")
>>> result = handle.read()

The variable result now contains a list of databases in XML format:

>>> print result
<?xml version="1.0"?>
<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May 2002//EN"
 "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
        <DbName>pubmed</DbName>
        <DbName>protein</DbName>
        <DbName>nucleotide</DbName>
        <DbName>nuccore</DbName>
        <DbName>nucgss</DbName>
        <DbName>nucest</DbName>
        <DbName>structure</DbName>
        <DbName>genome</DbName>
        <DbName>books</DbName>
        <DbName>cancerchromosomes</DbName>
        <DbName>cdd</DbName>
        <DbName>gap</DbName>
        <DbName>domains</DbName>
        <DbName>gene</DbName>
        <DbName>genomeprj</DbName>
        <DbName>gensat</DbName>
        <DbName>geo</DbName>
        <DbName>gds</DbName>
        <DbName>homologene</DbName>
        <DbName>journals</DbName>
        <DbName>mesh</DbName>
        <DbName>ncbisearch</DbName>
        <DbName>nlmcatalog</DbName>
        <DbName>omia</DbName>
        <DbName>omim</DbName>
        <DbName>pmc</DbName>
        <DbName>popset</DbName>
        <DbName>probe</DbName>
        <DbName>proteinclusters</DbName>
        <DbName>pcassay</DbName>
        <DbName>pccompound</DbName>
        <DbName>pcsubstance</DbName>
        <DbName>snp</DbName>
        <DbName>taxonomy</DbName>
        <DbName>toolkit</DbName>
        <DbName>unigene</DbName>
        <DbName>unists</DbName>
</DbList>
</eInfoResult>

Since this is a fairly simple XML file, we could extract the information it contains simply by string searching. Using Bio.Entrez’s parser instead, we can directly parse this XML file into a Python object:

>>> from Bio import Entrez
>>> handle = Entrez.einfo(email="A.N.Other@example.com")
>>> record = Entrez.read(handle)

Now record is a dictionary with exactly one key:

>>> record.keys()
[u'DbList']

The values stored in this key is the list of database names shown in the XML above:

>>> record["DbList"]
['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest',
 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap',
 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene',
 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc',
 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound',
 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']

For each of these databases, we can use EInfo again to obtain more information:

>>> handle = Entrez.einfo(db="pubmed", email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["DbInfo"]["Description"]
'PubMed bibliographic record'
>>> record["DbInfo"]["Count"]
'17989604'
>>> record["DbInfo"]["LastUpdate"]
'2008/05/24 06:45'

Try record["DbInfo"].keys() for other information stored in this record.

7.3  ESearch: Searching the Entrez databases

To search any of these databases, we use Bio.Entrez.esearch(). For example, let’s search in PubMed for publications related to Biopython:

>>> from Bio import Entrez
>>> handle = Entrez.esearch(db="pubmed", term="biopython", email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["IdList"]
['16403221', '16377612', '14871861', '14630660', '12230038']

In this output, you see five PubMed IDs (16403221, 16377612, 14871861, 14630660, 12230038), which can be retrieved by EFetch (see section 7.6).

You can also use ESearch to search GenBank. Here we’ll do a quick search for the rpl16 gene in Opuntia:

>>> handle = Entrez.esearch(db="nucleotide",term="Opuntia and rpl16",
                            email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["Count"]
'9'
>>> record["IdList"]
['57240072', '57240071', '6273287', '6273291', '6273290',
 '6273289', '6273286', '6273285', '6273284']

Each of the IDs (57240072, 57240071, 6273287...) is a GenBank identifier. See section 7.6 for information on how to actually download these GenBank records.

As a final example, let’s get a list of computational journal titles:

>>> handle = Entrez.esearch(db="journals", term="computational",
                            email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["Count"]
'16'
>>> record["IdList"]
['30367', '33843', '33823', '32989', '33190', '33009', '31986',
 '34502', '8799', '22857', '32675', '20258', '33859', '32534',
 '32357', '32249']

Again, we could use EFetch to obtain more information for each of these journal IDs.

ESearch has many useful options — see the ESearch help page for more information.

7.4  EPost

EPost posts a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through Bio.Entrez.epost().

7.5  ESummary: Retrieving summaries from primary IDs

ESummary retrieves document summaries from a list of primary IDs (see the ESummary help page for more information). In Biopython, ESummary is available as Bio.Entrez.esummary(). Using the search result above, we can for example find out more about the journal with ID 30367:

>>> from Bio import Entrez
>>> handle = Entrez.esummary(db="journals", id="30367", email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record[0]["Id"]
'30367'
>>> record[0]["Title"]
'Computational biology and chemistry'
>>> record[0]["Publisher"]
'Pergamon,'

7.6  EFetch: Downloading full records from Entrez

EFetch is what you use when you want to retrieve a full record from Entrez. For the Opuntia example above, we can download GenBank record 57240072 using Bio.Entrez.efetch:

>>> handle = Entrez.efetch(db="nucleotide", id="57240072", rettype="genbank", 
                           email="A.N.Other@example.com")
>>> print handle.read()
LOCUS       AY851612                 892 bp    DNA     linear   PLN 10-APR-2007
DEFINITION  Opuntia subulata rpl16 gene, intron; chloroplast.
ACCESSION   AY851612
VERSION     AY851612.1  GI:57240072
KEYWORDS    .
SOURCE      chloroplast Austrocylindropuntia subulata
  ORGANISM  Austrocylindropuntia subulata
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
            Caryophyllales; Cactaceae; Opuntioideae; Austrocylindropuntia.
REFERENCE   1  (bases 1 to 892)
  AUTHORS   Butterworth,C.A. and Wallace,R.S.
  TITLE     Molecular Phylogenetics of the Leafy Cactus Genus Pereskia
            (Cactaceae)
  JOURNAL   Syst. Bot. 30 (4), 800-808 (2005)
REFERENCE   2  (bases 1 to 892)
  AUTHORS   Butterworth,C.A. and Wallace,R.S.
  TITLE     Direct Submission
  JOURNAL   Submitted (10-DEC-2004) Desert Botanical Garden, 1201 North Galvin
            Parkway, Phoenix, AZ 85008, USA
FEATURES             Location/Qualifiers
     source          1..892
                     /organism="Austrocylindropuntia subulata"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:106982"
     gene            <1..>892
                     /gene="rpl16"
     intron          <1..>892
                     /gene="rpl16"
ORIGIN      
        1 cattaaagaa gggggatgcg gataaatgga aaggcgaaag aaagaaaaaa atgaatctaa
       61 atgatatacg attccactat gtaaggtctt tgaatcatat cataaaagac aatgtaataa
      121 agcatgaata cagattcaca cataattatc tgatatgaat ctattcatag aaaaaagaaa
      181 aaagtaagag cctccggcca ataaagacta agagggttgg ctcaagaaca aagttcatta
      241 agagctccat tgtagaattc agacctaatc attaatcaag aagcgatggg aacgatgtaa
      301 tccatgaata cagaagattc aattgaaaaa gatcctaatg atcattggga aggatggcgg
      361 aacgaaccag agaccaattc atctattctg aaaagtgata aactaatcct ataaaactaa
      421 aatagatatt gaaagagtaa atattcgccc gcgaaaattc cttttttatt aaattgctca
      481 tattttattt tagcaatgca atctaataaa atatatctat acaaaaaaat atagacaaac
      541 tatatatata taatatattt caaatttcct tatataccca aatataaaaa tatctaataa
      601 attagatgaa tatcaaagaa tctattgatt tagtgtatta ttaaatgtat atcttaattc
      661 aatattatta ttctattcat ttttattcat tttcaaattt ataatatatt aatctatata
      721 ttaatttata attctattct aattcgaatt caatttttaa atattcatat tcaattaaaa
      781 ttgaaatttt ttcattcgcg aggagccgga tgagaagaaa ctctcatgtc cggttctgta
      841 gtagagatgg aattaagaaa aaaccatcaa ctataacccc aagagaacca ga
//

The argument rettype="genbank" lets us download this record in the GenBank format. Alternatively, you could for example use rettype="fasta" to get the Fasta-format; see the EFetch Help page for other options. The available formats depend on which database you are downloading from.

If you fetch the record in one of the formats accepted by Bio.SeqIO (see Chapter 4), you can directly parse it into a SeqRecord:

>>> from Bio import Entrez, SeqIO
>>> handle = Entrez.efetch(db="nucleotide", id="57240072",rettype="genbank",
                           email="A.N.Other@example.com")
>>> record = SeqIO.read(handle, "genbank")
>>> print record
ID: AY851612.1
Name: AY851612
Desription: Opuntia subulata rpl16 gene, intron; chloroplast.
/sequence_version=1
/source=chloroplast Austrocylindropuntia subulata
....

By default you get the output in XML format, which you can parse using the Bio.Entrez.read() function:

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="nucleotide", id="57240072",
                           email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record[0]["GBSeq_definition"]
'Opuntia subulata rpl16 gene, intron; chloroplast'
>>> record[0]["GBSeq_source"]
'chloroplast Austrocylindropuntia subulata'
....

7.7  ELink

For help on ELink, see the ELink help page. ELink is available from Biopython through Bio.Entrez.elink().

7.8  EGQuery: Obtaining counts for search terms

EGQuery provides counts for a search term in each of the Entrez databases. This is particularly useful to find out how many items a search will return before actually performing the search with ESearch (see the example in 7.10.1 below).

In this example, we use Bio.Entrez.egquery() to obtain the counts for “Biopython”:

>>> handle = Entrez.egquery(term="biopython",
                            email="A.N.Other@example.com") 
>>> record = Entrez.read(handle)
>>> record["eGQueryResult"][0]["DbName"]
'pubmed'
>>> record["eGQueryResult"][0]["Count"]
'5'

See the EGQuery help page for more information.

7.9  ESpell: Obtaining spelling suggestions

ESpell retrieves spelling suggestions. In this example, we use Bio.Entrez.espell() to obtain the correct spelling of Biopython:

>>> from Bio import Entrez
>>> handle = Entrez.espell(term="biopythooon", email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["Query"]
'biopythooon'
>>> record["CorrectedQuery"]
'biopython'

See the ESpell help page for more information.

7.10  Examples

7.10.1  Searching and downloading Entrez Nucleotide records

Here we’ll show a simple example of performing a remote Entrez query. In section 2.3 of the parsing examples, we talked about using NCBI’s Entrez website to search the NCBI nucleotide databases for info on Cypripedioideae, our friends the lady slipper orchids. Now, we’ll look at how to automate that process using a Python script. In this example, we’ll just show how to connect, get the results, and parse them, with the Entrez module doing all of the work.

First, we use EGQuery to find out the number of results we will get before actually downloading them:

>>> from Bio import Entrez
>>> handle = Entrez.egquery(term='Cypripedioideae', email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> for row in record['eGQueryResult']:
...     if row['DbName']=='nuccore':
...         print row['Count']
814

So, we expect to find 814 Entrez Nucleotide records. If you find some ridiculously high number of hits, you may want to reconsider if you really want to download all of them, which is our next step:

>>> from Bio import Entrez
>>> handle = Entrez.esearch(db='nucleotide', term='Cypripedioideae', retmax=814,
                            email="A.N.Other@example.com")
>>> record = Entrez.read(handle)

Here, record is a Python dictionary containing the search results and some auxiliary information. Just for information, let’s look at what is stored in this dictionary:

>>> print record.keys()
[u'Count', u'RetMax', u'IdList', u'TranslationSet', u'RetStart', u'QueryTranslation']

First, let’s check how many results were found:

>>> print record['Count']
'814'

which is the number we expected. The 814 results are stored in record['IdList']:

>>> print len(record['IdList'])
814

Let’s look at the first five results:

>>> print record['IdList'][:5]
['187237168', '187372713', '187372690', '187372688', '187372686']

We can download these records using efetch. While you could download these records one by one, to reduce the load on NCBI’s servers, it is better to fetch a bunch of records at the same time, shown below. However, in this situation you should ideally be using the history feature described later in Section 7.10.3.

>>> idlist = ",".join(record['IdList'][:5])
>>> print idlist
187237168,187372713,187372690,187372688,187372686
>>> handle = Entrez.efetch(db='nucleotide', id=idlist, retmode='xml',
                           email="A.N.Other@example.com")
>>> records = Entrez.read(handle)
>>> print len(records)
5

Each of these records corresponds to one GenBank record.

>>> print records[0].keys()
[u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence',
 u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version',
 u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table',
 u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division',
 u'GBSeq_taxonomy', u'GBSeq_references', u'GBSeq_update-date',
 u'GBSeq_organism', u'GBSeq_locus', u'GBSeq_strandedness']

>>> print records[0]['GBSeq_primary-accession']
DQ110336

>>> print records[0]['GBSeq_other-seqids']
['gb|DQ110336.1|', 'gi|187237168']

>>> print records[0]['GBSeq_definition']
Cypripedium calceolus voucher Davis 03-03 A maturase (matR) gene, partial cds;
mitochondrial

>>> print records[0]['GBSeq_organism']
Cypripedium calceolus

You could use this to quickly set up searches – but for heavy usage, see Section 7.10.3.

7.10.2  Finding the lineage of an organism

Staying with the same organism, let’s now find its lineage. First, we search the Taxonomy database for Cypripedioideae. We find exactly one accession number:

>>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae",
                            email="A.N.Other@example.com")
>>> record = Entrez.read(handle)
>>> record["IdList"]
['158330']
>>> record["IdList"][0]
'158330'

Now, we use efetch to download this entry in the Taxonomy database and to parse it:

>>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode='xml')
>>> records = Entrez.read(handle)

Again, this record stores lots of information:

>>> records[0].keys()
[u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx',
 u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName',
 u'MitoGeneticCode', u'UpdateDate']

We can get the lineage directly from this record:

>>> records[0]['Lineage']
'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina;
 Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta;
 Liliopsida; Asparagales; Orchidaceae'

7.10.3  Using the history and WebEnv

Often you will want to make a series of linked queries. Most typically, running a search, perhaps refining the search, and then retrieving detailed search results. You can do this by making a series of separate calls to Entrez. However, the NCBI prefer you to take advantage of their history support.

For example, suppose we want to search and download all Orchid rpl16 nucleotide sequences, and store them in a FASTA file. We could naively combine the example code for Bio.Entrez.esearch() (Section 7.3) to get a list of GI numbers, and then repeatedly call Bio.Entrez.efetch() (Section 7.6) to download them all. You could reduce the number of queries by asking for the records in batches (see Section 7.10.1). That would probably be better, but is still not what the NCBI encourage.

The approved approach is to run the search with the history feature. Then, we can fetch the results by reference to the search results - which the NCBI can anticipate and cache.

from Bio import Entrez
search_handle = Entrez.esearch(db="nucleotide",term="Opuntia and rpl16",
                               usehistory="y", email="history.user@example.com")
search_results = Entrez.read(search_handle)
search_handle.close()

gi_list = search_results["IdList"]
count = int(search_results["Count"])
assert count == len(gi_list)

session_cookie = search_results["WebEnv"]
query_key = search_results["QueryKey"] 

In addition to the GI numbers of the sequences found in the search, because we have asked to use the history feature the XML search results also include WebEnv and QueryKey values which are used to refer to these search results. Having stored these values in variables session_cookie and query_key we can use them as parameters to Bio.Entrez.efetch() instead of giving the GI numbers as identifiers.

While for small searches you might be OK downloading everything at once, its better download in batches. You use the retstart and retmax parameters to specify which range of search results you want returned (starting entry using zero-based counting, and maximum number of results to return). For example,

batch_size = 3
out_handle = open("orchid_rpl16.fasta", "w")
for start in range(0,count,batch_size) :
    end = min(count, start+batch_size)
    print "Going to download record %i to %i" % (start+1, end)
    fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta",
                                 retstart=start, retmax=batch_size,
                                 webenv=session_cookie, query_key=query_key,
                                 email="history.user@example.com")
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

And finally, don’t forget to include your own email address in the Entrez calls.


Previous Up Next