Previous Up Next

Chapter 4  Cookbook -- Cool things to do with it

4.1  SWISS-PROT

4.1.1  Retrieving a SWISS-PROT record

SwissProt (http://www.expasy.org/sprot/sprot-top.html) is a hand-curated database of protein sequences. Let's look at how we connect with it from a script and parse SwissProt formatted results.

First, we need to retrieve some results to parse. Let's say we are looking at chalcone synthases for Orchids (see section 2.3 for some justification for looking for interesting things about orchids). Chalcone synthase is involved in flavanoid biosynthesis in plants, and flavanoids make lots of cool things like pigment colors and UV protectants.

If you do a search on SwissProt, you can find three orchid proteins for Chalcone Synthase, id numbers O23729, O23730, O23731. Now, let's write a script which grabs these, and parses out some interesting information.

First, we grab the records, using the get_sprot_raw() function of Bio.WWW.ExPASy. This function is very nice since you can feed it an id and get back a raw text record (no html to mess with!). As we get the three records we are interested in, we'll just put them together into one big string, which we'll then use for parsing. The following code accomplishes what I just wrote:
from Bio.WWW import ExPASy

ids = ["O23729", "O23730", "O23731"]

all_results = ""
for eid in ids:
    results = ExPASy.get_sprot_raw(eid)
    all_results = all_results + results.read()
Now that we've got the results, we are ready to parse them into interesting information. As with most parsers, we set up an iterator and parser. The parser we use here parses the SwissProt files into a Record object where the interesting features are attributes:
from Bio.SwissProt import SProt
from Bio import File

s_parser = SProt.RecordParser()
s_iterator = SProt.Iterator(File.StringHandle(all_results), s_parser)
Note that we convert all_results, which is a string, into a handle before passing it. The iterator requires a handle to be passed so that it can read in everything one line at a time. The Bio.File module has a nice StringHandle, which conveniently will convert a string into a handle. Very nice! Now we are ready to start extracting information.

To get out the information, we'll go through everything record by record using the iterator. For each record, we'll just print out some summary information:
while 1:
    cur_record = s_iterator.next()

    if cur_record is None:
        break

    print "description:", cur_record.description
    for ref in cur_record.references:
        print "authors:", ref.authors
        print "title:", ref.title

    print "classification:", cur_record.organism_classification
    print
This prints out a summary like the following:
description: CHALCONE SYNTHASE 8 (EC 2.3.1.74) (NARINGENIN-CHALCONE SYNTHASE 8)
authors: Liew C.F., Lim S.H., Loh C.S., Goh C.J.;
title: "Molecular cloning and sequence analysis of chalcone synthase cDNAs of
Bromheadia finlaysoniana.";
classification: ['Eukaryota', 'Viridiplantae', 'Embryophyta', 'Tracheophyta', 
'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 
'Bromheadia']
Since Python 2.2, we can simplify the loop:
for cur_record in s_iterator:
    print "description:", cur_record.description
    for ref in cur_record.references:
        print "authors:", ref.authors
        print "title:", ref.title

    print "classification:", cur_record.organism_classification
    print
It is equally easy to extract any kind of information you'd like from SwissProt records.

Now, you may remark that I knew the records' accession numbers beforehand. Indeed, get_sprot_raw() needs either the entry name or an accession number. When you don't have them handy, you can use one of the sprot_search_de() or sprot_search_ful() functions.

sprot_search_de() searches in the ID, DE, GN, OS and OG lines; sprot_search_ful() searches in (nearly) all the fields. They are detailed on http://www.expasy.org/cgi-bin/sprot-search-de and http://www.expasy.org/cgi-bin/sprot-search-ful respectively. Note that thet don't search in TrEMBL by default (argument trembl). Note also that they return html pages; however, accession numbers are quite easily extractable:
from Bio.WWW import ExPASy
import re

handle = ExPASy.sprot_search_de("Orchid Chalcone Synthase")
# or:
# handle = ExPASy.sprot_search_ful("Orchid and {Chalcone Synthase}")
html_results = handle.read()
if "Number of sequences found" in html_results:
    ids = re.findall(r'HREF="/uniprot/(\w+)"', html_results)
else:
    ids = re.findall(r'href="/cgi-bin/niceprot\.pl\?(\w+)"', html_results)

4.2  PubMed

4.2.1  Sending a query to PubMed

If you are in the Medical field or interested in human issues (and many times even if you are not!), PubMed (http://www.ncbi.nlm.nih.gov/PubMed/) is an excellent source of all kinds of goodies. So like other things, we'd like to be able to grab information from it and use it in python scripts.

Querying PubMed using Biopython is extremely painless. To get all of the article ids for articles having to do with orchids (see section 2.3 for our motivation), we only need the following three lines of code:
from Bio import PubMed

search_term = 'orchid'
orchid_ids = PubMed.search_for(search_term)
This returns a python list containing all of the orchid ids
['11070358', '11064040', '11028023', '10947239', '10938351', '10936520', 
'10905611', '10899814', '10856762', '10854740', '10758893', '10716342', 
...
With this list of ids we are ready to start retrieving the records, so follow on ahead to the next section.

4.2.2  Retrieving a PubMed record

The previous section described how to get a bunch of article ids. Now that we've got them, we obviously want to get the corresponding Medline records and extract the information from them.

The interface for retrieving records from PubMed should be very intuitive to python programmers -- it models a python dictionary. To set up this interface, we need to set up a parser that will parse the results that we retrieve. The following lines of code get everything set up:
from Bio import PubMed
from Bio import Medline

rec_parser = Medline.RecordParser()
medline_dict = PubMed.Dictionary(parser = rec_parser)
What we've done is create a dictionary like object medline_dict. To get an article we access it like medline_dict[id_to_get]. What this does is connect with PubMed, get the article you ask for, parse it into a record object, and return it. Very cool!

Now let's look at how to use this nice dictionary to print out some information about some ids. We just need to loop through our ids (orchid_ids from the previous section) and print out the information we are interested in:
for oid in orchid_ids[0:5]:
    cur_record = medline_dict[oid]
    print 'title:', string.rstrip(cur_record.title)
    print 'authors:', cur_record.authors
    print 'source:', string.strip(cur_record.source)
    print
The output for this looks like:
title: Sex pheromone mimicry in the early spider orchid (ophrys sphegodes):
patterns of hydrocarbons as the key mechanism for pollination by sexual
deception [In Process Citation]
authors: ['Schiestl FP', 'Ayasse M', 'Paulus HF', 'Lofstedt C', 'Hansson BS', 
'Ibarra F', 'Francke W']
source: J Comp Physiol [A] 2000 Jun;186(6):567-74
Especially interesting to note is the list of authors, which is returned as a standard python list. This makes it easy to manipulate and search using standard python tools. For instance, we could loop through a whole bunch of entries searching for a particular author with code like the following:
search_author = 'Waits T'

for our_id in our_id_list:
    cur_record = medline_dict[our_id]
    
    if search_author in cur_record.authors:
        print "Author %s found: %s" % (search_author,
                                       cur_record.source.strip())
The PubMed and Medline interfaces are very mature and nice to work with -- hopefully this section gave you an idea of the power of the interfaces and how they can be used.

4.3  GenBank

The GenBank record format is a very popular method of holding information about sequences, sequence features, and other associated sequence information. The format is a good way to get information from the NCBI databases at http://www.ncbi.nlm.nih.gov/.

4.3.1  Retrieving GenBank entries from NCBI

One very nice feature of the GenBank libraries is the ability to automate retrieval of entries from GenBank. This is very convenient for creating scripts that automate a lot of your daily work. In this example we'll show how to query the NCBI databases, and to retrieve the records from the query.

First, we want to make a query and find out the ids of the records to retrieve. Here we'll do a quick search for Opuntia, my favorite organism (since I work on it). We can do quick search and get back the GIs (GenBank identifiers) for all of the corresponding records:
from Bio import GenBank

gi_list = GenBank.search_for("Opuntia AND rpl16")


gi_list will be a list of all of the GenBank identifiers that match our query:
['6273291', '6273290', '6273289', '6273287', '6273286', '6273285', '6273284']
Now that we've got the GIs, we can use these to access the NCBI database through a dictionary interface. For instance, to retrieve the information for the first GI, we'll first have to create a dictionary that accesses NCBI:
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")
Now that we've got this, we do the retrieval:
gb_record = ncbi_dict[gi_list[0]]
In this case, gb_record will be GenBank formatted record:
LOCUS       AF191665      902 bp    DNA             PLN       07-NOV-1999
DEFINITION  Opuntia marenae rpl16 gene; chloroplast gene for chloroplast
            product, partial intron sequence.
ACCESSION   AF191665
VERSION     AF191665.1  GI:6273291
...
In this case, we are just getting the raw records. We can also pass these records directly into a parser and return the parsed record. For instance, if we wanted to get back SeqRecord objects with the GenBank file parsed into SeqFeature objects we would need to create the dictionary with the GenBank FeatureParser:
record_parser = GenBank.FeatureParser()
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank",
                                   parser = record_parser)
Now retrieving a record will give you a SeqRecord object instead of the raw record:
>>> gb_seqrecord = ncbi_dict[gi_list[0]]
>>> print gb_seqrecord
<Bio.SeqRecord.SeqRecord instance at 0x102f9404>
For more information of formats you can parse GenBank records into, please see section 4.3.2.

Using these automated query retrieval functionality is a big plus over doing things by hand. Additionally, the retrieval has nice built in features like a time-delay, which will prevent NCBI from getting mad at you and blocking your access.

4.3.2  Parsing GenBank records

While GenBank files are nice and have lots of information, at the same time you probably only want to extract a small amount of that information at a time. The key to doing this is parsing out the information. Biopython provides GenBank parsers which help you accomplish this task. Right now the GenBank module provides the following parsers:
  1. RecordParser -- This parses the raw record into a GenBank specific Record object. This object models the information in a raw record very closely, so this is good to use if you are just interested in GenBank records themselves.

  2. FeatureParser -- This parses the raw record in a SeqRecord object with all of the feature table information represented in SeqFeatures (see section 4.6 for more info on these objects). This is best to use if you are interested in getting things in a more standard format.
Depending on the type of GenBank files you are interested in, they will either contain a single record, or multiple records. Each record will start with a LOCUS line, various other header lines, a list of features, and finally the sequence data, ending with a // line.

Dealing with a GenBank file containing a single record is very easy. For example, let's use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can be downloaded from the NCBI here (only 1.15 MB):
from Bio import GenBank

feature_parser = GenBank.FeatureParser()

gb_file = "AE017199.gbk"

gb_record = feature_parser.parse(open(gb_file,"r"))

# now do something with the record
print "Name %s, %i features" % (gb_record.name, len(gb_record.features))
print gb_record.seq
This gives the following output:
Name AE017199, 1107 features
Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGAACCCAA ...', IUPACAmbiguousDNA())

4.3.3  Iterating over GenBank records

For multi-record GenBank files, the most common usage will be creating an iterator, and parsing through the file record by record. Doing this is very similar to how things are done in other formats, as the following code demonstrates, using an example file cor6_6.gb which is included in the BioPython source code under the Tests/GenBank/ directory:
from Bio import GenBank

gb_file = "cor6_6.gb"
gb_handle = open(gb_file, "r")

feature_parser = GenBank.FeatureParser()

gb_iterator = GenBank.Iterator(gb_handle, feature_parser)

while True:
   cur_record = gb_iterator.next()

   if cur_record is None:
       break

   # now do something with the record
   print "Name %s, %i features" % (cur_record.name, len(cur_record.features))
   print cur_record.seq
This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out the Seq objects representing the sequences in the record.

As with other formats, you have lots of tools for dealing with GenBank records. This should make it possible to do whatever you need to with GenBank.

4.3.4  Making your very own GenBank database

One very cool thing that you can do is set up your own personal GenBank database and access it like a dictionary (this can be extra cool because you can also allow access to these local databases over a network using BioCorba -- see the BioCorba documentation for more information).

Note - this is only worth doing if your GenBank file contains more than one record.

Making a local database first involves creating an index file, which will allow quick access to any record in the file. To do this, we use the index file function. Again, this example uses the file cor6_6.gb which is included in the BioPython source code under the Tests/GenBank/ directory:
>>> from Bio import GenBank
>>> dict_file = 'cor6_6.gb'
>>> index_file = 'cor6_6.idx'
>>> GenBank.index_file(dict_file, index_file)
This will create a directory called cor6_6.idx containing the index files. Now, we can use this index to create a dictionary object that allows individual access to every record. Like the Iterator and NCBIDictionary interfaces, we can either get back raw records, or we can pass the dictionary a parser that will parse the records before returning them. In this case, we pass a FeatureParser so that when we get a record, then we retrieve a SeqRecord object.

Setting up the dictionary is as easy as one line:
>>> gb_dict = GenBank.Dictionary(index_file, GenBank.FeatureParser())
Now we can deal with this like a dictionary. For instance:
>>> len(gb_dict)
6
>>> gb_dict.keys()
['L31939', 'AJ237582', 'X62281', 'AF297471', 'M81224', 'X55053']
Finally, we retrieve objects using subscripting:
>>> gb_dict['AJ237582']
<Bio.SeqRecord.SeqRecord instance at 0x102fdd8c>
>>> print len(gb_dict['X55053'].features)
3

4.4  Dealing with alignments

It is often very useful to be able to align particular sequences. I do this quite often to get a quick and dirty idea of relationships between sequences. Consequently, it is very nice to be able to quickly write up a python script that does an alignment and gives you back objects that are easy to work with. The alignment related code in Biopython is meant to allow python-level access to alignment programs so that you can run alignments quickly from within scripts.

4.4.1  Clustalw

Clustalx (http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html) is a very nice program for doing multiple alignments. Biopython offers access to alignments in clustal format (these normally have a *.aln extension) that are produced by Clustalx. It also offers access to clustalw, which the is command line version of clustalx.

We'll need some sequences to align, such as opuntia.fasta (also available online here) which is a small FASTA file containing seven orchid gene DNA sequences, which you can also from Doc/examples/ in the Biopython source distribution.

The first step in interacting with clustalw is to set up a command line you want to pass to the program. Clustalw has a ton of command line options, and if you set a lot of parameters, you can end up typing in a huge ol' command line quite a bit. This command line class models the command line by making all of the options be attributes of the class that can be set. A few convenience functions also exist to set certain parameters, so that some error checking on the parameters can be done.

To create a command line object to do a clustalw multiple alignment we do the following:
import os
from Bio.Clustalw import MultipleAlignCL

cline = MultipleAlignCL(os.path.join(os.curdir, "opuntia.fasta"))
cline.set_output("test.aln")
First we import the MultipleAlignCL object, which models running a multiple alignment from clustalw. We then initialize the command line, with a single argument of the fasta file that we are going to be using for the alignment. The initialization function also takes an optional second argument which specifies the location of the clustalw executable. By default, the commandline will just be invoked with 'clustalw,' assuming that you've got it somewhere on your PATH.

The second argument sets the output to go to the file test.aln. The MultipleAlignCL object also has numerous other parameters to specify things like output format, gap costs, etc.

We can look at the command line we have generated by invoking the __str__ member attribute of the MultipleAlignCL class. This is done by calling str(cline) or simple by printing out the command line with print cline. In this case, doing this would give the following output:
clustalw ./opuntia.fasta -OUTFILE=test.aln
Now that we've set up a simple command line, we now want to run the commandline and collect the results so we can deal with them. This can be done using the do_alignment function of Clustalw as follows:
from Bio import Clustalw

alignment = Clustalw.do_alignment(cline)
What happens when you run this if that Biopython executes your command line and runs clustalw with the given parameters. It then grabs the output, and if it is in a format that Biopython can parse (currently only clustal format), then it will parse the results and return them as an alignment object of the appropriate type. So in this case since we are getting results in the default clustal format, the returned alignment object will be a ClustalAlignment type.

Once we've got this alignment, we can do some interesting things with it such as get seq_record objects for all of the sequences involved in the alignment:
all_records = alignment.get_all_seqs()

print 'description:', all_records[0].description
print 'sequence:', all_records[0].seq
This prints out the description and sequence object for the first sequence in the alignment:
description: gi|6273285|gb|AF191659.1|AF191
sequence: Seq('TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT 
...', IUPACAmbiguousDNA())
You can also calculate the maximum length of the alignment with:
length = alignment.get_alignment_length()
Finally, to write out the alignment object in the original format, we just need to access the __str__ function. So doing a print alignment gives:
CLUSTAL X (1.81) multiple sequence alignment


gi|6273285|gb|AF191659.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
gi|6273284|gb|AF191658.1|AF191      TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA
...
This makes it easy to write your alignment back into a file with all of the original info intact.

If you want to do more interesting things with an alignment, the best thing to do is to pass the alignment to an alignment information generating object, such as the SummaryInfo object, described in section 4.4.2.

4.4.2  Calculating summary information

Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we've tried to separate out the functionality into separate classes, which act on the alignment.

Getting ready to calculate summary information about an object is quick to do. Let's say we've got an alignment object called alignment. All we need to do to get an object that will calculate summary information is:
from Bio.Align import AlignInfo
summary_align = AlignInfo.SummaryInfo(alignment)
The summary_align object is very useful, and will do the following neat things for you:
  1. Calculate a quick consensus sequence -- see section 4.4.3
  2. Get a position specific score matrix for the alignment -- see section 4.4.4
  3. Calculate the information content for the alignment -- see section 4.4.5
  4. Generate information on substitutions in the alignment -- section 4.5 details using this to generate a substitution matrix.

4.4.3  Calculating a quick consensus sequence

The SummaryInfo object, described in section 4.4.2, provides functionality to calculate a quick consensus of an alignment. Assuming we've got a SummaryInfo object called summary_align we can calculate a consensus by doing:
consensus = summary_align.dumb_consensus()
As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value (the default is .3) will add the common residue to the consensus. If it doesn't reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus would give:
consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT 
...', IUPACAmbiguousDNA())
You can adjust how dumb_consensus works by passing optional parameters:
the threshold
This is the threshold specifying how common a particular residue has to be at a position before it is added. The default is .7.

the ambiguous character
This is the ambiguity character to use. The default is 'N'.

the consensus alphabet
This is the alphabet to use for the consensus sequence. If an alphabet is not specified than we will try to guess the alphabet based on the alphabets of the sequences in the alignment.

4.4.4  Position Specific Score Matrices

Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment:
GTATC
AT--C
CTGTC
the PSSM for this alignment is:
      G A T C
    G 1 1 0 1
    T 0 0 3 0
    A 1 1 0 0
    T 0 0 2 0
    C 0 0 0 3
Let's assume we've got an alignment object called c_align. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence:
summary_align = AlignInfo.SummaryInfo(c_align)
consensus = summary_align.dumb_consensus()
Now, we want to make the PSSM, but ignore any N ambiguity residues when calculating this:
my_pssm = summary_align.pos_specific_score_matrix(consensus,
                                                  chars_to_ignore = ['N'])
Two notes should be made about this:
  1. To maintain strictness with the alphabets, you can only include characters along the top of the PSSM that are in the alphabet of the alignment object. Gaps are not included along the top axis of the PSSM.

  2. The sequence passed to be displayed along the left side of the axis does not need to be the consensus. For instance, if you wanted to display the second sequence in the alignment along this axis, you would need to do:
    second_seq = alignment.get_seq_by_num(1)
    my_pssm = summary_align.pos_specific_score_matrix(second_seq
                                                      chars_to_ignore = ['N'])
    
The command above returns a PSSM object. To print out the PSSM as we showed above, we simply need to do a print my_pssm, which gives:
    A   C   G   T
T  0.0 0.0 0.0 7.0
A  7.0 0.0 0.0 0.0
T  0.0 0.0 0.0 7.0
A  7.0 0.0 0.0 0.0
C  0.0 7.0 0.0 0.0
A  7.0 0.0 0.0 0.0
T  0.0 0.0 0.0 7.0
T  1.0 0.0 0.0 6.0
...
You can access any element of the PSSM by subscripting like your_pssm[sequence_number][residue_count_name]. For instance, to get the counts for the 'A' residue in the second element of the above PSSM you would do:
>>> print my_pssm[1]['A']
7.0
The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.

4.4.5  Information Content

A potentially useful measure of evolutionary conservation is the information ceontent of a sequence.

A useful introduction to information theory targetted towards molecular biologists can be found at http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:
ICj =
Na
å
i=1
Pij * log(
Pij
Qi
)

where: Well, now that we have an idea what information content is being calculated in Biopython, let's look at how to get it for a particular region of the alignment.

First, we need to use our alignment to get a alignment summary object, which we'll assume is called summary_align (see section 4.4.2) for instructions on how to get this. Once we've got this object, calculating the information content for a region is as easy as:
info_content = summary_align.information_content(5, 30, 
                                                 chars_to_ignore = ['N'])
Wow, that was much easier then the formula above made it look! The variable info_content now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue 'N' when calculating the information content, since this value is not included in our alphabet (so we shouldn't be interested in looking at it!).

As mentioned above, we can also calculate relative information content by supplying the expected frequencies:
expect_freq = {
    'A' : .3,
    'G' : .2,
    'T' : .3,
    'C' : .2}
The expected should not be passed as a raw dictionary, but instead by passed as a SubsMat.FreqTable object (see section 5.4.2 for more information about FreqTables). The FreqTable object provides a standard for associating the dictionary with an Alphabet, similar to how the Biopython Seq class works.

To create a FreqTable object, from the frequency dictionary you just need to do:
from Bio.Alphabet import IUPAC
from Bio.SubsMat import FreqTable

e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ,
                                   IUPAC.unambiguous_dna)
Now that we've got that, calculating the relative information content for our region of the alignment is as simple as:
info_content = summary_align.information_content(5, 30,
                                                 e_freq_table = e_freq_table,
                                                 chars_to_ignore = ['N'])
Now, info_content will contain the relative information content over the region in relation to the expected frequencies.

The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter log_base as the base you want:
info_content = summary_align.information_content(5, 30, log_base = 10
                                                 chars_to_ignore = ['N'])
Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won't reveal any mistakes made in coding this function!

4.4.6  Translating between Alignment formats

One thing that you always end up having to do is convert between different formats. Biopython does this using a FormatConverter class for alignment objects. First, let's say we have just parsed an alignment from clustal format into a ClustalAlignment object:
import os
from Bio import Clustalw

alignment = Clustalw.parse_file(os.path.join(os.curdir, "test.aln"))
Now, let's convert this alignment into FASTA format. First, we create a converter object:
from Bio.Align.FormatConvert import FormatConverter

converter = FormatConverter(alignment)
We pass the converter the alignment that we want to convert. Now, to get this in FASTA alignment format, we simply do the following:
fasta_align = converter.to_fasta()
Looking at the newly created fasta_align object using print fasta_align gives:
>gi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAATATATA----
------ATATATTTCAAATTTCCTTATATACCCAAATATAAAAATATCTAATAAATTAGA
...
The conversion process will, of course, lose information specific to a particular alignment format. Howerver, most of the basic information about the alignment will be retained.

As more formats are added the converter will be beefed up to read and write all of these different formats.

4.5  Substitution Matrices

Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book ``Biological Sequence Analysis'' by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.

Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.

4.5.1  Using common substitution matrices

4.5.2  Creating your own substitution matrix from an alignment

A very cool thing that you can do easily with the substitution matrix classes is to create your own substitution matrix from an alignment. In practice, this is normally done with protein alignments. In this example, we'll first get a biopython alignment object and then get a summary object to calculate info about the alignment. The file containing protein.aln (also available online here) contains the Clustalw alignment output.
from Bio import Clustalw
from Bio.Alphabet import IUPAC
from Bio.Align import AlignInfo

# get an alignment object from a Clustalw alignment output
c_align = Clustalw.parse_file("protein.aln", IUPAC.protein)
summary_align = AlignInfo.SummaryInfo(c_align)
Sections 4.4.1 and 4.4.2 contain more information on doing this.

Now that we've got our summary_align object, we want to use it to find out the number of times different residues substitute for each other. To make the example more readable, we'll focus on only amino acids with polar charged side chains. Luckily, this can be done easily when generating a replacement dictionary, by passing in all of the characters that should be ignored. Thus we'll create a dictionary of replacements for only charged polar amino acids using:
replace_info = summary_align.replacement_dictionary(["G", "A", "V", "L", "I",
                                                     "M", "P", "F", "W", "S",
                                                     "T", "N", "Q", "Y", "C"])
This information about amino acid replacements is represented as a python dictionary which will look something like:
{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0, 
('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0, 
('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0, 
('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0, 
('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0, 
('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0, 
('D', 'E'): 270.0, ('D', 'D'): 2360.0}
This information gives us our accepted number of replacements, or how often we expect different things to substitute for each other. It turns out, amazingly enough, that this is all of the information we need to go ahead and create a substitution matrix. First, we use the replacement dictionary information to create an Accepted Replacement Matrix (ARM):
from Bio import SubsMat
my_arm = SubsMat.SeqMat(replace_info)
With this accepted replacement matrix, we can go right ahead and create our log odds matrix (i. e. a standard type Substitution Matrix):
my_lom = SubsMat.make_log_odds_matrix(my_arm)
The log odds matrix you create is customizable with the following optional arguments: Once you've got your log odds matrix, you can display it prettily using the function print_mat. Doing this on our created matrix gives:
>>> my_lom.print_mat()
D   6
E  -5   5
H -15 -13  10
K -31 -15 -13   6
R -13 -25 -14  -7   7
   D   E   H   K   R
Very nice. Now we've got our very own substitution matrix to play with!

4.6  More Advanced Sequence Classes -- Sequence IDs and Features

You read all about the basic Biopython sequence class in section 2.2, which described how to do many useful things with just the sequence. Howerver, many times sequences have important additional properties associated with them. This section described how Biopython handles these higher level descriptions of a sequence.

4.6.1  Sequence ids and Descriptions -- dealing with SeqRecords

Immediately above the Sequence class is the SeqRecord class, defined in the Bio.SeqRecord module. This class allows higher level features such as ids and features to be associated with the sequence. The class itself is very simple, and offers the following information as attributes:
seq
-- The sequence itself -- A Bio.Seq object

id
-- The primary id used to identify the sequence. In most cases this is something like an accession number.

name
-- A ``common'' name/id for the sequence. In some cases this will be the same as the accession number, but it could also be a clone name. I think of this as being analagous to the LOCUS id in a GenBank record.

description
-- A human readible description or expressive name for the sequence. This is similar to what follows the id information in a FASTA formatted entry.

annotations
-- A dictionary of additional information about the sequence. The keys are the name of the information, and the information is contained in the value. This allows the addition of more ``unstructed'' information to the sequence.

features
-- A list of SeqFeature objects with more structured information about the features on a sequence. The structure of SeqFeatures is described below in section 4.6.2.
Using a SeqRecord class is not very complicated, since all of the information is stored as attributes of the class. Initializing the class just involves passing a Seq object to the SeqRecord:
>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)
Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:
>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = 'AC12345'
>>> simple_seq_r.description = 'My little made up sequence I wish I could 
write a paper about and submit to GenBank'
>>> print simple_seq_r.description
My little made up sequence I wish I could write a paper about and submit 
to GenBank
>>> simple_seq_r.seq
Seq('GATC', Alphabet())
Adding annotations is almost as easy, and just involves dealing directly with the annotation dictionary:
>>> simple_seq_r.annotations['evidence'] = 'None. I just made it up.'
>>> print simple_seq_r.annotations
{'evidence': 'None. I just made it up.'}
That's just about all there is to it! Next, you may want to learn about SeqFeatures, which offer an additional structured way to represent information about a sequence.

4.6.2  Features and Annotations -- SeqFeatures

Sequence features are an essential part of describing a sequence. Once you get beyond the sequence itself, you need some way to organize and easily get at the more ``abstract'' information that is known about the sequence. While it is probably impossible to develop a general sequence feature class that will cover everything, the Biopython SeqFeature class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you'll probably have an easier time grasping the structure of the Biopython classes.

4.6.2.1  SeqFeatures themselves

The first level of dealing with Sequence features is the SeqFeature class itself. This class has a number of attributes, so first we'll list them and there general features, and then work through an example to show how this applies to a real life example, a GenBank feature table. The attributes of a SeqFeature are:
location
-- The location of the SeqFeature on the sequence that you are dealing with. The locations end-points may be fuzzy -- section 4.6.2.2 has a lot more description on how to deal with descriptions.

type
-- This is a textual description of the type of feature (for instance, this will be something like 'CDS' or 'gene').

ref
-- A reference to a different sequence. Some times features may be ``on'' a particular sequence, but may need to refer to a different sequence, and this provides the reference (normally an accession number). A good example of this is a genomic sequence that has most of a coding sequence, but one of the exons is on a different accession. In this case, the feature would need to refer to this different accession for this missing exon.

ref_db
-- This works along with ref to provide a cross sequence reference. If there is a reference, ref_db will be set as None if the reference is in the same database, and will be set to the name of the database otherwise.

strand
-- The strand on the sequence that the feature is located on. This may either be '1' for the top strand, '-1' for the bottom strand, or '0' for both strands (or if it doesn't matter). Keep in mind that this only really makes sense for double stranded DNA, and not for proteins or RNA.

qualifiers
-- This is a python dictionary of additional information about the feature. The key is some kind of terse one-word description of what the information contained in the value is about, and the value is the actual information. For example, a common key for a qualifier might be ``evidence'' and the value might be ``computational (non-experimental).'' This is just a way to let the person who is looking at the feature know that it has not be experimentally (i. e. in a wet lab) confirmed.

sub_features
-- A very important feature of a feature is that it can have additional sub_features underneath it. This allows nesting of features, and helps us to deal with things such as the GenBank/EMBL feature lines in a (we hope) intuitive way.
To show an example of SeqFeatures in action, let's take a look at the following feature from a GenBank feature table:
     mRNA            complement(join(<49223..49300,49780..>50208))
                     /gene="F28B23.12"
To look at the easiest attributes of the SeqFeature first, if you got a SeqFeature object for this it would have it type of 'mRNA', a strand of -1 (due to the 'complement'), and would have None for the ref and ref_db since there are no references to external databases. The qualifiers for this SeqFeature would be a python dictionarary that looked like {'gene' : 'F28B23.12'}.

Now let's look at the more tricky part, how the 'join' in the location line is handled. First, the location for the top level SeqFeature (the one we are dealing with right now) is set as going from '<49223' to '>50208' (see section 4.6.2.2 for the nitty gritty on how fuzzy locations like this are handled). So the location of the top level object is the entire span of the feature. So, how do you get at the information in the 'join?' Well, that's where the sub_features go in.

The sub_features attribute will have a list with two SeqFeature objects in it, and these contain the information in the join. Let's look at top_level_feature.sub_features[0]; the first sub_feature). This object is a SeqFeature object with a type of 'mRNA_join,' a strand of -1 (inherited from the parent SeqFeature) and a location going from '<49223' to '49300'.

So, the sub_features allow you to get at the internal information if you want it (i. e. if you were trying to get only the exons out of a genomic sequence), or just to deal with the broad picture (i. e. you just want to know that the coding sequence for a gene lies in a region). Hopefully this structuring makes it easy and intuitive to get at the sometimes complex information that can be contained in a SeqFeature.

4.6.2.2  Locations

In the section on SeqFeatures above, we skipped over one of the more difficult parts of Features, dealing with the locations. The reason this can be difficult is because of fuzziness of the positions in locations. Before we get into all of this, let's just define the vocabulary we'll use to talk about this. Basically there are two terms we'll use:
position
-- This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and 3^5 are all positions.

location
-- A location is two positions that defines a region of a sequence. For instance 5..20 (i. e. 5 to 20) is a location.
I just mention this because sometimes I get confused between the two.

The complication in dealing with locations comes in the positions themselves. In biology many times things aren't entirely certain (as much as us wet lab biologists try to make them certain!). For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position. To help us deal with this, we have the concept of fuzzy positions. Basically there are five types of fuzzy positions, so we have five classes do deal with them:
ExactPosition
-- As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a a number, and you can get the position by looking at the position attribute of the object.

BeforePosition
-- This class represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like '<13', signifying that the real position is located somewhere less then 13. To get the specified upper boundary, look at the position attribute of the object.

AfterPosition
-- Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as '>13', and like BeforePosition, you get the boundary number by looking at the position attribute of the object.

WithinPosition
-- This class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as '(1.5)', to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.

BetweenPosition
-- This class deals with a position that occurs between two coordinates. For instance, you might have a protein binding site that occurs between two nucleotides on a sequence. This is represented as '2^3', which indicates that the real position happens between position 2 and 3. Getting this information from the object is very similar to WithinPosition, the position attribute specifies the lower boundary (2, in this case) and the extension indicates the range to the higher boundary (1 in this case).
Now that we've got all of the types of fuzzy positions we can have taken care of, we are ready to actually specify a location on a sequence. This is handled by the FeatureLocation class. An object of this type basically just holds the potentially fuzzy start and end positions of a feature. You can create a FeatureLocation object by creating the positions and passing them in:
>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(8, 1)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
If you print out a FeatureLocation object, you can get a nice representation of the information:
>>> print my_location
[>5:(8^9)]
We can access the fuzzy start and end positions using the start and end attributes of the location:
>>> my_location.start
<Bio.SeqFeature.AfterPosition instance at 0x101d7164>
>>> print my_location.start
>5
>>> print my_location.end
(8^9)
If you don't want to deal with fuzzy positions and just want numbers, you just need to ask for the nofuzzy_start and nofuzzy_end attributes of the location:
>>> my_location.nofuzzy_start 
5
>>> my_location.nofuzzy_end
8
Notice that this just gives you back the position attributes of the fuzzy locations.

Similary, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition constructors, and you'll get back out ExactPosition objects:
>>> exact_location = SeqFeature.FeatureLocation(5, 8)
>>> print exact_location
[5:8]
>>> exact_location.start
<Bio.SeqFeature.ExactPosition instance at 0x101dcab4>
That is all of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true!

4.6.2.3  References

Another common annotation related to a sequence is a reference to a journal or other published work dealing with the sequence. We have a fairly simple way of representing a Reference in Biopython -- we have a Bio.SeqFeature.Reference class that stores the relevant information about a reference as attributes of an object.

The attributes include things that you would expect to see in a reference like journal, title and authors. Additionally, it also can hold the medline_id and pubmed_id and a comment about the reference. These are all accessed simply as attributes of the object.

A reference also has a location object so that it can specify a particular location on the sequence that the reference refers to. For instance, you might have a journal that is dealing with a particular gene located on a BAC, and want to specify that it only refers to this position exactly. The location is a potentially fuzzy location, as described in section 4.6.2.2.

That's all there is too it. References are meant to be easy to deal with, and hopefully general enough to cover lots of usage cases.

4.7  BioRegistry -- automatically finding sequence sources

A consistently annoying problem in bioinformatics is easily finding a sequence and making it available to your program. Sequences are available from a ton of standard locations like NCBI and EMBL. as well as from non-standard locations such as local databases or web servers. To make this problem easier, Biopython (as well as the other open-bio projects) is working towards a standard mechanism to allow specification of the locations of resources. Once locations are specified, your code using Biopython can readily retrieve sequences without having to worry about the specifics of where the sequence came from.

This transparency of retrieval has a number of advantages for your code. If a single web service is down (ie. NCBI is too busy and is refusing connections), backup locations can be tried without having any effect on the code that you wrote. Similary, you can have local repositories of sequences that you use often, and then if these repositories are off-line, switch to a web based service. Third, it keeps the details of retrieval out of your code, allowing you to focus on your biological problem, instead of focusing on boring retrieval details. Finally, it's just a very cool idea.

This section deals with the specifics of setting up and using this system of automatically retrieving sequences. The first section deals with the interoperable configuration file method, while the second talks about a similar Biopython-specific method. The configuration file method is definately the way to go, unless you have specific needs it won't give you.

4.7.1  Finding resources using a configuration file

4.7.1.1  Writing a configuration file

4.7.1.2  Sequence retrieval using the configuration file

4.7.2  Finding resources through a biopython specific interface

Biopython has also developed a proprietary mechanism for retrieval that is Biopython only. This is only a good choice to use if the standard configuration file system doesn't give you everything you want, since this method is not compatible with other open-bio projects.

4.7.2.1  Retrieving sequences

By default, Biopython is configured to allow retrieval of sequences from a number of standard locations. This makes it useable immediately without knowing much about the system itself. To retrieve a Registry of databases, all you need to do is:
>>> from Bio import db
You can readily view all of the different databases that retrieval is possible be either printing the object and examining them, or programmatically through the keys() function of object:
>>> print db
DBRegistry, exporting 'embl', 'embl-dbfetch-cgi', 'embl-ebi-cgi',
'embl-fast', 'embl-xembl-cgi', 'interpro-ebi-cgi',
'nucleotide-dbfetch-cgi', 'nucleotide-genbank-cgi', 'pdb',
'pdb-ebi-cgi', 'pdb-rcsb-cgi', 'prodoc-expasy-cgi',
'prosite-expasy-cgi', 'protein-genbank-cgi', 'swissprot',
'swissprot-expasy-cgi'
>>> db.keys()
['embl-dbfetch-cgi', 'embl-fast', 'embl', 'prosite-expasy-cgi',
'swissprot-expasy-cgi', 'nucleotide-genbank-cgi', 'pdb-ebi-cgi',
'interpro-ebi-cgi', 'embl-ebi-cgi', 'embl-xembl-cgi',
'protein-genbank-cgi', 'pdb', 'prodoc-expasy-cgi',
'nucleotide-dbfetch-cgi', 'swissprot', 'pdb-rcsb-cgi']
Now, let's say we want to retrieve a swissprot record for one of our orchid chalcone synthases. First, we get the swissprot connection, then we retrieve an record of interest:
>>> sp = db["swissprot"]
>>> sp
<Bio.DBRegistry.DBGroup instance at 0x82fdb2c>
record_handle = sp['O23729']
>>> print record_handle.read()[:200]
ID   CHS3_BROFI     STANDARD;      PRT;   394 AA.
AC   O23729;
DT   15-JUL-1999 (Rel. 38, Created)
DT   15-JUL-1999 (Rel. 38, Last sequence update)
DT   15-JUL-1999 (Rel. 38, Last annotation update)
This retrieval method is nice for a number of reasons. First, we didn't have to worry about where exactly swissprot records were being retrieved from -- we only ask for an object that will give us any swissprot record we can get. Secondly, once we get the swissprot object, we don't need to worry about how we are getting our sequence -- we just ask for it by id and don't worry about the implementation details.

The default biopython database registry object can be used similarly to retrieve sequences from EMBL, prosite, PDB, interpro, GenBank and XEMBL.

4.7.2.2  Registering and Grouping databases

The basic registry objects are nice in that they provide basic functionality, but if you have a more advanced system it is nice to be able to specify new databases. This is a more advanced topic, but is very possible with the current system.

This example describes adding a local CGI script serving out GenBank (ie. if you had something like a local mirror of GenBank), and then registering this and the normal NCBI GenBank as a single group to retrieve from. This would allow you to normally get things from a local mirror and then switch over to the main GenBank server if your server goes down, all without adjusting your retrieval code.

First, we need to describe the CGI script to retrieve from. This example uses a CGI script, but we eventually hope to handle other sources such as Applications, databases, or CORBA servers (XXX, should have an example once this is in place). We describe the CGI script as follows:
from Bio.sources import CGI
local_cgi = CGI(name = "local_cgi",
                delay = 0.0,
                cgi = "http://www.myserver.org/cgi-bin/my_local.cgi",
                url = "http://www.myserver.org/cgi_documentation.html",
                doc = "Query a local databases",
                failure_cases = [])
Now that we have specified the details for connecting to the CGI script, we are ready to register this CGI script. We just need one more detail -- specifying what the script returns upon failure to find a sequence. We do this using Martel regular expressions:
import Martel
my_failures = [
     (Martel.Str("Sequence not available"), "No sequence found")]
Now we've got everything we need, and can register the database:
from Bio import register_db
register_db(name = "nucleotide-genbank-local",
            key = "uid",
            source = local_cgi,
            failure = my_failures)
This makes the database available as before, so if we print the keys of the database, we'll see "nucleotide-genbank-local" available. Now that we've got it registered, we'd like to link all of the genbank databases together. We do this, using a group_db command. First, we need to create a group named "genbank" to retrieve things from the database:
register_db(name = "genbank", behavior = "concurrent")
The behavior argument specifies how the group will try to retrieve things from the various databases registered with it. concurrent tells it to try to retrieve from all databases at once, and then just take whatever sequence record comes back first. You can also specify serial behavior, in which the retriever will connect to one database at a time until something gets retrieved.

Now that we've got the group, we want to register our local GenBank and the NCBI GenBank with this command:
group_db("genbank", "nucleotide-genbank-local")
group_db("genbank", "nucleotide-genbank-cgi")
Now we've got our database access set up, and the database registry contains our genbank and nucleotide-genbank-local entries:
['embl-dbfetch-cgi', 'embl-fast', 'embl', 'prosite-expasy-cgi',
'swissprot-expasy-cgi', 'nucleotide-genbank-cgi', 'pdb-ebi-cgi',
'genbank', 'nucleotide-genbank-local', 'interpro-ebi-cgi',
'embl-ebi-cgi', 'embl-xembl-cgi', 'protein-genbank-cgi', 'pdb',
'prodoc-expasy-cgi', 'nucleotide-dbfetch-cgi', 'swissprot',
'pdb-rcsb-cgi']
Cool, now we can add our own databases to the registry and make use of the simplified retrieval scheme!

4.8  BioSQL -- storing sequences in a relational database

4.9  BioCorba

Biocorba does some cool stuff with CORBA. Basically, it allows you to easily interact with code written in other languages, including Perl and Java. This is all done through an interface which is very similar to the standard biopython interface. Much work has been done to make it easy to use knowing only very little about CORBA. You should check out the biocorba specific documentation, which describes in detail how to use it.

4.10  Going 3D: The PDB module

Biopython also allows you to explore the extensive realm of macromolecular structure. Biopython comes with a PDBParser class that produces a Structure object. The Structure object can be used to access the atomic data in the file in a convenient manner.

4.10.1  Structure representation

A macromolecular structure is represented using a structure, model chain, residue, atom (or SMCRA) hierarchy. Fig. 4.10.1 shows a UML class diagram of the SMCRA data structure. Such a data structure is not necessarily best suited for the representation of the macromolecular content of a structure, but it is absolutely necessary for a good interpretation of the data present in a file that describes the structure (typically a PDB or MMCIF file). If this hierarchy cannot represent the contents of a structure file, it is fairly certain that the file contains an error or at least does not describe the structure unambiguously. If a SMCRA data structure cannot be generated, there is reason to suspect a problem. Parsing a PDB file can thus be used to detect likely problems. We will give several examples of this in section 4.10.5.1.



Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly) implements the Entity interface (because an Atom does not have children).

For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can extract an Atom object from a Residue object by using an atom name string as a key, you can extract a Chain object from a Model object by using its chain identifier as a key).

Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with disorder and behave exactly as Atom and Residue objects.

In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively) by using an id as a key.
child_entity=parent_entity[child_id]
You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific way (e.g. according to chain identifier for Chain objects in a Model object).
child_list=parent_entity.get_list()
You can also get the parent from a child.
parent_entity=child_entity.get_parent()
At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all id's starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is something like:
full_id=residue.get_full_id()

print full_id

("1abc", 0, "A", ("", 10, "A"))
This corresponds to: The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blanc hetero field, that its sequence identifier is 10 and that its insertion code is "A".

Some other useful methods:
# get the entity's id

entity.get_id()

# check if there is a child with a given id

entity.has_id(entity_id)

# get number of children

nr_children=len(entity)
It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be done via a nice Decorator class that includes integrity checking, but you can take a look at the code (Entity.py) if you want to use the raw interface.

4.10.1.1  Structure

The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a number of Model children. Most crystal structures (but not all) contain a single model, while NMR structures typically consist of several models. Disorder in crystal structures of large parts of molecules can also result in several models.

4.10.1.1.1  Constructing a Structure object
A Structure object is produced by a PDBParser object:
from Bio.PDB.PDBParser import PDBParser

p=PDBParser(PERMISSIVE=1)

structure_id="1fat"

filename="pdb1fat.ent"

s=p.get_structure(structure_id, filename)
The PERMISSIVE flag indicates that a number of common problems (see 4.10.5.1) associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a PDBConstructionException will be generated during the parse operation.

4.10.1.1.2  Header and trailer
You can extract the header and trailer (simple lists of strings) of the PDB file from the PDBParser object with the get_header and get_trailer methods.

4.10.1.2  Model

The id of the Model object is an integer, which is derived from the position of the model in the parsed file (they are automatically numbered starting from 0). The Model object stores a list of Chain children.

4.10.1.2.1  Example
Get the first model from a Structure object.
first_model=structure[0]

4.10.1.3  Chain

The id of a Chain object is derived from the chain identifier in the structure file, and can be any string. Each Chain in a Model object has a unique id. The Chain object stores a list of Residue children.

4.10.1.3.1  Example
Get the Chain object with identifier ``A'' from a Model object.
chain_A=model["A"]

4.10.1.4  Residue

Unsurprisingly, a Residue object stores a set of Atom children. In addition, it also contains a string that specifies the residue name (e.g. ``ASN'') and the segment identifier of the residue (well known to X-PLOR users, but not used in the construction of the SMCRA data structure).

The id of a Residue object is composed of three parts: the hetero field (hetfield), the sequence identifier (resseq) and the insertion code (icode).

The hetero field is a string : it is ``W'' for waters, ``H_'' followed by the residue name (e.g. ``H_FUC'') for other hetero residues and blank for standard amino and nucleic acids. This scheme is adopted for reasons described in section 4.10.3.1.

The second field in the Residue id is the sequence identifier, an integer describing the position of the residue in the chain.

The third field is a string, consisting of the insertion code. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure.

Let's give some examples. Asn 10 with a blank insertion code would have residue id ('' '', 10, '' ''). Water 10 would have residue id (``W``, 10, `` ``). A glucose molecule (a hetero residue with residue name GLC) with sequence identifier 10 would have residue id (''H_GLC'', 10, '' ''). In this way, the three residues (with the same insertion code and sequence identifier) can be part of the same chain because their residue id's are distinct.

In most cases, the hetflag and insertion code fields will be blank, e.g. ('' '', 10, '' ''). In these cases, the sequence identifier can be used as a shortcut for the full id:
# use full id

res10=chain[("", 10, "")]

# use shortcut 

res10=chain[10]
Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with in a special way, as described in section 4.10.2.3.2.

A Residue object has a number of additional methods:
r.get_resname()  # return residue name, e.g. "ASN"
r.get_segid()  # return the SEGID, e.g. "CHN1"

4.10.1.5  Atom

The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom name (e.g. ``OG'' for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms, as described in section 4.10.2.2.

In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these spaces can be removed for ease of use (e.g. an amino acid C a atom is labeled ``.CA.'' in a PDB file, where the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id). In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue contains atoms with names ``.CA.'' and ``CA..'', although this is not very likely.

The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier and the full atom name including spaces. Less used items like the atom element number or the atomic charge sometimes specified in a PDB file are not stored.

An Atom object has the following additional methods:
a.get_name()       # atom name (spaces stripped, e.g. "CA")

a.get_id()         # id (equals atom name)

a.get_coord()      # atomic coordinates

a.get_bfactor()    # B factor

a.get_occupancy()  # occupancy

a.get_altloc()     # alternative location specifie

a.get_sigatm()     # std. dev. of atomic parameters

a.get_siguij()     # std. dev. of anisotropic B factor

a.get_anisou()     # anisotropic B factor

a.get_fullname()   # atom name (with spaces, e.g. ".CA.")
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used.

4.10.2  Disorder

4.10.2.1  General approach

Disorder should be dealt with from two points of view: the atom and the residue points of view. In general, we have tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C a atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are stored in special objects that behave as if there is no disorder. This is done by only representing a subset of the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain atom positions of a Ser residue is used) can be specified by the user.

4.10.2.2  Disordered atoms

Disordered atoms are represented by ordinary Atom objects, but all Atom objects that represent the same physical atom are stored in a DisorderedAtom object. Each Atom object in a DisorderedAtom object can be uniquely indexed using its altloc specifier. The DisorderedAtom object forwards all uncaught method calls to the selected Atom object, by default the one that represents the atom with with the highest occupancy. The user can of course change the selected Atom object, making use of its altloc specifier. In this way atom disorder is represented correctly without much additional complexity. In other words, if you are not interested in atom disorder, you will not be bothered by it.

Each disordered atom has a characteristic altloc identifier. You can specify that a DisorderedAtom object should behave like the Atom object associated with a specific altloc identifier:
atom.disordered\_select("A")  # select altloc A atom

print atom.get_altloc()
"A"

atom.disordered_select("B")     # select altloc B atom
print atom.get_altloc()
"B"

4.10.2.3  Disordered residues

4.10.2.3.1  Common case
The most common case is a residue that contains one or more disordered atoms. This is evidently solved by using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the Atom objects (the selected Atom object) it contains.

4.10.2.3.2  Point mutations
A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2.

Since these residues belong to a different residue type (e.g. let's say Ser 60 and Cys 60) they should not be stored in a single Residue object as in the common case. In this case, each residue is represented by one Residue object, and both Residue objects are stored in a DisorderedResidue object.

The DisorderedResidue object forwards all uncaught methods to the selected Residue object (by default the last Residue object added), and thus behaves like an ordinary residue. Each Residue object in a DisorderedResidue object can be uniquely identified by its residue name. In the above example, residue Ser 60 would have id ``SER'' in the DisorderedResidue object, while residue Cys 60 would have id ``CYS''. The user can select the active Residue object in a DisorderedResidue object via this id.

4.10.3  Hetero residues

4.10.3.1  Associated problems

A common problem with hetero residues is that several hetero and non-hetero residues present in the same chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each hetero residue, waters and other hetero residues are treated in a different way.

Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank (`` ``) for amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is explained below.

4.10.3.2  Water residues

The hetfield string of a water residue consists of the letter ``W''. So a typical residue id for a water is (``W'', 1, `` ``).

4.10.3.3  Other hetero residues

The hetfield string for other hetero residues starts with ``H_'' followed by the residue name. A glucose molecule e.g. with residue name ``GLC'' would have hetfield ``H_GLC''. It's residue id could e.g. be (``H_GLC'', 1, `` ``).

4.10.4  Some random usage examples

Parse a PDB file, and extract some Model, Chain, Residue and Atom objects.
from PDBParser import PDBParser 

parser=PDBParser()

structure=parser.get_structure("test", "1fat.pdb")
model=structure[0]
chain=model["A"]
residue=chain[1]
atom=residue["CA"]
Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10).
residue_id=("H_GLC", 10, " ")
residue=chain[residue_id]
Print all hetero residues in chain.
for residue in chain.get_list():
 residue_id=residue.get_id()
 hetfield=residue_id[0]
 if hetfield[0]=="H":
  print residue_id
Print out the coordinates of all CA atoms in a structure with B factor greater than 50.
for model in structure.get_list():
  for chain in model.get_list():
    for residue in chain.get_list():
      if residue.has_id("CA"):
        ca=residue["CA"]
        if ca.get_bfactor()>50.0:
          print ca.get_coord()
Print out all the residues that contain disordered atoms.
for model in structure.get_list()
  for chain in model.get_list():
    for residue in chain.get_list():
      if residue.is_disordered():
        resseq=residue.get_id()[1]
        resname=residue.get_resname()
        model_id=model.get_id()
        chain_id=chain.get_id()
        print model_id, chain_id, resname, resseq
Loop over all disordered atoms, and select all atoms with altloc A (if present). This will make sure that the SMCRA data structure will behave as if only the atoms with altloc A are present.
for model in structure.get_list()
  for chain in model.get_list():
    for residue in chain.get_list():
      if residue.is_disordered():
        for atom in residue.get_list():
          if atom.is_disordered():
            if atom.disordered_has_id("A"):
              atom.disordered_select("A")
Suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys residue.
residue=chain[10]
residue.disordered_select("CYS")

4.10.5  Common problems in PDB files

4.10.5.1  Examples

The PDBParser/Structure class was tested on about 800 structures (each belonging to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5 seconds per structure. Parsing the structure of the large ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC.

Three exceptions were generated in cases where an unambiguous data structure could not be built. In all three cases, the likely cause is an error in the PDB file that should be corrected. Generating an exception in these cases is much better than running the chance of incorrectly describing the structure in a data structure.

4.10.5.1.1  Duplicate residues
One structure contains two amino acid residues in one chain with the same sequence identifier (resseq 3) and icode. Upon inspection it was found that this chain contains the residues Thr A3, ..., Gly A202, Leu A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar situations exist for structure 1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66).

4.10.5.1.2  Duplicate atoms
Structure 1EJG contains a Ser/Pro point mutation in chain A at position 22. In turn, Ser 22 contains some disordered atoms. As expected, all atoms belonging to Ser 22 have a non-blank altloc specifier (B or C). All atoms of Pro 22 have altloc A, except the N atom which has a blank altloc. This generates an exception, because all atoms belonging to two residues at a point mutation should have non-blank altloc. It turns out that this atom is probably shared by Ser and Pro 22, as Ser 22 misses the N atom. Again, this points to a problem in the file: the N atom should be present in both the Ser and the Pro residue, in both cases associated with a suitable altloc identifier.

4.10.5.2  Automatic correction

Some errors are quite common and can be easily corrected without much risk of making a wrong interpretation. These cases are listed below.

4.10.5.2.1  A blank altloc for a disordered atom
Normally each disordered atom should have a non-blanc altloc identifier. However, there are many structures that do not follow this convention, and have a blank and a non-blank identifier for two disordered positions of the same atom. This is automatically interpreted in the right way.

4.10.5.2.2  Broken chains
Sometimes a structure contains a list of residues belonging to chain A, followed by residues belonging to chain B, and again followed by residues belonging to chain A, i.e. the chains are ``broken''. This is correctly interpreted.

4.10.5.3  Fatal errors

Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing and risking a mistake, an exception is generated, and the user is expected to correct the PDB file. These cases are listed below.

4.10.5.3.1  Duplicate residues
All residues in a chain should have a unique id. This id is generated based on: If this does not lead to a unique id something is quite likely wrong, and an exception is generated.

4.10.5.3.2  Duplicate atoms
All atoms in a residue should have a unique id. This id is generated based on: If this does not lead to a unique id something is quite likely wrong, and an exception is generated.

4.10.6  Other features

There are also some tools to analyze a crystal structure. Tools exist to superimpose two coordinate sets (SVDSuperimposer), to extract polypeptides from a structure (Polypeptide), to perform neighbor lookup (NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup is done using a KD tree module written in C++. It is very fast and also includes a fast method to find all point pairs within a certain distance of each other.

A Polypeptide object is simply a UserList of Residue objects. You can construct a list of Polypeptide objects from a Structure object as follows:
model_nr=1
polypeptide_list=build_peptides(structure, model_nr)

for polypeptide in polypeptide_list:
    print polypeptide
The Polypeptide objects are always created from a single Model (in this case model 1).

4.11  Miscellaneous

4.11.1  Translating a DNA sequence to Protein


Previous Up Next