The Bio.PubMed
module uses Bio.Entrez
internally to access the NCBI.
Please remember to read and observe the NCBI’s guidelines on using their Entrez online facilities responsibly. If you are found to be abusing their servers, they can and will block your access! See Section 7.1.
If you are in the Medical field or interested in human issues (and many times even if you are not!), PubMed (http://www.ncbi.nlm.nih.gov/PubMed/) is an excellent source of all kinds of goodies. So like other things, we’d like to be able to grab information from it and use it in python scripts.
Querying PubMed using Biopython is extremely painless. To get all of the article ids for articles having to do with orchids (see section 2.3 for our motivation), we only need the following three lines of code:
from Bio import PubMed search_term = 'orchid' orchid_ids = PubMed.search_for(search_term)
This returns a python list containing all of the orchid ids
['11070358', '11064040', '11028023', '10947239', '10938351', '10936520', '10905611', '10899814', '10856762', '10854740', '10758893', '10716342', ...
With this list of ids we are ready to start retrieving the records, so follow on ahead to the next section.
The previous section described how to get a bunch of article ids. Now that we’ve got them, we obviously want to get the corresponding Medline records and extract the information from them.
The interface for retrieving records from PubMed should be very intuitive to python programmers – it models a python dictionary. To set up this interface, we need to set up a parser that will parse the results that we retrieve. The following lines of code get everything set up:
from Bio import PubMed from Bio import Medline rec_parser = Medline.RecordParser() medline_dict = PubMed.Dictionary(parser = rec_parser)
What we’ve done is create a dictionary like object medline_dict
. To get an article we access it like medline_dict[id_to_get]
. What this does is connect with PubMed, get the article you ask for, parse it into a record object, and return it. Very cool!
Now let’s look at how to use this nice dictionary to print out some information about some ids. We just need to loop through our ids (orchid_ids
from the previous section) and print out the information we are interested in:
for oid in orchid_ids[0:5]: cur_record = medline_dict[oid] print 'title:', cur_record.title.rstrip() print 'authors:', cur_record.authors print 'source:', cur_record.source.strip() print
The output for this looks like:
title: Sex pheromone mimicry in the early spider orchid (ophrys sphegodes): patterns of hydrocarbons as the key mechanism for pollination by sexual deception [In Process Citation] authors: ['Schiestl FP', 'Ayasse M', 'Paulus HF', 'Lofstedt C', 'Hansson BS', 'Ibarra F', 'Francke W'] source: J Comp Physiol [A] 2000 Jun;186(6):567-74
Especially interesting to note is the list of authors, which is returned as a standard python list. This makes it easy to manipulate and search using standard python tools. For instance, we could loop through a whole bunch of entries searching for a particular author with code like the following:
search_author = 'Waits T' for our_id in our_id_list: cur_record = medline_dict[our_id] if search_author in cur_record.authors: print "Author %s found: %s" % (search_author, cur_record.source.strip())
The PubMed and Medline interfaces are very mature and nice to work with – hopefully this section gave you an idea of the power of the interfaces and how they can be used.
The GenBank record format is a very popular method of holding information about sequences, sequence features, and other associated sequence information. The format is a good way to get information from the NCBI databases at http://www.ncbi.nlm.nih.gov/.
The Bio.GenBank
module uses the NCBI’s Entrez interface to run searches and fetch data.
Please remember to read and observe the NCBI’s guidelines on using their Entrez online facilities responsibly. If you are found to be abusing their servers, they can and will block your access! See Section 7.1.
One very nice feature of the GenBank libraries is the ability to automate retrieval of entries from GenBank. This is very convenient for creating scripts that automate a lot of your daily work. In this example we’ll show how to query the NCBI databases, and to retrieve the records from the query - something touched on in Section 4.2.1.
First, we want to make a query and find out the ids of the records to retrieve. Here we’ll do a quick search for our favorite organism, Opuntia. We can do quick search and get back the GIs (GenBank identifiers) for all of the corresponding records:
from Bio import GenBank gi_list = GenBank.search_for("Opuntia AND rpl16")
gi_list
will be a list of all of the GenBank identifiers that match our query:
["6273291", "6273290", "6273289", "6273287", "6273286", "6273285", "6273284"]
Now that we’ve got the GIs, we can use these to access the NCBI database through a dictionary interface. For instance, to retrieve the information for the first GI, we’ll first have to create a dictionary that accesses NCBI:
ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank")
Now that we’ve got this, we do the retrieval:
gb_record = ncbi_dict[gi_list[0]]
In this case, gb_record
will be GenBank formatted record:
LOCUS AF191665 902 bp DNA PLN 07-NOV-1999 DEFINITION Opuntia marenae rpl16 gene; chloroplast gene for chloroplast product, partial intron sequence. ACCESSION AF191665 VERSION AF191665.1 GI:6273291 ...
In this case, we are just getting the raw records. We can also pass these records directly into a parser and return the parsed record. For instance, if we wanted to get back SeqRecord objects with the GenBank file parsed into SeqFeature objects we would need to create the dictionary with the GenBank FeatureParser:
record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary("nucleotide", "genbank", parser = record_parser)
Now retrieving a record will give you a SeqRecord object instead of the raw record:
>>> gb_seqrecord = ncbi_dict[gi_list[0]] >>> print gb_seqrecord <Bio.SeqRecord.SeqRecord instance at 0x102f9404>
Using these automated query retrieval functionality is a big plus over doing things by hand. Although the module should obey the NCBI’s three-second rule, the NCBI have other recommendations like avoiding peak hours. See Section 7.1.
For more information of formats you can parse GenBank records into, please see section 9.2.2.
While GenBank files are nice and have lots of information, at the same time you probably only want to extract a small amount of that information at a time. The key to doing this is parsing out the information. Biopython provides GenBank parsers which help you accomplish this task. Right now the GenBank module provides the following parsers:
Bio.SeqIO
(Chapter 4) to read a GenBank file, it will call this FeatureParser for you.
Depending on the type of GenBank files you are interested in, they will either contain a single record, or multiple records. Each record will start with a LOCUS line, various other header lines, a list of features, and finally the sequence data, ending with a // line.
Dealing with a GenBank file containing a single record is very easy. For example, let’s use a small bacterial genome, Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can be downloaded from the NCBI here (only 1.15 MB):
from Bio import GenBank feature_parser = GenBank.FeatureParser() gb_record = feature_parser.parse(open("AE017199.gbk")) # now do something with the record print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) print repr(gb_record.seq)
Or, using Bio.SeqIO
instead (see Chapter 4):
from Bio import SeqIO gb_record = SeqIO.read(open("AE017199.gbk"), "genbank") print "Name %s, %i features" % (gb_record.name, len(gb_record.features)) print repr(gb_record.seq)
Either should give the following output:
Name AE017199, 1107 features Seq('TCTCGCAGAGTTCTTTTTTGTATTAACAAACCCAAAACCCATAGAATTTAATGA...TTA', IUPACAmbiguousDNA())
For multi-record GenBank files, the most common usage will be creating an iterator, and parsing through the file record by record. Doing this is very similar to how things are done in other formats, as the following code demonstrates, using an example file cor6_6.gb which is included in the BioPython source code under the Tests/GenBank/ directory:
from Bio import GenBank feature_parser = GenBank.FeatureParser() gb_iterator = GenBank.Iterator(open("cor6_6.gb"), feature_parser) for cur_record in gb_iterator : print "Name %s, %i features" % (cur_record.name, len(cur_record.features)) print repr(cur_record.seq)
Or, using Bio.SeqIO
instead (see Chapter 4):
from Bio import SeqIO for cur_record in SeqIO.parse(open("cor6_6.gb"), "genbank") : print "Name %s, %i features" % (cur_record.name, len(cur_record.features)) print repr(cur_record.seq)
This just iterates over a GenBank file, parsing it into SeqRecord and SeqFeature objects, and prints out the Seq objects representing the sequences in the record.
As with other formats, you have lots of tools for dealing with GenBank records. This should make it possible to do whatever you need to with GenBank.
It is often very useful to be able to align particular sequences. I do this quite often to get a quick and dirty idea of relationships between sequences. Consequently, it is very nice to be able to quickly write up a python script that does an alignment and gives you back objects that are easy to work with. The alignment related code in Biopython is meant to allow python-level access to alignment programs so that you can run alignments quickly from within scripts.
Clustalx (http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html) is a very nice program for doing multiple alignments. Biopython offers access to alignments in clustal format (these normally have a *.aln
extension) that are produced by Clustalx. It also offers access to clustalw, which the is command line version of clustalx.
We’ll need some sequences to align, such as opuntia.fasta (also available online here) which is a small FASTA file containing seven orchid gene DNA sequences, which you can also from Doc/examples/
in the Biopython source distribution.
The first step in interacting with clustalw is to set up a command line you want to pass to the program. Clustalw has a ton of command line options, and if you set a lot of parameters, you can end up typing in a huge ol’ command line quite a bit. This command line class models the command line by making all of the options be attributes of the class that can be set. A few convenience functions also exist to set certain parameters, so that some error checking on the parameters can be done.
To create a command line object to do a clustalw multiple alignment we do the following:
import os from Bio.Clustalw import MultipleAlignCL cline = MultipleAlignCL(os.path.join(os.curdir, "opuntia.fasta")) cline.set_output("test.aln")
First we import the MultipleAlignCL
object, which models running a multiple alignment from clustalw. We then initialize the command line, with a single argument of the fasta file that we are going to be using for the alignment. The initialization function also takes an optional second argument which specifies the location of the clustalw
executable. By default, the commandline will just be invoked with ’clustalw,’ assuming that you’ve got it somewhere on your PATH
.
The second argument sets the output to go to the file test.aln
. The MultipleAlignCL
object also has numerous other parameters to specify things like output format, gap costs, etc.
We can look at the command line we have generated by invoking the __str__
member attribute of the MultipleAlignCL
class. This is done by calling str(cline)
or simple by printing out the command line with print cline
. In this case, doing this would give the following output:
clustalw ./opuntia.fasta -OUTFILE=test.aln
Now that we’ve set up a simple command line, we now want to run the commandline and collect the results so we can deal with them. This can be done using the do_alignment
function of Clustalw
as follows:
from Bio import Clustalw alignment = Clustalw.do_alignment(cline)
What happens when you run this if that Biopython executes your command line and runs clustalw with the given parameters. It then grabs the output, and if it is in a format that Biopython can parse (currently only clustal format), then it will parse the results and return them as an alignment object of the appropriate type. So in this case since we are getting results in the default clustal format, the returned alignment
object will be a ClustalAlignment
type.
Once we’ve got this alignment, we can do some interesting things with it such as get seq_record
objects for all of the sequences involved in the alignment:
all_records = alignment.get_all_seqs() print "description:", all_records[0].description print "sequence:", all_records[0].seq
This prints out the description and sequence object for the first sequence in the alignment:
description: gi|6273285|gb|AF191659.1|AF191 sequence: Seq('TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT ...', IUPACAmbiguousDNA())
You can also calculate the maximum length of the alignment with:
length = alignment.get_alignment_length()
Finally, to write out the alignment object in the original format, we just need to access the __str__
function. So doing a print alignment
gives:
CLUSTAL X (1.81) multiple sequence alignment gi|6273285|gb|AF191659.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA gi|6273284|gb|AF191658.1|AF191 TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAA ...
This makes it easy to write your alignment back into a file with all of the original info intact.
If you want to do more interesting things with an alignment, the best thing to do is to pass the alignment to an alignment information generating object, such as the SummaryInfo object, described in section 9.3.2.
Once you have an alignment, you are very likely going to want to find out information about it. Instead of trying to have all of the functions that can generate information about an alignment in the alignment object itself, we’ve tried to separate out the functionality into separate classes, which act on the alignment.
Getting ready to calculate summary information about an object is quick to do. Let’s say we’ve got an alignment object called alignment
. All we need to do to get an object that will calculate summary information is:
from Bio.Align import AlignInfo summary_align = AlignInfo.SummaryInfo(alignment)
The summary_align
object is very useful, and will do the following neat things for you:
The SummaryInfo
object, described in section 9.3.2, provides functionality to calculate a quick consensus of an alignment. Assuming we’ve got a SummaryInfo
object called summary_align
we can calculate a consensus by doing:
consensus = summary_align.dumb_consensus()
As the name suggests, this is a really simple consensus calculator, and will just add up all of the residues at each point in the consensus, and if the most common value is higher than some threshold value (the default is .3) will add the common residue to the consensus. If it doesn’t reach the threshold, it adds an ambiguity character to the consensus. The returned consensus object is Seq object whose alphabet is inferred from the alphabets of the sequences making up the consensus. So doing a print consensus
would give:
consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT ...', IUPACAmbiguousDNA())
You can adjust how dumb_consensus
works by passing optional parameters:
Position specific score matrices (PSSMs) summarize the alignment information in a different way than a consensus, and may be useful for different tasks. Basically, a PSSM is a count matrix. For each column in the alignment, the number of each alphabet letters is counted and totaled. The totals are displayed relative to some representative sequence along the left axis. This sequence may be the consesus sequence, but can also be any sequence in the alignment. For instance for the alignment,
GTATC AT--C CTGTC
the PSSM is:
G A T C G 1 1 0 1 T 0 0 3 0 A 1 1 0 0 T 0 0 2 0 C 0 0 0 3
Let’s assume we’ve got an alignment object called c_align
. To get a PSSM with the consensus sequence along the side we first get a summary object and calculate the consensus sequence:
summary_align = AlignInfo.SummaryInfo(c_align) consensus = summary_align.dumb_consensus()
Now, we want to make the PSSM, but ignore any N
ambiguity residues when calculating this:
my_pssm = summary_align.pos_specific_score_matrix(consensus, chars_to_ignore = ['N'])
Two notes should be made about this:
second_seq = alignment.get_seq_by_num(1) my_pssm = summary_align.pos_specific_score_matrix(second_seq chars_to_ignore = ['N'])
The command above returns a PSSM
object. To print out the PSSM as we showed above, we simply need to do a print my_pssm
, which gives:
A C G T T 0.0 0.0 0.0 7.0 A 7.0 0.0 0.0 0.0 T 0.0 0.0 0.0 7.0 A 7.0 0.0 0.0 0.0 C 0.0 7.0 0.0 0.0 A 7.0 0.0 0.0 0.0 T 0.0 0.0 0.0 7.0 T 1.0 0.0 0.0 6.0 ...
You can access any element of the PSSM by subscripting like your_pssm[sequence_number][residue_count_name]
. For instance, to get the counts for the ’A’ residue in the second element of the above PSSM you would do:
>>> print my_pssm[1]["A"] 7.0
The structure of the PSSM class hopefully makes it easy both to access elements and to pretty print the matrix.
A potentially useful measure of evolutionary conservation is the information content of a sequence.
A useful introduction to information theory targetted towards molecular biologists can be found at http://www.lecb.ncifcrf.gov/~toms/paper/primer/. For our purposes, we will be looking at the information content of a consesus sequence, or a portion of a consensus sequence. We calculate information content at a particular column in a multiple sequence alignment using the following formula:
ICj = |
| Pij log | ⎛ ⎜ ⎜ ⎝ |
| ⎞ ⎟ ⎟ ⎠ |
where:
Well, now that we have an idea what information content is being calculated in Biopython, let’s look at how to get it for a particular region of the alignment.
First, we need to use our alignment to get a alignment summary object, which we’ll assume is called summary_align
(see section 9.3.2) for instructions on how to get this. Once we’ve got this object, calculating the information content for a region is as easy as:
info_content = summary_align.information_content(5, 30, chars_to_ignore = ['N'])
Wow, that was much easier then the formula above made it look! The variable info_content
now contains a float value specifying the information content over the specified region (from 5 to 30 of the alignment). We specifically ignore the ambiguity residue ’N’ when calculating the information content, since this value is not included in our alphabet (so we shouldn’t be interested in looking at it!).
As mentioned above, we can also calculate relative information content by supplying the expected frequencies:
expect_freq = { 'A' : .3, 'G' : .2, 'T' : .3, 'C' : .2}
The expected should not be passed as a raw dictionary, but instead by passed as a SubsMat.FreqTable
object (see section 10.4.2 for more information about FreqTables). The FreqTable object provides a standard for associating the dictionary with an Alphabet, similar to how the Biopython Seq class works.
To create a FreqTable object, from the frequency dictionary you just need to do:
from Bio.Alphabet import IUPAC from Bio.SubsMat import FreqTable e_freq_table = FreqTable.FreqTable(expect_freq, FreqTable.FREQ, IUPAC.unambiguous_dna)
Now that we’ve got that, calculating the relative information content for our region of the alignment is as simple as:
info_content = summary_align.information_content(5, 30, e_freq_table = e_freq_table, chars_to_ignore = ['N'])
Now, info_content
will contain the relative information content over the region in relation to the expected frequencies.
The value return is calculated using base 2 as the logarithm base in the formula above. You can modify this by passing the parameter log_base
as the base you want:
info_content = summary_align.information_content(5, 30, log_base = 10 chars_to_ignore = ['N'])
Well, now you are ready to calculate information content. If you want to try applying this to some real life problems, it would probably be best to dig into the literature on information content to get an idea of how it is used. Hopefully your digging won’t reveal any mistakes made in coding this function!
One thing that you always end up having to do is convert between different formats.
You can do this using the Bio.AlignIO
module, see Section 5.2.1.
Substitution matrices are an extremely important part of everyday bioinformatics work. They provide the scoring terms for classifying how likely two different residues are to substitute for each other. This is essential in doing sequence comparisons. The book “Biological Sequence Analysis” by Durbin et al. provides a really nice introduction to Substitution Matrices and their uses. Some famous substitution matrices are the PAM and BLOSUM series of matrices.
Biopython provides a ton of common substitution matrices, and also provides functionality for creating your own substitution matrices.
A very cool thing that you can do easily with the substitution matrix classes is to create your own substitution matrix from an alignment. In practice, this is normally done with protein alignments. In this example, we’ll first get a biopython alignment object and then get a summary object to calculate info about the alignment. The file containing protein.aln (also available online here) contains the Clustalw alignment output.
from Bio import Clustalw from Bio.Alphabet import IUPAC from Bio.Align import AlignInfo # get an alignment object from a Clustalw alignment output c_align = Clustalw.parse_file("protein.aln", IUPAC.protein) summary_align = AlignInfo.SummaryInfo(c_align)
Sections 9.3.1 and 9.3.2 contain more information on doing this.
Now that we’ve got our summary_align
object, we want to use it
to find out the number of times different residues substitute for each
other. To make the example more readable, we’ll focus on only amino
acids with polar charged side chains. Luckily, this can be done easily
when generating a replacement dictionary, by passing in all of the
characters that should be ignored. Thus we’ll create a dictionary of
replacements for only charged polar amino acids using:
replace_info = summary_align.replacement_dictionary(["G", "A", "V", "L", "I", "M", "P", "F", "W", "S", "T", "N", "Q", "Y", "C"])
This information about amino acid replacements is represented as a python dictionary which will look something like:
{('R', 'R'): 2079.0, ('R', 'H'): 17.0, ('R', 'K'): 103.0, ('R', 'E'): 2.0, ('R', 'D'): 2.0, ('H', 'R'): 0, ('D', 'H'): 15.0, ('K', 'K'): 3218.0, ('K', 'H'): 24.0, ('H', 'K'): 8.0, ('E', 'H'): 15.0, ('H', 'H'): 1235.0, ('H', 'E'): 18.0, ('H', 'D'): 0, ('K', 'D'): 0, ('K', 'E'): 9.0, ('D', 'R'): 48.0, ('E', 'R'): 2.0, ('D', 'K'): 1.0, ('E', 'K'): 45.0, ('K', 'R'): 130.0, ('E', 'D'): 241.0, ('E', 'E'): 3305.0, ('D', 'E'): 270.0, ('D', 'D'): 2360.0}
This information gives us our accepted number of replacements, or how often we expect different things to substitute for each other. It turns out, amazingly enough, that this is all of the information we need to go ahead and create a substitution matrix. First, we use the replacement dictionary information to create an Accepted Replacement Matrix (ARM):
from Bio import SubsMat my_arm = SubsMat.SeqMat(replace_info)
With this accepted replacement matrix, we can go right ahead and create our log odds matrix (i. e. a standard type Substitution Matrix):
my_lom = SubsMat.make_log_odds_matrix(my_arm)
The log odds matrix you create is customizable with the following optional arguments:
exp_freq_table
– You can pass a table of expected
frequencies for each alphabet. If supplied, this will be used
instead of the passed accepted replacement matrix when calculate
expected replacments.logbase
- The base of the logarithm taken to create the
log odd matrix. Defaults to base 10.factor
- The factor to multiply each matrix entry
by. This defaults to 10, which normally makes the matrix numbers
easy to work with.round_digit
- The digit to round to in the matrix. This
defaults to 0 (i. e. no digits).Once you’ve got your log odds matrix, you can display it prettily
using the function print_mat
. Doing this on our created matrix
gives:
>>> my_lom.print_mat() D 6 E -5 5 H -15 -13 10 K -31 -15 -13 6 R -13 -25 -14 -7 7 D E H K R
Very nice. Now we’ve got our very own substitution matrix to play with!
BioSQL is a joint effort between the
OBF projects (BioPerl, BioJava etc) to support a
shared database schema for storing sequence data. In theory, you could load a
GenBank file into the database with BioPerl, then using Biopython extract this
from the database as a record object with featues - and get more or less the same
thing as if you had loaded the GenBank file directly as a SeqRecord using
Bio.SeqIO
(Chapter 4).
Biopython’s BioSQL module is currently documented at http://biopython.org/wiki/BioSQL which is part of our wiki pages.
Biocorba does some cool stuff with CORBA. Basically, it allows you to easily interact with code written in other languages, including Perl and Java. This is all done through an interface which is very similar to the standard biopython interface. Much work has been done to make it easy to use knowing only very little about CORBA. You should check out the biocorba specific documentation, which describes in detail how to use it.
Biopython also allows you to explore the extensive realm of macromolecular structure. Biopython comes with a PDBParser class that produces a Structure object. The Structure object can be used to access the atomic data in the file in a convenient manner.
A macromolecular structure is represented using a structure, model chain, residue, atom (or SMCRA) hierarchy. Fig. 9.7.1 shows a UML class diagram of the SMCRA data structure. Such a data structure is not necessarily best suited for the representation of the macromolecular content of a structure, but it is absolutely necessary for a good interpretation of the data present in a file that describes the structure (typically a PDB or MMCIF file). If this hierarchy cannot represent the contents of a structure file, it is fairly certain that the file contains an error or at least does not describe the structure unambiguously. If a SMCRA data structure cannot be generated, there is reason to suspect a problem. Parsing a PDB file can thus be used to detect likely problems. We will give several examples of this in section 9.7.5.1.
Structure, Model, Chain and Residue are all subclasses of the Entity base class. The Atom class only (partly) implements the Entity interface (because an Atom does not have children).
For each Entity subclass, you can extract a child by using a unique id for that child as a key (e.g. you can extract an Atom object from a Residue object by using an atom name string as a key, you can extract a Chain object from a Model object by using its chain identifier as a key).
Disordered atoms and residues are represented by DisorderedAtom and DisorderedResidue classes, which are both subclasses of the DisorderedEntityWrapper base class. They hide the complexity associated with disorder and behave exactly as Atom and Residue objects.
In general, a child Entity object (i.e. Atom, Residue, Chain, Model) can be extracted from its parent (i.e. Residue, Chain, Model, Structure, respectively) by using an id as a key.
child_entity=parent_entity[child_id]
You can also get a list of all child Entities of a parent Entity object. Note that this list is sorted in a specific way (e.g. according to chain identifier for Chain objects in a Model object).
child_list=parent_entity.get_list()
You can also get the parent from a child.
parent_entity=child_entity.get_parent()
At all levels of the SMCRA hierarchy, you can also extract a full id. The full id is a tuple containing all id’s starting from the top object (Structure) down to the current object. A full id for a Residue object e.g. is something like:
full_id=residue.get_full_id() print full_id ("1abc", 0, "A", ("", 10, "A"))
This corresponds to:
The Residue id indicates that the residue is not a hetero-residue (nor a water) because it has a blanc hetero field, that its sequence identifier is 10 and that its insertion code is "A".
Some other useful methods:
# get the entity's id entity.get_id() # check if there is a child with a given id entity.has_id(entity_id) # get number of children nr_children=len(entity)
It is possible to delete, rename, add, etc. child entities from a parent entity, but this does not include any sanity checks (e.g. it is possible to add two residues with the same id to one chain). This really should be done via a nice Decorator class that includes integrity checking, but you can take a look at the code (Entity.py) if you want to use the raw interface.
The Structure object is at the top of the hierarchy. Its id is a user given string. The Structure contains a number of Model children. Most crystal structures (but not all) contain a single model, while NMR structures typically consist of several models. Disorder in crystal structures of large parts of molecules can also result in several models.
A Structure object is produced by a PDBParser object:
from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="1fat" filename="pdb1fat.ent" s=p.get_structure(structure_id, filename)
The PERMISSIVE flag indicates that a number of common problems (see 9.7.5.1) associated with PDB files will be ignored (but note that some atoms and/or residues will be missing). If the flag is not present a PDBConstructionException will be generated during the parse operation.
You can extract the header and trailer (simple lists of strings) of the PDB file from the PDBParser object with the get_header and get_trailer methods.
The id of the Model object is an integer, which is derived from the position of the model in the parsed file (they are automatically numbered starting from 0). The Model object stores a list of Chain children.
Get the first model from a Structure object.
first_model=structure[0]
The id of a Chain object is derived from the chain identifier in the structure file, and can be any string. Each Chain in a Model object has a unique id. The Chain object stores a list of Residue children.
Get the Chain object with identifier “A” from a Model object.
chain_A=model["A"]
Unsurprisingly, a Residue object stores a set of Atom children. In addition, it also contains a string that specifies the residue name (e.g. “ASN”) and the segment identifier of the residue (well known to X-PLOR users, but not used in the construction of the SMCRA data structure).
The id of a Residue object is composed of three parts: the hetero field (hetfield), the sequence identifier (resseq) and the insertion code (icode).
The hetero field is a string : it is “W” for waters, “H_” followed by the residue name (e.g. “H_FUC”) for other hetero residues and blank for standard amino and nucleic acids. This scheme is adopted for reasons described in section 9.7.3.1.
The second field in the Residue id is the sequence identifier, an integer describing the position of the residue in the chain.
The third field is a string, consisting of the insertion code. The insertion code is sometimes used to preserve a certain desirable residue numbering scheme. A Ser 80 insertion mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g. have sequence identifiers and insertion codes as followed: Thr 80 A, Ser 80 B, Asn 81. In this way the residue numbering scheme stays in tune with that of the wild type structure.
Let’s give some examples. Asn 10 with a blank insertion code would have residue id (’’ ’’, 10, ’’ ’’). Water 10 would have residue id (‘‘W‘‘, 10, ‘‘ ‘‘). A glucose molecule (a hetero residue with residue name GLC) with sequence identifier 10 would have residue id (’’H_GLC’’, 10, ’’ ’’). In this way, the three residues (with the same insertion code and sequence identifier) can be part of the same chain because their residue id’s are distinct.
In most cases, the hetflag and insertion code fields will be blank, e.g. (’’ ’’, 10, ’’ ’’). In these cases, the sequence identifier can be used as a shortcut for the full id:
# use full id res10=chain[("", 10, "")] # use shortcut res10=chain[10]
Each Residue object in a Chain object should have a unique id. However, disordered residues are dealt with in a special way, as described in section 9.7.2.3.2.
A Residue object has a number of additional methods:
r.get_resname() # return residue name, e.g. "ASN" r.get_segid() # return the SEGID, e.g. "CHN1"
The Atom object stores the data associated with an atom, and has no children. The id of an atom is its atom name (e.g. “OG” for the side chain oxygen of a Ser residue). An Atom id needs to be unique in a Residue. Again, an exception is made for disordered atoms, as described in section 9.7.2.2.
In a PDB file, an atom name consists of 4 chars, typically with leading and trailing spaces. Often these spaces can be removed for ease of use (e.g. an amino acid C α atom is labeled “.CA.” in a PDB file, where the dots represent spaces). To generate an atom name (and thus an atom id) the spaces are removed, unless this would result in a name collision in a Residue (i.e. two Atom objects with the same atom name and id). In the latter case, the atom name including spaces is tried. This situation can e.g. happen when one residue contains atoms with names “.CA.” and “CA..”, although this is not very likely.
The atomic data stored includes the atom name, the atomic coordinates (including standard deviation if present), the B factor (including anisotropic B factors and standard deviation if present), the altloc specifier and the full atom name including spaces. Less used items like the atom element number or the atomic charge sometimes specified in a PDB file are not stored.
An Atom object has the following additional methods:
a.get_name() # atom name (spaces stripped, e.g. "CA") a.get_id() # id (equals atom name) a.get_coord() # atomic coordinates a.get_bfactor() # B factor a.get_occupancy() # occupancy a.get_altloc() # alternative location specifie a.get_sigatm() # std. dev. of atomic parameters a.get_siguij() # std. dev. of anisotropic B factor a.get_anisou() # anisotropic B factor a.get_fullname() # atom name (with spaces, e.g. ".CA.")
To represent the atom coordinates, siguij, anisotropic B factor and sigatm Numpy arrays are used.
Disorder should be dealt with from two points of view: the atom and the residue points of view. In general, we have tried to encapsulate all the complexity that arises from disorder. If you just want to loop over all C α atoms, you do not care that some residues have a disordered side chain. On the other hand it should also be possible to represent disorder completely in the data structure. Therefore, disordered atoms or residues are stored in special objects that behave as if there is no disorder. This is done by only representing a subset of the disordered atoms or residues. Which subset is picked (e.g. which of the two disordered OG side chain atom positions of a Ser residue is used) can be specified by the user.
Disordered atoms are represented by ordinary Atom objects, but all Atom objects that represent the same physical atom are stored in a DisorderedAtom object. Each Atom object in a DisorderedAtom object can be uniquely indexed using its altloc specifier. The DisorderedAtom object forwards all uncaught method calls to the selected Atom object, by default the one that represents the atom with with the highest occupancy. The user can of course change the selected Atom object, making use of its altloc specifier. In this way atom disorder is represented correctly without much additional complexity. In other words, if you are not interested in atom disorder, you will not be bothered by it.
Each disordered atom has a characteristic altloc identifier. You can specify that a DisorderedAtom object should behave like the Atom object associated with a specific altloc identifier:
atom.disordered\_select("A") # select altloc A atom print atom.get_altloc() "A" atom.disordered_select("B") # select altloc B atom print atom.get_altloc() "B"
The most common case is a residue that contains one or more disordered atoms. This is evidently solved by using DisorderedAtom objects to represent the disordered atoms, and storing the DisorderedAtom object in a Residue object just like ordinary Atom objects. The DisorderedAtom will behave exactly like an ordinary atom (in fact the atom with the highest occupancy) by forwarding all uncaught method calls to one of the Atom objects (the selected Atom object) it contains.
A special case arises when disorder is due to a point mutation, i.e. when two or more point mutants of a polypeptide are present in the crystal. An example of this can be found in PDB structure 1EN2.
Since these residues belong to a different residue type (e.g. let’s say Ser 60 and Cys 60) they should not be stored in a single Residue object as in the common case. In this case, each residue is represented by one Residue object, and both Residue objects are stored in a DisorderedResidue object.
The DisorderedResidue object forwards all uncaught methods to the selected Residue object (by default the last Residue object added), and thus behaves like an ordinary residue. Each Residue object in a DisorderedResidue object can be uniquely identified by its residue name. In the above example, residue Ser 60 would have id “SER” in the DisorderedResidue object, while residue Cys 60 would have id “CYS”. The user can select the active Residue object in a DisorderedResidue object via this id.
A common problem with hetero residues is that several hetero and non-hetero residues present in the same chain share the same sequence identifier (and insertion code). Therefore, to generate a unique id for each hetero residue, waters and other hetero residues are treated in a different way.
Remember that Residue object have the tuple (hetfield, resseq, icode) as id. The hetfield is blank (“ “) for amino and nucleic acids, and a string for waters and other hetero residues. The content of the hetfield is explained below.
The hetfield string of a water residue consists of the letter “W”. So a typical residue id for a water is (“W”, 1, “ “).
The hetfield string for other hetero residues starts with “H_” followed by the residue name. A glucose molecule e.g. with residue name “GLC” would have hetfield “H_GLC”. It’s residue id could e.g. be (“H_GLC”, 1, “ “).
Parse a PDB file, and extract some Model, Chain, Residue and Atom objects.
from PDBParser import PDBParser parser=PDBParser() structure=parser.get_structure("test", "1fat.pdb") model=structure[0] chain=model["A"] residue=chain[1] atom=residue["CA"]
Extract a hetero residue from a chain (e.g. a glucose (GLC) moiety with resseq 10).
residue_id=("H_GLC", 10, " ") residue=chain[residue_id]
Print all hetero residues in chain.
for residue in chain.get_list(): residue_id=residue.get_id() hetfield=residue_id[0] if hetfield[0]=="H": print residue_id
Print out the coordinates of all CA atoms in a structure with B factor greater than 50.
for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.has_id("CA"): ca=residue["CA"] if ca.get_bfactor()>50.0: print ca.get_coord()
Print out all the residues that contain disordered atoms.
for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): resseq=residue.get_id()[1] resname=residue.get_resname() model_id=model.get_id() chain_id=chain.get_id() print model_id, chain_id, resname, resseq
Loop over all disordered atoms, and select all atoms with altloc A (if present). This will make sure that the SMCRA data structure will behave as if only the atoms with altloc A are present.
for model in structure.get_list() for chain in model.get_list(): for residue in chain.get_list(): if residue.is_disordered(): for atom in residue.get_list(): if atom.is_disordered(): if atom.disordered_has_id("A"): atom.disordered_select("A")
Suppose that a chain has a point mutation at position 10, consisting of a Ser and a Cys residue. Make sure that residue 10 of this chain behaves as the Cys residue.
residue=chain[10] residue.disordered_select("CYS")
The PDBParser/Structure class was tested on about 800 structures (each belonging to a unique SCOP superfamily). This takes about 20 minutes, or on average 1.5 seconds per structure. Parsing the structure of the large ribosomal subunit (1FKK), which contains about 64000 atoms, takes 10 seconds on a 1000 MHz PC.
Three exceptions were generated in cases where an unambiguous data structure could not be built. In all three cases, the likely cause is an error in the PDB file that should be corrected. Generating an exception in these cases is much better than running the chance of incorrectly describing the structure in a data structure.
One structure contains two amino acid residues in one chain with the same sequence identifier (resseq 3) and icode. Upon inspection it was found that this chain contains the residues Thr A3, …, Gly A202, Leu A3, Glu A204. Clearly, Leu A3 should be Leu A203. A couple of similar situations exist for structure 1FFK (which e.g. contains Gly B64, Met B65, Glu B65, Thr B67, i.e. residue Glu B65 should be Glu B66).
Structure 1EJG contains a Ser/Pro point mutation in chain A at position 22. In turn, Ser 22 contains some disordered atoms. As expected, all atoms belonging to Ser 22 have a non-blank altloc specifier (B or C). All atoms of Pro 22 have altloc A, except the N atom which has a blank altloc. This generates an exception, because all atoms belonging to two residues at a point mutation should have non-blank altloc. It turns out that this atom is probably shared by Ser and Pro 22, as Ser 22 misses the N atom. Again, this points to a problem in the file: the N atom should be present in both the Ser and the Pro residue, in both cases associated with a suitable altloc identifier.
Some errors are quite common and can be easily corrected without much risk of making a wrong interpretation. These cases are listed below.
Normally each disordered atom should have a non-blanc altloc identifier. However, there are many structures that do not follow this convention, and have a blank and a non-blank identifier for two disordered positions of the same atom. This is automatically interpreted in the right way.
Sometimes a structure contains a list of residues belonging to chain A, followed by residues belonging to chain B, and again followed by residues belonging to chain A, i.e. the chains are “broken”. This is correctly interpreted.
Sometimes a PDB file cannot be unambiguously interpreted. Rather than guessing and risking a mistake, an exception is generated, and the user is expected to correct the PDB file. These cases are listed below.
All residues in a chain should have a unique id. This id is generated based on:
If this does not lead to a unique id something is quite likely wrong, and an exception is generated.
All atoms in a residue should have a unique id. This id is generated based on:
If this does not lead to a unique id something is quite likely wrong, and an exception is generated.
There are also some tools to analyze a crystal structure. Tools exist to superimpose two coordinate sets (SVDSuperimposer), to extract polypeptides from a structure (Polypeptide), to perform neighbor lookup (NeighborSearch) and to write out PDB files (PDBIO). The neighbor lookup is done using a KD tree module written in C++. It is very fast and also includes a fast method to find all point pairs within a certain distance of each other.
A Polypeptide object is simply a UserList of Residue objects. You can construct a list of Polypeptide objects from a Structure object as follows:
model_nr=1 polypeptide_list=build_peptides(structure, model_nr) for polypeptide in polypeptide_list: print polypeptide
The Polypeptide objects are always created from a single Model (in this case model 1).
Bio.PopGen is a new Biopython module supporting population genetics, available in Biopython 1.44 onwards.
The medium term objective for the module is to support widely used data formats, applications and databases. This module is currently under intense development and support for new features should appear at a rather fast pace. Unfortunately this might also entail some instability on the API, especially if you are using a CVS version. APIs that are made available on public versions should be much stabler.
GenePop (http://genepop.curtin.edu.au/) is a popular population genetics software package supporting Hardy-Weinberg tests, linkage desiquilibrium, population diferentiation, basic statistics, Fst and migration estimates, among others. GenePop does not supply sequence based statistics as it doesn’t handle sequence data. The GenePop file format is supported by a wide range of other population genetic software applications, thus making it a relevant format in the population genetics field.
Bio.PopGen provides a parser and generator of GenePop file format. Utilities to manipulate the content of a record are also provided. Here is an example on how to read a GenePop file (you can find example GenePop data files in the Test/PopGen directory of Biopython):
from Bio.PopGen import GenePop handle = open("example.gen") rec = GenePop.parse(handle) handle.close()
This will read a file called example.gen and parse it. If you do print rec, the record will be output again, in GenePop format.
The most important information in rec will be the loci names and population information (but there is more – use help(GenePop.Record) to check the API documentation). Loci names can be found on rec.loci_list. Population information can be found on rec.populations. Populations is a list with one element per population. Each element is itself a list of individuals, each individual is a pair composed by individual name and a list of alleles (2 per marker), here is an example for rec.populations:
[ [ ('Ind1', [(1, 2), (3, 3), (200, 201)], ('Ind2', [(2, None), (3, 3), (None, None)], ], [ ('Other1', [(1, 1), (4, 3), (200, 200)], ] ]
So we have two populations, the first with two individuals, the second with only one. The first individual of the first population is called Ind1, allelic information for each of the 3 loci follows. Please note that for any locus, information might be missing (see as an example, Ind2 above).
A few utility functions to manipulate GenePop records are made available, here is an example:
from Bio.PopGen import GenePop #Imagine that you have loaded rec, as per the code snippet above... rec.remove_population(pos) #Removes a population from a record, pos is the population position in # rec.populations, remember that it starts on position 0. # rec is altered. rec.remove_locus_by_position(pos) #Removes a locus by its position, pos is the locus position in # rec.loci_list, remember that it starts on position 0. # rec is altered. rec.remove_locus_by_name(name) #Removes a locus by its name, name is the locus name as in # rec.loci_list. If the name doesn't exist the function fails # silently. # rec is altered. rec_loci = rec.split_in_loci() #Splits a record in loci, that is, for each loci, it creates a new # record, with a single loci and all populations. # The result is returned in a dictionary, being each key the locus name. # The value is the GenePop record. # rec is not altered. rec_pops = rec.split_in_pops(pop_names) #Splits a record in populations, that is, for each population, it creates # a new record, with a single population and all loci. # The result is returned in a dictionary, being each key # the population name. As population names are not available in GenePop, # they are passed in array (pop_names). # The value of each dictionary entry is the GenePop record. # rec is not altered.
GenePop does not support population names, a limitation which can be cumbersome at times. Functionality to enable population names is currently being planned for Biopython. These extensions won’t break compatibility in any way with the standard format. In the medium term, we would also like to support the GenePop web service.
A coalescent simulation is a backward model of population genetics with relation to time. A simulation of ancestry is done until the Most Recent Common Ancestor (MRCA) is found. This ancestry relationship starting on the MRCA and ending on the current generation sample is sometimes called a genealogy. Simple cases assume a population of constant size in time, haploidy, no population structure, and simulate the alleles of a single locus under no selection pressure.
Coalescent theory is used in many fields like selection detection, estimation of demographic parameters of real populations or disease gene mapping.
The strategy followed in the Biopython implementation of the coalescent was not to create a new, built-in, simulator from scratch but to use an existing one, SIMCOAL2 (http://cmpg.unibe.ch/software/simcoal2/). SIMCOAL2 allows for, among others, population structure, multiple demographic events, simulation of multiple types of loci (SNPs, sequences, STRs/microsatellites and RFLPs) with recombination, diploidy multiple chromosomes or ascertainment bias. Notably SIMCOAL2 doesn’t support any selection model. We recommend reading SIMCOAL2’s documentation, available in the link above.
The input for SIMCOAL2 is a file specifying the desired demography and genome, the output is a set of files (typically around 1000) with the simulated genomes of a sample of individuals per subpopulation. This set of files can be used in many ways, like to compute confidence intervals where which certain statistics (e.g., Fst or Tajima D) are expected to lie. Real population genetics datasets statistics can then be compared to those confidence intervals.
Biopython coalescent code allows to create demographic scenarios and genomes and to run SIMCOAL2.
Creating a scenario involves both creating a demography and a chromosome structure. In many cases (e.g. when doing Approximate Bayesian Computations – ABC) it is important to test many parameter variations (e.g. vary the effective population size, Ne, between 10, 50, 500 and 1000 individuals). The code provided allows for the simulation of scenarios with different demographic parameters very easily.
Below we see how we can create scenarios and then how simulate them.
A few predefined demographies are built-in, all have two shared parameters: sample size (called sample_size on the template, see below for its use) per deme and deme size, i.e. subpopulation size (pop_size). All demographies are available as templates where all parameters can be varied, each template has a system name. The prefedined demographies/templates are:
In our first example, we will generate a template for a single population, constant size model with a sample size of 30 and a deme size of 500. The code for this is:
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template generate_simcoal_from_template('simple', [(1, [('SNP', [24, 0.0005, 0.0])])], [('sample_size', [30]), ('pop_size', [100])])
Executing this code snippet will generate a file on the current directory called simple_100_300.par this file can be given as input to SIMCOAL2 to simulate the demography (below we will see how Biopython can take care of calling SIMCOAL2).
This code consists of a single function call, lets discuss it paramter by parameter.
The first parameter is the template id (from the list above). We are using the id ’simple’ which is the template for a single population of constant size along time.
The second parameter is the chromosome structure. Please ignore it for now, it will be explained in the next section.
The third parameter is a list of all required parameters (recall that the simple model only needs sample_size and pop_size) and possible values (in this case each parameter only has a possible value).
Now, lets consider an example where we want to generate several island models, and we are interested in varying the number of demes: 10, 50 and 100 with a migration rate of 1%. Sample size and deme size will be the same as before. Here is the code:
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template generate_simcoal_from_template('island', [(1, [('SNP', [24, 0.0005, 0.0])])], [('sample_size', [30]), ('pop_size', [100]), ('mig', [0.01]), ('total_demes', [10, 50, 100])])
In this case, 3 files will be generated: island_100_0.01_100_30.par, island_10_0.01_100_30.par and island_50_0.01_100_30.par. Notice the rule to make file names: template name, followed by parameter values in reverse order.
A few, arguably more esoteric template demographies exist (please check the Bio/PopGen/SimCoal/data directory on Biopython source tree). Furthermore it is possible for the user to create new templates. That functionality will be discussed in a future version of this document.
We strongly recommend reading SIMCOAL2 documentation to understand the full potential available in modeling chromosome structures. In this subsection we only discuss how to implement chromosome structures using the Biopython interface, not the underlying SIMCOAL2 capabilities.
We will start by implementing a single chromosome, with 24 SNPs with a recombination rate immediately on the right of each locus of 0.0005 and a minimum frequency of the minor allele of 0. This will be specified by the following list (to be passed as second parameter to the function generate_simcoal_from_template):
[(1, [('SNP', [24, 0.0005, 0.0])])]
This is actually the chromosome structure used in the above examples.
The chromosome structure is represented by a list of chromosomes, each chromosome (i.e., each element in the list) is composed by a tuple (a pair): the first element is the number of times the chromosome is to be repeated (as there might be interest in repeating the same chromosome many times). The second element is a list of the actual components of the chromosome. Each element is again a pair, the first member is the locus type and the second element the parameters for that locus type. Confused? Before showing more examples lets review the example above: We have a list with one element (thus one chromosome), the chromosome is a single instance (therefore not to be repeated), it is composed of 24 SNPs, with a recombination rate of 0.0005 between each consecutive SNP, the minimum frequency of the minor allele is 0.0 (i.e, it can be absent from a certain population).
Lets see a more complicated example:
[ (5, [ ('SNP', [24, 0.0005, 0.0]) ] ), (2, [ ('DNA', [10, 0.0, 0.00005, 0.33]), ('RFLP', [1, 0.0, 0.0001]), ('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0]) ] ) ]
We start by having 5 chromosomes with the same structure as above (i.e., 24 SNPs). We then have 2 chromosomes which have a DNA sequence with 10 nucleotides, 0.0 recombination rate, 0.0005 mutation rate, and a transition rate of 0.33. Then we have an RFLP with 0.0 recombination rate to the next locus and a 0.0001 mutation rate. Finally we have a microsatellite (or STR), with 0.0 recombination rate to the next locus (note, that as this is a single microsatellite which has no loci following, this recombination rate here is irrelevant), with a mutation rate of 0.001, geometric paramater of 0.0 and a range constraint of 0.0 (for information about this parameters please consult the SIMCOAL2 documentation, you can use them to simulate various mutation models, including the typical – for microsatellites – stepwise mutation model among others).
We now discuss how to run SIMCOAL2 from inside Biopython. It is required that the binary for SIMCOAL2 is called simcoal2 (or simcoal2.exe on Windows based platforms), please note that the typical name when downloading the program is in the format simcoal2_x_y. As such renaming of the binary after download is needed.
It is possible to run SIMCOAL2 on files that were not generated using the method above (e.g., writing a parameter file by hand), but we will show an example by creating a model using the framework presented above.
from Bio.PopGen.SimCoal.Template import generate_simcoal_from_template from Bio.PopGen.SimCoal.Controller import SimCoalController generate_simcoal_from_template('simple', [ (5, [ ('SNP', [24, 0.0005, 0.0]) ] ), (2, [ ('DNA', [10, 0.0, 0.00005, 0.33]), ('RFLP', [1, 0.0, 0.0001]), ('MICROSAT', [1, 0.0, 0.001, 0.0, 0.0]) ] ) ], [('sample_size', [30]), ('pop_size', [100])]) ctrl = SimCoalController('.') ctrl.run_simcoal('simple_100_30.par', 50)
The lines of interest are the last two (plus the new import). Firstly a controller for the application is created. The directory where the binary is located has to be specified.
The simulator is then run on the last line: we know, from the rules explained above, that the input file name is simple_100_30.par for the simulation parameter file created. We then specify that we want to run 50 independent simulations, by default Biopython requests a simulation of diploid data, but a third parameter can be added to simulate haploid data (adding as a parameter the string ’0’). SIMCOAL2 will now run (please note that this can take quite a lot of time) and will create a directory with the simulation results. The results can now be analysed (typically studying the data with Arlequin3). In the future Biopython might support reading the Arlequin3 format and thus allowing for the analysis of SIMCOAL2 data inside Biopython.
Here we discuss interfaces and utilities to deal with population genetics’ applications which arguably have a smaller user base.
FDist is a selection detection application suite based on computing (i.e. simulating) a “neutral” confidence interval based on Fst and heterozygosity. Markers (which can be SNPs, microsatellites, AFLPs among others) which lie outside the “neutral” interval are to be considered as possible candidates for being under selection.
FDist is mainly used when the number of markers is considered enough to estimate an average Fst, but not enough to either have outliers calculated from the dataset directly or, with even more markers for which the relative positions in the genome are known, to use approaches based on, e.g., Extended Haplotype Heterozygosity (EHH).
The typical usage pattern for FDist is as follows:
We will now discuss each step with illustrating example code (for this example to work FDist binaries have to be on the executable PATH).
The FDist data format is application specific and is not used at all by other applications, as such you will probably have to convert your data for use with FDist. Biopython can help you do this. Here is an example converting from GenePop format to FDist format (along with imports that will be needed on examples further below):
from Bio.PopGen import GenePop from Bio.PopGen import FDist from Bio.PopGen.FDist import Controller from Bio.PopGen.FDist.Utils import convert_genepop_to_fdist gp_rec = GenePop.parse(open("example.gen")) fd_rec = convert_genepop_to_fdist(gp_rec) in_file = open("infile", "w") in_file.write(str(fd_rec)) in_file.close()
In this code we simply parse a GenePop file and convert it to a FDist record.
Printing an FDist record will generate a string that can be directly saved to a file and supplied to FDist. FDist requires the input file to be called infile, therefore we save the record on a file with that name.
The most important fields on a FDist record are: num_pops, the number of populations; num_loci, the number of loci and loci_data with the marker data itself. Most probably the details of the record are of no interest to the user, as the record only purpose is to be passed to FDist.
The next step is to calculate the average Fst of the dataset (along with the sample size):
ctrl = Controller.FDistController() fst, samp_size = ctrl.run_datacal()
On the first line we create an object to control the call of FDist suite, this object will be used further on in order to call other suite applications.
On the second line we call the datacal application which computes the average Fst and the sample size. It is worth noting that the Fst computed by datacal is a variation of Weir and Cockerham’s θ.
We can now call the main fdist application in order to simulate neutral markers.
sim_fst = ctrl.run_fdist(npops = 15, nsamples = fd_rec.num_pops, fst = fst, sample_size = samp_size, mut = 0, num_sims = 40000)
The confusion in wording between number of samples and sample size stems from the original application.
A file named out.dat will be created with the simulated heterozygosities and Fsts, it will have as many lines as the number of simulations requested.
Note that fdist returns the average Fst that it was capable of simulating, for more details about this issue please read below the paragraph on approximating the desired average Fst.
The next (optional) step is to calculate the confidence interval:
cpl_interval = ctrl.run_cplot(ci=0.99)
You can only call cplot after having run fdist.
This will calculate the confidence intervals (99% in this case) for a previous fdist run. A list of quadruples is returned. The first element represents the heterozygosity, the second the lower bound of Fst confidence interval for that heterozygosity, the third the average and the fourth the upper bound. This can be used to trace the confidence interval contour. This list is also written to a file, out.cpl.
The main purpose of this step is return a set of points which can be easily used to plot a confidence interval. It can be skipped if the objective is only to assess the status of each marker against the simulation, which is the next step...
pv_data = ctrl.run_pv()
You can only call cplot after having run datacal and fdist.
This will use the simulated markers to assess the status of each individual real marker. A list, in the same order than the loci_list that is on the FDist record (which is in the same order that the GenePop record) is returned. Each element in the list is a quadruple, the fundamental member of each quadruple is the last element (regarding the other elements, please refer to the pv documentation – for the sake of simplicity we will not discuss them here) which returns the probability of the simulated Fst being lower than the marker Fst. Higher values would indicate a stronger candidate for positive selection, lower values a candidate for balancing selection, and intermediate values a possible neutral marker. What is “higher”, “lower” or “intermediate” is really a subjective issue, but taking a “confidence interval” approach and considering a 95% confidence interval, “higher” would be between 0.95 and 1.0, “lower” between 0.0 and 0.05 and “intermediate” between 0.05 and 0.95.
Fdist tries to approximate the desired average Fst by doing a coalescent simulation using migration rates based on the formula
Nm = |
|
This formula assumes a few premises like an infinite number of populations.
In practice, when the number of populations is low, the mutation model is stepwise and the sample size increases, fdist will not be able to simulate an acceptable approximate average Fst.
To address that, a function is provided to iteratively approach the desired value by running several fdists in sequence. This approach is computationally more intensive than running a single fdist run, but yields good results. The following code runs fdist approximating the desired Fst:
sim_fst = ctrl.run_fdist_force_fst(npops = 15, nsamples = fd_rec.num_pops, fst = fst, sample_size = samp_size, mut = 0, num_sims = 40000, limit = 0.05)
The only new optional parameter, when comparing with run_fdist, is limit which is the desired maximum error. run_fdist can (and probably should) be safely replaced with run_fdist_force_fst.
The process to determine the average Fst can be more sophisticated than the one presented here. For more information we refer you to the FDist README file. Biopython’s code can be used to implement more sophisticated approaches.
The most desired future developments would be the ones you add yourself ;) .
That being said, already existing fully functional code is currently being incorporated in Bio.PopGen, that code covers the applications FDist and SimCoal2, the HapMap and UCSC Table Browser databases and some simple statistics like Fst, or allele counts.
The Bio.InterPro
module works with files from the
InterPro database, which can be obtained from the InterPro project:
http://www.ebi.ac.uk/interpro/.
The Bio.InterPro.Record
contains all the information stored in
an InterPro record. Its string representation also is a valid InterPro
record, but it is NOT guaranteed to be equivalent to the record
from which it was produced.
The Bio.InterPro.Record
contains:
Database
Accession
Name
Dates
Type
Parent
Process
Function
Component
Signatures
Abstract
Examples
References
Database links