In this chapter we’ll discuss the Bio.AlignIO
module, which is very similar to the Bio.SeqIO
module from the previous chapter, but deals with Alignment
objects rather than SeqRecord
objects.
This is new interface in Biopython 1.46, which aims to provide a simple interface for working with assorted sequence alignment file formats in a uniform way.
Note that both Bio.SeqIO
and Bio.AlignIO
can read and write sequence alignment files. The appropriate choice will depend largely on what you want to do with the data.
We have two functions for reading in sequence alignments, Bio.AlignIO.read()
and Bio.AlignIO.parse()
which following the convention introduced in Bio.SeqIO
are for files containing one or multiple alignments respectively.
Using Bio.AlignIO.parse()
will return an iterator which gives Alignment
objects. Iterators are typically used in a for loop. Examples of situations where you will have multiple different alignments include resampled alignments from the PHYLIP tool seqboot
, or multiple pairwise alignments from the EMBOSS tools water
or needle
, or Bill Pearson’s FASTA tools.
However, in many situations you will be dealing with files which contain only a single alignment. In this case, you should use the Bio.AlignIO.read()
function which return a single Alignment
object.
Both functions expect two mandatory arguments:
Bio.SeqIO
we don’t try and guess the file format for you! See http://biopython.org/wiki/AlignIO for a full listing of supported formats.
There is also an optional seq_count
argument which is discussed in Section 5.1.3 below for dealing with ambigous file formats which may contain more than one alignment.
As an example, consider the following annotation rich protein alignment in the PFAM or Stockholm file format:
# STOCKHOLM 1.0 #=GS COATB_BPIKE/30-81 AC P03620.1 #=GS COATB_BPIKE/30-81 DR PDB; 1ifl ; 1-52; #=GS Q9T0Q8_BPIKE/1-52 AC Q9T0Q8.1 #=GS COATB_BPI22/32-83 AC P15416.1 #=GS COATB_BPM13/24-72 AC P69541.1 #=GS COATB_BPM13/24-72 DR PDB; 2cpb ; 1-49; #=GS COATB_BPM13/24-72 DR PDB; 2cps ; 1-49; #=GS COATB_BPZJ2/1-49 AC P03618.1 #=GS Q9T0Q9_BPFD/1-49 AC Q9T0Q9.1 #=GS Q9T0Q9_BPFD/1-49 DR PDB; 1nh4 A; 1-49; #=GS COATB_BPIF1/22-73 AC P03619.2 #=GS COATB_BPIF1/22-73 DR PDB; 1ifk ; 1-50; COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA #=GR COATB_BPIKE/30-81 SS -HHHHHHHHHHHHHH--HHHHHHHH--HHHHHHHHHHHHHHHHHHHHH---- Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA #=GR COATB_BPM13/24-72 SS ---S-T...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT-- COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA #=GR Q9T0Q9_BPFD/1-49 SS ------...-HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH-- COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA #=GR COATB_BPIF1/22-73 SS XX-HHHH--HHHHHH--HHHHHHH--HHHHHHHHHHHHHHHHHHHHHHH--- #=GC SS_cons XHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC-- #=GC seq_cons AEssss...AptAhDSLpspAT-hIu.sWshVsslVsAsluIKLFKKFsSKA //
This is the seed alignment for the Phage_Coat_Gp8 (PF05371) PFAM entry, downloaded as a compressed archive from http://pfam.sanger.ac.uk/family/alignment/download/gzipped?acc=PF05371&alnType=seed. We can load this file as follows (assuming it has been saved to disk as “PF05371_seed.sth” in the current working directory):
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm") print alignment
This code will print out a summary of the alignment:
SingleLetterAlphabet() alignment with 7 rows and 52 columns AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKA COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRA Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKA COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKA Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRA COATB_BPIF1/22-73
You’ll notice in the above output the sequences have been truncated. We could instead write our own code to format this as we please by iterating over the rows as SeqRecord
objects:
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm") print "Alignment length %i" % alignment.get_alignment_length() for record in alignment : print "%s - %s" % (record.seq, record.id)
This will give the following output:
Alignment length 52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA - COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA - Q9T0Q8_BPIKE/1-52 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA - COATB_BPI22/32-83 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - COATB_BPM13/24-72 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA - COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA - Q9T0Q9_BPFD/1-49 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA - COATB_BPIF1/22-73
Did you notice in the raw file above that several of the sequences include database cross-references to the PDB and the associated known secondary strucutre? Try this:
for record in alignment : if record.dbxrefs : print record.id, record.dbxrefs
giving:
COATB_BPIKE/30-81 ['PDB; 1ifl ; 1-52;'] COATB_BPM13/24-72 ['PDB; 2cpb ; 1-49;', 'PDB; 2cps ; 1-49;'] Q9T0Q9_BPFD/1-49 ['PDB; 1nh4 A; 1-49;'] COATB_BPIF1/22-73 ['PDB; 1ifk ; 1-50;']
To have a look at all the sequence annotation, try this:
for record in alignment : print record
Sanger provide a nice web interface at http://pfam.sanger.ac.uk/family?acc=PF05371 which will actually let you download this alignment in several other formats. This is what the file looks like in the FASTA file format:
>COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA >Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA >COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA >COATB_BPM13/24-72 AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA >COATB_BPZJ2/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA >Q9T0Q9_BPFD/1-49 AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA >COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
Assuming you download and save this as file “PF05371_seed.faa” then you can load it with almost exactly the same code:
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.faa"), "fasta") print alignment
All that has changed in this code is the filename and the format string. You’ll get the same output as before, the sequences and record identifiers are the same.
However, as you should expect, if you check each SeqRecord
there is no annotation nor database cross-references because these are not included in the FASTA file format.
Note that rather than using the Sanger website, you could have used Bio.AlignIO
to convert the original Stockholm format file into a FASTA file yourself (see below).
With any supported file format, you can load an alignment in exactly the same way just by changing the format string. For example, use “phylip” for PHYLIP files, “nexus” for NEXUS files or “emboss” for the alignments output by the EMBOSS tools. There is a full listing on the wiki page (http://biopython.org/wiki/AlignIO).
The previous section focused on reading files containing a single alignment. In general however, files can contain more than one alignment, and to read these files we must use the Bio.AlignIO.parse()
function.
Suppose you have a small alignment in PHYLIP format:
5 6 Alpha AACAAC Beta AACCCC Gamma ACCAAC Delta CCACCA Epsilon CCAAAC
If you wanted to bootstrap a phylogenetic tree using the PHYLIP tools, one of the steps would be to create a set of many resampled alignments using the tool bootseq
. This would give output something like this, which has been abbreviated for conciseness:
5 6 Alpha AAACCA Beta AAACCC Gamma ACCCCA Delta CCCAAC Epsilon CCCAAA 5 6 Alpha AAACAA Beta AAACCC Gamma ACCCAA Delta CCCACC Epsilon CCCAAA 5 6 Alpha AAAAAC Beta AAACCC Gamma AACAAC Delta CCCCCA Epsilon CCCAAC ... 5 6 Alpha AAAACC Beta ACCCCC Gamma AAAACC Delta CCCCAA Epsilon CAAACC
If you wanted to read this in using Bio.AlignIO
you could use:
from Bio import AlignIO alignments = AlignIO.parse(open("resampled.phy"), "phylip") for alignment in alignments : print alignment print
This would give the following output, again abbreviated for display:
SingleLetterAlphabet() alignment with 5 rows and 6 columns AAACCA Alpha AAACCC Beta ACCCCA Gamma CCCAAC Delta CCCAAA Epsilon SingleLetterAlphabet() alignment with 5 rows and 6 columns AAACAA Alpha AAACCC Beta ACCCAA Gamma CCCACC Delta CCCAAA Epsilon SingleLetterAlphabet() alignment with 5 rows and 6 columns AAAAAC Alpha AAACCC Beta AACAAC Gamma CCCCCA Delta CCCAAC Epsilon ... SingleLetterAlphabet() alignment with 5 rows and 6 columns AAAACC Alpha ACCCCC Beta AAAACC Gamma CCCCAA Delta CAAACC Epsilon
As with the function Bio.SeqIO.parse()
, using Bio.AlignIO.parse()
returns an iterator.
If you want to keep all the alignments in memory at once, which will allow you to access them in any order, then turn the iterator into a list:
from Bio import AlignIO alignments = list(AlignIO.parse(open("resampled.phy"), "phylip")) last_align = alignments[-1] first_align = alignments[0]
Many alignment file formats can explicitly store more than one alignment, and the division between each alignment is clear. However, when a general sequence file format has been used there is no such block structure. The most common such situation is when alignments have been saved in the FASTA file format. For example consider the following:
>Alpha ACTACGACTAGCTCAG--G >Beta ACTACCGCTAGCTCAGAAG >Gamma ACTACGGCTAGCACAGAAG >Alpha ACTACGACTAGCTCAGG-- >Beta ACTACCGCTAGCTCAGAAG >Gamma ACTACGGCTAGCACAGAAG
This could be a single alignment containing six sequences (with repeated identifiers). Or, judging from the identifiers, this is probably two different alignments each with three sequences, which happen to all have the same length.
What about this next example?
>Alpha ACTACGACTAGCTCAG--G >Beta ACTACCGCTAGCTCAGAAG >Alpha ACTACGACTAGCTCAGG-- >Gamma ACTACGGCTAGCACAGAAG >Alpha ACTACGACTAGCTCAGG-- >Delta ACTACGGCTAGCACAGAAG
Again, this could be a single alignment with six sequences. However this time based on the identifiers we might guess this is three pairwise alignments which by chance have all got the same lengths.
This final example is similar:
>Alpha ACTACGACTAGCTCAG--G >XXX ACTACCGCTAGCTCAGAAG >Alpha ACTACGACTAGCTCAGG >YYY ACTACGGCAAGCACAGG >Alpha --ACTACGAC--TAGCTCAGG >ZZZ GGACTACGACAATAGCTCAGG
In this third example, because of the differing lengths, this cannot be treated as a single alignment containing all six records. However, it could be three pairwise alignments.
Clearly trying to store more than one alignment in a FASTA file is not ideal. However, if you are forced to deal with these as input files Bio.AlignIO
can cope with the most common situation where all the alignments have the same number of records.
One example of this is a collection of pairwise alignments, which can be produced by the EMBOSS tools needle
and water
– although in this situation, Bio.AlignIO
should be able to understand their native output using “emboss” as the format string.
To interpret these FASTA examples as several separate alignments, we can use Bio.AlignIO.parse()
with the optional seq_count
argument which specifies how many sequences are expected in each alignment (in these examples, 3, 2 and 2 respectively).
For example, using the third example as the input data:
for alignment in AlignIO.parse(handle, "fasta", seq_count=2) : print "Alignment length %i" % alignment.get_alignment_length() for record in alignment : print "%s - %s" % (record.seq, record.id) print
giving:
Alignment length 19 ACTACGACTAGCTCAG--G - Alpha ACTACCGCTAGCTCAGAAG - XXX Alignment length 17 ACTACGACTAGCTCAGG - Alpha ACTACGGCAAGCACAGG - YYY Alignment length 21 --ACTACGAC--TAGCTCAGG - Alpha GGACTACGACAATAGCTCAGG - ZZZ
Using Bio.AlignIO.read()
or Bio.AlignIO.parse()
without the seq_count
argument would give a single alignment containing all six records for the first two examples. For the third example, an exception would be raised because the lengths differ preventing them being turned into a single alignment.
If the file format itself has a block structure allowing Bio.AlignIO
to determine the number of sequences in each alignment directly, then the seq_count
argument is not needed. If it is supplied, and doesn’t agree with the file contents, an error is raised.
Note that this optional seq_count
argument assumes each alignment in the file has the same number of sequences. Hypothetically you may come across stranger situations, for example a FASTA file containing several alignments each with a different number of sequences – although I would love to hear of a real world example of this. Assuming you cannot get the data in a nicer file format, there is no straight forward way to deal with this using Bio.AlignIO
. In this case, you could consider reading in the sequences themselves using Bio.SeqIO
and batching them together to create the alignments as appropriate.
We’ve talked about using Bio.AlignIO.read()
and Bio.AlignIO.parse()
for alignment input (reading files), and now we’ll look at Bio.AlignIO.write()
which is for alignment output (writing files). This is a function taking three arguments: some Alignment
objects, a handle to write to, and a sequence format.
Here is an example, where we start by creating a few Alignment
objects the hard way (by hand, rather than by loading them from a file):
from Bio.Align.Generic import Alignment from Bio.Alphabet import IUPAC, Gapped alphabet = Gapped(IUPAC.unambiguous_dna) align1 = Alignment(alphabet) align1.add_sequence("Alpha", "ACTGCTAGCTAG") align1.add_sequence("Beta", "ACT-CTAGCTAG") align1.add_sequence("Gamma", "ACTGCTAGDTAG") align2 = Alignment(alphabet) align2.add_sequence("Delta", "GTCAGC-AG") align2.add_sequence("Epislon","GACAGCTAG") align2.add_sequence("Zeta", "GTCAGCTAG") align3 = Alignment(alphabet) align3.add_sequence("Eta", "ACTAGTACAGCTG") align3.add_sequence("Theta", "ACTAGTACAGCT-") align3.add_sequence("Iota", "-CTACTACAGGTG") my_alignments = [align1, align2, align3]
Now we have a list of Alignment
objects, we’ll write them to a PHYLIP format file:
from Bio import AlignIO handle = open("my_example.phy", "w") SeqIO.write(my_alignments, handle, "phylip") handle.close()
And if you open this file in your favourite text editor it should look like this:
3 12 Alpha ACTGCTAGCT AG Beta ACT-CTAGCT AG Gamma ACTGCTAGDT AG 3 9 Delta GTCAGC-AG Epislon GACAGCTAG Zeta GTCAGCTAG 3 13 Eta ACTAGTACAG CTG Theta ACTAGTACAG CT- Iota -CTACTACAG GTG
Its more common to want to load an existing alignment, and save that, perhaps after some simple manipulation like removing certain rows or columns.
Converting between sequence alignment file formats with Bio.AlignIO
works in the same way as converting between sequence file formats with Bio.SeqIO
– we load generally the alignment(s) using Bio.AlignIO.parse()
and then save them using the Bio.AlignIO.write()
.
For this example, we’ll load the PFAM/Stockholm format file used earlier and save it as a Clustal W format file:
from Bio import AlignIO alignments = AlignIO.parse(open("PF05371_seed.sth"), "stockholm") handle = open("PF05371_seed.aln","w") AlignIO.write(alignments, handle, "clustal") handle.close()
The Bio.AlignIO.write()
function expects to be given multiple alignment objects. In the example above we gave it the alignment iterator returned by Bio.AlignIO.parse()
.
In this case, we know there is only one alignment in the file so we could instead have used Bio.AlignIO.read()
but notice we have to pass this alignment to Bio.AlignIO.write()
as a single element list:
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm") handle = open("PF05371_seed.aln","w") AlignIO.write([alignment], handle, "clustal") handle.close()
Either way, you should end up with the same new Clustal W format file “PF05371_seed.aln"” with the following content:
CLUSTAL X (1.81) multiple sequence alignment COATB_BPIKE/30-81 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS Q9T0Q8_BPIKE/1-52 AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS COATB_BPI22/32-83 DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS COATB_BPM13/24-72 AEGDDP---AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPZJ2/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS Q9T0Q9_BPFD/1-49 AEGDDP---AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS COATB_BPIF1/22-73 FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS COATB_BPIKE/30-81 KA Q9T0Q8_BPIKE/1-52 RA COATB_BPI22/32-83 KA COATB_BPM13/24-72 KA COATB_BPZJ2/1-49 KA Q9T0Q9_BPFD/1-49 KA COATB_BPIF1/22-73 RA
Alternatively, you could make a PHYLIP format file which we’ll name “PF05371_seed.phy”:
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm") handle = open("PF05371_seed.phy","w") AlignIO.write([alignment], handle, "phylip") handle.close()
This time the output looks like this:
7 52 COATB_BPIK AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS Q9T0Q8_BPI AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS COATB_BPI2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS COATB_BPM1 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS COATB_BPZJ AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS Q9T0Q9_BPF AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS COATB_BPIF FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS KA RA KA KA KA KA RA
One of the big handicaps of the PHYLIP alignment file format is that the sequence identifiers are strictly truncated at ten characters. In this example, as you can see the resulting names are still unique - but they are not very readable. In this particular case, there is no clear way to compress the identifers, but for the sake of argument you may want to assign your own names or numbering system. This following bit of code manipulates the record identifiers before saving the output:
from Bio import AlignIO alignment = AlignIO.read(open("PF05371_seed.sth"), "stockholm") name_mapping = {} for i, record in enumerate(alignment) : name_mapping[i] = record.id record.id = "seq%i" % i print name_mapping handle = open("PF05371_seed.phy","w") AlignIO.write([alignment], handle, "phylip") handle.close()
This code using a python dictionary to record a simple mapping from the new sequence system to the original identifier:
{0: 'COATB_BPIKE/30-81', 1: 'Q9T0Q8_BPIKE/1-52', 2: 'COATB_BPI22/32-83', ...}
Here is the new PHYLIP format output:
7 52 seq0 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIRLFKKFSS seq1 AEPNAATNYA TEAMDSLKTQ AIDLISQTWP VVTTVVVAGL VIKLFKKFVS seq2 DGTSTATSYA TEAMNSLKTQ ATDLIDQTWP VVTSVAVAGL AIRLFKKFSS seq3 AEGDDP---A KAAFNSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS seq4 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFAS seq5 AEGDDP---A KAAFDSLQAS ATEYIGYAWA MVVVIVGATI GIKLFKKFTS seq6 FAADDATSQA KAAFDSLTAQ ATEMSGYAWA LVVLVVGATV GIKLFKKFVS KA RA KA KA KA KA RA
In general, because of the identifier limitation, working with PHYLIP file formats shouldn’t be your first choice. Using the PFAM/Stockholm format on the other hand allows you to record a lot of additional annotation too.