from IPython.core.display import HTML
def _set_css_style(css_file_path):
"""
Read the custom CSS file and load it into Jupyter.
Pass the file path to the CSS file.
"""
styles = open(css_file_path, "r").read()
s = '<style>%s</style>' % styles
return HTML(s)
_set_css_style('rise.css')
Biopython and sequence analysis continued¶
- List comprehensions
- Phylogenetic trees
- Sequence motifs
A bit more python... list comprehensions¶
A concise way to create lists
[ <expr of var> for <var> in <iterable> if <condition> ]
[x for x in range(10) if x % 2 == 0]
Two ways to do the same thing¶
squares = []
for x in range(10):
squares.append(x**2)
squares
squares = [x**2 for x in range(10)]
squares
Other comprehensions¶
list(enumerate('ABCD')) # enumerate returns tuples of index,value
{key: val for key, val in enumerate('ABCD') if val not in 'CB'}
{v for v in 'ABCDABCD' if v not in 'CB'}
Should you use comprehensions?¶
Sure, if the result is short and easy to understand
Compound expressions can appear complicated, however
result = [line.strip().split('\t') for line in open('file') if not line.startswith('#')]
Back to Biopython...¶
from Bio import AlignIO
a = AlignIO.read('../files/hydra179.aln','clustal')
len(a)
len(a[0]),a.get_alignment_length()
a
Phylogenetic Trees¶
A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. --Wikipedia
Biopython can read a variety of tree formats: Newick (clustal), NEXUS, phyloXML, NeXML, and CDAO
from Bio import Phylo
tree = Phylo.read('../files/hydra179.dnd','newick') #must specify format
tree
Displaying trees¶
Phylo.draw_ascii(tree)
Displaying trees¶
Phylo can draw trees using matplot lib (e.g., can use savefig etc)
%matplotlib inline
Phylo.draw(tree)
Phylo.draw(tree,label_func=lambda x: None)
Motifs¶
Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF).
Motif logos¶
from Bio import motifs # lower case for some reason
m = motifs.create(["TACAA","CATGC","TACTA","CCCAA"])
m.counts
m.consensus
m.weblogo('logo.png', alphabet='alphabet_dna', stack_width='large')
from IPython.display import Image
Image(filename='./logo.png')
Reading motifs¶
f = open('../files/MA0004.1.sites') # unlike other parts of Biopython, can't just provide filename to open
arnt = motifs.read(f,'sites') # JASPAR sites
arnt
arnt.consensus
print(arnt.counts)
arnt.alignment.sequences
Scoring matrices¶
The counts attribute can be normalized to represent probabilities at each position
print(arnt.counts.normalize())
A pseudocount is often added at each position to prevent probability from going to zero.
print(arnt.counts.normalize(pseudocounts=0.8))
PSSM¶
The position-specific scoring matrix is the position weight matrix (with pseudocounts) expressed as a log (base 2) odds ratios
pwm = arnt.counts.normalize(pseudocounts=0.8)
pssm = pwm.log_odds()
print(pssm)
A negative value means a nucleotide is less likely than the background at a specific position.
By default a uniform background is assumed, but this can be changed with the background
parameter of log_odds
.
Searching for motifs¶
from Bio import SeqIO
from Bio.Seq import Seq
largeseq = SeqIO.read('../files/bnip3.fasta','fasta') # load with same alphabet as motif
smallseq = Seq('AAACCCACGTGACTATATA')
We can search for matches using PSSM
pwm = arnt.counts.normalize(pseudocounts=0.8)
pssm = pwm.log_odds()
positions = [pos for pos, seq in pssm.search(largeseq.seq)]
len(positions)
Searching for motifs¶
results = [(pos, score) for pos, score in pssm.search(largeseq.seq, threshold=4)]
len(results)
The score is a $\log_2$ likelihood so a score of 4 is $2^4=16$ times more likely to occur as part of the motif than as part of the (uniform) background
results[0]
Searching for motifs¶
Positions may be negative if the motif was found on the reverse strand.
results[:2]
pos = results[1][0] # -13823
hit = largeseq.seq[pos:pos+len(arnt)] # negative indices can still be used to retrieve matched subsequence
print(pos,len(largeseq)+pos)
print(hit, hit.reverse_complement())
print(arnt.counts)
Some more marine biology¶

Your Herculean Task¶
Get input files of 179 hydra sequences
- Display the phylogenetic tree from the clustal alignment (hydra179.dnd)
- Identify the subsequence of length 20 that has the most variation, measured by the number of unique subsequences (like last time)
- Use clustal to compute the multiple alignment of these 179 length 20 subsequences and display the phylogenetic tree
For next time¶
Protein structure with Prof. Mert Gur