In [ ]:
from IPython.core.display import HTML

def _set_css_style(css_file_path):
   """
   Read the custom CSS file and load it into Jupyter.
   Pass the file path to the CSS file.
   """

   styles = open(css_file_path, "r").read()
   s = '<style>%s</style>' % styles     
   return HTML(s)

_set_css_style('rise.css')

Biopython and sequence analysis continued¶

print view
notebook

  • List comprehensions
  • Phylogenetic trees
  • Sequence motifs

A bit more python... list comprehensions¶

A concise way to create lists

[ <expr of var> for <var> in <iterable> if <condition> ]

In [ ]:
[x for x in range(10) if x % 2 == 0]

Two ways to do the same thing¶

In [ ]:
squares = []
for x in range(10):
    squares.append(x**2)
squares
In [ ]:
squares = [x**2 for x in range(10)]
squares

Other comprehensions¶

In [ ]:
list(enumerate('ABCD')) # enumerate returns tuples of index,value
In [ ]:
{key: val for key, val in enumerate('ABCD') if val not in 'CB'}
In [ ]:
{v for v in 'ABCDABCD' if v not in 'CB'}

Should you use comprehensions?¶

Sure, if the result is short and easy to understand

Compound expressions can appear complicated, however

result = [line.strip().split('\t') for line in open('file') if not line.startswith('#')]

Back to Biopython...¶

In [ ]:
from Bio import AlignIO
a = AlignIO.read('../files/hydra179.aln','clustal')
In [ ]:
len(a)
In [ ]:
len(a[0]),a.get_alignment_length()
In [ ]:
a

Phylogenetic Trees¶

A phylogenetic tree or evolutionary tree is a branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. --Wikipedia

Biopython can read a variety of tree formats: Newick (clustal), NEXUS, phyloXML, NeXML, and CDAO

In [ ]:
from Bio import Phylo
tree = Phylo.read('../files/hydra179.dnd','newick') #must specify format
tree

Displaying trees¶

In [ ]:
Phylo.draw_ascii(tree)

Displaying trees¶

Phylo can draw trees using matplot lib (e.g., can use savefig etc)

In [ ]:
%matplotlib inline
Phylo.draw(tree)
In [ ]:
Phylo.draw(tree,label_func=lambda x: None)

Motifs¶

Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. Often they indicate sequence-specific binding sites for proteins such as nucleases and transcription factors (TF).

Motif logos¶

In [ ]:
from Bio import motifs  # lower case for some reason
m = motifs.create(["TACAA","CATGC","TACTA","CCCAA"])
In [ ]:
m.counts
In [ ]:
m.consensus

Motif logos¶

Biopython uses weblogo

In [ ]:
m.weblogo('logo.png', alphabet='alphabet_dna', stack_width='large')
In [ ]:
from IPython.display import Image
Image(filename='./logo.png')

Reading motifs¶

Biopython supports a number of motif formats: JASPAR, MEME, TRANSFAC

These formats are associated with databases (JASPAR, TRANSFAC) and tools (MEME)

Of particular interest are sequence motifs for transcription factor binding

JASPAR

Reading motifs¶

In [ ]:
f = open('../files/MA0004.1.sites') # unlike other parts of Biopython, can't just provide filename to open
arnt = motifs.read(f,'sites') # JASPAR sites
In [ ]:
arnt
In [ ]:
arnt.consensus
In [ ]:
print(arnt.counts)
In [ ]:
arnt.alignment.sequences

Scoring matrices¶

The counts attribute can be normalized to represent probabilities at each position

In [ ]:
print(arnt.counts.normalize())

A pseudocount is often added at each position to prevent probability from going to zero.

In [ ]:
print(arnt.counts.normalize(pseudocounts=0.8))

PSSM¶

The position-specific scoring matrix is the position weight matrix (with pseudocounts) expressed as a log (base 2) odds ratios

In [ ]:
pwm = arnt.counts.normalize(pseudocounts=0.8)
pssm = pwm.log_odds()
print(pssm)

A negative value means a nucleotide is less likely than the background at a specific position.

By default a uniform background is assumed, but this can be changed with the background parameter of log_odds.

Searching for motifs¶

In [ ]:
from Bio import SeqIO
from Bio.Seq import Seq
largeseq = SeqIO.read('../files/bnip3.fasta','fasta') # load with same alphabet as motif
smallseq = Seq('AAACCCACGTGACTATATA')

We can search for matches using PSSM

In [ ]:
pwm = arnt.counts.normalize(pseudocounts=0.8)
pssm = pwm.log_odds()
positions = [pos for pos, seq in pssm.search(largeseq.seq)]
len(positions)

Searching for motifs¶

In [ ]:
results = [(pos, score) for pos, score in pssm.search(largeseq.seq, threshold=4)]
len(results)

The score is a $\log_2$ likelihood so a score of 4 is $2^4=16$ times more likely to occur as part of the motif than as part of the (uniform) background

In [ ]:
results[0]

Searching for motifs¶

Positions may be negative if the motif was found on the reverse strand.

In [ ]:
results[:2]
In [ ]:
pos = results[1][0] # -13823
hit = largeseq.seq[pos:pos+len(arnt)]  # negative indices can still be used to retrieve matched subsequence
In [ ]:
print(pos,len(largeseq)+pos)
print(hit, hit.reverse_complement())
In [ ]:
print(arnt.counts)

Some more marine biology¶

No description has been provided for this image

Your Herculean Task¶

Get input files of 179 hydra sequences

https://MSCBIO2025-2025.github.io/files/hydra179.aln
https://MSCBIO2025-2025.github.io/files/hydra179.dnd

  1. Display the phylogenetic tree from the clustal alignment (hydra179.dnd)
  2. Identify the subsequence of length 20 that has the most variation, measured by the number of unique subsequences (like last time)
  3. Use clustal to compute the multiple alignment of these 179 length 20 subsequences and display the phylogenetic tree

For next time¶

Protein structure with Prof. Mert Gur