Prediction
of MicroRNA Recognition Sites
BIOL 265/COMP
113
Michael
Weir and Danny Krizanc
Today
we are going to carry out a genomic scale analysis to discover special properties of
a subset of the screened genes. In particular, we will use sequence properties to
predict genes that are targets of microRNAs.
Small
RNAs are increasingly being found to play a significant role in regulating
expression of genes. One of the
mechanisms of regulating expression involves microRNAs that bind through base
pairing to 3-prime UTRs of target genes.
This can cause protein translation of the targeted mRNA to be down-regulated and can also cause the mRNA to be degraded.
An
example of this process is illustrated for the C. elegans let-7 microRNA which targets the lin-41 mRNA. This is illustrated
here. [See link
for discussion.]
Let us
screen the C. elegans
genome for possible target genes of let-7.
But
first, let us work out how we will identify putative targets.
The figure
illustrates an example of the let-7
microRNA binding to a target.
We will
use the rules for possible base pairing:
A can pair with U
C can pair with G
G can pair with C or U
U can pair with A or G
These
rules can be used to make a model for the possible target binding sites of let-7. Let us assume that the let-7 microRNA base pairs according to
the above rules at exactly the 17 nucleotide positions
illustrated in part 2 of the figure.
It is
probably easiest to express this as a regular expression.
[You
will be able to use this RE in your python code.]
To
build your model, you may wish to draw the let-7
sequence, then on the next line, the corresponding lin-41 sequence that is bound, and then on the next line,
alternative nucleotides, if appropriate.
let-7 sequence 3- GAUAUGUUGG . . . GAUGGAG
lin-41 5-
Alternative
nt
How
many different target binding sequences can let-7
bind to according to your model?
[8192]
This is
sometimes described as the degeneracy of the binding event.
There
are 13 positions in let-7 that
theoretically can base pair with two different bases.
Four
positions can only bind one base.
Hence,
we can also estimate of the probability of finding a binding site according to
our model:
p = (1/4) 4 * (1/2) 13
This is
1 / 2097152
So we
might expect to see a let-7 site on
average about every 2 million nucleotides (if we were screening random
sequence).
Now
that we have an RE model, let us write python code that will read sequences
from a file and test for occurrences of possible let-7 binding sites according to this model.
Because
let-7 regulates translation by
binding to 3-prime UTRs, we will screen 1000 nt of genomic sequence downstream of each gene's ORF
(why are we screening genomic sequence?).
These are available here in
fasta format [You may need to right-click to download
the files] [Note that the sequence lines have been concatenated so that each
gene has a single line of sequence.]
Write
code that reads each line of the fasta file and
outputs (prints) all matches to the RE including the fasta
header line, nucleotide position and matching sequence.
e.g. of output
['>WBGene00011936|T22H2.6',
[353, 'ttgtacggcttgcctacctc']]
Note
that the fasta sequences are DNA sequences (T instead
of U). You may wish to use the
short test file c_el_test1.txt for testing your code.
You may
want to write a Python function to test your RE. After importing the Python RE
module and compiling your RE, you could write a for loop that processes each line from the fasta
file, and tests whether the line is a fasta header or
a sequence line; if a sequence, you might use a for loop to screen the sequence for matches to the RE.
Assignment
Screen
all six chromosomes of C. elegans – this corresponds to approximately
22,000 genes.
[You
could write code to count the exact number of genes you screen.]
1. How
many putative let-7 sites do you
discover?
2. Is
this number in the expected order of magnitude?
3. Is lin-41 among the predicted candidates?
– this is a useful test of your python code.
4. Of
the putative sites that you discover, are some more likely to be real than
others? Why? (Think about how the mRNAs compare to the DNA sequences used for your analysis.)
5. What wet lab experimental
approaches could you imagine to test whether given candidates are real targets
of let-7? (Consider the expected function of let-7.)
Copyright
2019 Wesleyan University