Prediction of MicroRNA Recognition Sites

BIOL 265/COMP 113

Michael Weir and Danny Krizanc

Today we are going to carry out a genomic scale analysis to discover special properties of a subset of the screened genes. In particular, we will use sequence properties to predict genes that are targets of microRNAs.

Small RNAs are increasingly being found to play a significant role in regulating expression of genes. One of the mechanisms of regulating expression involves microRNAs that bind through base pairing to 3-prime UTRs of target genes. This can cause protein translation of the targeted mRNA to be down-regulated and can also cause the mRNA to be degraded.

An example of this process is illustrated for the C. elegans let-7 microRNA which targets the lin-41 mRNA. This is illustrated here. [See link for discussion.]

Let us screen the C. elegans genome for possible target genes of let-7.

But first, let us work out how we will identify putative targets.

The figure illustrates an example of the let-7 microRNA binding to a target.

We will use the rules for possible base pairing:

A can pair with U

C can pair with G

G can pair with C or U

U can pair with A or G

These rules can be used to make a model for the possible target binding sites of let-7. Let us assume that the let-7 microRNA base pairs according to the above rules at exactly the 17 nucleotide positions illustrated in part 2 of the figure.

It is probably easiest to express this as a regular expression.

[You will be able to use this RE in your python code.]

To build your model, you may wish to draw the let-7 sequence, then on the next line, the corresponding lin-41 sequence that is bound, and then on the next line, alternative nucleotides, if appropriate.

let-7 sequence 3- GAUAUGUUGG . . . GAUGGAG

lin-41 5-

Alternative nt

How many different target binding sequences can let-7 bind to according to your model?

[8192]

This is sometimes described as the degeneracy of the binding event.

There are 13 positions in let-7 that theoretically can base pair with two different bases.

Four positions can only bind one base.

Hence, we can also estimate of the probability of finding a binding site according to our model:

p = (1/4) ⁴ * (1/2) ¹³

This is 1 / 2097152

So we might expect to see a let-7 site on average about every 2 million nucleotides (if we were screening random sequence).

Now that we have an RE model, let us write python code that will read sequences from a file and test for occurrences of possible let-7 binding sites according to this model.

Because let-7 regulates translation by binding to 3-prime UTRs, we will screen 1000 nt of genomic sequence downstream of each gene's ORF (why are we screening genomic sequence?). These are available here in fasta format [You may need to right-click to download the files] [Note that the sequence lines have been concatenated so that each gene has a single line of sequence.]

Write code that reads each line of the fasta file and outputs (prints) all matches to the RE including the fasta header line, nucleotide position and matching sequence.

e.g. of output

['>WBGene00011936|T22H2.6', [353, 'ttgtacggcttgcctacctc']]

Note that the fasta sequences are DNA sequences (T instead of U). You may wish to use the short test file c_el_test1.txt for testing your code.

You may want to write a Python function to test your RE. After importing the Python RE module and compiling your RE, you could write a for loop that processes each line from the fasta file, and tests whether the line is a fasta header or a sequence line; if a sequence, you might use a for loop to screen the sequence for matches to the RE.

Assignment

Screen all six chromosomes of C. elegans – this corresponds to approximately 22,000 genes.

[You could write code to count the exact number of genes you screen.]

1. How many putative let-7 sites do you discover?

2. Is this number in the expected order of magnitude?

3. Is lin-41 among the predicted candidates? – this is a useful test of your python code.

4. Of the putative sites that you discover, are some more likely to be real than others? Why? (Think about how the mRNAs compare to the DNA sequences used for your analysis.)

5. What wet lab experimental approaches could you imagine to test whether given candidates are real targets of let-7? (Consider the expected function of let-7.)