MPCS 55001 — Lecture 11

RNA Secondary Structure

In the following, $\Sigma$ will denote a finite alphabet of symbols. For example, $\{0,1\}$ is the binary alphabet and $\{A,C,G,T\}$ is the DNA alphabet. We denote the set of strings over a particular alphabet $\Sigma$ by $\Sigma^*$. So $\{0,1\}^*$ denotes the set of all binary strings.

One of the key insights that led to the development of bioinformatics was the observation that we could view sequences of DNA and RNA as strings. If we take a formal language-theoretic view of this, we can treat DNA and RNA as strings over a four-letter alphabet: $\{A,C,G,T\}$ for DNA and $\{A,C,G,U\}$ for RNA.

This simplifies the behaviour of DNA and RNA a bit. After all, they aren't just sequences, they have various biochemical properties as well. For instance, DNA is double-stranded and any sequence we read is really just one side of a double-stranded sequence. If we consider a DNA sequence $s \in \{A,C,G,T\}^*$, we can define its Watson-Crick complement mathematically.

We say that a function $h : \Sigma^* \to \Sigma^*$ is a homomorphism if $h(w)$, with $w \in \Sigma^*$ satisfies the following:

Informally, homomorphisms are string functions that map each letter to a string while preserving the order of the string. Homomorphisms are a definition derived from algebraic structures (that is, there are ring homomorphisms, group homomorphisms, etc., and this is because strings themselves form an algebraic structure called a monoid). This definition leads us to the actual notion we care about: antimorphisms.

We say that a function $\psi : \Sigma^* \to \Sigma^*$ is an antimorphism if $\psi(w)$, with $w \in \Sigma^*$ satisfies the following:

In other words, an antimorphism is like a homomorphism in that it maps each letter of a string to another string, but it preserves the order in reverse. This is exactly what for a definition of the complement of a DNA sequence.

The Watson-Crick antimorphism is the involutive antimorphism $\theta: \{A,C,G,T\}^* \to \{A,C,G,T\}^*$ defined by $\theta(A) = T$, $\theta(C) = G$, $\theta(G) = C$, $\theta(T) = A$.

So for any string $s \in \{A,C,G,T\}^*$, $\theta(s)$ is the Watson-Crick complement of $s$. This has the additional nice property of being an involution: $\theta(\theta(s)) = s$, or the fact that applying the function twice gets us the identity.

We can get a similar definition for the RNA alphabet $\{A,C,G,U\}$, where $U$ takes the same role as $T$ in our theoretical model. Unlike DNA, RNA is not double-stranded, but it can still bind to itself, which creates what are called secondary structures.

Obviously, strings are one-dimensional objects, but it is possible to predict the formation of secondary structures using sequences of RNA. There are some very nice combinatorial definitions for what this looks like.

Consider the RNA sequence $UGUACCGGUAGUACA$. This sequence can have the following secondary structure.

This particular structure is called a hairpin. Notice that we can write it as a string $xy\theta(x)$, where $x = UGUAC$ and $y = CGGUA$. Generally, hairpins can be said to have this sequence structure $uv\theta(u)$, where $u,v \in \Sigma^*$. In this way, we can identify secondary structures combinatorially on one-dimensional sequences.

The specific problem we would like to work on is prediction: which secondary structures are most likely to form given an RNA sequence. We will use the following definition/restrictions.

A secondary structure on a string $w = b_1 \cdots b_n$ of length $n$ over the alphabet $\{A,C,G,U\}$ is a set of pairs $S = \{(i,j)\}$, where $i,j \in \{1,\dots,n\}$ such that

  1. Each pair is separated by at least four bases: if $(i,j) \in S$, then $i \lt j-4$. That is, there are no sharp turns.
  2. Each pair is either $\{A,U\}$ or $\{C,G\}$. That is, the pairs must satisfy Watson-Crick complementarity.
  3. No base appears in more than one pair. That is, $S$ is a matching.
  4. If $(i,j)$ and $(k,\ell)$ are in $S$, then we cannot have $i \lt k \lt j \lt \ell$. In other words, the pairs can't cross.

Our algorithm, due to Ruth Nussinov in her PhD thesis in 1977 and published later with Jacobson in 1980, is based on the idea that the secondary structure most likely to occur is the one with the most base pairs. This means that our problem is a maximization problem.

The RNA secondary structure prediction problem is: given an RNA sequence $w = b_1 b_2 \cdots b_n$, determine a secondary structure $S$ with the maximum number of base pairs.

This makes it clear that we want something like $OPT(j)$ should be the maximum number of base pairs in a secondary structure on $b_1 \cdots b_j$. In the same way as many problems before, we consider the question of whether $b_j$ participates in a base pair. If it doesn't, we have $OPT(j) = OPT(j-1)$.

If it does, then $b_j$ is in a pair with some base $b_t$ with $t \lt j-4$. But there's an issue with this: we now have two subproblems. We need to consider the number of base pairs on $b_1 \cdots b_{t-1}$, which would give us $OPT(t-1)$. But we also need to consider the same question on $b_{t+1} \cdots b_{j-1}$. However, this problem isn't captured by our definition of $OPT(i)$, since it asks the question for $b_1 \cdots b_i$.

This suggests that we actually need another variable to handle substrings of $b_1 \cdots b_n$. So instead, we consider $OPT(i,j)$ to be the maximum number of base pairs on the string $b_i \cdots b_j$. Taking our two cases from above, we have the following

To maximize the value, we take the larger of the above two cases. However, there is one more case we need to consider: what if $i \geq j-4$? In this case, there is no secondary structure, by our definition above, so we have $OPT(i,j) = 0$. This forms our base case. This then gives us the following recurrence.

The following recurrence correctly computes the maximum number of base pairs on a string $b_i \cdots b_j$. \[OPT(i,j) = \begin{cases} 0 & \text{if $j-4 \leq i \leq j$,} \\ \max\left\{ \begin{aligned} &OPT(i,j-1), \\ &\max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \left\{1 + OPT(i,t-1) + OPT(t+1,j-1)\right\} \end{aligned} \right\} & \text{otherwise.} \end{cases}\]

Observe that the base case is not just starting at $(0,0)$ and moving across each row. We have to be a bit careful with how we fill out the table for our recurrence.

    \begin{algorithmic}
    \PROCEDURE{rna-secondary}{$u = b_1 \cdots b_n$}
        \FOR{$i$ from $1$ to $n$}
            \FOR{$j$ from $1$ to $n$}
                \IF{$i \geq j-4$}
                    \STATE $S[i,j] = 0$
                \ENDIF
            \ENDFOR
        \ENDFOR
        \FOR{$k$ from $5$ to $n-1$}
            \FOR{$i$ from $1$ to $n-k$}
                \STATE $j = i+k$
                \STATE $S[i,j] = \max\left\{S[i,j-1], \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \left\{1 + S[i,t-1] + S[t+1,j-1]\right\}\right\} $ 
            \ENDFOR
        \ENDFOR
        \RETURN{$S[1,n]$}
    \ENDPROCEDURE
    \end{algorithmic}

We iterate through entries of the table according to the length of the substrings we consider. This makes sense, because our subproblems depend on smaller substrings, which don't necessarily start with index 1. Our proof will then do likewise.

RNA-SECONDARY correctly computes $OPT$.

We will show that $S[i,j] = OPT(i,j)$ by induction on $(i,j)$ such that $|j-i|$.

For our base case, we consider all pairs $(i,j)$ such that $|j-i| \leq 4$. In this case, all such entries are $S[i,j] = 0 = OPT(i,j)$.

For our inductive case, we assume that for all $(i',j')$ with $4 \lt |j'-i'| \lt |j-i|$ that $S[i',j'] = OPT(i',j')$. By definition of $OPT(i,j)$, we have \[OPT(i,j) = \max\left\{OPT(i,j-1), \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \{1 + OPT(i,t-1) + OPT(t+1,j-1)\}\right\}.\] Observe by our inductive hypothesis, we have the following:

Putting this together gives us \[OPT(i,j) = \max\left\{S[i,j-1], \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \{1 + S[i,t-1] + S[t+1,j-1]\}\right\},\] and thus, $OPT(i,j) = S[i,j]$.

RNA-SECONDARY has running time $O(n^3)$ and uses $O(n^2)$ space.

It is clear that the array that we fill out uses $O(n^2)$ space. This means that we need at least that much time to fill out each entry. We claim that filling out each entry requires $O(n)$ time. To see this, observe that we must determine the maximum value across the interval $i$ through $j-4$, which takes $O(n)$ time. This gives us a running time of $O(n^3)$.

It's important to note that Nussinov's algorithm was the first of now many dynamic programming algorithms. Since then, there have been many more developed, with more complicated criteria and fewer assumptions about the secondary structure. For instance, the assumption that pairs don't cross could be an oversimplification. While arguably rare, there are secondary structures that do contain such crossings. One example of these are pseudoknots, which can be represented as substrings of the form $uxv\theta(u)y\theta(v)$.