MPCS 55001 — Lecture 7

RNA Secondary Structure

In the following, $\Sigma$ will denote a finite alphabet of symbols. For example, $\{0,1\}$ is the binary alphabet and $\{A,C,G,T\}$ is the DNA alphabet. We denote the set of strings over a particular alphabet $\Sigma$ by $\Sigma^*$. So $\{0,1\}^*$ denotes the set of all binary strings.

One of the key insights that led to the development of bioinformatics was the observation that we could view sequences of DNA and RNA as strings. If we take a formal language-theoretic view of this, we can treat DNA and RNA as strings over a four-letter alphabet: $\{A,C,G,T\}$ for DNA and $\{A,C,G,U\}$ for RNA.

This simplifies the behaviour of DNA and RNA a bit. After all, they aren't just sequences, they have various biochemical properties as well. For instance, DNA is double-stranded and any sequence we read is really just one side of a double-stranded sequence. If we consider a DNA sequence $s \in \{A,C,G,T\}^*$, we can define its Watson-Crick complement mathematically.

We say that a function $h : \Sigma^* \to \Sigma^*$ is a homomorphism if $h(w)$, with $w \in \Sigma^*$ satisfies the following:

Informally, homomorphisms are string functions that map each letter to a string while preserving the order of the string. Homomorphisms are a definition derived from algebraic structures (that is, there are ring homomorphisms, group homomorphisms, etc., and this is because strings themselves form an algebraic structure called a monoid). This definition leads us to the actual notion we care about: antimorphisms.

We say that a function $\psi : \Sigma^* \to \Sigma^*$ is an antimorphism if $\psi(w)$, with $w \in \Sigma^*$ satisfies the following:

In other words, an antimorphism is like a homomorphism in that it maps each letter of a string to another string, but it preserves the order in reverse. This is exactly what for a definition of the complement of a DNA sequence.

The Watson-Crick antimorphism is the involutive antimorphism $\theta: \{A,C,G,T\}^* \to \{A,C,G,T\}^*$ defined by $\theta(A) = T$, $\theta(C) = G$, $\theta(G) = C$, $\theta(T) = A$.

So for any string $s \in \{A,C,G,T\}^*$, $\theta(s)$ is the Watson-Crick complement of $s$. This has the additional nice property of being an involution: $\theta(\theta(s)) = s$, or the fact that applying the function twice gets us the identity.

We can get a similar definition for the RNA alphabet $\{A,C,G,U\}$, where $U$ takes the same role as $T$ in our theoretical model. Unlike DNA, RNA is not double-stranded, but it can still bind to itself, which creates what are called secondary structures.

Obviously, strings are one-dimensional objects, but it is possible to predict the formation of secondary structures using sequences of RNA. There are some very nice combinatorial definitions for what this looks like.

Consider the RNA sequence $UGUACCGGUAGUACA$. This sequence can have the following secondary structure.

This particular structure is called a hairpin. Notice that we can write it as a string $xy\theta(x)$, where $x = UGUAC$ and $y = CGGUA$. Generally, hairpins can be said to have this sequence structure $uv\theta(u)$, where $u,v \in \Sigma^*$. In this way, we can identify secondary structures combinatorially on one-dimensional sequences.

The specific problem we would like to work on is prediction: which secondary structures are most likely to form given an RNA sequence. We will use the following definition/restrictions.

A secondary structure on a string $w = b_1 \cdots b_n$ of length $n$ over the alphabet $\{A,C,G,U\}$ is a set of pairs $S = \{(i,j)\}$, where $i,j \in \{1,\dots,n\}$ such that

  1. Each pair is separated by at least four bases: if $(i,j) \in S$, then $i \lt j-4$. That is, there are no sharp turns.
  2. Each pair is either $\{A,U\}$ or $\{C,G\}$. That is, the pairs must satisfy Watson-Crick complementarity.
  3. No base appears in more than one pair. That is, $S$ is a matching.
  4. If $(i,j)$ and $(k,\ell)$ are in $S$, then we cannot have $i \lt k \lt j \lt \ell$. In other words, the pairs can't cross.

Our algorithm, due to Ruth Nussinov in her PhD thesis in 1977 and published later with Jacobson in 1980, is based on the idea that the secondary structure most likely to occur is the one with the most base pairs. This means that our problem is a maximization problem.

The RNA secondary structure prediction problem is: given an RNA sequence $w = b_1 b_2 \cdots b_n$, determine a secondary structure $S$ with the maximum number of base pairs.

This makes it clear that we want something like $OPT(j)$ should be the maximum number of base pairs in a secondary structure on $b_1 \cdots b_j$. In the same way as many problems before, we consider the question of whether $b_j$ participates in a base pair. If it doesn't, we have $OPT(j) = OPT(j-1)$.

If it does, then $b_j$ is in a pair with some base $b_t$ with $t \lt j-4$. But there's an issue with this: we now have two subproblems. We need to consider the number of base pairs on $b_1 \cdots b_{t-1}$, which would give us $OPT(t-1)$. But we also need to consider the same question on $b_{t+1} \cdots b_{j-1}$. However, this problem isn't captured by our definition of $OPT(i)$, since it asks the question for $b_1 \cdots b_i$.

This suggests that we actually need another variable to handle substrings of $b_1 \cdots b_n$. So instead, we consider $OPT(i,j)$ to be the maximum number of base pairs on the string $b_i \cdots b_j$. Taking our two cases from above, we have the following

To maximize the value, we take the larger of the above two cases. However, there is one more case we need to consider: what if $i \geq j-4$? In this case, there is no secondary structure, by our definition above, so we have $OPT(i,j) = 0$. This forms our base case. This then gives us the following recurrence.

The following recurrence correctly computes the maximum number of base pairs on a string $b_i \cdots b_j$. \[OPT(i,j) = \begin{cases} 0 & \text{if $j-4 \leq i \leq j$,} \\ \max\left\{ \begin{aligned} &OPT(i,j-1), \\ &\max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \left\{1 + OPT(i,t-1) + OPT(t+1,j-1)\right\} \end{aligned} \right\} & \text{otherwise.} \end{cases}\]

Observe that the base case is not just starting at $(0,0)$ and moving across each row. We have to be a bit careful with how we fill out the table for our recurrence.

    \begin{algorithmic}
    \PROCEDURE{rna-secondary}{$u = b_1 \cdots b_n$}
        \FOR{$i$ from $1$ to $n$}
            \FOR{$j$ from $1$ to $n$}
                \IF{$i \geq j-4$}
                    \STATE $S[i,j] = 0$
                \ENDIF
            \ENDFOR
        \ENDFOR
        \FOR{$k$ from $5$ to $n-1$}
            \FOR{$i$ from $1$ to $n-k$}
                \STATE $j = i+k$
                \STATE $S[i,j] = \max\left\{S[i,j-1], \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \left\{1 + S[i,t-1] + S[t+1,j-1]\right\}\right\} $ 
            \ENDFOR
        \ENDFOR
        \RETURN{$S[1,n]$}
    \ENDPROCEDURE
    \end{algorithmic}

We iterate through entries of the table according to the length of the substrings we consider. This makes sense, because our subproblems depend on smaller substrings, which don't necessarily start with index 1. The consequence of this is that our algorithm does not quite fill the array out in the usual left-to-right row by row iteration.

Observe that by iterating by length, our array is filled out diagonally. The base case condition means that a vertical rectangle of width 4 is zeroed out at the beginning. After that, the lower triangular half of the array is also zeroed out for the same reason. Our array values all lie on the upper triangular half of the array. While it looks like there area lot of zeroes, it's only because we're at the beginning of the array and seeing a small portion of it—for large sequences, this amount becomes negligible.

Our proof will then proceed by induction on the length of intervals as well.

RNA-SECONDARY correctly computes $OPT$.

We will show that $S[i,j] = OPT(i,j)$ by induction on $(i,j)$ such that $|j-i|$.

For our base case, we consider all pairs $(i,j)$ such that $|j-i| \leq 4$. In this case, all such entries are $S[i,j] = 0 = OPT(i,j)$.

For our inductive case, we assume that for all $(i',j')$ with $4 \lt |j'-i'| \lt |j-i|$ that $S[i',j'] = OPT(i',j')$. By definition of $OPT(i,j)$, we have \[OPT(i,j) = \max\left\{OPT(i,j-1), \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \{1 + OPT(i,t-1) + OPT(t+1,j-1)\}\right\}.\] Observe by our inductive hypothesis, we have the following:

Putting this together gives us \[OPT(i,j) = \max\left\{S[i,j-1], \max\limits_{\substack{i \leq t \lt j-4 \\ b_j = \theta(b_t)}} \{1 + S[i,t-1] + S[t+1,j-1]\}\right\},\] and thus, $OPT(i,j) = S[i,j]$.

RNA-SECONDARY has running time $O(n^3)$ and uses $O(n^2)$ space.

It is clear that the array that we fill out uses $O(n^2)$ space. This means that we need at least that much time to fill out each entry. We claim that filling out each entry requires $O(n)$ time. To see this, observe that we must determine the maximum value across the interval $i$ through $j-4$, which takes $O(n)$ time. This gives us a running time of $O(n^3)$.

It's important to note that Nussinov's algorithm was the first of now many dynamic programming algorithms. Since then, there have been many more developed, with more complicated criteria and fewer assumptions about the secondary structure. For instance, the assumption that pairs don't cross could be an oversimplification. While arguably rare, there are secondary structures that do contain such crossings. One example of these are pseudoknots, which can be represented as substrings of the form $uxv\theta(u)y\theta(v)$.

Shortest paths, revisited

Recall from Lecture 3 that we were interested in solving the shortest path problem, where we are given a digraph $G$ with edge weights, a source vertex $s$ and a destination vertex $t$. We saw how to solve this using a greedy approach.

Now, we are ready to revisit a problem that we were introduced to earlier that we couldn't solve: how to find the shortest path in a graph that contains negative edge weights. One of the things we saw on the midterm was that even tinkering with the edge weight function wasn't enough to guarantee that we could get Dijkstra's algorithm to be correct.

One of the things some of you observed was that if there's a negative cycle (i.e. a cycle for which the sum of its edge weights is negative) on a path from $s$ to $t$, then there's no shortest path from $s$ to $t$. The proof is not complicated: if there's a negative cycle, then we can take the negative cycle as much as we like to decrease the cost of the path by an arbitrary amount.

Now, we will consider a non-greedy approach (read: dynamic programming) to solving the problem that will allow us to use negative edge weighs. Our algorithm will even be able to determine whether the graph contains a negative cycle, so we can figure out whether there is a shortest path.

Rather than computing the shortest path from a source $s$ to every other node, we consider sort of the backwards version of the problem, where we compute the shortest path from every vertex to our destination $t$.

The single-destination shortest-paths problem is: Given a directed graph $G = (V,E)$ with an edge weight function $e: E \to \mathbb R$ and a destination vertex $t$, find a shortest path from $v$ to $t$ for every $v \in V$.

Because we plan to use dynamic programming, we need to determine what subproblem is appropriate to compute in the service of finding a shortest path. Here, we can take some inspiration from our greedy approach: recall that Dijkstra's algorithm kept an estimated weight of the path from the source $s$ to each vertex $v$.

There are a few similar subproblems that we can try:

To help clarify this decision, here is a useful result that is similar to something you should have seen in discrete math.

Suppose $G$ has no negative cycles. Then there is a shortest path from $u$ to $v$ that does not repeat any vertices.

One consequence of this result is that any shortest path will have at most $n-1$ edges in it, since any path that has more edges will revisit some vertex.

We will define $OPT(i,v)$ to be the weight of a shortest path from $v$ to $t$ that uses at most $i$ edges. So if we want to compute the shortest path from $s$ to $t$, we would want to compute $OPT(n-1,s)$.

Let us consider a shortest path from a vertex $v$ to our destination $t$, which would have weight $OPT(i,v)$. How can we decompose this path? Since we want to express this in terms of some shorter path, it makes sense to consider the first edge in our path, say $(v,x)$, followed by the shortest path from that vertex $x$ to $t$. So we have the following possibilities.

  1. Our shortest path from $v$ to $t$ uses fewer than $i$ edges. In this case, our work is done: we just use the shortest path we already have, so we must have $OPT(i,v) = OPT(i-1,v)$.
  2. Our shortest path from $v$ to $t$ uses exactly $i$ edges. In this case, we want to express the weight of our path as a path with fewer edges, but we also know the cost of at least one edge: the one leaving $v$. So we must have $OPT(i,v) \geq w(v,x) + OPT(i-1,x)$ for some vertex $x$. Since we can always take the optimal path from $x$ to $t$, we have $OPT(i,v) \leq w(v,x) + OPT(i-1,x)$. This means for each edge leaving $v$, we have a choice of vertices $x$ such that the weight of the shortest path is $w(v,x) + OPT(i-1,x)$.

    Then the obvious choice is to choose the one with minimal weight, giving us \[OPT(i,v) = \min_{(v,x) \in E} \{w(v,x) + OPT(i-1,x)\}.\]

We then put this together to get the following.

$OPT(i,v)$ satisfies the recurrence

We will prove this by induction on $i$. Our base case is $i = 0$. In this case if $v = t$, then we can reach $t$ from $v$ (itself) with 0 edges, so $OPT(i,v) = 0$. However, if $v \neq t$, then $t$ is not reachable from $v$, so $OPT(i,v) = \infty$.

Now consider $i \gt 0$. We assume that $OPT(i',v')$ is the weight of a shortest path from $v'$ to $t$ that uses at most $i'$ edges for all $i' \lt i$. Let $\mathcal P_{i,v}$ be a shortest path from $v$ to $t$ that uses at most $i$ edges.

First, we argue that \[OPT(i,v) \geq \min\left\{ OPT(i-1, v), \min_{(v,x) \in E} \{OPT(i-1,x) + w(v,x)\}\right\}.\] There are two cases.

Next, we show that \[OPT(i,v) \leq \min\left\{ OPT(i-1, v), \min_{(v,x) \in E} \{OPT(i-1,x) + w(v,x)\}\right\}.\]

Therefore by the two inequalities above, we have \[OPT(i,v) = \min\left\{ OPT(i-1, v), \min_{(v,x) \in E} \{OPT(i-1,x) + w(v,x)\}\right\}.\]

 

As usual, the recurrence leads us to a straightforward algorithm that fills out a table.

    \begin{algorithmic}
    \PROCEDURE{shortest-paths}{$G,t$}
        \FORALL{$v \in V$}
            \STATE $P[0,v] \gets \infty$
        \ENDFOR
        \STATE $P[0,t] \gets 0$
        \FOR{$i$ from $1$ to $n-1$}
            \FORALL{$v \in V$}
                \STATE $P[i,v] \gets P[i-1,v]$
                \FORALL{$(v,x) \in E$}
                    \STATE $P[i,v] \gets \min\{P[i,v], P[i-1,x]+w(v,x)\}$
                \ENDFOR
            \ENDFOR
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}

This algorithm is pretty straightforward. The tricky part is counting the number of iterations this algorithm makes. Just like with Dijkstra's algorithm, it's easy to conclude that there must be something like $O(mn^2)$ just by looking at the definition of the loops, but if we observe how the edges are chosen, we see that only edges the leave a particular vertex are examined at each iteration, so every edge is examined at most once. So the inner two loops together comprise a total of $O(m)$ iterations. In total, this gives us $O(mn)$ time.

On the other hand, the algorithm clearly requires $O(n^2)$ space. This seems like a lot, especially compared to Dikjstra's algorithm. It turns out that we can do better by applying some of the same ideas.

Bellman-Ford

One observation we can make is that in the end, we really don't care how many edges a shortest path actually uses. In that sense, all we really need to do is keep track of the minimum weight of the path from each vertex. So instead of keeping track of a table of size $O(n^2)$, we can keep track of an array $d$ of estimated weights of shortest paths.

The problem with this is that this may not be enough information to reconstruct the actual path itself. We take a similar approach to Dijkstra's algorithm and keep track of the successors for each vertex as we update the estimated weights $d$. In doing so, we reduce our space requirements from a table of size $O(n^2)$ to two arrays of size $n$, which is $O(n)$.

This leads to the following algorithm attributed to Bellman, who we saw earlier as the inventor of dynamic programming, and Ford, who we will see again very shortly when we get into network flows. But even though the algorithm is called Bellman-Ford, they didn't publish together: Ford published the algorithm in 1956, while Bellman published it in 1958. Even this isn't the only instance: Moore also published the algorithm in 1957 and sometimes also gets attribution, while Shimbel published it in 1955 and does not.

    \begin{algorithmic}
    \PROCEDURE{bellman-ford}{$G,t$}
        \FORALL{$v \in V$}
            \STATE $d[v] \gets \infty$
            \STATE $s[v] \gets \varnothing$
        \ENDFOR
        \STATE $d[t] \gets 0$
        \FOR{$i$ from $1$ to $n-1$}
            \FORALL{$x \in V$}
                \IF{$d[x]$ was updated in the previous iteration}
                    \FORALL{$(v,x) \in E$}
                        \IF{$d[v] \gt d[x] + w((v,x))$}
                            \STATE $d[v] \gets d[x] + w(v,x)$
                            \STATE $s[v] \gets x$
                        \ENDIF
                    \ENDFOR
                \ENDIF
            \ENDFOR
        \ENDFOR
    \ENDPROCEDURE
    \end{algorithmic}

The running time of this algorithm is still $O(mn)$ by the same argument as before and the space requirements have clearly changed to $O(n)$. What is not as obvious is that $d$ ends up holding the weights of the shortest paths for each vertex and that $s$ correctly holds the successor for each vertex. First, we will show that Bellman-Ford correctly computes the weight of the shortest path.

After the $i$th iteration of the outer loop of Bellman-Ford, the following properties hold for every vertex $v \in V$:

  1. If $d[v] \neq \infty$, then $d[v]$ is the weight of some path from $v$ to $t$,
  2. If there is a path from $v$ to $t$ with at most $i$ edges, then $d[v]$ is the length of the shortest path from $v$ to $t$ with at most $i$ edges.

Note that the consequence of (1) says that $d[v] \geq OPT(i,v)$ while (2) leads to the conclusion that $d[v] \leq OPT(i,v)$. So this proposition shows that $d[v] = OPT(i,v)$ after the $i$th iteration, which is exactly what we wanted.

We will prove this by induction on $i$. First, we consider $i = 0$. We have $d[t] = 0$ and $d[v] = \infty$ for all $v \neq t$, since there are no paths from $v$ to $t$ with 0 edges.

Now consider $i \gt 0$ and consider an update to $d[v] = d[u] + w(v,u)$. By our inductive hypothesis, $d[u]$ is the weight of a path from $u$ to $t$, so $d[u] + w(v,u)$ is the weight of a path from $v$ to $t$. So (1) holds.

Now, consider a shortest path $P$ from $v$ to $t$ with at most $i$ edges. Let $x$ be the first vertex after $v$ in $P$. Then the subpath $P'$ from $x$ to $t$ is a path from $x$ to $t$ with at most $i-1$ edges.

By our inductive hypothesis, $d[x]$ is the weight of a shortest path from $x$ to $t$ with at most $i-1$ edges after $i-1$ iterations of the outer loop of Bellman-Ford. So $d[x] \leq w(P')$.

Then on the next iteration, we consider the edge $(v,x)$. We have \[d[v] \leq d[x] + w(v,x) \leq w(P') + w(v,x) = w(P).\] So after $i$ iterations, $d[v]$ is at most the weight of the shortest path from $v$ to $t$ with at most $i$ edges. So (2) holds.

All that is left is to show that the path that we get from following successors from $s$ gets us the shortest $v,t$ path. Unlike with Dijkstra's algorithm, we haven't shown that it's not possible to create a cycle through our computation. This is something that we need to consider. First, let us consider the successor subgraph.

Let $G_s = (V,E_s)$ be the subgraph defined by \[E_s = \{(v,s[v]) \mid v \in V\}.\]

If the successor graph $G_s$ contains a cycle, it must be a negative cycle.

First, we observe that if $s[v] = x$, then we have that $d[v] \geq d[x] + w(v,x)$. Now, consider a cycle $C$ on $v_1, v_2, \dots, v_k, v_1$ in $G_s$ and assume that $(v_k,v_1)$ is the last edge that belongs to the cycle that is added to $G_s$.

At the point that $v_1$is assigned to $s[v_k]$ and added to $G_s$, we have the following inequalities: \begin{align*} d[v_1] & \geq d[v_2] + w(v_1,v_2) \\ d[v_2] & \geq d[v_3] + w(v_2,v_3) \\ & \vdots \\ d[v_{k-1}] & \geq d[v_k] + w(v_{k-1},v_k) \\ d[v_k] & \gt d[v_1] + w(v_k,v_1) \\ \end{align*} Adding these inequalities together and grouping the $d[i]$'s on one side and edge weights on the other gives us \begin{align*} d[v_1] + \cdots + d[v_k] - d[v_1] - \cdots d[v_k] &\gt w(v_1,v_2) + \cdots + w(v_k,v_1) \\ 0 &\gt w(v_1,v_2) + \cdots + w(v_k,v_1) \\ 0 &\gt w(C) \\ \end{align*}

So if our graph does not have any negative cycles, then we will not end up with a cycle in our successor graph. It remains to show that the paths that are obtained via the successor graph are actually the shortest paths from $v$ to $t$.

Assuming no negative cycles, Bellman-Ford finds shortest $v,t$ paths for every vertex $v$ in $O(mn)$ time and $O(n)$ space.

Since there are no negative cycles, the successor graph $G_s$ cannot contain any cycles. Therefore, the successor graph must consist of paths to $t$. Consider a path $P$ on vertices $v = v_1, v_2, \dots, v_k = t$.

Once the algorithm terminates, if $s[v] = x$, we must have $d[v] = d[x] + w(v,x)$ for all vertices $v \in V$. This gives us the equations \begin{align*} d[v_1] & = d[v_2] + w(v_1,v_2) \\ d[v_2] & = d[v_3] + w(v_2,v_3) \\ & \vdots \\ d[v_{k-1}] & = d[v_k] + w(v_{k-1},v_k) \\ \end{align*} Since $v_1 = v$ and $d_k = t$, adding these equations together gives us \begin{align*} d[v] &= d[t] + w(v_1,v_2) + \cdots + w(v_{k-1},v_k) \\ &= 0 + w(v_1,v_2) + \cdots + w(v_{k-1},v_k) \\ &= w(P) \\ \end{align*} By Proposition 7.16, $d[v]$ is the weight of the shortest path from $v$ to $t$, so $P$ is the shortest path from $v$ to $t$.

So Bellman-Ford is able to compute the shortest path, assuming there are no negative cycles. And that's fine, because we know that if a graph has a negative cycle, then it doesn't have a shortest path.