MPCS 55001 — Lecture 1

Welcome to MPCS 55001! Basic course information can be found on the home page of the course website.

What is this course about?

An algorithm is nothing but a finite series of steps that can be followed to solve a particular problem. Today, we think of algorithms as something that a computer does, but the notion of an algorithm is quite old. The word itself is derived from the 9th century Persian mathematician al-Khwarizmi, but the idea goes back even further. One such algorithm we've already seen (because you should've taken discrete math in order to get into this class) is Euclid's algorithm for finding the greatest common divisor of two numbers, which is from about 300 BC.

Historically, algorithms typically came from proofs, since to show that some object existed, it was good enough to prove that one could construct it. Many of these constructions are algorithms. But while algorithms themselves are old, the idea of analyzing algorithms is relatively new. The oldest known analysis of an algorithm is due to Gabriel Lamé's analysis of Euclid's algorithm in the 1800s, about 2000 years later.

Even then, there was very little reason to analyze the efficiency of algorithms. After all, what's the point of seeing how an algorithm performs on an input of size 1000000000 if humans wouldn't be able to carry out that many operations? And so, if there was any analysis of algorithms, it would be the same way we think about what makes a good proof: how clever or elegant they were. Obviously, things changed once we started making machines that could carry out millions of operations at a time. The idea of studying the efficiency of algorithms began in the 1960s, though this is a few decades after computer science emerged.

Very broadly, the theoretical computer science is concerned with problems and how to solve them mechanically. Here, I use the word mechanically rather than computationally to emphasize the idea that we're interested in how to solve problems via a well defined process that can be automated rather than focusing on the machines called computers, since, as we discussed earlier, algorithms don't need to involve computers at all.

So what is a problem? Generally, we view problems in as a function which takes some input and produces some output. Then our program or algorithm computes the function by correctly mapping the inputs to the right outputs. By now, you have some intuition on how to do this and you'll have developed some tricks to solve these problems and check that they're correct.

However, we want to go a bit further than that and make all of these notions a bit more formal. Instead of trial and error, we want to study the structure of these problems and come up with formal approaches to designing algorithms for them. Instead of acquiring evidence that our algorithms are probably correct via testing, we want to be able to prove definitively that the algorithms are correct. And we want to have a formal notion of what it means for an algorithm to be fast or efficient.

This is a theoretical course, and we'll be discussing algorithms and how to prove their correctness and efficiency. To succeed in this course, you should be comfortable with mathematical proof. Many of the results in this course will depend on ideas from discrete math, particularly mathematical induction and graph theory.

There and Back Again

Let's jump right in and see what kinds of things we'll be doing for the rest of the quarter.

A simple problem in many domains is whether it's possible to get from $A$ to $B$. This sounds like a question about physical location, like "Can I get from Chicago to Toronto?", but as we'll see, the problem turns out to be a lot more general than that:

Define the problem

We begin by abstracting the problem. What we realize is that we don't really care that these are packets or game states or airplanes. These scenarios all have the same sort of constraints:

Once we have our abstraction, we need a convenient representation for it. Such a representation is nice to have mathematically, since we can exploit mathematical properties to solve our problem. But since we have computers, a representation also gives us a way to tell the computer about it.

A graph is a pair $G = (V,E)$ of vertices $V$ and edges $E$. A graph is undirected if edges are unordered pairs of vertices. A graph is directed if edges are ordered pairs of vertices.

Within this graph, we notice that our problem is about whether we can get from one "thing", represented by a vertex, to another, by following some predefined notion of relationships between two "things", represented by edges. Visually, this looks like starting at one circle and following lines until we reach another circle. We can define this formally.

A path is a sequence of vertices $v_0 v_1 \cdots v_k$ such that $(v_i,v_{i+1})$ is an edge for all $0 \leq i \lt k$.

Once we have this idea, it becomes pretty clear that what we want to do is determine whether there is a path between two vertices. So we have a problem. A problem takes some input and produces some output. We solve a problem by defining an algorithm that when given input, produces the correct output.

The $s,t$-connectivity problem is: Given two vertices $s$ and $t$, is there a path from $s$ to $t$?

Designing the algorithm

We've talked a lot about paths and connectivity in graphs in discrete math, but we never really considered the question of how to find a path in a graph. To understand why this is a problem, we have to remember that a computer doesn't understand or know what a graph is—we have to provide instructions for how to navigate this structure.

In other words, we need to figure out how to say that we should "start" at $s$ and follow edges until we can or can't find $t$. In essence, what we want to do is search the graph for $t$. However, we have to figure out how to break this idea down in a systematic way into a finite series of well-defined steps. Let's examine two possible strategies, which depend on how "adventurous" you are.

Breadth

A conservative strategy would be to systematically explore all of the vertices near the start and slowly branch out. More specifically, we explore the vertices that are directly connected to $s$, then explore the vertices directly connected to those vertices, and so on. Such a strategy is called breadth-first search.

We can more formally define the action of the algorithm recursively in the following way. The breadth-first search algorithm when run on a graph $G$ and source vertex $s$ constructs sets of vertices $L_i$, defined by

First of all, we need to convince ourselves that this process will terminate. This isn't difficult to see—since there are finitely many vertices, there can only be finitely many sets $L_i$. Then once we've constructed these sets, all we need to do is check whether $t$ is in one of them. We claim that there's a path from $s$ to $t$ if so.

This may not look like what you think of as an algorithm, but it provides a precise definition, how to construct each set, and it is deterministic—you get the same answer every time. You may also be uncomfortable with the idea of recursion. However, we'll see that recursion is at the heart of computer science and secretly lurks in every so-called non-recursive iterative function you write.

Of course, we want to show that this algorithm is correct. What we can do is make some observations about some properties of the algorithm and use it to show that we did compute what we were intending to. First, we need the following definition.

Let $u$ and $v$ be vertices in a graph $G$. The distance between $u$ and $v$ is the least number of edges in a path between $u$ and $v$.

Distance is a natural name for this measure and it represents the length of a shortest path from $u$ to $v$. It turns out that this can be applied quite naturally to the paths we find via our breadth-first search. Since the definition of the sets $L_i$ is recursive, a natural proof technique to use is mathematical induction.

For each $i$, the set $L_i$ consists of all vertices at distance exactly $i$ from $s$.

We will prove this by induction on $k$, the number of sets $L_i$.

Our base case is $k = 0$. In this case, we have $L_0 = \{s\}$ and we consider a vertex to have a path to itself of length $0$, so our claim holds.

Now consider $k \gt 0$ and assume for all $j \lt k$ that $L_j$ contains all vertices of distance $j$ from $s$. Suppose $t \in L_k$. By definition of $L_k$, there exists a vertex $u \in L_{k-1}$ such that $(u,t) \in E$. By our inductive hypothesis, $u$ has distance $k-1$ so there is a path from $s$ to $u$ of length exactly $k-1$ and this is the shortest such path. Then we can find a path from $s$ to $t$ of length exactly $k$ by taking the path from $s$ to $u$ and adding the edge $(u,t)$. Furthermore, this is the shortest such path, since otherwise $t$ would be in $L_j$ for some $j \lt k$.

A consequence of this is that there is a path from $s$ to $t$ if and only if $t$ appears in $L_i$ for some $i$.

Depth

Another strategy would be to charge ahead, trying to get as far as you can until you get stuck, then retrace your steps and try another direction. More specifically, we explore vertices until we can travel no further. At this point, we backtrack until we have an alternate path and continue in the same way until we get stuck again, and so on. Such a strategy is called depth-first search.

Like many ideas in computer science, this algorithm is naturally expressed as a recursive algorithm. We will write this algorithm in pseudocode.

It can be a bit hard to think about the steps that a recursive algorithm takes. One helpful way to do this is to think of recursive calls as a tree.

This is a nice visualization, but it also gives us some structure to do yet another inductive proof.

Upon termination, depth-first-search(s) marks all vertices that can be reached from $s$.

We will prove this by induction on $n$, the depth of recursion of depth-first-search(s).

For our base case, we have $n = 0$. This is the initial call of depth-first-search(s) and the first vertex we mark is $s$ itself. We consider $s$ to have a path of length $0$ to itself, so our claim holds.

Now consider $n \gt 1$ and assume that all vertices marked at recursion depth $j \lt n$ are reachable from $s$. Suppose $t \in V$ is marked by the algorithm. Since $t$ was marked, depth-first-search(t) was called from an earlier recursive call. In this earlier call, a vertex $u$ was marked and is reachable from $s$ by the inductive hypothesis. Since a call on $t$ was made, there is an edge $(u,t) \in E$. Then there is a path from $s$ to $t$: follow the path from $s$ to $u$ and add the edge $(u,t)$.

Now suppose that $t$ is reachable. We want to show that $t$ must have been marked. Suppose that it wasn't. Since $t$ is reachable from $s$, there is a path $s v_1 v_2 \cdots v_k t$. Let $v_i$ be the first vertex on this path that is not marked by the algorithm. We argue that no such vertex can exist. If $v_i$ is the first vertex that is not marked by our algorithm, then $v_{i-1}$ was marked by the algorithm. But this means that the edge $(v_{i-1},v_i)$ must have been considered by the algorithm and depth-first-search(v_i) must have been called which would mark $v_i$. Therefore, $t$ was marked.

So we have proved that our two algorithms at least give us a correct answer. But there are more questions we can ask about them, namely: How efficient are these algorithms?

Analyzing Algorithms

Generally speaking, we now have a basic outline of the approach we'll take to coming up with algorithmic solutions to problems:

Up until now, if you've considered the efficiency of an algorithm, it's been by doing things like counting the number of iterations of a loop or counting the number of operations that you perform. These are generally what we will continue to do, but we'll make things a bit more formal.

Models of computation

One of the immediate problems with analyzing algorithms is figuring out what it is we're actually counting. To do that, we need to think about how our algorithms are being executed and what constitutes a "basic" operation. This is not necessarily obvious and you can fall into some unlikely traps if you're not careful with how you define this. For example, we often think of arithmetic operations as being "basic", but does it really make sense to think of addition and multiplication as having the same cost? What about looking things up in an array? What we need to do is formalize and fix a model of computation which tells us what our "computer" can do and what's considered a basic operation for such a computer.

The classical model of computation is the Turing machine, defined by Alan Turing in 1936 and is considered the prototypical computational device. The Turing machine is a machine that consists of a finite state machine and an infinitely long tape. The machine reads the tape one tape cell at a time, and based on the state of the machine and what it reads on the tape, the machine will go to a different state, write to the tape, and move along the tape. The "basic operations" for a Turing machine are very well defined: one transition of the machine can be thought of as one time unit, and one tape cell is one unit of space.

The problem with this is that although everything is very well-defined, it isn't really how we think of modern algorithms. One of the surprising things about the Turing machine is that, in principle, the Turing machine is capable of doing anything that a modern computer does. However, we want to abstract some of this stuff. For example, adding two numbers is non-trivial on the Turing machine because we have to describe how to do so in terms of the basic operations. Intuitively, we feel like adding two numbers should be something that's "basic" that we don't need to describe.

The model of computation that's generally used in algorithms analysis more closely resembles electronic computers and is called the word random access machine (RAM) model. We assume that we have random access to an indexed memory made up of registers that each store a word, which is an $n$-bit integer. Basic operations include integer arithmetic and logic (comparisons), reading and writing to a register, pointer arithmetic, branching, and so on. Here, time is defined as the number of primitive operations we use and space is the number of memory cells that are used.

Even here, we have to be careful about things like how big a word is or what constitutes a "basic" operation. For instance, it's easy to forget that registers store an $n$-bit integer for some fixed $n$. Since we often think of integer arithmetic as one operation, what if we want to add an $n^2$-bit integer? Or, if we wanted to think about algorithms for integer multiplication, it'd be silly to make multiplication a unit cost operation! These kinds of things do not come up often, but when they do, they can have a big effect on your analysis. For most purposes, we cheat and simply say that a word is as big as we need.

Worst-case analysis

This gives us a way to discuss the time or space that an algorithm uses, but leads us to another problem: the number of operations that an algorithm uses depends on the input. For example, the running time for a sorting algorithm when given a list that's already been sorted can be quite different from an arbitrary list!

To deal with this, we typically analyze algorithms based on the worst-case running time. In other words, the complexity of an algorithm is the running time for the input that forces the algorithm to use the most time or space. This gives us a complexity guarantee, since the algorithm will never use more than its worst-case time or space.

But what do we mean by "worst case"? We clearly don't mean an arbitrary worst case, since in most cases, we can always find a larger input that will take longer. So what we really mean is the worst case for a particular input size $n$. This then lets us talk about the time or space complexity as a function of the input size $n$.

The worst-case time complexity $T(n)$ of an algorithm $A$ is the maximum over all inputs $I$ of size $n$ of the total number of operations used by $A$ when run on $I$. The worst-case space complexity $S(n)$ of $A$ is the maximum over all inputs $I$ of size $n$ of the total number of registers used by $A$ when run on $I$.

A common objection to this is that it's a pretty harsh metric to be judging algorithms by. For instance, what if the worst case is rare? And this is a perfectly fine objection! If you happen to know some extra information about your particular inputs, then you can and should use that in your design and analysis of your algorithm.

There are other types of analysis, like average-case analysis or probabilistic analysis. And there will be situations where such analyses are useful, but generally speaking, worst-case analysis gives us the most general understanding of an algorithm and the problem that the algorithm is trying to solve.

Asymptotics

So now we know that to think about the complexity of an algorithm, we simply count the number of steps it takes or how much memory it uses in the worst case input, and this gives us a function in the size of the input. So if we want to compare the efficiency or complexity of algorithms, what we really end up doing is comparing the functions.

Rather than figuring out whether a function is always less than some other function, we are more interested in how the function grows as input grows larger. To do this, we have to turn to calculus/analysis and perform asymptotic analysis. By analyzing the time and space complexity of an algorithm asymptotically, we abstract away many of the small, unimportant details of implementation and get a better idea of whether the broader strategy of the algorithm is efficient.

Let $f:\mathbb N \to \mathbb R$ be a function. We say that $f(n)$ is $O(g(n))$ for some function $g:\mathbb N \to \mathbb R$ if there exist constants $c \gt 0$ and $n_0 \geq 0$ such that for all $n \geq n_0$, $f(n) \leq c \cdot g(n)$.

Intuitively, what this definition says is that if $f$ is $O(g)$, then there is some point $n_0$ at which for every $n$ after $n_0$, $f(n)$ is less than $g(n)$, for some constant scaling factor $c$. This allows us to talk broadly about the growth of the functions $f$ and $g$ without having to worry about small values

We call this "big O" notation. Roughly speaking, big O gives us an upper bound on the growth of a function. This system of notation is due the German analytic number theorists Bachmann and Landau in the late 19th Century and is sometimes called Landau notation. Its use in the analysis of algorithms was popularized by Donald Knuth in the 1960s.

We should note that we are primarily interested in functions with positive inputs and outputs, so we will carry that assumption along with us.

Let $f:\mathbb N \to \mathbb R$ be a function. We say that $f(n)$ is $\Omega(g(n))$ for some function $g:\mathbb N \to \mathbb R$ if there exist constants $c \gt 0$ and $n_0 \geq 0$ such that for all $n \geq n_0$, $f(n) \geq c \cdot g(n)$.

Let $f:\mathbb N \to \mathbb R$ be a function. We say that $f(n)$ is $\Theta(g(n))$ for some function $g:\mathbb N \to \mathbb R$ if there exist constants $c_1,c_2 \gt 0$ and $n_0 \geq 0$ such that for all $n \geq n_0$, $c_1 \cdot g(n) \leq f(n) \leq c_2 \cdot g(n)$.

Without any additional properties, we would have to go through a proof based on the definition in order to do any comparisons. However, there are a number of properties and identities that are quite easy to prove that will make our lives considerably easier.

Proving these properties makes for a nice exercise. Here are some useful consequences, which give us a rough idea of how functions grow.

One important point to keep in mind as we talk about growth orders of functions is that the functions that we deal with in this class are of a very specific kind: we map input sizes (nonnegative integers) to steps (also nonnegative integers). In other words, our functions are $\mathbb N \to \mathbb N$, which means that we're really dealing with sequences of integers. Our primary schooling trains us to think of functions as continuous over the real numbers, which is admittedly useful for visualizing, but hewing too closely to this view may confuse things if you're not careful.

Efficiency and Polynomial Time

So now, we can analyze an algorithm and determine its worst-case time and space complexity asymptotically. The final basic question that we have to account for is what is considered "good" or "reasonable" when it comes to time and space complexity. To do this, we have to think about what our possible algorithms can look like.

Most of the time, there is an "obvious" algorithm for solving a problem: try every solution. Depending on the exact problem, such an approach can result in checking something like $2^n$ or $n!$ different solutions. So if we double the size of our input, then we double the number of solutions we're searching through.

Of course, we hope to find a more clever way to solve a problem than just trying everything. What this means is that if we're clever enough, our algorithm shouldn't just double the amount of time it takes when we double the size of our problem instance.

More formally, we would like our algorithms to scale in such a way that when the input size doubles, the algorithm should only slow down by some constant factor. This leads to the notion of time complexity that is expressed as a polynomial.

We say an algorithm runs in polynomial time if there exists $k$ such that its worst-case time complexity $T(n)$ is $O(n^k)$.

To see why this works, we suppose that our running time $T(n)$ is $O(n^k)$. In other words, $T(n)$ is about proportional to $cn^k$. If we double the input size, we have $c(2n)^k = c2^kn^k = 2^k \cdot cn^k$. Since $k$ is constant, doubling the input size increases the running time by a constant factor.

There are a few basic objections to tying polynomial time to efficiency or practical, but by now we've come up with good arguments. The most common one is the example of the algorithm that runs in $O(n^{1000000})$ time versus the algorithm that runs in $O(1.000001^n)$ time.

Of course, it is a cop-out to say that in most practical cases, this won't come up, but that's really the truth. In some strange coincidence, many of the algorithms we come up with tend to have small and reasonable constants, exponents, or bases, and of those that don't, it's often not too long before further research leads us to better algorithms.

To step a bit outside of the realm of algorithms, you may have heard about the Church-Turing thesis, which says that all reasonable models of computation (like the Turing machine or the RAM) are equivalent in power. The extended Church-Turing thesis says that not only are all of these models equivalent, but that they can all simulate each other with only a polynomial increase in running time. This means that the idea of tying efficiency and polynomial time together is quite robust and doesn't really depend on a particular model of computation.

Here's a fun question to think about: what about algorithms that use polynomial space?

Analyzing our algorithms

One thing that we notice about our algorithms is that it's still quite heavy on math. That's okay, but if we want to really nail down the efficiency of the algorithm, we have to think about some degree of implementation, or how a computer would see these objects.

Representation

First of all, how do we represent a graph? There are two main ways to do this (which you may have already seen).

Both of these representations are robust, in that we can easily modify them for particular representations of graphs. For instance, if we have an undirected graph, then for the adjacency matrix, both $A[i,j]$ and $A[j,i]$ are 1, while for directed graphs, this depends on the direction. Similarly, for an adjacency list, we would have that $j$ is in $A[i]$ but $i$ not necessarily in $A[j]$ in a directed graph, while both would be true in an undirected graph.

Both of these representations have efficiency tradeoffs that stem from their implementations. To think about these implementations, we need to remember some basic differences in linear data structures.

Internally, arrays are a contiguous block of memory, which allows quick pointer arithmetic-based access to a particular location in the array. This means that access can be done in $O(1)$ time, while modifying the array in any other way requires reconstructing the entire array, at a cost of $O(n)$, where $n$ is the length of the array.

A list is a linked collection of values. There are two equivalent definitions. The first is a mathematical, inductive definition:

This definition is common in functional programming and it's what's used implicitly when we perform inductive proofs over lists. A more common definition, from the world of imperative programming languages involves the idea of pointers, but is still inductive in a sense.

The pointer here is why these are sometimes called linked lists. Internally, our list is not expected to be stored contiguously like an array, which allows for easy modification: we can add and remove elements simply by updating links, which takes $O(1)$ time. However, accessing a particular element in the list requires a search by following the links. This can be as bad as $O(n)$ time, since we may have to search all the elements.

When is it better to use which representation? It depends on what you know about your particular application and what kinds of graphs you expect to see. For example, graphs that are dense and contain lots of edges would benefit from an adjacency matrix representation. Why? We know that the maximum number of edges in a graph of $n$ vertices is $\binom n 2$, which is $\frac 1 2 n (n-1) \leq n^2$. This means that the advantages in space requirements for an adjacency list are erased for dense graphs.

Implementation

The next thing we realize as we think about how to implement our algorithms is that the order in which we visit vertices matters. This is a natural place to introduce the idea of data structures. Data structures (or more specifically, abstract data types) give us a way to store and manage our data. Specialized data structures allow us to retrieve data under certain constraints or with certain properties efficiently.

The first data structure we'll be considering is the stack. A stack has two operations:

As the name suggests, we can imagine the stack working like a physical stack: we put things on top of the stack and we remove things on top of the stack.

The other data structure we'll be considering is the queue. A queue has two operations:

Again, the name suggests something about the operation of the structure: when we line up (or as the British might say, join a queue), the person who was there first gets served first.

Both of these can be implemented using linked lists (doubly-linked to be exact) so that all operations require $O(1)$ time. We won't be discussing much about underlying implementations of data structures, only that this can be done and that we can rely on these facts.

So finally, we can evaluate how efficient our algorithms are. We will assume that our graphs are given in edge-list representation.

First, we consider the breadth-first search. Since the idea behind BFS is to visit the vertices closest to the source vertex first, a queue is a natural data structure for holding vertices to be visited. In a queue, the first vertices inserted into the queue are the first ones to leave the queue. In other words, the oldest vertices, which are the ones closest to the source, get visited first.

    \begin{algorithmic}
    \FUNCTION{breadth-first-search}{$G,s$}
        \STATE Mark $s$ as Explored
        \STATE Enqueue $s$ onto a queue $Q$
        \WHILE{$Q$ is not empty} 
            \STATE $v \gets$ \CALL{dequeue}{$Q$}
            \FORALL{$(v,w) \in E$}
                \IF{$w$ is not marked as Explored}
                    \STATE Mark $w$ as Explored
                    \STATE \CALL{enqueue}{$Q,w$}
                \ENDIF
            \ENDFOR
        \ENDWHILE
    \RETURN{Explored}
    \ENDFUNCTION
    \end{algorithmic}

Breadth first search can be performed in $O(m+n)$ time, where $m$ is the number of edges and $n$ is the number of vertices.

It is known that enqueueing and dequeueing items on a queue can be done in $O(1)$ time. We describe how to mark a vertex. Since we intend to mark all vertices, we can set up an array of length $n$, which holds a boolean indicating whether the corresponding vertex has been marked. Setting this array up requires $O(n)$ time. Then querying the array and marking a vertex can be done in $O(1)$ time thereafter.

Next, we argue that the algorithm performs at most $n$ iterations of the while loop. To see this, note that the loop depends on the queue $Q$ being empty. We observe that items are only added to the queue immediately after they are marked. This only happens once per vertex. Therefore, there can be at most $n$ elements inserted into the queue. At the start of each iteration of the while loop, the queue decreases by one. Since we can only add at most $n$ elements into the queue, the queue is guaranteed to be emptied eventually. Since the queue removes one item per iteration of the while loop, there can be at most $n$ iterations of the while loop.

Now, we argue about the cost of the operations inside the while loop. An item is dequeued in $O(1)$ time and we check all of its outgoing edges for neighbours. If we encounter a neighbour that has not yet been marked, we mark it and enqueue it in $O(1)$ time.

The big question here is how much time checking incident edges is. A naive analysis may conclude that it is proportional to the number of edges, so this loop takes $O(m)$ time. This results in an overall running time of $O(mn)$. However, the upper bound of $O(m)$ is much too loose. We can tighten it with a more careful analysis.

We argue that the cost of checking outgoing edges for each vertex over all iterations of the while loop is $O(m)$ in total. This is because we consider each vertex once, as argued above, and we only consider its outgoing edges. This means that over the entire while loop, we examine every edge exactly once.

As a result, the running time of our algorithm is $O(m+n)$ time.

Then in depth-first search, the idea is to go as far as possible by considering the most recently visited vertices first. In other words, the last vertices to be seen should be the first ones to be visited. So a stack is a natural data structure to manage vertices we've seen.

Depth first search can be performed in $O(m+n)$ time, where $m$ is the number of edges and $n$ is the number of vertices.

It turns out the complexity of both algorithms is about the same. But this is something that we wouldn't have discovered without setting up a rigourous analysis of both. And while it turns out these algorithms are of roughly the same complexity, their behaviour may still be something you want to take into account.