CMSC 28000 — Lecture 1

The title of this course refers to formal languages and you will definitely learn a lot about formal languages. However, for our purposes, formal language theory is really more of the vehicle through which we talk about computation. Many similar courses at other schools will call this course something like Theory of Computation.

Very broadly, the theory of computation is about problems and how to solve them mechanically. Here, I use the word mechanically rather than computationally to emphasize the idea that we're interested in how to solve problems via a well defined process that can be automated rather than focusing on the machines called computers. The key idea in the theory of computation is that all of the objects that we study, problems and the processes by which we can solve these problems, are themselves mathematical objects.

So what is a problem? In this course and for a lot of computability and complexity theory, we focus on decision problems, which can be answered by a YES or NO. In this sense, we can view a problem as a function: given some input, our function tells us YES or NO.

For instance, a well-known problem would be something like: Given a natural number $n$, is $n$ a prime number? This is a yes or no question and we can define a function, say $\operatorname{Prime}:\mathbb N \to \{\mathrm{Yes},\mathrm{No}\}$.

We can get fancier, of course. What if we wanted to receive multiple inputs? Or more exotic inputs? For example, we could ask whether, given a graph $G$, and vertices in that graph $u$ and $v$, whether $G$ contains a path between $u$ and $v$. Again, this is a yes or no question.

Viewing such problems as functions is a good start, but we want something more general to work with. In the above examples, there's no unification in terms of the number of inputs or the domains of the functions, which makes it a bit difficult to talk about the two questions in the same way.

One of the first insights you will have internalized already is that we can represent pretty much anything we would like with the proper encoding. Generally speaking, this means that we can treat any input as a string. We can even treat "multiple" inputs, as in the case of our graph problem, as a single string, by encoding the delimiters appropriately as well. In this way, we can now define decision problems a functions from strings to Yes or No.

Another observation is that such a function essentially defines a set of strings. If $f$ is our function for our decision problem, then we can define $A$ to be the set of strings for which $f$ answers Yes and the complement of $A$ is all of the other strings.

And so we have already established a basic connection between the study of problems and how to compute them and the study of the mathematical properties of strings and sets of strings.

Strings

We typically first run into strings when we first learn to program. They also happen to be an interesting mathematical object.

An alphabet is a finite set of symbols. We typically denote alphabets by $\Sigma$.

When we think of an alphabet we usually think of something like Latin alphabet $\{a,b,c,d,\dots,z\}$. However, we can take any non-empty finite set to be an alphabet. An alphabet that we're all probably very familiar with is the binary alphabet $\{0,1\}$. From this, we can make binary strings. We can also consider more exotic alphabets if we wish. For instance, we might consider the alphabet of DNA bases $\{A,C,G,T\}$. Or we can consider a very large, but still finite, alphabet like Unicode.

A string or word $w = a_1 a_2 \cdots a_n$ over an alphabet $\Sigma$ is a finite sequence of symbols where each $a_i \in \Sigma$. The set of all strings over $\Sigma$ is denoted $\Sigma^*$. The length of a word $w$ is denoted by $|w|$ and is the number of symbols in $w$. We denote by $\varepsilon$ the empty word, and $|\varepsilon| = 0$. Sometimes, we also want to consider the number of a particular alphabet symbol in a word. We denote by $|w|_a$ the number of occurrences of the letter $a$ in $w$.

Depending on the particular subfield of research, strings are sometimes called words. I can guarantee that even if I promise to try to use one consistently, I will slip and use the other, so I will make no such promises.

The definitions we have here are the typical definitions that we stick at the beginning of papers in formal language theory. However, a more insightful and slightly more formal way to define these is by defining everything inductively.

This gives us a natural way to define other properties of strings and perform induction on them.

The concatenation of two strings $u = a_1 a_2 \cdots a_m$ and $v = b_1 b_2 \cdots b_n$, denoted $u \cdot v$ or simply $uv$, is $uv = a_1 \cdots a_m b_1 \cdots b_n$.

Observe that concatenation is associative, since $(uv)w = u(vw)$. However, it is not commutative (i.e. $uv$ is not necessarily equal to $vu$). Usually, we learn about concatenation when we first learn to program. Depending on how you learned to program, you may think of it as sticking two strings together. Or, if you learned to program functionally, then you will have seen the following alternative definition before.

It becomes more clear in this view of strings that alphabet symbols are not technically considered strings. However, we are often lazy and don't bother making this distinction very often since for the most part, treating symbols as strings of length 1 doesn't get us into trouble very often.

Also notice that this makes it clear that we can perform concatenation on the empty string. If we have a string $w$, then $w \epsilon = \epsilon w = w$. This leads us to a very interesting way to consider strings.

A monoid is a structure $\langle M,\cdot,1 \rangle$, where $M$ is a set, $\cdot$ is an associative binary operation, and $1$ is an identity for $\cdot$.

From this, we can see that the set of strings, with the binary operation concatenation and identity element $\varepsilon$, is a monoid, and this is where $\Sigma^*$ gets its notation from—it is the free monoid on $\Sigma$. Of course, at this point, this is little more than a curiosity, but it does point to a few things. First of all, strings aren't just arbitrary objects; they have mathematical meaning and structure. Secondly, monoids are algebraic objects (just add inverses and you've got groups!), and this points to the possibility that studying formal language theory using algebraic tools may not be such a bad idea.

The $k$th power of a string $w$ is $$w^k = \overbrace{ww \cdots w}^{k \text{ times}}$$ and we define $w^0 = \varepsilon$. Formally, we define this inductively by

A related notion that won't come up very much in this course is the notion of primitivity. We say a word is primitive if it is non-empty and can't be written as the power of some other word.

If $w = uxv$ is a string, we say $u$ is a prefix of $w$, $v$ is a suffix of $w$, and $x$ is a subword or infix or substring or factor of $w$.

Much like the usage of string or word, there are a plethora of terms that can be used interchangeably with what you're probably most familiar with as a substring. The term subword is the same idea, if you were to use word instead of string. Similarly, the term infix comes naturally from the analogy with prefix and suffix. The one which may be less obvious is factor; this comes from viewing the string as an algebraic object.

The reversal of a string $w = a_1 a_2 \cdots a_n$ is the string with the symbols written in reverse order, $w^R = a_n \cdots a_2 a_1$. Formally,

That strings can be reversed is a consequence of the fact that concatenation is not commutative.

We will show this by induction on $x$.

For our base case, let $x = \varepsilon$. We have \begin{align*} (xy)^R &= (\varepsilon y)^R &\text{substitution}\\ &= y^R &\text{definition of $\varepsilon$} \\ &= y^R \varepsilon &\text{definition of $\varepsilon$}\\ &= y^R \varepsilon^R &\text{definition of reversal}\\ &= y^R x^R & \text{substitution} \end{align*}

Now, for our induction step, let $x = az$ for some word $z \in \Sigma^*$ and assume that $(zy)^R = y^R z^R$. Then we have \begin{align*} (xy)^R &= ((az)y)^R &\text{subtitution}\\ &= (a(zy))^R &\text{definition of concatenation}\\ &= (zy)^R a &\text{definition of reversal}\\ &= (y^R z^R) a &\text{inductive hypothesis}\\ &= y^R (z^R a) &\text{associativity}\\ &= y^R (az)^R &\text{definition of reversal}\\ &= y^R x^R &\text{subtitution} \end{align*}

Languages

Informally, languages are just sets of strings and we describe a language in the same way as for any other set.

We can define the notion of a language more formally, with respect to an alphabet $\Sigma$.

We denote by $\Sigma^*$ the set of all finite words over $\Sigma$. A language over $\Sigma$ is a subset of $\Sigma^*$.

You might notice that the more curious definition is for $\Sigma^*$. If you recall our brief diversion about monoids, you might recognize that $\Sigma^*$ is really just the free monoid on $\Sigma$.

Since languages are sets, we can perform any set theoretic operations we're familiar with, like union and intersection.

For a language $L \subseteq \Sigma^*$, the complement of $L$ is $\overline L = \Sigma^* \setminus L$.

This also means there are a few things we have to be careful with. For example, there is the notion of the empty language $\emptyset$, which is the language with no strings in it. This is different from the language $\{\varepsilon\}$, which is the language containing the empty string.

For two languages $L_1$ and $L_2$, we define the catenation of $L_1$ and $L_2$ by $$ L_1 L_2 = \{uv \mid u \in L_1, v \in L_2 \}.$$

The Kleene star or Kleene closure of $L$ is defined by $$ L^* = \bigcup_{i \geq 0} L^i.$$ We can similarly define the positive Kleene closure of $L$ by $$ L^+ = \bigcup_{i \geq 1} L^i.$$