Brzozowski derivatives are one of the shibboleths of functional programming: if you ask someone about implementing regular expressions, and you get back an answer involving derivatives of regular expressions, then you have almost surely identified a functional programmer.

The reason that functional programmers like derivatives so much is that they offer an elegantly algebraic and inductive approach to string matching. The derivative function $\delta_c$ takes a character $c$ and a regular expression $r$, and returns a new regular expression $r'$, which accepts a string $s$ if and only if $r$ accepts $c\cdot s$. This function can be defined as follows: $$ \begin{array}{lcl} \delta_c(c) & = & \epsilon \\ \delta_c(c') & = & \bot \\ \delta_c(\epsilon) & = & \bot \\ \delta_c(r_1 \cdot r_2) & = & \left\{ \begin{array}{ll} \delta_c(r_1) \cdot r_2 \vee \delta_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\ \delta_c(r_1) \cdot r_2 & \mbox{otherwise} \end{array} \right. \\ \delta_c(\bot) & = & \bot \\ \delta_c(r_1 \vee r_2) & = & \delta_c(r_1) \vee \delta_c(r_2) \\ \delta_c(r\ast) & = & \delta_c(r)\cdot r\ast \end{array} $$

Now we can turn the 1-character derivative into a word derivative as follows: $$ \begin{array}{lcl} \delta_\epsilon(r) & = & r \\ \delta_{c\cdot w}(r) & = & \delta_w(\delta_c(r)) \end{array} $$

Again, this is a simple inductive definition. So to match a string
we just call the 1-character derivative for each character, and then
check to see if the final regular expression is nullable. However,
there's a catch! As you can see from the definition of $\delta_c$,
there is no guarantee that the size of the derivative is bounded. So
matching a long string can potentially lead to the construction of
very large derivatives. This is especially unfortunate, since it is
possible (using DFAs) to match strings using regular expressions in
*constant* space.

In his 1962 paper, Brzozowski showed that the number of derivatives of a regular expression is finite, if you quotient by the associativity, commutativity, and idempotence axioms for for union $r_1 \vee r_2$. That is, suppose we consider regular expressions modulo an equivalence relation $\equiv$ such that: $$ \begin{array}{lcl} r \vee r & \equiv & r \\ r \vee \bot & \equiv & r \\ r \vee r' & \equiv & r' \vee r \\ r_1 \vee (r_2 \vee r_3) & \equiv & (r_1 \vee r_2) \vee r_3 \\ \end{array} $$

Then his theorem guarantees that the set of derivatives is finite. However, computing with derivatives up to equivalence is rather painful. Even computing equality and ordering is tricky, since union can occur anywhere in a regular expression, and without that it's difficult to implement higher-level data structures such as sets of regular expressions.

So for the most part, derivatives have remained a minor piece of functional programming folklore: they are cute, but a good implementation of derivatives is not really simple enough to displace the usual Dragon Book automata constructions.

However, in 1995 Valentin Antimirov introduced the notion of a
*partial derivative* of a regular expression. The idea is that
if you have a regular expression $r$ and a character $c$, then a
partial derivative is a regular expression $r'$ such that if $r'$
accepts a word $s$, then $r$ accepts $c \cdot s$. Unlike a
derivative, this is only a if-then relationship, and not an
if-and-only-if relationship.

Then, rather than taking regular expressions modulo the ACUI
equations, we can construct *sets* of partial derivatives,
which collectively accept the same strings as the Brzozowski
derivative. The idea is that we can use the algebraic properties of
sets to model the effect of the ACUI equations. Below, I give
the partial derivative function:
$$
\begin{array}{lcl}
\alpha_c(c) & = & \setof{\epsilon} \\
\alpha_c(c') & = & \emptyset \\
\alpha_c(\epsilon) & = & \emptyset \\
\alpha_c(r_1 \cdot r_2) & = &
\left\{
\begin{array}{ll}
\comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} \cup \alpha_c(r_2) & \mbox{if } r_1 \mbox{ nullable} \\
\comprehend{r \cdot r_2}{ r \in \alpha_c(r_1)} & \mbox{otherwise}
\end{array}
\right.
\\
\alpha_c(\bot) & = & \emptyset \\
\alpha_c(r_1 \vee r_2) & = & \alpha_c(r_1) \cup \alpha_c(r_2) \\
\alpha_c(r\ast) & = & \comprehend{ r' \cdot r\ast }{ r' \in \alpha_c(r) }
\end{array}
$$

Partial derivatives can be lifted to words just as ordinary derivatives can be, and it is relatively easy to prove that the set of partial word derivatives of a regular expression is finite. We can show this even without taking a quotient, and so partial derivatives lend themselves even more neatly to an efficient implementation than Brzozowski derivatives do.

I'll illustrate this point by using Antimirov derivatives to construct a DFA-based regular expression matcher in ~50 lines of code. You can find the Ocaml source code here.

First, let's define a datatype for regular expressions.

type re = C of char | Nil | Seq of re * re | Bot | Alt of re * re | Star of re

So `C`

is the constructor for single-character strings,
`Nil`

and `Seq`

correspond to $\epsilon$ and
$r\cdot r'$, and `Bot`

and `Alt`

correspond to
$\bot$ and $r \vee r'$.

Next, we'll define the nullability function, which returns `true`

if the regular expression accepts the empty string, and `false`

if
it doesn't. It's the obvious recursive definition.

let rec null = function | C _ | Bot -> false | Nil | Star _ -> true | Alt(r1, r2) -> null r1 || null r2 | Seq(r1, r2) -> null r1 && null r2Now come some scaffolding code. The

`R`

module is the type
of finite sets of regular expressions, the `M`

module are
finite maps keyed by sets of regexps, and `I`

is the type
of finite sets of `int`

s. The `rmap`

function
maps a function over a set of regexps, building a new regexp. (I don't
know why this is not in the standard `Set`

signature.)
module R = Set.Make(struct type t = re let compare = compare end) let rmap f rs = R.fold (fun r -> R.add (f r)) rs R.empty module M = Map.Make(R) module I = Set.Make(struct type t = int let compare = compare end)The

`aderiv`

function implements the Antimirov derivative
function $\alpha_c$ described above. It's basically just a direct
transcription of the mathematical definition into Ocaml. The `deriv`

function applies the derivative to a whole set of regular expressions and
takes the union.
let rec aderiv c = function | C c' when c = c' -> R.singleton Nil | C _ | Nil | Bot -> R.empty | Alt(r, r') -> R.union (aderiv c r) (aderiv c r') | Seq(r1, r2) -> R.union (rmap (fun r1' -> Seq(r1', r2)) (aderiv c r1)) (if null r1 then aderiv c r2 else R.empty) | Star r -> rmap (fun r' -> Seq(r', Star r)) (aderiv c r) let deriv c rs = R.fold (fun r acc -> R.union (aderiv c r) acc) rs R.empty

Since the set of partial derivatives is finite, this means that the powerset of this set is also finite, and so by treating sets of partial derivatives as states, we can construct a deterministic finite-state automaton for matching regular expressions. Let's define an Ocaml type for DFAs:

type dfa = {size : int; fail : int; trans : (int * char * int) list; final : int list}

Here, `size`

is the number of states, and we'll use
integers in the range `[0,size)`

to label the states. We'll
use `fail`

to label the sink state for non-matching strings,
and take `trans`

to be the list of transitions. The `final`

field is a list of accepting states for the DFA.

Now, we'll need a little more scaffolding. The `enum`

function
is a "functional for-loop" looping from `i`

to `max`

.
We use this to write `charfold`

, which lets us fold over all of the
ASCII characters.

let rec enum f v i max = if i < max then enum f (f i v) (i+1) max else v let charfold f init = enum (fun i -> f (Char.chr i)) init 0 256

We use this to define the `dfa`

function, which constructs
a DFA from a regular expression:

let find rs (n,m) = try M.find rs m, (n,m) with _ -> n, (n+1, M.add rs n m) let dfa r = let rec loop s v t f rs = let (x, s) = find rs s in if I.mem x v then (s, v, t, f) else charfold (fun c (s, v, t, f) -> let rs' = deriv c rs in let (y, s) = find rs' s in loop s v ((x,c,y) :: t) f rs') (s, I.add x v, t, if R.exists null rs then x :: f else f) in let (s, v, t, f) = loop (0, M.empty) I.empty [] [] (R.singleton r) in let (fail, (n, m)) = find R.empty s in { size = n; fail = fail; trans = t; final = f }

The `find`

function takes a set of regular expression and returns
a numeric index for it. To do this, it uses a state `(n, m)`

, where
`n`

is a counter for a gensym, and `m`

is the map we use
to map sets of regular expressions to their indices.

The main work happens in the `loop`

function. The `s`

parameter is the state parameter for `find`

, and `v`

is
the visited set storing the set of states we have previously visited. The `t`

parameter is the list of transitions built to date, and `f`

are the
final states generated so far.

The `rs`

parameter is the current set of regular
expressions. We first look up its index `x`

, and if we have
visited it, we return. Otherwise, we add the current state to the
visited set (and to the final set if any of its elements are
nullable), and iterate over the ASCII characters. For each character
`c`

, we can take the derivative of `rs`

, and
find its index `y`

. Then, we can add the transition
`(x, c, y)`

to `t`

, and loop on the derivative.
Essentially, this does a depth-first search of the spanning tree.

We then kick things off with empty states and the singleton set of
the argument regexp `r`

, and build a DFA from the return
value. Note that the failure state is simply the index corresponding to the
empty set of partial derivatives. The initial state will always be 0, since
the first state we `find`

will be the singleton set `r`

.

Since we have labelled states by integers from 0 to `size`

,
we can easily build a table-based matcher from our `dfa`

type.

type table = { m : int array array; accept : bool array; error : int } let table d = { error = d.fail; accept = Array.init d.size (fun i -> List.mem i d.final); m = (let a = Array.init d.size (fun _ -> Array.make 256 0) in List.iter (fun (x, c, y) -> a.(x).(Char.code c) <- y) d.trans; a) }

Here, the `table`

type has a field `m`

for
the transition table, an array of booleans indicating whether the
state accepts or not, and the error state. We build the table in the
obvious way in the `table`

function, by building an array
of array of integers and initializing it using the list of transitions.

Then, we can give the matching function. The `matches'`

function takes a table `t`

, a string `s`

, an
index `i`

, and a current state `x`

. It will just
trundle along updating its state, as long as we have not yet reached
the end of the string (or hit the error state), and then it will
report whether it ends in an accepting state or not. The
`re_match`

function just calls `matches'`

at
index 0, in state 0.

let rec matches' t s i x = if i < String.length s && x != t.error then matches' t s (i+1) t.m.(x).(Char.code s.[i]) else t.accept.(x) let re_match t s = matches' t s 0 0

And that's it! Obviously more work is needed to make this a real regexp implementation, but the heart of the story is here. It's also possible to improve this implementation a bit further by adding a DFA minimization step, but that's a tale for another day. (DFA minimization is a very pretty little coinductive algorithm, and so makes a nice companion to the inductive story here.)

Thanks for the article, I have read about and implemented Brzozowski derivatives but have not seen Antimirov's work.

ReplyDeleteDo you have any comments on "Regular expression derivatives re-examined"? They implement the reduction relation for Brzozowski as "smart constructors" which does not seem to take that much more code than your presentation. It just seems that their approach requires some care in defining the comparison function. Are there other drawbacks, besides proving that the set is finite being harder?

Also - do you have any pointers on how to implement a recognizer rather than just a matcher? With parser combinator-like interface. I tried an approach based on threading "evidence transforms" through Brzozowski normalizazer and derivative functions, which seems quite cumbersome but can work - was wondering you would have any literature pointers for this. Thanks!

My experience has been that programming the smart constructors ends up taking more code than it really seems like it should. It's possible I've done it in an inelegant way, though.

ReplyDeleteBy recognizer, do you mean things like group captures?

Yes, ideally to fit this interface - https://github.com/t0yv0/ocaml-regex-algebraic/blob/master/regex.mli

ReplyDeleteThis comment has been removed by the author.

ReplyDeleteThis comment has been removed by the author.

ReplyDeleteRather than working up to equivalence wrt the ACUI equations, I think it is quite simple to compute a canonical regular expression for each equivalence class. Instead of having a binary ∨, you use an n-ary ∨ and keep the invariant that ∨-nodes do not have ∨ children. Then to canonicalize a regular expression, you first canonicalize its subexpressions, then if it is a ∨-node, you sort it by some arbitrary metric and remove duplicates and ⊥'s.

ReplyDelete