In our last SETT article on Clojure we introduced the immutable style of programming favored by Clojure: Immutability in Clojure - Part 1, Programming Without Variables. We showed how recursion can be used to do things like looping over a collection that would ordinarily require setting mutable variables. Traditionally, the functional programming languages Scheme and LISP that Clojure inherits from relied heavily on recursion. However we also said that recursion is not used as quite as extensively in Clojure: instead Clojure provides 'lazy sequences' which can provide a more efficient alternative to recursion in many cases. This SETT will examine how to use lazy sequences in Clojure. We will also show how immutable collection data structures are efficiently shared via Clojure's persistent collections. First however we will introduce Clojure's normal (non-lazy) sequences.
In Clojure, every collection like data structure can be treated like a sequence:
In addition many other things can behave like sequences:
A sequence supports three core functions:
In the following code examples the result of an expression is shown with the comment character ';' followed by '->' and the result, like this:
(+ 2 2) ; -> 4
(+ 2 2)is the actual Clojure expression, and 4 is the result of the expression.
The following code creates a vector of the three integers 1, 2, 3, and then retrieves the first element ('1'):
(first [1 2 3]) ; -> 1
The first function returns nil if passed an empty sequence or nil. The rest
function returns another sequence every element following the first element:
(rest [1 2 3]) ; -> (2 3)
rest will return an empty sequence when passed a sequence
with less than two elements. The cons function
constructs a new sequence from a single element and another
sequence:
(cons 4 [1 2 3]) ; -> (4 1 2 3)
Building on these three core three sequence functions, first,
rest and cons, Clojure's standard library
provides functionality for filtering, querying and manipulating sequences
similar to what one finds in the Java collections API, Guava collections,
etc. The full functionality of Clojure's sequences is outside the scope
of this article, but let's take a peek at a few key functions to get the
flavor of things. Note that all of the examples that follow can be
applied to any sequence like data structure in Clojure, including I/O
streams, regular expression matches, XML documents, etc.
(filter even? [1 2 3 4]) ; -> (2 4)
The filter function takes a test function, in this case the
even? function, calls the function on each element in the
original sequence, and returns a new (lazy) sequence containing all the
elements in the original sequence for which the test function returned
true.
(every? even? [1 2 3 4]) ; -> false
(every? even? [2 4 6 8]) ; -> true
The every? function takes a test function, in this case
even?, and returns true if the test function returns true for
every element in the sequence.
(map #(* 2 %) [1 2 3]) ; -> (2 4 6)
The map function takes a transformation function, calls the
transformation function on each element on the sequence, and returns a new
sequence of transformed elements. In this case we are using Clojure's
abbreviated syntax for an anonymous function:
#(* 2 %0) ; -> creates an anonymous function that multiplies its argument by 2
The # character marks the beginning of an anonymous function
definition, surrounded by parenthesis. The implicit single argument is
referred to by the % symbol.
Clojure sequences such as lists, vectors, sets and maps are immutable. So how do we 'add' items to a collection, if we can't modify the collection? Instead of modifying the original collection, we create a new collection that contains all the elements of the original collection, plus the new element that we want to add. Here is an example using maps:
(def m1 {"a" 1 "b" 2}) ; -> create map m1 with two key/value pairs
(def m2 (assoc m1 "c" 3)) ; -> 'add' new key/value pair to m1, assign to m2
First we create map m1 using the Clojure curly brace syntax
for map literals. The m1 has two key value pairs, "a" -> 1
and "b" -> 2. Next, we call the assoc function which takes
three arguments: a key ("c"), a value (2), and an existing map
(m1). The assoc call returns a new map
containing all three key / value pairs.
A natural question arises here: isn't this horribly inefficient? Surely we
can't afford to make copies of entire collections every time we want to
'add' one element. The answer is that Clojure's collection use a
technique called 'structural sharing' to only compute the differences
between two different versions of a collection, and share the common bits
in memory. Version control systems like Subversion or GIT use similar
techniques. Logically, a version control system stores separate
copies of every version of a file. Physically, however, it only
stores the diffs, and computes the full version of a file on demand.
Clojure's 'persistent' collections work the same way. Here is another
example using lists. First let's create an initial list l1
with two elements:
(def l1 (list 1 2))
Next let's create new two lists, l2 and l3 from
contents of l1 and 'adding' one extra element:
(def l2 (cons 3 l1))
(def l3 (cons 4 l1))
Logically, we have three separate lists. In memory however, lists
l2 and l3 share the contents of the first list
l1. We can demonstrate this as follows. First let's verify
that the contents of the 'rest' of l2 and l3 are
equivalent to the contents of l1:
(= (rest l2) l1) ; -> returns true
(= (rest l3) l1) ; -> returns true
That just proves that the cons function worked correctly. But let's
see if the contents are identical and therefore shared in memory:
(identical? (rest l2) l1) ; -> returns true
(identical? (rest l3) l1) ; -> returns true
The identical? tests to see if two objects point to the same
region in memory. It is similar to the == operator in Java.
As we can see, the contents of list l1 are shared between all
three lists, l1, l2 and l3.
With the introduction to basic sequences out of the way let's look at lazy
sequences. A lazy sequence in Clojure is a virtual sequence that behaves
just like a 'real' sequence except that its elements are computed on
demand, lazily. The filter and map functions
demonstrated above actually return lazy sequences.
Since each element is realized on demand, we can even create infinite lazy
sequences (as long as we don't try to realize every element in the infinite
sequence). An example of an infinite sequence would be the set of all whole
numbers. Let's create a lazy sequence that produces whole
numbers. To create a lazy sequence we use the lazy-seq macro.
The lazy-seq macro takes an expression that returns a sequence,
and only invokes that expression as it is needed. Our first version
creates a lazy sequence that increments from a starting number:
(defn whole-numbers-from [n]
(cons n (lazy-seq (whole-numbers-from (inc n)))))
We use the cons function to create a sequence whose first
element is n, and the rest is the lazy sequence
returned by recursively calling whole-numbers-from again,
incrementing the value of n by one. The key is the call to
lazy-seq. Without this the function would recur forever,
until the end of time or we run out of whole numbers, whichever comes
first. (Actually we would get a stack overflow error first). With
the lazy-seq call, the information needed to generate
each subsequent element is remembered, or memoized, to be produced on
demand, as needed. We can retrieve a finite set of elements from our
logically infinite sequence using the take function:
(take 3 (whole-numbers-from 0)) ; -> (0 1 2)
As each element is produced, whole-numbers-from is called
again with the incremented value of n, and the information
for the previous call to whole-numbers-from is discarded and
eligible for garbage collection. Thus nothing is placed on the stack; the
stack does not nest more than one level and we do not get a stack
overflow, even for large values of n.
We can provide a simpler version that starts at 0 using the built in iterate
function:
(def whole-numbers (iterate inc 1))
The iterate function takes a function and starting value as arguments, and produces
a lazy sequence of n, (f n), (f (f n)) etc. Note that we do not need to create
whole-numbers as a function definition
using defn. Instead we assigned the lazy sequence returned
by the iterate call directly to the symbol definition
whole-numbers using def. We can call our whole-numbers
definition like this:
(take 3 whole-numbers) ; -> (1 2 3)
We can call standard sequence functions
including first, filter,
map, etc on our lazy sequence:
(first whole-numbers) ; -> 1
(take 3 (map #(* 2 %) whole-numbers)) ; -> (2 4 6)
The call to take is very important - we can't transform the
entire set of whole numbers without running out of memory first!
To review what we have covered so far, let's look at four implementations for calculating the numbers of the Fibonacci sequence. The Fibonacci numbers are defined as follows: the first two Fibonacci numbers are 0 and 1, and each of the following numbers is the sum of the previous two. So the third Fibonacci number is 1, because 0 + 1 = 1. The fourth Fibonacci number is 2, because 1 + 1 = 2. The first ten Fibonacci numbers are 0, 1, 1, 2, 3, 5, 8, 13, 21, 34.
We will create four implementations for calculating the Fibonacci sequence: the first naive implementation will use normal, non-tail recursion. It should be easy enough to follow, but will not be efficient. We will refactor this implementation to use tail recursion. This version will be more efficient but harder to follow. Next we will implement the sequence using lazy sequences. Finally we will look at an elegant implementation that uses some of the built in functions in Clojure's standard sequence library.
Our first, naive implementation matches the problem definition well:
; naive implementation:
(defn fib [n]
(condp = n
0 0
1 1
(+ (fib (- n 1)) (fib (- n 2)))))
; call like this:
(fib 9) ; -> 34
The condp macro is a kind of switch statement. It takes
the name of a two argument test function, = in this case,
a value to use as the first argument to the test function, and a set
of clauses where the first element of the clause is passed to the test
function and the second part of the clause is the code to execute if
the test evaluates to true.
So, our fib function returns 0 if n is 0, 1
if n is 1, and otherwise returns the sum of the previous
two Fibonacci numbers. This fits our English language description of
the Fibonacci sequence above well: the first Fibonacci number (0) is
0, the second (1) is 1, and each following number is the sum of the
previous two. However, this implementation will not work for large
values of n:
(fib 10000000) ; -> java.lang.StackOverflowError :(
Each call to fib recursively places two more calls on the
stack. You will quickly overflow the JVM stack with this approach.
In the previous
article in this series we saw how to use the recur
function to replace simple recursion with tail call recursion. Tail
call recursion allows the compiler to unroll the recursion into a
non-recursive loop that is more efficient. In Clojure this is
achieved by replacing the recursive function call with
recur. The call to recur must be last
expression evaluated in the function (thus the term 'tail call').
However, in our fib function the + function is the last
expression. The two recursive calls to fib happen before
the + call. If we attempt to replace the two calls to
fib with recur our code will throw a
java.lang.UnsupportedOperationException. We will have to
rearrange our function to allow for just one recursive call, and that
call will need to be the last expression evaluated. One approach is
to introduce two accumulator values to 'remember' the two previous
Fibonacci numbers on the stack:
(defn tail-fib
([n] (tail-fib n 0 1))
([n prev2 prev1]
(condp = n
0 prev2
1 prev1
2 (+ prev1 prev2)
(recur (dec n) prev1 (+ prev1 prev2)))))
Our new function is overloaded based on the number of arguments based. If
passed a single argument n, it calls itself with three
arguments: (tail-fib n 0 1). The the last two arguments,
prev1 and prev2 are the accumulator values that
'remember' the last two Fibonacci numbers calculated. Since the first two
numbers of the sequence are 0 and 1 we can hard code those two and then
start accumulating, with the call to recur. The
recur is the last expression of the function.. This
implementation works, even for large values of n, but is
harder to read. It does not match the problem definition as well.
The phrase 'Fibonacci sequence' suggests an alternative implementation: a lazy sequence that calculates each succeeding Fibonacci number on demand. This approach turns out to be fairly straightforward (once one is comfortable with lazy sequences), and efficient:
(def fib-seq
((fn fib [prev1 prev2]
(lazy-seq (cons prev1 (fib prev2 (+ prev1 prev2)))))
0 1))
(take 10 fib-seq) ; -> (0 1 1 2 3 5 8 13 21 34)
(nth fib-seq 9) ; -> 34
The definition of fib-seq is similar to
whole-numbers-from above. The fn call
creates a little helper function, fib, that takes the two
previously accumulated numbers in the sequence, prev1 and
prev2. It returns a lazy sequence whose first element is
prev1. The following elements are the result of the
fib helper function recursively calling itself, passing
prev2 and the sum of the previous two numbers. After
creating the helper function fib, our sequence definition
invokes fib with the first two values of the sequence,
0 and 1. This is the reason for the double
level of parenthesis around the fn call: first the
function fib is defined, and then it is called with the
arguments 0 and 1 to see our sequence.
Since this recursive call is placed inside the call to
lazy-seq, it is only invoked on demand, as needed, one
element at a time. The stack does not nest and we do not run out of
stack space, even for large values of n.
The Clojure standard library comes with many useful functions that create
lazy sequences. We have already seen two of them, iterate
and map. We can use these functions to create a simpler version
of the Fibonacci sequence:
(def fibs (map first (iterate (fn [[prev1 prev2]] [prev2 (+ prev1 prev2)]) [0 1])))
(take 10 fibs) ; -> (0 1 1 2 3 5 8 13 21 34)
(nth fibs 9) ; -> 34
The call to iterate loops over pairs of the two
previously accumulated Fibonacci numbers. Recall
that iterate returns a lazy sequence. The Fibonacci
sequence itself is simply the first element of each pair, so the call
to (map first... simply translates the nested sequence
of pairs into single elements, the elements of the Fibonacci sequence.
The map function also returns a lazy sequence. This
version was suggested by Christophe Grand and described in the book Programming
Clojure by Stuart Halloway. This version reuses existing
functions, is the simplest version, and is efficient.
In this SETT we demonstrated how Clojure's persistent collection library uses structural sharing to allow for efficient use of immutable collections. We also showed how Clojure's lazy sequences can create efficient, virtual collections over large (even infinite) sequences. We implemented different methods of calculating the Fibonacci sequence to show how to replace simple recursion with more efficient tail recursion, how to replace tail (or simple) recursion with lazy sequences, and how the lazy sequence functions in the Clojure standard library often remove the need to implement your own lazy sequences.