The Miracle of F# Continuation Passing

One thing I consistently include in preparing functional data structures for the FSharpx.DataStructures library is the IEnumerable interface, which facilitates function composition by seamlessly flattening structures. Using Brian McNamara’s indispensable series on catamorphisms as a guide I set about to flatten the BinomialHeap.

Ignoring for now isMaximalist : bool we see a recursive structure of

list -> tuple -> list -> tuple -> etc.

and we want to flatten the elements 'a to 'a list. A classic case for a continuation passing style (CPS) fold.

This code shows a first attempt at hard-coding the flatten we want to accomplish. There are only four patterns to account for, so reasoning about the steps to take is simple. The fifth “can’t get here” pattern is purely to prevent a warning message. (The compiler doesn’t see there are only four paths.)

This code does what we want, producing an 'a list of all the heap’s elements, but the continuation on the last pattern is appending lists, an inefficient procedure we want to avoid…but how?

Let’s see if we gain anything by abstracting out the flatten logic to make this a true generic fold.

So far I have been treating this problem as a classic example from part three of Brian’s series. In abstracting a CPS fold over a discriminated union, you end-up with one continuation function for each case in the union. Our data type is not a four-case discriminated union, but I have been treating the number of match patterns as a proxy for a discriminated union. (The second and third patterns take the same function. That should be a hint there is something wrong about this approach.)

Ultimately abstracting over (a proxy for) a discriminated union has been a dead-end. We do have an abstracted fold method using CPS, but there seems to be no way to get rid of that annoying list append.

The correct approach is through the tree CPS method Brian teaches us in part two. Except for the name “BinomialTree” it is not immediately obvious how a recursive structure like

list -> tuple -> list -> tuple -> etc.

is any kind of tree, let alone a binary tree. Only when we write out the pattern

Node(_, a, h')::tl

do we see the binary tree:

    (‘a element to flatten)    
  h’   tl  
  (list within the element’s tuple)   (tail of list of which element is head)  

So the tree structure has been apparent from the time we wrote out the patterns for our match algorithm, and we know those patterns represent the complete set when iterating through our heap, but to accomplish a binary tree CPS fold you need a “null” pattern (a pattern which does not contribute data to the fold operation) at the bottom of each continuation in order to thread through the identity function. This is why you see so many tree examples in functional languages ending in “leaf”, a seemingly useless end of branch marker carrying no data.

There is no “leaf” in our data structure, but lets go through the motions of implementing the classic tree CPS fold, and maybe that will help us figure out something.

And then a miracle occurs

Don’t believe in software miracles, but all of a sudden our newest generation of the fold starts failing with “can’t get here”. We are now passing something new through our loop: empty list. It turns out there was a “null” pattern all along. Change

| _ -> failwith "can't get here"

to

| _ -> cont leafV

and we have a working CPS fold on the BinomialHeap. (I’m using Brian’s naming standard for the fold parameters.)

Let me break apart the inOrder call to foldHeap to make the fold mechanism more explicit:

A heap sidebar

The code is now perfectly suited for exploring CPS execution. The inOrder member we have been evolving does an in order tree traversal of the heap’s internal data representation, which in the case of heaps is entirely meaningless. We only care that a complete traversal take place. We will sort the resulting list to get it into proper head/tail order.

You don’t believe the internal order is meaningless? Here’s the internal structure we will walk through:


Node (0, "b", [])
Node (1, "a", [Node (0, "c", [])])

In this case the isMaximalist : bool I told you to ignore above is set to false, so the resulting list should be

["a"; "b"; "c"]

but not as result of the traversal, only after we sort the output list of the fold ourselves. The in order traversal results in

["b"; "c"; "a"]

Now back to our story

The step four code makes it easier to follow the logic and clearly shows how the empty lists satisfy the null pattern. Let’s walk through it.

               current Node   executing code  
 
cont =
  (fun x -> x)
 
  (0, “b”, [])   Node(_, a, [])::tl -> loop [] (fun lacc ->
  (head of top list)  
 
cont =
  (fun lacc ->
    loop [] (fun racc ->
    (nodeF a lacc racc) |> (fun x -> x))
 
  cont leafV  
 
The remaining in order tree traversal builds-out increasingly nested continuations.
 
  Node(_, a, [])::tl -> loop [] (fun lacc ->
  loop [] (fun racc ->
 
  (1, “a”, …)   Node(_, a, h’)::[] -> loop h’ (fun lacc ->
  (head of top right list)  
 
  (0, “c”, [])   Node(_, a, [])::[] -> loop [] (fun lacc ->
 
  cont leafV  
 
  Node(_, a, [])::tl -> loop [] (fun lacc ->
  loop [] (fun racc ->
 
  cont leafV  
 
  (1, “a”, …)   Node(_, a, h’)::[] -> loop h’ (fun lacc ->
  loop [] (fun racc ->
 
  cont leafV  

Swapping in the nodeF parameter and substituting its parameter names and the element parameter name for “x” we get the crucial code fragment. I am also using the result passing symbol for readability.

(fun a lacc racc acc -> lacc (a :: (racc acc)))|> (fun x -> x))

Another substitution, using strikethroughs to show what I changed. Coincidentally the symbol “a” from the code gets replaced by the value “a”.

(fun a lacc racc acc -> ("a" :: (racc acc)))|> (fun acc -> acc) |> (fun x -> x))

One more result passing symbol for clarity.

(fun a lacc racc acc -> ("a" :: (acc |> racc)))|> (fun acc -> acc) |> (fun x -> x))

At the bottom of all the right trees we eventually hit an empty list, so we make one more substitution.

(fun acc -> ("a" :: (acc |> (fun acc -> acc)))) |> (fun acc -> acc) |> (fun x -> x))

(fun x -> x) starts the whole continuation unwinding (executing). Notice the anonymous function from nodeF only has three of its four parameters filled in. This is where the code we broke-out in inOrder comes into play. I teased apart inOrder to explicitly show how the CPS pattern wraps the call to foldHeap in a function taking the implicit parameter acc : 'a list.

x []

We accomplished building a partial function with one parameter. [] provides the seed empty list passed down the nested acc parameters.

Here is a reasonable facsimile of the generated function and execution:

And here is the finished product:

If not for adding line breaks for better readability, this is seven lines of code. Six to create a generic fold structure, and one to perform an in-order traversal. However this is a case where F#’s famous expressiveness suffers from the succinctness. If you did not know beforehand how this works, good luck figuring it out!

The complete source code for the BinomialHeap can be found here.

Silicon Valley Code Camp 2012

I’m pleased to announce I will be speaking at this year’s Silicon Valley Code Camp on Saturday, October 6 at 3:30 PM. My session, Working with Functional Data Structures, Practical F# Application, extends my recent articles on the topic. Here is the session description


Beyond the bread-and-butter singly linked list are dozens of practical Functional Data Structures available to mask complexity, enable composition, and open possibilities in pattern matching. This session focuses on data structures available today in F#, but has practical application to most any functional language. The session covers what makes these structures functional, when to use them, why to use them, choosing a structure, time complexity awareness, garbage collection, and practical insights into creating and profiling your own Functional Data Structure. This session is intended for intermediate level functional programmers and above, but should be of interest to functional programming newbies as well.

SVCC 2012 is two full days of tech talk, including breakfasts and lunches, and it’s all FREE! But you have to register. Dozens of interesting and useful sessions, great speakers, thousands of coders / hackers / technophiles, and did I mention free food?

So click my session above (it helps promote my talk) then check out the rest of the SVCC site and register.

Double-ended Queues for F#

Thanks to the FSharpx team I had the opportunity to add double-ended queues to FSharpx.Core. Think of double-ended queues, or deques (Okasaki’s spelling, which I like), as lists that work in both directions. They have mirror image values to the familiar head and tail, namely last and init, and a value doing the mirror image operation of cons named snoc (cons spelled backwards). Imagine the possibilities of list when you can also add and remove elements on the end. For instance use pattern matching on head and last.

match q with
| Cons(hd, Snoc(mid, lst)) -> hd, mid, lst  //first, middle, and last
| _ -> ...

Which deque is right for you?

The four deques available in FSharpx.Core implement the following signature

inherit System.Collections.IEnumerable
inherit System.Collections.Generic.IEnumerable<'a>

Cons : 'a -> IDeque<'a>
Head : unit -> 'a
Init : unit -> IDeque<'a>
IsEmpty : unit -> bool
Last : unit -> 'a
Length : unit -> int
Snoc : 'a -> IDeque<'a>
Tail : unit -> IDeque<'a>
Uncons : unit -> 'a * IDeque<'a>
Unsnoc : unit -> IDeque<'a> * 'a

They also all have a module level value of OfCatLists which concatenates two lists to create a deque. Deques internally implement two lists (or lazy lists) with the second list reversed. That’s basically all there is to allowing concatenations at both ends. (For in depth discussion of double-ended queues see either Okasaki’s thesis or his book.) This makes the OfCatLists operation a very fast way to concatenate two lists. Across all four deque implementations the operation is competitive with the FSharpx data structure DList (Difference List) which implements an append value that is truly O(1). OfCatLists is not truly O(1), but two of the deque structures, Deque and BatchedDeque, actually out-perform DList right through a scale of concatenating 10,000 element lists with DList only barely out-performing them at 100,000. (To be sure we are talking time scales of less than 0.3 milliseconds on a 64-bit system.)

Two deques, BankersDeque and RealTimeDeque implement additional values

Lookup : int -> BankersDeque<'a>
Remove : int -> BankersDeque<'a>
TryRemove : int -> BankersDeque<'a> option
Rev : unit -> BankersDeque<'a>
Update : int -> 'a -> BankersDeque<'a> option
TryUpdate : int -> 'a -> BankersDeque<'a> option

Rev is always O(1), a nice feature of deques.

Although nowhere near as adept at the update operation as the AltBinaryRandomAccessList data structure (specifically designed for update), deques exhibit good across the board performance in the index enabled operations Lookup, Remove and Update. They have a theoretical worst case time complexity of ~O(n/2) on index operations because the elements are partitioned in two lists.

But watch out! The actual time complexity on index operations is O(d), where “d” is the displacement into the internal list containing the indexed element. The relative sizes of the internal lists can vary. BankersDeque and RealTimeDeque require a “front-back stream ratio constant” in their construction. I’ve implemented a default value of 2 where this is required, but you can also specify a higher value. Essentially it insures neither list is ever larger than a constant times the size of the other list. BatchedDeque does nothing with the relative sizes until you request a Tail or Init, then it brings the lists into a relation of neither being more than twice the size of the other. And finally Deque is a naive implementation that does not regulate the relative sizes at all.

They all implement IEnumerable and have available all the Seq module values, but be sure to always use the native Lookup and not Seq.nth. This is one of those cases where straight-up comparisons of time complexity (~O(n/4) versus O(n) over many random iterations) does not tell the whole performance story. Actual Seq.nth performance is much worse than the time complexity would suggest.

So, which deque is right for your purposes? As is always the case with complexity that makes reasoning about performance hard, do your own profiling!

Purely Functional Stepchildren

Lately I’ve taken a closer look at Julien’s F# implementations of Chris Okasaki’s Purely Functional Data Structures1. AltBinaryRandomAccessList2 caught my eye because judging from its existing update function it looked pretty simple to implement a remove function. As you may have noticed in all the major collections of F# data structures –FSharp.Core.Collections, FSharpx.core.DataStructures, and PowerPack — remove is pretty rare. And both update and remove seem to be neglected stepchildren in the purely functional data structure world. (By update and remove I am referring to functions updating and removing any element by index.)

Let’s look at Okasaki’s “pseudo-canonical”3 update for list, implemented in F#.

let rec loop i y (l:'a list) = 
    match (i,y,l) with
    | i',y,[] -> raise (System.Exception("subscript"))
    | 0,y',x::xs -> y::xs
    | i',y,x::xs -> x::(loop (i'-1) y xs) 

Do you see the problem? I’ll get back to it below.

How else might we implement update? We could just punt by converting to Array, updating, and back to list.

let punt i y (l:’a list) =
    let a = List.toArray l 
    a.[i] <- y
    List.ofArray a

And our last candidate is a hybrid, in which we progressively take the tail, collecting each head in an Array, until we pass the element for update. Then cons the new element and the elements from the Array.

let hybrid i y (l:'a list) =
    if (i = 0) then List.Cons (y, (List.tail l))
    else 
        let rec loop i' (front:'a array) back =
            match i' with
            | x when x < 0 -> front, (List.tail back)
            | x ->  
                Array.set front x (List.head back)
                loop (x-1) front (List.tail back)

        let front', back' = loop (i - 1) (Array.create i y) l

        let rec loop2 i' frontLen (front'':'a array) back'' =
            match i' with
            | x when x > frontLen -> back''
            | x -> loop2 (x + 1) frontLen front'' (front''.[x]::back'')

        loop2 0 ((Seq.length front') - 1) front' (y::back')

AltBinaryRandomAccessList is a structure with the same basic functions as List, but optimized for update. Here’s the type structure:

type AltBinRndAccList<'a> =
    | Nil
    | Zero of AltBinRndAccList<'a * 'a>
    | One of 'a * AltBinRndAccList<'a * 'a>

The element grouping represents the binary number of the count of elements, with least significant digit to the left4. This allows for a member update function which runs in O(log n) time. (The best we have seen so far is O(i), where i is the index of the target element, and punt runs in O(n)).

static member fupdate : ('a -> 'a) * int * AltBinRndAccList<'a> -> AltBinRndAccList<'a> = function
    | f, i, Nil -> raise Subscript
    | f, 0, One(x, ps) -> One(f x, ps)
    | f, i, One (x, ps) -> AltBinRndAccList.cons x (AltBinRndAccList.fupdate (f, i-1, Zero ps))
    | f, i, Zero ps ->
        let f' (x, y) = if i % 2= 0 then f x, y else x, f y
        Zero(AltBinRndAccList.fupdate(f', i/2, ps))

Now lets look at actual performance for our update alternatives5.

Size 10k Random Updates   One-time Worst Case
102 PC -   2.9 ms   Punt -   0.2 ms
  Hybrid 1.4 X 4.0     PC 1.1 X 0.2  
  Punt 1.5   4.5     Hybrid 4.1   0.8  
  RndAcc 1.8   5.3     RndAcc 7.7   1.5  
 
103 RndAcc -   8.2     Punt -   0.2  
  Hybrid 3.6   29.6     PC 1.1   0.2  
  Punt 5.8   47.6     Hybrid 4.1   0.8  
  PC 6.2   50.3     RndAcc 8.0   1.6  
 
104 RndAcc -   11.5     Punt -   0.3  
  Hybrid 27.8   320.3     PC 1.3   0.4  
  Punt 46.4   534.9     Hybrid 3.2   0.9  
  PC 79.8   920.2     RndAcc 6.5   1.8  
 
105 RndAcc -   14.8     Punt -   1.0  
  Hybrid 315.5   4.67 sec   Hybrid 1.5   1.5  
  Punt 630.7   9.34     RndAcc 2.0   2.0  
  PC stack overflow

By way of comparison Array, which is not a purely functional structure, performs 10,000 updates in 0.1 millisecond at all scales, and really has no “worst case” single update. And notice because its function is not tail recursive Pseudo-Canonical eventually took a stack overflow. Also of interest, starting at the scale of 1,000 elements Punt outperforms Pseudo-Canonical in random updates, even though it is O(n) and the latter O(i). Time complexity does not tell the whole performance story.

So where does this leave remove? As it turns out my original idea to modify the update of AltBinaryRandomAccessList resulted with a function that was not tail recursive, but applying the hybrid technique to it resulted in a satisfactory algorithm6 and marginally improved performance, but like most every remove implemented in a purely functional list-like structure, the best time complexity you are going to get is O(i). (With a double-ended queue you can achieve ~O(i/2) .)

Here’s the comparative summary of remove performance on list-like structures. This time I include Array, where I copy the retained elements into a new structure one element shorter.

Size 10k Random Deletes   One-time Worst Case
102 Array -   1.4 ms   Array -   0.1 ms
  PC 1.9 X 2.6     PC 2.7 X 0.2  
  Hybrid 2.7   3.7     Punt 4.4   0.3  
  Punt 3.9   5.3     Hybrid 10.5   0.7  
  RndAcc 36.5   49.7     RndAcc 40.4   2.7  
 
103 Array -   11.2     Array -   0.1  
  Hybrid 2.6   28.7     PC 2.7   0.2  
  PC 4.3   48.3     Punt 4.3   0.3  
  Punt 4.4   49.3     Hybrid 10.4   0.7  
  RndAcc 40.1   448.5     RndAcc 41.1   2.9  
 
104 Array -   107.9     Array -   0.1  
  Hybrid 3.0   321.5     PC 3.2   0.3  
  Punt 5.5   589.9     Punt 4.5   0.4  
  PC 8.3   898.8     Hybrid 10.0   0.8  
  RndAcc 48.8   5.27 sec   RndAcc 45.8   3.8  
 
105 Array -   1.43 sec   Array -   0.2  
  Hybrid 3.2   4.51     Punt 6.0   1.1  
  Punt 7.7   11.04     Hybrid 8.1   1.5  
  RndAcc 42.9   61.39     RndAcc 89.8   16.1  
  PC stack overflow

As you can see, remove is a viable function for moderate-sized data structures, but suffers from the limitation of its time complexity as you scale up. You could easily modify remove functions to process an ordered list or range of indexes and still have O(max i) performance.

One remaining question for purists: are the stepchildren still Purely Functional when they use mutation inside a function?

Notes

1 Okasaki, Purely Functional Data Structures, 1998

2 AltBinaryRandomAccessList

3 Okasaki, 1998.

4 See Okasaki, 1998 or Okasaki, Purely Functional Data Structures, 1996 for in depth discussion of numerical representations of data structures.

5 Times were taken on a not particularly fast dual core 64-bit system. Time multiples were calculated at a finer granularity, and so may not perfectly match the timings in milliseconds. AltBinaryRandomAccessList is abbreviated to “RndAcc”.

6 I hope to release it soon in an open source library.

Benchmarking F# Data Structures — Loading Data

In many disciplines engineers rely on performance characteristic reference tables of the materials and structures they work with. For a host of reasons these aids are lacking in the field of engineering application software. In the previous article in this series I introduced DS_Benchmark, an open source project for benchmarking F# data structure performance. Now I’m rolling out the first performance tables constructed with this tool. At the same time I’m launching the F# Data Structures pages as a centralized source of information on data structures across all available sources.

But won’t performance vary on different systems?

Of course. That has long been a problem with benchmarks represented in terms of how much time operations take. I am using a different approach, which is to compare operations in terms of their relative speed. Running tests on a 32 and 64-bit system I found the performance ratios remarkably consistent despite big differences in absolute performance.

Loading Data Structures

The Initial Data Performance tables rank data structures, originating data source, and data loading action from fastest to slowest for data sizes ranging from 1 to 106. The column “TIME REL TO BEST” reports how many times longer a given combination takes compared to the top ranked.

Naturally you will be interested in specific comparisons. At the scale 105 it is not surprising loading a List from another List of ascending integers is ranked the fastest for the “Init” action (List.ofList), but say your interest is in comparing performance of FSharpx DList to FSharpx PersistentVector when loading from an integer Array? Read down the chart until you find

FSharpx-DList ArrayIntAsc Init 1.0 96.1

and

FSharpx-PersistentVector ArrayIntAsc Init 1.0 199.4

The initial load performance of DList is 96.1 times as long as the top ranked, so divide Persistent Vector’s number by this to learn that it’s performance is 2.1 times slower than DList.

For more information on how the Performance Tables are constructed and how to interpret them see Performance Metric Tables Methodology and Reading the Data Structure Performance Tables.

While the tables comparing performance across data structures will probably be your go to tables, there are also tables comparing performance within a data structure, one for comparing across different methods of loading data (there are some surprising results) as well as actual performance when scaling up data sizes.

Pathologies and other observations

Reasoning about any kind of software performance is notoriously hard. Modern CPU pipelining architectures and instruction and memory caches make it darn near impossible. And there are many considerations to take into account regarding the choice of language. You need to do your own profiling. These performance tables are no substitute, but meant to be a guide. Here are some interesting things that come to light with the performance tables:

  • Structures you would expect to display a more or less O(n) data loading performance seldom do. It’s much more common to display better performance up to some threshold (often 104) and then perform worse than O(n).
  • The Power Pack library’s HashMultiMap structure loads significantly faster using the AddOne action (a For...do loop) than the new() constructor up to scale 105 where performance becomes roughly the same.
  • FSharpx’s RealTimeQueue will stack overflow at scale 105 on both the 32 and 64-bit systems I tested on.
  • FSharpx’s BootstrappedQueue scales remarkably well when loading from a List. (Of course if you looked at the code for this structure you would understand why.) And so the AddOne action scales badly in comparison to Init at scale 105.
  • Not only is Array generally the fastest structure to initialize at all scales. It is often the fastest structure as the source for loading another structure.

Caveats

It’s important to remember the data load performance charts are the result of the performance interplay of the source and target data structures. If you are loading from I/O or some other “neutral” source consider comparing performance characteristics of Array loads, since Array seems to be consistently fast, and therefore somewhat neutral.

You will also see some strange blips at certain scales and for certain data and ordering characteristics. While I can’t vouch for every one, the oddities I looked at were consistently repeatable. I was still doing some development work while compiling these statistics. There is the possibility of bad data, and I eventually intend to re-compile all these stats. I deemed it more important to publish what I have and move forward with the next phases of my data structure performance survey.

Still to come

Still a lot of data structures to add to the data structure pages, and I still have to compile performance charactistics on common actions, like lookup, update, append, iterate. So I’ll have my head down working on this for a while. Please send comments, suggestions, and requests.

Here are some more .NET related performance benchmarks, also using the ratio comparison method.

Benchmarking F# Data Structures — Introduction

Overview

DS_Benchmark is a new open source project for benchmarking F# data structures. I developed it to provide a platform for comparing performance across data structures, data types, common actions on structures, as well as scale-up performance. It currently supports almost all the structures in the FSharp Collections, FSharpX Data Structures, and Power Pack libraries. The project’s architecture allows incorporating additional structures and functions to be timed with minimal effort.

The choice of data structure in a functional language project profoundly influences architecture and performance. You may not have the time or inclination to read-up on purely functional data structures, and even if you are familiar with a lot of data structures and time complexity how do you choose between several choices all offering O(1) complexity in a crucial function, for instance? You should always do your own profiling, and especially profile performance critical code.

Some of the benchmarks possible with this project include:

Compare similar functions across data structures.

Investigate initializing structures from different collection types.

Look for pathologies caused by certain initial data.

At what scale does a data structure become less or more performant compared to other choices?

Methodology

I intend the project to measure what I call “most common best case performance” in an “apples to apples” manner, so it does not readily tell you how many milliseconds a process took (although you can retrieve this information by not drilling down very deep in the project). I took the precautions I could think of to best isolate the event being measured. The project forces garbage collection prior to the event, and the timed event executes in it’s own exe. (It’s still important to minimize all other activity on the machine during timings.)

I found through experiment that less than 15% of timing executions are typically outliers on the long side, so DS_Benchmark times 100 executions and uses the fastest 85 to calculate a median time (in ticks) and the deviation from the median.

The output format is designed to enable further analysis within and across data structures by computing comparisons in terms of ratios (percents) rather than simply reporting milliseconds (which only apply to the host environment). Ratio comparisons are remarkably environment agnostic, as long as all original measurements are taken in the same environment. For instance I ran analyses based on measurements on two different machines, and even though the measured timings (in ticks) were wildly different, all of the comparative ratios I have analysed so far are very similar.

Simple Measurement Set-up

The console1 program in the solution provides documented examples of the most useful cases. Timings of individual scenarios print to the console, or you can generate batches of timings with a simple template model. The batched timings output to a tab-delimited file for easy import to Excel. (My next phase for this project is to persist the timing results and automate the analysis. The idea is to compare actions and inputs within and across data structures.)

For more in depth usage and architecture information see the project readme file.

Coming up Next

In following articles I will report on some interesting comparisons of data structures using DS_Benchmark. I am devoting a permanent section of this blog to F# Data Structures. Keep an eye on this space for benchmark comparisons as well as a single point of reference for many data structures and documentation, including time complexities, that otherwise are scattered all over the internet.

If you are interested in adding structures or otherwise contributing to the project, get in touch with me.