FSharpx Data Structures Discussion

In the ten weeks since I began submitting pull requests contributing Functional Data Structures to the FSharpx open source project I’ve learned a lot, not just expanding my knowledge of the F# language, .NET Framework, and Functional Data Structures, but Git, GitHub, and perhaps most interesting of all, contributing to an open source project.

Steffen Forkmann and Mauricio Scheffer, two of the core FSharpx team, deserve special call-outs for helping me come up to speed. Steffen ported some of my work before I began doing Git pull requests, and they’ve both been very patient with my Git blunders and other mistakes. It took some time, but I’m starting to understand the somewhat anarchic governance structure to open source. With my background leading medium-sized production projects, I try to develop a big picture of the entire stack. I’m learning OS contributors tend to be much more focused on immediate issues. I think this is for good reason as commit history is crucial to the health and long-term benefit of the project. It works like a self-organizing system with big picture benefits as emergent phenomena.

I still can’t resist formulating a bigger picture framework for the FSharpx Data Structures sub-project. What follows is purely my opinion, and I hope FSharpx contributors and others will correct me where I err, and encourage me where I am right with their feedback.

FSharpx.DataStructures Namespace is Purely Functional

This namespace is for Purely Functional Data Structures. There is already one exception, TransientVector which shares the IVector interface with the purely functional PersistentVector.

Mutable structures should have their own namespace, which currently doesn’t appear to exist. There is a “Collections” folder, but its components are in the FSharpx namespace. It includes extension modules to functional and mutable structures as well as one functional structure, NonEmptyList.

Structure Type and Module

The structure’s type should exist directly in the DataStructures namespace, as well as a module with the same name as the type, but not the type under the module. There are a few structures in the latter situation with no non-breaking remedy, except perhaps duplicating the entire type at the namespace level. The type should have all the members that operate on an instantiated type. The module should have let bindings to all the the type members as well as let bindings for all other “utility” functions on the type, like ofSeq and empty.

Understand the difference between a property and a method

Property Usage Guidelines One rule of thumb, if it is not O(1), it is not a property. (The inverse does not necessarily hold.)

Pattern Discriminator

Include any relevant pattern discriminators in the module. Hint: if the type implements TryUncons or TryUnsnoc you can implement pattern discriminators.

Try to Avoid Exceptions

If a structure’s method can throw an exception, include a TryMethodName method that returns option.

Necessary Members

Every data structure should have a length member (even if length is O(n)) and an IsEmpty member in the type and ofSeq in the module.

Additional Members

Of course implement all the members central to the structure, but there can be some nice O(1) or O(log n) values just waiting to be implemented that perhaps were not directly addressed in the research paper because they were not interesting. A couple of examples: the internal structure of the type allows O(1) rev or ofList.

Necessary Interface

Implement IEnumerable in the type. It should sequence elements in the inherent order of the data structure. For instance if a structure implements the head/tail paradigm, IEnumerable should sequence elements in the order of unconsing the structure down to empty. Most every structure has some logical, if not physical, order to its elements when flattened. (I can’t think of an exception right now.)

Implementing ofSeq and IEnumerable makes practically every Data Structure composable into every other, a very useful feature!

Type Interface

In theory data structures implementing the same paradigm, e.g. queue, should implement an interface that both restricts upcasting the interface to the original type and is a usable interface. Meeting both the restrictive and usable requirements has proven difficult. I haven’t given up on this project yet, just deferred it.

Hide unsafe underlying structures

When I first went to work on data structures in the FSharpx project, I noticed some of the structures exposing unsafe (in the sense that arbitrary use could lead to a dysfunctional structure) discriminated unions. I didn’t fully appreciate at the time how an OS project like FSharpx incrementally works toward ideal practices, and considered this an intentional practice, which I emulated.

As a result of Mauricio’s initiative to test for monoid law adherence, this [mal]practice has been exposed here and here.

This is a fair amount of clean-up work. Still, the exposed union does provide one more handle for the determined hacker to be inventive with, so perhaps the correct solution is to expose the union in the type and module in some manner that is clearly identified as unsafe. Or even collecting all of the unsafe unions in a single unsafe module. “Unsafe” or “unchecked” would be ideal names, except they are already C# key words.

Still thinking this over, and hope to get feedback on this.

Intellisense

Every public member and let binding should have intellisense documentation to make it user friendly and not require a literature search or inspection of the code to determine functionality. At the type level there should be intellisense documenting the purpose and use of the structure. Include URLs to additional literature where appropriate.

Data structures should give developers useful tools, not research projects.

Time Complexity

Adopting a practice from Haskell documentation, I put the time complexity for members and let bindings at the beginning of their intellisense documentation. I even went so far as to put O(1) on empty and singleton to accustom users to seeing the time complexity in the intellisense, even though strictly speaking time complexity for these bindings has no meaning.

Take time to evaluate the naming standard

While it’s a good practice to keep the name a structure is best known as, consider renaming properties and methods if it makes sense to fit in with F# paradigms. Use the FSharp.Collections and existing FSharpx.DataStructures namepaces as models, don’t just unthinkingly adopt the names from a research paper or the implementation in another language. Use ofList and ofArray not fromList or fromArray. I intentionally renamed the Heap members findMin and deleteMin to head and tail because I think that accurately conveys the head/tail paradigm like several other data structures (and I implemented Heaps as ascending or descending, so “min” is inappropriate). Stick with cons, snoc, head, tail, init, last wherever possible.

IReadOnlyList and IReadOnlyDictionary

IReadOnlyCollection and IReadOnlyDictionary, new to .NET 4.5, would appear to be useful to implement, however backwards compatibility of FSharpx raises a problem. A discussion of this issue is here.

Should FSharpx take over the PowerPack Data Structures?

Discussion: here

There are related threads for migrating other functions from PowerPack, for instance here and here

I don’t know what the right answer is.

Conclusion

Data Structures make F# a richer multi-technology-stack language. I keep a running collection of the structures from Okasaki’s 1998 book not yet ported to FSharpx here. (There’s at least one or two covered in the book not in this collection.) Feel free to request expediting porting any or all, or work on the port yourself (I’m happy to provide any assistance I can). There’s lots of other useful structures still “in the wild”.

Simple Lookups in F#, Choices and Trade-offs

What is your go-to data structure for lookups in F#? Most any choice for key-based lookups will work just fine, but if your requirements go beyond simple lookup or you have a demanding edge case what other criteria might you consider? Here’s an overview of the available lookup data structures:

Map

Purely functional and persistent. Over 20 methods providing various custom folds, mappings, filters, etc.

IntMap

Purely functional and persistent, but restricted to integer keys. Dozens of specialized methods including folds, mappings, filters, and inequality lookups (e.g. greater than key). Part of the FSharpx open source project.

Dictionary

Mutable. Almost always the fastest lookup. The fastest object to initialize with data, provided you initialize the dictionary with the expected capacity. If the capacity is not given data loading performance will degrade significantly in the 100 to 10,000 element range. It has none of the ancillary functional methods available to native F# data structures.

Hashtable

Mutable. Very good lookup performance and data initialization performance on string keys. If the capacity is not given data loading performance will degrade significantly at all ranges of element size. None of the ancillary functional methods available to native F# data structures.

HashMultiMap

Mutable. Available in the PowerPack.Core.Community library. The distinguishing feature of this data structure is a FindAll method returning a list of all values bound to a hash key. A fold method is also available.

So what are the raw lookup performance numbers? (All timings are in milliseconds.)

Integer Key by Number of Elements in Lookup Structure, 10,000 random lookups

  Map IntMap Dict HashTbl HshMltMap
102 0.7 0.7 0.2 0.3 1.8
103 0.9 0.9 0.2 0.3 1.8
104 1.2 1.3 0.3 0.4 1.9
105 2.1 2.2 0.6 0.8 2.3
106 3.3 3.1 0.9 1.0 2.6

String Key by Number of Elements in Lookup Structure, 10,000 random lookups

  Map IntMap Dict HashTbl HshMltMap
102 1.3 n/a 0.4 0.3 1.5
103 1.7 n/a 0.4 0.4 1.5
104 3.0 n/a 0.7 0.7 1.8
105 5.3 n/a 1.5 1.2 2.4
106 8.4 n/a 1.6 1.5 6.3

Good old System.Collections.Generic.Dictionary consistently performs lookups about 4 times faster than Map, but all the numbers are in the very low milliseconds or even tenths. This couldn’t make a performance difference unless you are doing a bazillion lookups in some long-running machine learning system.

So what else distinguishes the lookup choices? Maybe your app repeatedly loads data into new lookup tables.

Integer Key by Number of Elements loaded into Lookup Structure

* indicates capacity was not set for Dictionary and Hashtable

  Map IntMap Dict * HashTbl * HshMltMap
102 1.6 2.6 0.6 1.7 0.3 1.4 1.2
103 1.8 2.8 0.6 1.7 0.3 2.7 1.3
104 5.6 5.0 0.9 2.2 0.8 4.7 1.9
105 79.0 49.2 4.3 7.4 11.6 68.7 11.3

String Key by Number of Elements loaded into Lookup Structure

  Map IntMap Dict HashTbl HshMltMap
102 4.0 n/a 0.6 0.3 1.2
103 4.5 n/a 0.7 0.4 1.3
104 12.1 n/a 1.3 1.3 2.4
105 144.0 n/a 11.2 13.5 19.6

Once again Dictionary outperforms the competition, and more so if you set the capacity properly in the constructor.

If you are doing updates performance differences start becoming very pronounced. This is especially due to mutable structures’ advantage of doing updates in place, and in the case of Map there is no update-equivalent method, so updates require remove followed by add, returning new map instances and generating work for garbage collection.

Integer Key by Number of Elements in Lookup Structure, 10,000 random updates

  Map IntMap Dict HashTbl HshMltMap
102 4.3 2.4 0.2 0.3 0.8
103 6.8 3.1 0.2 0.3 0.8
104 10.6 4.3 0.2 0.4 0.9
105 17.9 9.3 0.6 0.6 1.3

String Key by Number of Elements in Lookup Structure, 10,000 random updates

  Map IntMap Dict HashTbl HshMltMap
102 7.5 n/a 0.3 0.4 0.8
103 12.0 n/a 0.4 0.5 0.8
104 17.9 n/a 0.6 0.7 1.1
105 34.6 n/a 1.2 1.2 1.7

In conclusion, for those rare applications doing huge volumes of lookups and/or repeatedly creating new lookup tables, Dictionary is your best choice, unless you require “functional” characteristics of the lookup table. Dictionary also enjoys a clear performance advantage in iterating over its contents as IEnumerable (but I’ve already bored you with enough tables), and because of this you can overcome the lack of “functional” methods (fold, etc.) by enumerating to the Seq module and use its methods.

Note on methodology: Integer keys are in the range 1 to number of elements and string keys range in length from 1 to 26 characters, evenly distributed. Both key types are shuffled before loading the data structure (i.e. the order is somewhat random). Data is presented in array format for loading. Timed events are isolated to the greatest extent possible. Each timed event is performed 50 times and the best 40 times are averaged. This is to discard arbitrarily long timed events which inevitably occur and would skew the results if included in the average. Taking the median of 50 timed events would serve the same purpose. Original timings are in ticks and divided by 10,000 to arrive at milliseconds. Timings measured using DS_Benchmark. The machine of record is a low-end dual-core processor running 64-bit Windows 7. Not covered by this analysis is garbage collection of large monolithic structures like large dictionaries or hashtables.

Let’s Make a Bayesian Deal

Let’s have some fun with the Monty Hall problem. The set-up is three closed doors. Behind one is a nice prize, behind the other two are dud prizes. You (having no prior knowledge) choose a door, at which point the game’s host (with prior knowledge) reveals a dud prize behind one of the doors you did not choose and asks if you want to stay with the door you chose or switch and choose the other door that is still closed.

There are many published articles on why the proper strategy is to switch doors, including the article above, but when I tried to explain this to someone a few months ago I was told that I was “full of sh*t, the odds don’t change”. Well actually the odds do change, because the game’s host has revealed new information. It all has to do with Bayes’ Theorem whose basic formula

P(A|B) = P(B|A)P(A) / P(B)

reads as follows:

The probability of event “A” occurring, given event “B” has occurred
equals
The probability of event “B” occurring, given event “A” has occurred
times
the probability of event “A”
divided by
The probability of event “B”.

Since I am apparently a failure at explaining this, I wrote an F# Monte Carlo program to do my talking for me. The F# code really expresses the way the game works. There is not much to it. Here is the type for an individual game:

It constructs each game given doorWithPrize, randomly generated, fstChoice, also randomly generated, and revealedDoor the door the host opens during the game play, for which there are two rules for choosing this door.

If the player happened to choose the door hiding the real prize then the host arbitrarily chooses between the other two doors:

Otherwise the host is constrained to open one and only one of the doors:

Over a large enough sample of games the host is constrained two-thirds of the time in his choice of doors to open because the player will make the wrong first choice two-thirds of the time. So the act of exposing one of the dud prizes reveals valuable information to you the player. (At least it tells you something with better than 50/50 odds.) And that’s why this works.

For your friends who still won’t believe and think you have played some sort of trick the program comes with an audit function that will print the records of up to 100 games. The source code is available here or on GitHub. Have fun with it.

The Miracle of F# Continuation Passing

One thing I consistently include in preparing functional data structures for the FSharpx.DataStructures library is the IEnumerable interface, which facilitates function composition by seamlessly flattening structures. Using Brian McNamara’s indispensable series on catamorphisms as a guide I set about to flatten the BinomialHeap.

Ignoring for now isMaximalist : bool we see a recursive structure of

list -> tuple -> list -> tuple -> etc.

and we want to flatten the elements 'a to 'a list. A classic case for a continuation passing style (CPS) fold.

This code shows a first attempt at hard-coding the flatten we want to accomplish. There are only four patterns to account for, so reasoning about the steps to take is simple. The fifth “can’t get here” pattern is purely to prevent a warning message. (The compiler doesn’t see there are only four paths.)

This code does what we want, producing an 'a list of all the heap’s elements, but the continuation on the last pattern is appending lists, an inefficient procedure we want to avoid…but how?

Let’s see if we gain anything by abstracting out the flatten logic to make this a true generic fold.

So far I have been treating this problem as a classic example from part three of Brian’s series. In abstracting a CPS fold over a discriminated union, you end-up with one continuation function for each case in the union. Our data type is not a four-case discriminated union, but I have been treating the number of match patterns as a proxy for a discriminated union. (The second and third patterns take the same function. That should be a hint there is something wrong about this approach.)

Ultimately abstracting over (a proxy for) a discriminated union has been a dead-end. We do have an abstracted fold method using CPS, but there seems to be no way to get rid of that annoying list append.

The correct approach is through the tree CPS method Brian teaches us in part two. Except for the name “BinomialTree” it is not immediately obvious how a recursive structure like

list -> tuple -> list -> tuple -> etc.

is any kind of tree, let alone a binary tree. Only when we write out the pattern

Node(_, a, h')::tl

do we see the binary tree:

    (‘a element to flatten)    
  h’   tl  
  (list within the element’s tuple)   (tail of list of which element is head)  

So the tree structure has been apparent from the time we wrote out the patterns for our match algorithm, and we know those patterns represent the complete set when iterating through our heap, but to accomplish a binary tree CPS fold you need a “null” pattern (a pattern which does not contribute data to the fold operation) at the bottom of each continuation in order to thread through the identity function. This is why you see so many tree examples in functional languages ending in “leaf”, a seemingly useless end of branch marker carrying no data.

There is no “leaf” in our data structure, but lets go through the motions of implementing the classic tree CPS fold, and maybe that will help us figure out something.

And then a miracle occurs

Don’t believe in software miracles, but all of a sudden our newest generation of the fold starts failing with “can’t get here”. We are now passing something new through our loop: empty list. It turns out there was a “null” pattern all along. Change

| _ -> failwith "can't get here"

to

| _ -> cont leafV

and we have a working CPS fold on the BinomialHeap. (I’m using Brian’s naming standard for the fold parameters.)

Let me break apart the inOrder call to foldHeap to make the fold mechanism more explicit:

A heap sidebar

The code is now perfectly suited for exploring CPS execution. The inOrder member we have been evolving does an in order tree traversal of the heap’s internal data representation, which in the case of heaps is entirely meaningless. We only care that a complete traversal take place. We will sort the resulting list to get it into proper head/tail order.

You don’t believe the internal order is meaningless? Here’s the internal structure we will walk through:


Node (0, "b", [])
Node (1, "a", [Node (0, "c", [])])

In this case the isMaximalist : bool I told you to ignore above is set to false, so the resulting list should be

["a"; "b"; "c"]

but not as result of the traversal, only after we sort the output list of the fold ourselves. The in order traversal results in

["b"; "c"; "a"]

Now back to our story

The step four code makes it easier to follow the logic and clearly shows how the empty lists satisfy the null pattern. Let’s walk through it.

               current Node   executing code  
 
cont =
  (fun x -> x)
 
  (0, “b”, [])   Node(_, a, [])::tl -> loop [] (fun lacc ->
  (head of top list)  
 
cont =
  (fun lacc ->
    loop [] (fun racc ->
    (nodeF a lacc racc) |> (fun x -> x))
 
  cont leafV  
 
The remaining in order tree traversal builds-out increasingly nested continuations.
 
  Node(_, a, [])::tl -> loop [] (fun lacc ->
  loop [] (fun racc ->
 
  (1, “a”, …)   Node(_, a, h’)::[] -> loop h’ (fun lacc ->
  (head of top right list)  
 
  (0, “c”, [])   Node(_, a, [])::[] -> loop [] (fun lacc ->
 
  cont leafV  
 
  Node(_, a, [])::tl -> loop [] (fun lacc ->
  loop [] (fun racc ->
 
  cont leafV  
 
  (1, “a”, …)   Node(_, a, h’)::[] -> loop h’ (fun lacc ->
  loop [] (fun racc ->
 
  cont leafV  

Swapping in the nodeF parameter and substituting its parameter names and the element parameter name for “x” we get the crucial code fragment. I am also using the result passing symbol for readability.

(fun a lacc racc acc -> lacc (a :: (racc acc)))|> (fun x -> x))

Another substitution, using strikethroughs to show what I changed. Coincidentally the symbol “a” from the code gets replaced by the value “a”.

(fun a lacc racc acc -> ("a" :: (racc acc)))|> (fun acc -> acc) |> (fun x -> x))

One more result passing symbol for clarity.

(fun a lacc racc acc -> ("a" :: (acc |> racc)))|> (fun acc -> acc) |> (fun x -> x))

At the bottom of all the right trees we eventually hit an empty list, so we make one more substitution.

(fun acc -> ("a" :: (acc |> (fun acc -> acc)))) |> (fun acc -> acc) |> (fun x -> x))

(fun x -> x) starts the whole continuation unwinding (executing). Notice the anonymous function from nodeF only has three of its four parameters filled in. This is where the code we broke-out in inOrder comes into play. I teased apart inOrder to explicitly show how the CPS pattern wraps the call to foldHeap in a function taking the implicit parameter acc : 'a list.

x []

We accomplished building a partial function with one parameter. [] provides the seed empty list passed down the nested acc parameters.

Here is a reasonable facsimile of the generated function and execution:

And here is the finished product:

If not for adding line breaks for better readability, this is seven lines of code. Six to create a generic fold structure, and one to perform an in-order traversal. However this is a case where F#’s famous expressiveness suffers from the succinctness. If you did not know beforehand how this works, good luck figuring it out!

The complete source code for the BinomialHeap can be found here.