Simple Lookups in F#, Choices and Trade-offs

What is your go-to data structure for lookups in F#? Most any choice for key-based lookups will work just fine, but if your requirements go beyond simple lookup or you have a demanding edge case what other criteria might you consider? Here’s an overview of the available lookup data structures:

Map

Purely functional and persistent. Over 20 methods providing various custom folds, mappings, filters, etc.

IntMap

Purely functional and persistent, but restricted to integer keys. Dozens of specialized methods including folds, mappings, filters, and inequality lookups (e.g. greater than key). Part of the FSharpx open source project.

Dictionary

Mutable. Almost always the fastest lookup. The fastest object to initialize with data, provided you initialize the dictionary with the expected capacity. If the capacity is not given data loading performance will degrade significantly in the 100 to 10,000 element range. It has none of the ancillary functional methods available to native F# data structures.

Hashtable

Mutable. Very good lookup performance and data initialization performance on string keys. If the capacity is not given data loading performance will degrade significantly at all ranges of element size. None of the ancillary functional methods available to native F# data structures.

HashMultiMap

Mutable. Available in the PowerPack.Core.Community library. The distinguishing feature of this data structure is a FindAll method returning a list of all values bound to a hash key. A fold method is also available.

So what are the raw lookup performance numbers? (All timings are in milliseconds.)

Integer Key by Number of Elements in Lookup Structure, 10,000 random lookups

  Map IntMap Dict HashTbl HshMltMap
102 0.7 0.7 0.2 0.3 1.8
103 0.9 0.9 0.2 0.3 1.8
104 1.2 1.3 0.3 0.4 1.9
105 2.1 2.2 0.6 0.8 2.3
106 3.3 3.1 0.9 1.0 2.6

String Key by Number of Elements in Lookup Structure, 10,000 random lookups

  Map IntMap Dict HashTbl HshMltMap
102 1.3 n/a 0.4 0.3 1.5
103 1.7 n/a 0.4 0.4 1.5
104 3.0 n/a 0.7 0.7 1.8
105 5.3 n/a 1.5 1.2 2.4
106 8.4 n/a 1.6 1.5 6.3

Good old System.Collections.Generic.Dictionary consistently performs lookups about 4 times faster than Map, but all the numbers are in the very low milliseconds or even tenths. This couldn’t make a performance difference unless you are doing a bazillion lookups in some long-running machine learning system.

So what else distinguishes the lookup choices? Maybe your app repeatedly loads data into new lookup tables.

Integer Key by Number of Elements loaded into Lookup Structure

* indicates capacity was not set for Dictionary and Hashtable

  Map IntMap Dict * HashTbl * HshMltMap
102 1.6 2.6 0.6 1.7 0.3 1.4 1.2
103 1.8 2.8 0.6 1.7 0.3 2.7 1.3
104 5.6 5.0 0.9 2.2 0.8 4.7 1.9
105 79.0 49.2 4.3 7.4 11.6 68.7 11.3

String Key by Number of Elements loaded into Lookup Structure

  Map IntMap Dict HashTbl HshMltMap
102 4.0 n/a 0.6 0.3 1.2
103 4.5 n/a 0.7 0.4 1.3
104 12.1 n/a 1.3 1.3 2.4
105 144.0 n/a 11.2 13.5 19.6

Once again Dictionary outperforms the competition, and more so if you set the capacity properly in the constructor.

If you are doing updates performance differences start becoming very pronounced. This is especially due to mutable structures’ advantage of doing updates in place, and in the case of Map there is no update-equivalent method, so updates require remove followed by add, returning new map instances and generating work for garbage collection.

Integer Key by Number of Elements in Lookup Structure, 10,000 random updates

  Map IntMap Dict HashTbl HshMltMap
102 4.3 2.4 0.2 0.3 0.8
103 6.8 3.1 0.2 0.3 0.8
104 10.6 4.3 0.2 0.4 0.9
105 17.9 9.3 0.6 0.6 1.3

String Key by Number of Elements in Lookup Structure, 10,000 random updates

  Map IntMap Dict HashTbl HshMltMap
102 7.5 n/a 0.3 0.4 0.8
103 12.0 n/a 0.4 0.5 0.8
104 17.9 n/a 0.6 0.7 1.1
105 34.6 n/a 1.2 1.2 1.7

In conclusion, for those rare applications doing huge volumes of lookups and/or repeatedly creating new lookup tables, Dictionary is your best choice, unless you require “functional” characteristics of the lookup table. Dictionary also enjoys a clear performance advantage in iterating over its contents as IEnumerable (but I’ve already bored you with enough tables), and because of this you can overcome the lack of “functional” methods (fold, etc.) by enumerating to the Seq module and use its methods.

Note on methodology: Integer keys are in the range 1 to number of elements and string keys range in length from 1 to 26 characters, evenly distributed. Both key types are shuffled before loading the data structure (i.e. the order is somewhat random). Data is presented in array format for loading. Timed events are isolated to the greatest extent possible. Each timed event is performed 50 times and the best 40 times are averaged. This is to discard arbitrarily long timed events which inevitably occur and would skew the results if included in the average. Taking the median of 50 timed events would serve the same purpose. Original timings are in ticks and divided by 10,000 to arrive at milliseconds. Timings measured using DS_Benchmark. The machine of record is a low-end dual-core processor running 64-bit Windows 7. Not covered by this analysis is garbage collection of large monolithic structures like large dictionaries or hashtables.

One thought on “Simple Lookups in F#, Choices and Trade-offs

  1. Pingback: F# Weekly #47, 2012 « Sergey Tihon's Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>