What is your go-to data structure for lookups in F#? Most any choice for key-based lookups will work just fine, but if your requirements go beyond simple lookup or you have a demanding edge case what other criteria might you consider? Here’s an overview of the available lookup data structures:
Purely functional and persistent. Over 20 methods providing various custom folds, mappings, filters, etc.
Purely functional and persistent, but restricted to integer keys. Dozens of specialized methods including folds, mappings, filters, and inequality lookups (e.g. greater than key). Part of the FSharpx open source project.
Mutable. Almost always the fastest lookup. The fastest object to initialize with data, provided you initialize the dictionary with the expected capacity. If the capacity is not given data loading performance will degrade significantly in the 100 to 10,000 element range. It has none of the ancillary functional methods available to native F# data structures.
Mutable. Very good lookup performance and data initialization performance on string keys. If the capacity is not given data loading performance will degrade significantly at all ranges of element size. None of the ancillary functional methods available to native F# data structures.
Mutable. Available in the PowerPack.Core.Community library. The distinguishing feature of this data structure is a
FindAll method returning a list of all values bound to a hash key. A
fold method is also available.
So what are the raw lookup performance numbers? (All timings are in milliseconds.)
Integer Key by Number of Elements in Lookup Structure, 10,000 random lookups
String Key by Number of Elements in Lookup Structure, 10,000 random lookups
Good old System.Collections.Generic.Dictionary consistently performs lookups about 4 times faster than Map, but all the numbers are in the very low milliseconds or even tenths. This couldn’t make a performance difference unless you are doing a bazillion lookups in some long-running machine learning system.
So what else distinguishes the lookup choices? Maybe your app repeatedly loads data into new lookup tables.
Integer Key by Number of Elements loaded into Lookup Structure
* indicates capacity was not set for Dictionary and Hashtable
String Key by Number of Elements loaded into Lookup Structure
Once again Dictionary outperforms the competition, and more so if you set the capacity properly in the constructor.
If you are doing updates performance differences start becoming very pronounced. This is especially due to mutable structures’ advantage of doing updates in place, and in the case of Map there is no update-equivalent method, so updates require
remove followed by
add, returning new map instances and generating work for garbage collection.
Integer Key by Number of Elements in Lookup Structure, 10,000 random updates
String Key by Number of Elements in Lookup Structure, 10,000 random updates
In conclusion, for those rare applications doing huge volumes of lookups and/or repeatedly creating new lookup tables, Dictionary is your best choice, unless you require “functional” characteristics of the lookup table. Dictionary also enjoys a clear performance advantage in iterating over its contents as IEnumerable (but I’ve already bored you with enough tables), and because of this you can overcome the lack of “functional” methods (fold, etc.) by enumerating to the Seq module and use its methods.
Note on methodology: Integer keys are in the range 1 to number of elements and string keys range in length from 1 to 26 characters, evenly distributed. Both key types are shuffled before loading the data structure (i.e. the order is somewhat random). Data is presented in array format for loading. Timed events are isolated to the greatest extent possible. Each timed event is performed 50 times and the best 40 times are averaged. This is to discard arbitrarily long timed events which inevitably occur and would skew the results if included in the average. Taking the median of 50 timed events would serve the same purpose. Original timings are in ticks and divided by 10,000 to arrive at milliseconds. Timings measured using DS_Benchmark. The machine of record is a low-end dual-core processor running 64-bit Windows 7. Not covered by this analysis is garbage collection of large monolithic structures like large dictionaries or hashtables.