Double-ended Queues for F#

Thanks to the FSharpx team I had the opportunity to add double-ended queues to FSharpx.Core. Think of double-ended queues, or deques (Okasaki’s spelling, which I like), as lists that work in both directions. They have mirror image values to the familiar head and tail, namely last and init, and a value doing the mirror image operation of cons named snoc (cons spelled backwards). Imagine the possibilities of list when you can also add and remove elements on the end. For instance use pattern matching on head and last.

match q with
| Cons(hd, Snoc(mid, lst)) -> hd, mid, lst  //first, middle, and last
| _ -> ...

Which deque is right for you?

The four deques available in FSharpx.Core implement the following signature

inherit System.Collections.IEnumerable
inherit System.Collections.Generic.IEnumerable<'a>

Cons : 'a -> IDeque<'a>
Head : unit -> 'a
Init : unit -> IDeque<'a>
IsEmpty : unit -> bool
Last : unit -> 'a
Length : unit -> int
Snoc : 'a -> IDeque<'a>
Tail : unit -> IDeque<'a>
Uncons : unit -> 'a * IDeque<'a>
Unsnoc : unit -> IDeque<'a> * 'a

They also all have a module level value of OfCatLists which concatenates two lists to create a deque. Deques internally implement two lists (or lazy lists) with the second list reversed. That’s basically all there is to allowing concatenations at both ends. (For in depth discussion of double-ended queues see either Okasaki’s thesis or his book.) This makes the OfCatLists operation a very fast way to concatenate two lists. Across all four deque implementations the operation is competitive with the FSharpx data structure DList (Difference List) which implements an append value that is truly O(1). OfCatLists is not truly O(1), but two of the deque structures, Deque and BatchedDeque, actually out-perform DList right through a scale of concatenating 10,000 element lists with DList only barely out-performing them at 100,000. (To be sure we are talking time scales of less than 0.3 milliseconds on a 64-bit system.)

Two deques, BankersDeque and RealTimeDeque implement additional values

Lookup : int -> BankersDeque<'a>
Remove : int -> BankersDeque<'a>
TryRemove : int -> BankersDeque<'a> option
Rev : unit -> BankersDeque<'a>
Update : int -> 'a -> BankersDeque<'a> option
TryUpdate : int -> 'a -> BankersDeque<'a> option

Rev is always O(1), a nice feature of deques.

Although nowhere near as adept at the update operation as the AltBinaryRandomAccessList data structure (specifically designed for update), deques exhibit good across the board performance in the index enabled operations Lookup, Remove and Update. They have a theoretical worst case time complexity of ~O(n/2) on index operations because the elements are partitioned in two lists.

But watch out! The actual time complexity on index operations is O(d), where “d” is the displacement into the internal list containing the indexed element. The relative sizes of the internal lists can vary. BankersDeque and RealTimeDeque require a “front-back stream ratio constant” in their construction. I’ve implemented a default value of 2 where this is required, but you can also specify a higher value. Essentially it insures neither list is ever larger than a constant times the size of the other list. BatchedDeque does nothing with the relative sizes until you request a Tail or Init, then it brings the lists into a relation of neither being more than twice the size of the other. And finally Deque is a naive implementation that does not regulate the relative sizes at all.

They all implement IEnumerable and have available all the Seq module values, but be sure to always use the native Lookup and not Seq.nth. This is one of those cases where straight-up comparisons of time complexity (~O(n/4) versus O(n) over many random iterations) does not tell the whole performance story. Actual Seq.nth performance is much worse than the time complexity would suggest.

So, which deque is right for your purposes? As is always the case with complexity that makes reasoning about performance hard, do your own profiling!

Purely Functional Stepchildren

Lately I’ve taken a closer look at Julien’s F# implementations of Chris Okasaki’s Purely Functional Data Structures1. AltBinaryRandomAccessList2 caught my eye because judging from its existing update function it looked pretty simple to implement a remove function. As you may have noticed in all the major collections of F# data structures –FSharp.Core.Collections, FSharpx.core.DataStructures, and PowerPack — remove is pretty rare. And both update and remove seem to be neglected stepchildren in the purely functional data structure world. (By update and remove I am referring to functions updating and removing any element by index.)

Let’s look at Okasaki’s “pseudo-canonical”3 update for list, implemented in F#.

let rec loop i y (l:'a list) = 
    match (i,y,l) with
    | i',y,[] -> raise (System.Exception("subscript"))
    | 0,y',x::xs -> y::xs
    | i',y,x::xs -> x::(loop (i'-1) y xs) 

Do you see the problem? I’ll get back to it below.

How else might we implement update? We could just punt by converting to Array, updating, and back to list.

let punt i y (l:’a list) =
    let a = List.toArray l 
    a.[i] <- y
    List.ofArray a

And our last candidate is a hybrid, in which we progressively take the tail, collecting each head in an Array, until we pass the element for update. Then cons the new element and the elements from the Array.

let hybrid i y (l:'a list) =
    if (i = 0) then List.Cons (y, (List.tail l))
    else 
        let rec loop i' (front:'a array) back =
            match i' with
            | x when x < 0 -> front, (List.tail back)
            | x ->  
                Array.set front x (List.head back)
                loop (x-1) front (List.tail back)

        let front', back' = loop (i - 1) (Array.create i y) l

        let rec loop2 i' frontLen (front'':'a array) back'' =
            match i' with
            | x when x > frontLen -> back''
            | x -> loop2 (x + 1) frontLen front'' (front''.[x]::back'')

        loop2 0 ((Seq.length front') - 1) front' (y::back')

AltBinaryRandomAccessList is a structure with the same basic functions as List, but optimized for update. Here’s the type structure:

type AltBinRndAccList<'a> =
    | Nil
    | Zero of AltBinRndAccList<'a * 'a>
    | One of 'a * AltBinRndAccList<'a * 'a>

The element grouping represents the binary number of the count of elements, with least significant digit to the left4. This allows for a member update function which runs in O(log n) time. (The best we have seen so far is O(i), where i is the index of the target element, and punt runs in O(n)).

static member fupdate : ('a -> 'a) * int * AltBinRndAccList<'a> -> AltBinRndAccList<'a> = function
    | f, i, Nil -> raise Subscript
    | f, 0, One(x, ps) -> One(f x, ps)
    | f, i, One (x, ps) -> AltBinRndAccList.cons x (AltBinRndAccList.fupdate (f, i-1, Zero ps))
    | f, i, Zero ps ->
        let f' (x, y) = if i % 2= 0 then f x, y else x, f y
        Zero(AltBinRndAccList.fupdate(f', i/2, ps))

Now lets look at actual performance for our update alternatives5.

Size 10k Random Updates   One-time Worst Case
102 PC -   2.9 ms   Punt -   0.2 ms
  Hybrid 1.4 X 4.0     PC 1.1 X 0.2  
  Punt 1.5   4.5     Hybrid 4.1   0.8  
  RndAcc 1.8   5.3     RndAcc 7.7   1.5  
 
103 RndAcc -   8.2     Punt -   0.2  
  Hybrid 3.6   29.6     PC 1.1   0.2  
  Punt 5.8   47.6     Hybrid 4.1   0.8  
  PC 6.2   50.3     RndAcc 8.0   1.6  
 
104 RndAcc -   11.5     Punt -   0.3  
  Hybrid 27.8   320.3     PC 1.3   0.4  
  Punt 46.4   534.9     Hybrid 3.2   0.9  
  PC 79.8   920.2     RndAcc 6.5   1.8  
 
105 RndAcc -   14.8     Punt -   1.0  
  Hybrid 315.5   4.67 sec   Hybrid 1.5   1.5  
  Punt 630.7   9.34     RndAcc 2.0   2.0  
  PC stack overflow

By way of comparison Array, which is not a purely functional structure, performs 10,000 updates in 0.1 millisecond at all scales, and really has no “worst case” single update. And notice because its function is not tail recursive Pseudo-Canonical eventually took a stack overflow. Also of interest, starting at the scale of 1,000 elements Punt outperforms Pseudo-Canonical in random updates, even though it is O(n) and the latter O(i). Time complexity does not tell the whole performance story.

So where does this leave remove? As it turns out my original idea to modify the update of AltBinaryRandomAccessList resulted with a function that was not tail recursive, but applying the hybrid technique to it resulted in a satisfactory algorithm6 and marginally improved performance, but like most every remove implemented in a purely functional list-like structure, the best time complexity you are going to get is O(i). (With a double-ended queue you can achieve ~O(i/2) .)

Here’s the comparative summary of remove performance on list-like structures. This time I include Array, where I copy the retained elements into a new structure one element shorter.

Size 10k Random Deletes   One-time Worst Case
102 Array -   1.4 ms   Array -   0.1 ms
  PC 1.9 X 2.6     PC 2.7 X 0.2  
  Hybrid 2.7   3.7     Punt 4.4   0.3  
  Punt 3.9   5.3     Hybrid 10.5   0.7  
  RndAcc 36.5   49.7     RndAcc 40.4   2.7  
 
103 Array -   11.2     Array -   0.1  
  Hybrid 2.6   28.7     PC 2.7   0.2  
  PC 4.3   48.3     Punt 4.3   0.3  
  Punt 4.4   49.3     Hybrid 10.4   0.7  
  RndAcc 40.1   448.5     RndAcc 41.1   2.9  
 
104 Array -   107.9     Array -   0.1  
  Hybrid 3.0   321.5     PC 3.2   0.3  
  Punt 5.5   589.9     Punt 4.5   0.4  
  PC 8.3   898.8     Hybrid 10.0   0.8  
  RndAcc 48.8   5.27 sec   RndAcc 45.8   3.8  
 
105 Array -   1.43 sec   Array -   0.2  
  Hybrid 3.2   4.51     Punt 6.0   1.1  
  Punt 7.7   11.04     Hybrid 8.1   1.5  
  RndAcc 42.9   61.39     RndAcc 89.8   16.1  
  PC stack overflow

As you can see, remove is a viable function for moderate-sized data structures, but suffers from the limitation of its time complexity as you scale up. You could easily modify remove functions to process an ordered list or range of indexes and still have O(max i) performance.

One remaining question for purists: are the stepchildren still Purely Functional when they use mutation inside a function?

Notes

1 Okasaki, Purely Functional Data Structures, 1998

2 AltBinaryRandomAccessList

3 Okasaki, 1998.

4 See Okasaki, 1998 or Okasaki, Purely Functional Data Structures, 1996 for in depth discussion of numerical representations of data structures.

5 Times were taken on a not particularly fast dual core 64-bit system. Time multiples were calculated at a finer granularity, and so may not perfectly match the timings in milliseconds. AltBinaryRandomAccessList is abbreviated to “RndAcc”.

6 I hope to release it soon in an open source library.

Benchmarking F# Data Structures — Loading Data

In many disciplines engineers rely on performance characteristic reference tables of the materials and structures they work with. For a host of reasons these aids are lacking in the field of engineering application software. In the previous article in this series I introduced DS_Benchmark, an open source project for benchmarking F# data structure performance. Now I’m rolling out the first performance tables constructed with this tool. At the same time I’m launching the F# Data Structures pages as a centralized source of information on data structures across all available sources.

But won’t performance vary on different systems?

Of course. That has long been a problem with benchmarks represented in terms of how much time operations take. I am using a different approach, which is to compare operations in terms of their relative speed. Running tests on a 32 and 64-bit system I found the performance ratios remarkably consistent despite big differences in absolute performance.

Loading Data Structures

The Initial Data Performance tables rank data structures, originating data source, and data loading action from fastest to slowest for data sizes ranging from 1 to 106. The column “TIME REL TO BEST” reports how many times longer a given combination takes compared to the top ranked.

Naturally you will be interested in specific comparisons. At the scale 105 it is not surprising loading a List from another List of ascending integers is ranked the fastest for the “Init” action (List.ofList), but say your interest is in comparing performance of FSharpx DList to FSharpx PersistentVector when loading from an integer Array? Read down the chart until you find

FSharpx-DList ArrayIntAsc Init 1.0 96.1

and

FSharpx-PersistentVector ArrayIntAsc Init 1.0 199.4

The initial load performance of DList is 96.1 times as long as the top ranked, so divide Persistent Vector’s number by this to learn that it’s performance is 2.1 times slower than DList.

For more information on how the Performance Tables are constructed and how to interpret them see Performance Metric Tables Methodology and Reading the Data Structure Performance Tables.

While the tables comparing performance across data structures will probably be your go to tables, there are also tables comparing performance within a data structure, one for comparing across different methods of loading data (there are some surprising results) as well as actual performance when scaling up data sizes.

Pathologies and other observations

Reasoning about any kind of software performance is notoriously hard. Modern CPU pipelining architectures and instruction and memory caches make it darn near impossible. And there are many considerations to take into account regarding the choice of language. You need to do your own profiling. These performance tables are no substitute, but meant to be a guide. Here are some interesting things that come to light with the performance tables:

  • Structures you would expect to display a more or less O(n) data loading performance seldom do. It’s much more common to display better performance up to some threshold (often 104) and then perform worse than O(n).
  • The Power Pack library’s HashMultiMap structure loads significantly faster using the AddOne action (a For...do loop) than the new() constructor up to scale 105 where performance becomes roughly the same.
  • FSharpx’s RealTimeQueue will stack overflow at scale 105 on both the 32 and 64-bit systems I tested on.
  • FSharpx’s BootstrappedQueue scales remarkably well when loading from a List. (Of course if you looked at the code for this structure you would understand why.) And so the AddOne action scales badly in comparison to Init at scale 105.
  • Not only is Array generally the fastest structure to initialize at all scales. It is often the fastest structure as the source for loading another structure.

Caveats

It’s important to remember the data load performance charts are the result of the performance interplay of the source and target data structures. If you are loading from I/O or some other “neutral” source consider comparing performance characteristics of Array loads, since Array seems to be consistently fast, and therefore somewhat neutral.

You will also see some strange blips at certain scales and for certain data and ordering characteristics. While I can’t vouch for every one, the oddities I looked at were consistently repeatable. I was still doing some development work while compiling these statistics. There is the possibility of bad data, and I eventually intend to re-compile all these stats. I deemed it more important to publish what I have and move forward with the next phases of my data structure performance survey.

Still to come

Still a lot of data structures to add to the data structure pages, and I still have to compile performance charactistics on common actions, like lookup, update, append, iterate. So I’ll have my head down working on this for a while. Please send comments, suggestions, and requests.

Here are some more .NET related performance benchmarks, also using the ratio comparison method.

A Tip on Debugging F# Type Signatures

I recently had a problem with someone else’s code that raised a debugging issue not found in most languages. Somehow a complex let binding calling several other let bindings returned the wrong type signature. I was expecting a collection of generic type ‘a, but the compiler was inferring a collection of type int. It’s not unusual to have to explicitly annotate the parameter and return types in let bindings, and that was my first course of action, but the let binding still returned the wrong signature despite annotations. The compiler’s type inferencing works up and down the chain of functions, so how do you isolate a problem like this?

Here’s a technique you may find useful for problems like this. The key concept is to isolate the problem, so comment out all the code inside the let bindings in question and stub out a return of the correct type until the whole complex of functions returns the correct type. Then methodically start uncommenting code that returns a value, checking the let binding return type at each step. When the return type changes to incorrect, you have isolated the problem spot (or perhaps one of several).

In my case the author of the code in question incorrectly assigned the critical value of type ‘a from a wrong value, which happened to be of type int. His only unit test case tested type int, and no other types for the generic binding, so this problem slipped through.

Signature mis-matches can be very difficult to work through, so first try to isolate the problem by simplifying the chain of signature match-ups as much as you can.

Another lesson here is although functional programming eliminates whole classes of potential bugs, it is still possible to write incorrect logic. Unit testing only catches a problem if you wrote a (correct) unit test for that particular case.

More About Signatures

Type signatures is an interesting and important topic in its own right. Here’s some miscellaneous signature topics to brush up on.

How to read F# type signatures?

F# Generics

Automatic Generalization

F# Constraints

Equality and Comparison Constraints in F#

Flexible Types

A Stackoverflow thread on flexible types

The Git Article I Hoped not to Write

As I explained in my introductory article, Git is a core technology I am exploring. I had halfway expected, and hoped, I would have nothing to say on this topic. Of course hope is a plan doomed to failure, and in this case the go-to-hell-plan is to write an article.

Cutting to the Chase

The only resources you need for Git are

The Git Reference

Pro Git

The rest of this article is another data point in the seemingly endless Git literature all over the web. Continue at your own peril.

A little Background

Back in the dark ages of mainframe batch programming source control was via delta decks and their text file descendants. This apparently evolved into CVS, the first modern source control I was exposed to. I was PM of one quarter of a 40 member project, which in turn was part of a larger 300 person project. The 40 person team hired a contractor to set-up and administer CVS for the team (he eventually had other duties as well). In this role I never got any closer than that to CVS, since I wasn’t writing code. I wasn’t sure how to interpret having a full time person who did mostly source control. The project was doing several very inefficient things I had no control over.

SourceSafe is a ticking time-bomb if you don’t take serious back-up precautions. Nuff said. The first source control system I had the responsibility to set up from scratch was TFS. Ultimately all source control includes manual policies and procedures. The trick is understanding enough about a new source control system from reading the documentation and playing with it to develop your own complete system. (Back in the day there was less how-to literature available on this.) It’s been a long time now, and I’ve lost touch with whatever changes Microsoft has made, but we were happy with TFS. It fit in well with the team’s rapid-fire development style of constantly changing anything and everything, but corporate, in its wisdom, decreed all development teams must get on the company’s Subversion “Enterprise-class centralized version control for the masses” system. So another iteration of figuring out how this really works and how to fit it with our work style. For as long as I was with that team we were never happy with Subversion.

Back on my own again I’m once more learning a new version control system, this time with no one to bounce ideas off or ask questions, even embarrassed myself posting a stupid question yesterday (sorry Phil and Github!). Phil Haack graciously responed. I had read the Git Pro book over a year ago, but without putting Git into practice it all just leaked out of my mind.

At the critical level of functionality, Git is just another (though arguably superior) choice in managing the merging and auditing of source. It still takes gatekeeping policies and procedures to manage pull requests.

Why is Every GUI a Two-edged Sword Compared to the Command Line?

Coincident with my starting to use Git, Github came out with a Windows interface. Now this was both good and bad. The good was I didn’t even glance at the Git reference or Git Pro book to develop and publish my first open source project. The bad was I didn’t even glance at the Git reference or Git Pro book. Using the GUI worked well as long as I was developing in the master branch. My confusion started when I wanted to create a branch. Like all good GUIs, it just works. Unlike all good GUIs there is no help feature to clue you into what is going on. So having forgotten all I read about Git over a year ago, I created a branch and started looking for how to hook-up the branch to its workspace. Doh! It’s all the same workspace with Git.

…off to the command line, which involves learning what’s really going on; and set up a sandbox Git project, because I have a deep aversion to doing anything with production source code when I am not 100% sure of the consequences.

Here’s what I learned in a very short time:

The Windows interface saves some drudgery in setting up a repo. You can initialize a repo with the command line

>Git init

(which I did in my sandbox), but the Windows interface sets up helpful parameters in .gitignore and .gitattributes files specifically for Visual Studio development. I’m happy not to have to dive too deep into the ins and outs of these files just yet.

The situation with commits and branches is more nuanced. So for now I’ll switch to the command line because that’s the only way I’ll learn the full range of useful things I can do with Git. I was thinking about installing a Visual Studio Git extension, and Git Source Control Provider appears to be the only option, but I think I will put that off for now.

In the mean time,

>Git status

is my friend.

And Finally, the Punch Line

–everything-is-local

That’s the tag line on the git-scm.com site, and now I understand. You can use Git for simple source control, like any other SCM. And if I had been a little braver (or foolhardy) about using an undocumented GUI feature, I might still be there. Now, however, I see my way forward learning Git. It’s actually personal productivity software. Once you acquire a certain level of adeptness you can do things you would not have considered doing, either because it was too tedious or impossible. Things like exerting more control over the commit story your branch tells, or more frequent branching to allow experimentation that otherwise risks the integrity of another branch, or multiple versions, or things I haven’t thought of.

To update your local repository with work posted to the remote

>Git fetch origin

Benchmarking F# Data Structures — Introduction

Overview

DS_Benchmark is a new open source project for benchmarking F# data structures. I developed it to provide a platform for comparing performance across data structures, data types, common actions on structures, as well as scale-up performance. It currently supports almost all the structures in the FSharp Collections, FSharpX Data Structures, and Power Pack libraries. The project’s architecture allows incorporating additional structures and functions to be timed with minimal effort.

The choice of data structure in a functional language project profoundly influences architecture and performance. You may not have the time or inclination to read-up on purely functional data structures, and even if you are familiar with a lot of data structures and time complexity how do you choose between several choices all offering O(1) complexity in a crucial function, for instance? You should always do your own profiling, and especially profile performance critical code.

Some of the benchmarks possible with this project include:

Compare similar functions across data structures.

Investigate initializing structures from different collection types.

Look for pathologies caused by certain initial data.

At what scale does a data structure become less or more performant compared to other choices?

Methodology

I intend the project to measure what I call “most common best case performance” in an “apples to apples” manner, so it does not readily tell you how many milliseconds a process took (although you can retrieve this information by not drilling down very deep in the project). I took the precautions I could think of to best isolate the event being measured. The project forces garbage collection prior to the event, and the timed event executes in it’s own exe. (It’s still important to minimize all other activity on the machine during timings.)

I found through experiment that less than 15% of timing executions are typically outliers on the long side, so DS_Benchmark times 100 executions and uses the fastest 85 to calculate a median time (in ticks) and the deviation from the median.

The output format is designed to enable further analysis within and across data structures by computing comparisons in terms of ratios (percents) rather than simply reporting milliseconds (which only apply to the host environment). Ratio comparisons are remarkably environment agnostic, as long as all original measurements are taken in the same environment. For instance I ran analyses based on measurements on two different machines, and even though the measured timings (in ticks) were wildly different, all of the comparative ratios I have analysed so far are very similar.

Simple Measurement Set-up

The console1 program in the solution provides documented examples of the most useful cases. Timings of individual scenarios print to the console, or you can generate batches of timings with a simple template model. The batched timings output to a tab-delimited file for easy import to Excel. (My next phase for this project is to persist the timing results and automate the analysis. The idea is to compare actions and inputs within and across data structures.)

For more in depth usage and architecture information see the project readme file.

Coming up Next

In following articles I will report on some interesting comparisons of data structures using DS_Benchmark. I am devoting a permanent section of this blog to F# Data Structures. Keep an eye on this space for benchmark comparisons as well as a single point of reference for many data structures and documentation, including time complexities, that otherwise are scattered all over the internet.

If you are interested in adding structures or otherwise contributing to the project, get in touch with me.

Pinhole Astronomy

For Feliks.

I’m diverting from the usual blog topics to chronicle some goings-on in our neighborhood. Sol, Venus, and Luna have been aligning for some remarkable views from Earth. First up is the May 20, 2012 Annular Eclipse. I chose as my vantage the top of Sugarloaf Mountain at the head of Doney Creek on Lake Shasta, California.

I grew-up summers on Doney Creek in the 60’s and early 70’s, and remember in those days the hillsides and mountains around the lake dominated by manzanita, the forests in the region having been largely destroyed by copper mining in the area around the turn of the 20th century; not so much the mining as the smelting. The dirt road to the lookout starts at Grandpa Fox’s old house on Doney Creek and climbs nearly 2,000 feet (600 m) to the top of Sugarloaf. I hadn’t made this trek in 40 years. The forest had grown up much more around the road than I remembered. At any rate I know I had never been up the mountain at this time of year, and it was really beautiful. A little late in the season for poppies, but the woolly bluecurls were blooming in abundance.

A half mile from the summit the road was closed. The Forest Service still had not opened the lookout for the fire season. Here was a little group of observers with “real” equipment (reflector telescopes and cameras with proper filters). But I was going all the way. So I slung my telescope, tripod, camp chair, and courier bag over my shoulders, donned my straw hat, and made for the lookout like some itinerant tinker.

The view from the top was well worth the climb: Mt. Shasta, Mt. Lassen, Lake Shasta, the Sacramento Valley, the Marysville Buttes, the Trinity Alps, and the Castle Crags. The party down the mountain had the better equipment, but I had the view…to myself.

My equipment was not that bad. Observing the sun with a pinhole telescope (camera obscura) is somehow sublime in a way I cannot describe, and I could not take a photograph that captures it either. None of the photos of the pinhole screen I attempted at maximum eclipse, the sublimest of moments, even turned out. I had the shutter speed set too fast. Held over the lens of my camera good old shade 14 welders glass produced much better photos.

Jim Welsh was in the party down the mountain from me with the real equipment. And while he didn’t get all the views I did, he put together this awesome (and I’m not using that overused word lightly) timelapse of the eclipse from Sugarloaf Mountain, California.

Well, here are the particulars of the event from my observation post, followed by photos. Notice the photos of the pinhole screen show the event vertically inverted (upside down). The focal length of my pinhole telescope was 63 inches (160 cm).

Lat.: 40.9144° N Annular Solar Eclipse, Magnitude: 0.967
Long.: 122.4449° W Duration of Annularity: 4m44.5s
Date/Time (UT) Alt Azi Date/Time (PDT)
Start partial eclipse: 5/21/2012 12:10:58 AM 034.6° 267.5° 5/20/2012 5:10:58 PM
Start annular eclipse: 5/21/2012 1:25:45 AM 020.6° 279.4° 5/20/2012 6:25:45 PM
Maximum eclipse : 5/21/2012 1:28:08 AM 020.1° 279.8° 5/20/2012 6:28:08 PM
End annular eclipse: 5/21/2012 1:30:30 AM 019.7° 280.1° 5/20/2012 6:30:30 PM
End partial eclipse: 5/21/2012 2:36:03 AM 007.8° 290.1° 5/20/2012 7:36:03 PM

The Transit of Venus on June 5, 2012 was much less dramatic, both in the event and the observational circumstances (my back yard in Danville). Even so I was much better equipped, with a 129 inch (328 cm) pinhole telescope. This one projected an image of the sun an inch and a half (3.8 cm) in diameter. You could just discern sunspots, which was my criteria for success. Even so, I still couldn’t adequately capture the images in photos. Shade 14 welders glass once again provided the best photos.

…and the much more interesting timelapse, courtesy of Jim Welsh.

When Accuracy is the Wrong Evaluation Measure

Say you are developing a process returning a list of results, and you want to evaluate how well your process is doing. This is a process more complex than a database or dictionary lookup. You expect it cannot be perfect, but you have criteria for judging whether each result in the list is correct or not, and whether it returned all the intended results. Let’s call this general case information retrieval.1

Definition: Information Retrieval – A process returning zero to many results. Each returned result can be evaluated as correct or incorrect. Likewise results not returned can be evaluated as correctly or incorrectly rejected.

So information retrieval systems have two valid states in relation to any target in the potential result set, correct selection and correct rejection. There are also two ways the system can be in error, namely incorrectly rejecting a result and incorrectly returning a result.

Definition: Type I Error – false negative or incorrectly rejecting a result.

Definition: Type II Error – false positive or incorrectly returning a result.

  correct incorrect
returned: good Type II error
rejected: Type I error good

The naive approach to evaluating an information retrieval system is a simple comparison of correct results to incorrect results, measuring either accuracy or error, usually as a percentage.

Definition: Accuracy – (correctly returned + correctly rejected) / (set of potential results)

Definition: Error – (Type I and II errors) / (set of potential results)

Accuracy and error are complements of each other, and while they are good measures for many day-to-day problems, their usefulness falls apart when we evaluate information retrieval systems where the set of items that should be rejected is large.

We need to measure something else to evaluate systems like this, precision and recall.

Definition: Precision – Fraction (percent) of returned results in an information retrieval system that are correct.

(correctly returned) / (correctly returned + Type II errors)

Definition: Recall – Fraction (percent) of correct results in an information retrieval system that are returned.

(correctly returned) / (correctly returned + Type I errors)

By themselves neither precision or recall is a particularly good measure. Simply return nothing and precision will be “off the charts” (literally, since you will divide by zero!), but then recall will be zero. Alternately return everything and recall will be 100%, but precision suffers.

By calculating the harmonic mean of precision and recall we have a much better measure of information retrieval performance in one number.2

Definition: F Score – A measure of system accuracy derived from precision and recall. The most general case, giving equal weighting to both parameters, is the harmonic mean

F1 = (2 * precision * recall) / (precision + recall)

You can vary the weighting of precision and recall by applying the formula with an additional parameter.

Fa = ((1 + a2) * precision * recall) / ((a2 * precision) + recall)

Setting a > 1 weights recall higher, while a < 1 weights precision higher.

There is one more measure, fallout, which measures the rate of Type II errors. It shows the probability of a false positive, which may be an important measure, depending on your system.3

Definition: Fallout – probability of a Type II error (false positive) in an information retrieval system.

(Type II errors) / (Type II errors + correctly rejected)

Finally, lets look at the numbers from some made-up information retrieval systems. In the first two sets notice as we scale-up the proportion of of correctly rejected results in the set of potential results, accuracy and error get better and better, but precision, recall, and F score remain consistent (and the F score is consitently poor despite improving accuracy). In the last set notice the different ways we can achieve high accuracy with a poor F score.

correctly correctly Type I Type II F
returned rejected error  error accuracy error precision recall score fallout
25 99 100 3 54.6% 45.4% 89.3% 20.0% 32.7% 2.9%
25 990 100 3 90.8% 9.2% 89.3% 20.0% 32.7% 0.3%
25 9,900 100 3 99.0% 1.0% 89.3% 20.0% 32.7% 0.0%
25 99,000 100 3 99.9% 0.1% 89.3% 20.0% 32.7% 0.0%
 
100 99 3 25 87.7% 12.3% 80.0% 97.1% 87.7% 20.2%
100 990 3 25 97.5% 2.5% 80.0% 97.1% 87.7% 2.5%
100 9,900 3 25 99.7% 0.3% 80.0% 97.1% 87.7% 0.3%
100 99,000 3 25 100.0% 0.0% 80.0% 97.1% 87.7% 0.0%
 
34 99,850 115 1 99.9% 0.1% 97.1% 22.8% 37.0% 0.0%
100 99,700 100 100 99.8% 0.2% 50.0% 50.0% 50.0% 0.1%
75 99,700 75 150 99.8% 0.2% 33.3% 50.0% 40.0% 0.2%
125 99,625 245 5 99.8% 0.3% 96.2% 33.8% 50.0% 0.0%
195 99,525 5 275 99.7% 0.3% 41.5% 97.5% 58.2% 0.3%

Notes

[1] This is a restricted definition of information retrieval. As you can see in the Wikipedia article the term frequently applies when results have varying degrees of relevancy.

[2] Since perfect precision and recall together result in an F score of 1.0, I represent it as a percentage, which would probably make a real statistician cringe.

[3] For consistency’s sake also shown as a percentage.

A Course of F# Study

I’ve been tinkering off and on with F# for about two years. Now I plan to dive in and do some serious code writing, so I’m gathering all my resources in one place (here) ready to move forward. For anyone with a beginning to intermediate level of experience with F#, this is my recommended course of study.

Development Environment

If you plan to write F# in Mono I can’t help you setting up the environment, but now is a great time to get started with F# using Visual Studio 11 beta because it includes the upcoming F# 3.0, and the Intellisense for F# is better than ever in the beta. F# 3.0 includes the much anticipated Type Providers and F# LINQ queries against Type Providers data sources. Type Providers auto-magically provide strongly typed data structures from diverse sources at design time, meaning the data schema is available to you in Intellisense. Wow! Type Providers are available fora growing number of data providers1, and if you can’t find the Type Provider you need, write your own.

You’ll want to install a couple VS extensions.

  1. F# Outlining let’s you collapse sections of F# code, pretty much like you already can with C# in Visual Studio.

  2. I also recommend getting both the MsTest Project Template and FsUnit from Daniel Mohl. The template will get you started a little faster on unit tests, and FsUnit will make your unit tests more composable, a little more in tune with how you (hopefully) write the rest of your F# code (i.e. in a functional manner).

Core Course

All you need to learn F# is the Wikibook F# Programming and the MSDN Visual F# references pages, most notably the F# Language Reference and F# Core Library Reference.

So open up VS11 beta and the MSDN reference pages and work through each chapter of F# Programming. It covers every topic of the language you need to know (with the possible exception of Code Quotations and Type Providers). This course is complete and easy to follow. I still use it as a reference.

You’ll also want to use the F# REPL, fsi, while writing code, or while studying code someone else has written. You have a lot of choices here. There’s a fsi window built into Visual Studio, or you can run it from the command line or PowerShell. There are at least two versions available on web pages, Try F# and Try Joinads. There’s even a Try F# iOS app.

You’ll find you have to deliberately write F# one statement at a time. Of course that’s literally true for other languages you’ve worked in, but especially while first learning you will find it challenging to get the syntax perfect as you attempt writing more sophisticated statements (and it has to be perfect). So take advantage of Intellisense in the Visual Studio code window, and fsi for syntax debugging until the statement builds correctly. If everything does not build in the VS code window (with the exception of the incomplete statement you are currently writing), Intellisense is not available . This is why you have to make sure each statement builds as you write it. But here’s your reward: if your statement builds (and you are coding in a purely functional style) it probably works as you intend. More time up-front thinking through what you are doing, and less time typing.

Beyond the Basics

Once you finish Awesome Princess’ excellent course, you’ll want to continue your graduate studies along with whatever productive coding you are doing. Studying the code others have published is a good place to start. F# Snippets is a repository of open source F# snippets that do all sorts of things created by anyone who cares to post. Hat tip to Tomas Petricek for creating this great site and the technology behind it. You know what’s really cool? Mouse over the code in any snippet on the site and you get Intellisense tool tips!

And then there are the bloggers. So many smart folks writing great articles on the why’s and how to’s of F# in particular and functional programming in general. Don Syme’s WebLog on F# and Related Topics deserves special mention, as he is the designer and architect of the F# programming language. Here’s a few others I’ve found useful (and I know there are many others). In some cases I link directly to the author’s F# tagged articles.

Tomas Petricek’s homepage

Bug squash, Mauricio Scheffer’s blog

Clear Lines, Mathias Brandewinder’s blog

Random Ravings of a Red Headed Code Monkey, Daniel Mohl’s blog

Rash thoughts about .NET, C#, F# and Dynamics NAV, Steffen Forkmann’s blog

Faisal’s space, can’t find Faisal’s last name

Intellifactory Blog, creators of WebSharper

F# and Data Mining, Yin Zhu’s blog

Chris Smith’s completely unique view and then continued at a Chris Smith on Software, but apparently Chris hasn’t blogged about F# since 2010

Simon Cousins: almost functioning, Simon Tyler Cousins’ blog

MOIRAE Software Engineering Ltd, a software engineering company based in the north east of England

F# News, Jon Harrop’s blog

Matthew Podwysocki, can’t seem to filter on just F#

Inside F#, can’t find Brian’s last name

Geez this is getting kind of long.

Here’s the F# open-source community on GitHub. Any of these folks I missed probably have F# blogs too.

…and last, but not least

FPish – Online community of functional developers, more than a blog…

Theory Matters

Start getting used to the arrow notation F# uses to describe function signatures. This is a big deal. Function signatures are going to be constantly in your face, and sometimes subtly lurking in the background through everything you do in F#. So pay attention to them from the beginning. With a little effort you will get good at understanding them.

Once you complete the course, read a few articles, and write some functional code, you start realizing how different functional programming is from the old familiar imperative style. And you will probably run across articles that get pretty deep in the theoretical weeds. Inevitably you will read the term “monad”, and you probably won’t understand what the heck the article is about. I spent a long time reading and rereading articles about monads, studying code implementing monads, and I still didn’t get it.

Finally I found a concise, understandable description of monads. Here’s another article that is a little longer, a little more abstract, but still comprehensible. And finally you may find the Wikipedia article useful (after you start to catch-on). Some people might find an explanation illustrated in C# useful.

Don’t try to force-feed yourself on monads. You can get a lot done (even with monads) without fully understanding them.

Thankfully the helpful fellows on the FSharpx team have implemented a number of the most common monads, and a bunch of other basic F# functional constructs in their useful .NET library. It’s well worth the effort to download the source code and study it. While you are studying the monad code in FSharpX keep handy this set of Matthew Podwysocki articles:

Much Ado About Monads – State Edition

Much Ado About Monads – Reader Edition

Much Ado About Monads – List Edition

Much Ado About Monads – Maybe Edition

Much Ado About Monads – Creating Extended Builders

Much Ado About Monads – Creating Extended Builders Part II

A Kick in the Monads – Creating Extended Builders Part III

A Kick in the Monads – Writer Edition

OK, enough of monads. I’m going to throw one more series of articles at you. Monads is not a prerequisite, in fact skip monads altogether and read this series. It gets to the heart of a broad area of functional programming, specifically using F#.

Catamorphisms, part one

Catamorphisms, part two

Catamorphisms, part three

Catamorphisms, part four

Catamorphisms, part five

Catamorphisms, part six

Catamorphisms, part seven

Catamorphisms, part eight

Purely functional data structures are an important part of functional programming. Chris Okasaki popularized this subject with his ground-breaking PhD thesis, and the expansion of his thesis into the book (adding more structures that weren’t part of his original research). F# programmers are fortunate that many, if not all, of the structures in Chris’ book have been implemented in F# (don’t forget to click “Older Entries” at the bottom of the page, there are two pages of headings). Hat tip to Steffen Forkmann for bringing this to my notice. Whoever Julien is, he needs to stand up and take a bow.

Had Enough?

If not, then Try Joinads, Tomas Petricek’s excursion into a F# research extension for concurrent, parallel and asynchronous programming. I think his work on joinads is really valuable, and I hope Microsoft soon incorporates it into a F# 3.1 release, but for now it requires a hacked version of the open source F# compiler, and supposedly breaks Intellisense.

Notes

[1] F#3.0 comes with a handful of built-in type providers and FSharpX provides more. At a minimum you should find Type Providers for OData services, SQL schemas and connections, WSDL web services, resource files, regex, file system, csv, Excel, JSON, XML, Registry, XAML, and App Settings.

Introducing craftyThoughts

I fell into The Craft through an accident of history. I had returned to this country at the height of the ’82 recession after three years of adventures abroad with no special skills looking for any kind of entry level position to earn a living. It took nearly a year of night school and dead-end day jobs to land a position leading to System 370 assembler coding, developing health insurance claims adjudication systems. Ironically, my only brush with computers at university was a statistics course I dropped. Coding up punch cards was so uninspiring, but a tour of the data center exposed me to Wylbur.1 The possibilities from that brief encounter nurtured my thoughts for years.

More accidents of history steered me toward the Microsoft technology stack, and kept me out of Internet development until the 21st century. Eventually life led me to an Internet startup in the SaaS space that became successful.

The Craft

In some ways today’s Internet workforce of engineers, coders, hackers, and analysts resembles the medieval masons who built castles and cathedrals. We are virtually, if not physically, itinerant, going from company to company, industry to industry, even founding industries. The journeyman stonecutter had more freedom than contemporary agricultural serfs because acquiring skill required time and practice and contemporary society was in an expansive phase of stone building. And both workforces understand building things. Cutting and setting stone then, and journeyman Internet work today, can provide a good living and lots of personal choice, but entrée to The Craft requires more. It requires seriousness of purpose and love of understanding…

Lest I wax too philosophical, let me call out one Master of The Craft, Douglas Crockford, on the importance of studying history, Crockford on JavaScript – Volume 1: The Early Years. And lucky lazy student! The contiguous history of our craft only stretches back to Alan Turing’s seminal 1936 paper, On Computable Numbers, with an Application to the Entscheidungsproblem. There were significant precursors, most notably Boole and Babbage (and some might include Leibnitz and Lovelace).

One more bit of grounding from an established Master

“…there’s really no such thing as raw general intelligence–it’s all just computation.”

– Stephen Wolfram

Plan for craftyThoughts

So with applications experience spanning eight industries and a long-time interest in semantics, I’m launching this site to approach the problem of meaning (semantics) in data and automated systems. To that end I throw out this postulate (which I do not attempt to prove): theory matters. If you attempt anything at all sophisticated in automation, you must build on the foundational engineering work that has come before. Theory and history teach us what is possible.

I intend to scratch my own itch. Articles will take the form of topical notes organized for presentation. Rather than crafting persuasive arguments or rehashing information widely available I will link to where others have already done this. I don’t claim prior expertise in any of the areas I am investigating. I’m not sure I will ever present anything original, and if I do convey any synergy whatsoever, it is because I stand on a foundation built by others.

Among the enabling technologies and application theory I intend to investigate and present:

Functional Language

If you have some facility in the most important programming language today and you know about the good parts, then you know about functions as first-class citizens, anonymous functions, and closures). Congratulations! You have scratched the surface of functional programming, and along with the goodness this provides, you may also wonder if the trade of functionality for problems was a good deal given the issues with debugging anonymous functions in JavaScript. This is hardly an endorsement of functional programming. In exchange for some useful tools you now write software that is harder to maintain.

Well, if JavaScript is your entire exposure to functional language, then you have not been entirely exposed. I’m going to make a bold claim: writing software using functional programming in a purely functional style (if not already enforced by the language) will result in far, far fewer defects right out of the box, more concise software (i.e. fewer lines of code), software whose source code more readily reveals its intent, and software that in many, if not most cases can easily be multi-threaded. (It is possible to write purely functional software in established object-oriented and imperative languages, but I don’t recommend it.)2

Functional languages have been around since 1958, when the grand-daddy of them all, Lisp, appeared. So if functional languages have been around so long, why aren’t they more “mainstream”? Some authors have put forth the “victim of own success” argument, but I think the root cause is more profound, having to do with how we were taught to think about programming. From John von Neumann to Donald Knuth the pioneers of Computer Science naturally fell into the imperative style of programming (with the notable exception of John McCarthy).

do this

do this

test this

depending on results

loop

branch

etc.

Imperative programming requires less thought up front. Indeed, most experienced programmers fall right into writing code without fully thinking through the problem at hand because they intuitively recognize a familiar design pattern (even if they cannot name or explain the pattern). The success of this technique depends on “just in time” thinking through problems as they arise. For your typical CRUD application this is good enough. In the best cases a skilled and careful programmer delivers a structured, simple application incorporating enough, but not too much object-orientedness. In the hands of lesser talent all sorts of problems arise. Over-reliance on modern debugging tools during the code writing phase can lead to hacks upon hacks, solutions to problems in code ever further removed from the application’s problem at hand. Over-exuberant OO style leads to less maintainable code, and on and on. Combating the problems inherent to imperative programming has spawned whole classes of techniques, products, and specialties: detailed coding specifications (beyond mere functional specs), shop-enforced coding standards, code walk-throughs, code quality analysis tools, code generation tools, TDD, BDD, etc.

Sixty years worth of tools and techniques arising from the problems of imperative programming still have their place for functional programming3, don’t get me wrong. And imperative programming is not going away. It is responsible for nearly all the foundations of the computer-enabled culture we live in, including the website you are now on. It’s just that in many cases, especially the hard cases, there is a better way going forward.

So why has functional programming, despite its advantages, not gone “mainstream” (i.e. why aren’t medium and large shops using it)? And why the intensity of interest in functional programming today? Let me answer the second question first: better access to educational opportunities. Remember I said theory matters? Well theory doesn’t stand still, especially in a field as young as computer science/engineering. The foundations upon which engineering students are educated and the avenues available for continuing education just keep getting better, practically on a daily basis. When Andrew Ng’s Machine Learning course at Stanford, which typically enrolled 350 students, was first offered as a free online course in 2012, reportedly over 100,000 people enrolled worldwide. With easy access to resources, and a solid theoretical basis, the Royal Road4 to understanding functional programming (along with virtually every other science and engineering topic) is open to all. There is a ground swell of young (and not so young) engineers discovering functional programming for themselves.

Finally, my take on the difficulty of establishing functional programming as a more “mainstream” development paradigm. It involves a radically different thought process. You have to do more thinking up-front and you have to do a different kind of thinking. Whereas the thinking in imperative development is along the lines of “how to do something”, functional development requires in-depth comprehension of “what we are doing”. Let that sink in for a minute. It is not trivial. The general course of development for functional programming is to abstract the problem at hand. This is what I mean when I say functional programs better reveal the “intent” of the task they accomplish. This is a completely different thought process. Do you remember in grammar school when most of your classmates hated math word problems? I read once that John von Neumann’s greatest skill was his ability to take a problem from any field and abstract it to the right mathematical form. Rather than a dozen or so design patterns to automatically fall into and start coding, as with imperative programming, you have literally all of abstract math and logic to choose from.

Fortunately that is not as daunting as it sounds. There are a couple of meta design patterns that become second nature in functional programming. One is recursion. Recursion gets talked about in imperative programming circles too. It’s something you “should” know how to do. In practice it’s not used all that often, perhaps because in imperative languages the implementation always feels stilted (and there’s something called stack overflow to worry about if you recurse too many times). Functional languages, on the other hand, make recursion so easy and natural that with a little practice you will be seeing recursion in practically every problem.5 Another very useful tool, available in the ML family of languages and some others, is pattern matching, which let’s you interrogate the “shape” of a data structure in very sophisticated ways. Some comparisons can be drawn to regular expressions, but don’t take the analogy too far. Regular expressions is for matching patterns in strings, while “pattern matching” operates on data structures (which can still be a string), and you can generally rely on your pattern matching code to be more transparent than some of the crazy regular expressions you have seen (once again, revealing intent), and therefore more reliable (i.e. less prone to unintended behavior).

Another reason not to be daunted by the word problems, you seldom abstract the problem at hand into a single function. Rather the thought process usually starts with identifying the lowest level problems in the bigger assignment, tackling each in turn and building up to solving the whole problem.

Skeptic: But wait! We do this all the time in imperative and OO languages. We create class libraries and interfaces and API’s.

Me: In functional programming you create functions specific to your business at hand. What you end up with is a programming language specific to the area of your problem, a Domain Specific Language or DSL.6

Skeptic: That’s a distinction without a difference.

Me: OK, I’ve been holding out on you. In (purely) functional programming there are no variables. In fact, strike that word from your vocabulary when talking about functional programming. There are values bound to tokens. That’s it. It never changes. The beauty of this is once your statement builds, it probably works correctly (you’ll know soon enough). Go ahead and throw in a unit test. That’s good practice after all. But literally, instead of programmers using the debugger to guess how to get out of the kludge they coded themselves into, you use the REPL to quickly verify the statement you just wrote does what you thought through. After that you’re done. It never changes. That’s also why your functional programs are so easy to multi-thread.

Skeptic: You can’t code without variables!

Well, it certainly is a paradigm shift. There’s a lot less typing (even less mouse clicking) and a lot more thinking going on. Think back again to word problems again. Now we’re talking algebra, calculus, and beyond. You know why so few people are good at them? Because it takes a lot of force to overcome the inertia of not thinking about them. And it’s much easier for teachers to stick to teaching the fundamentals than mentoring students on how to think through applications.

Anyway, enough evangelizing about functional programming. There’s a lot more theory behind this: currying, monads, purely functional data structures (remember, theory matters). I barely scratched the surface, but I want to keep my articles closer to presenting information and ideas, rather than selling ideas. I’ll end with a link to Microsoft’s white paper on F# in the Enterprise and Awesome Princess’ article on F# Basic Concepts.

F#, by the way, is my functional language of choice (another accident of history). Maybe I should have gone “all in” and studied Haskell.7

noSQL, Schemaless, Document-based Database

Tha relational database, like object-oriented programming, has been such a great tool it is easy to fall into the habit of assuming it is the right tool for every job. That it is a “good enough” tool for so many jobs perpetuates this confusion. Relational databases are not going away. In fact I expect the leading RDBMS’s to subsume more of the functionality currently staked-out under the noSQL banner.8 On the other hand the need for alternative database models is also not going away. We have to evaluate the right tool for the job at hand.

The noSQL moniker pretty much covers every alternative database model, a real heterogeneous group. My immediate interest in this polyglot space is the so-called schemaless database (which of course is not schemaless at all), more frequently referred to as document-oriented. document-based, or just document database. What interests me about this technology is the ability to store and reference data that logically belongs together, but is hard to coerce into a consistent relational schema.

As a reference point, let’s compare and contrast generic relational and document databases.

RDBMS has tables. Doc-database has collections.

RDBMS has rows. Doc-database has documents.

RDBMS has columns. Doc-database has fields, but any given fields are arbitrarily present or not present in a document, with the exception of an identifier field. In other words, the table schema in an RDBMS is locked-down (until some administrative action changes it, and then the change applies uniformly to all data in the table), while in a doc-database every single document in a collection could have a different schema.

RDBMS has indices on selected table columns. Doc-database has indices on selected fields. There is always some limit to the number of indices allowed per table/collection. The practical design and implementation considerations regarding indices between the two technologies are remarkably similar.

RDBMS allows joining tables on columns. While in principle limited joining would be possible in a doc-database technology, I don’t think any have actually implemented it. Like everything else to do with the schema, joining is left up to the application.

RDBMS allows partitioning and/or sharding . Doc-database allows sharding.

RDBMS implements SQL, providing many native aggregation functions. Doc-database allows aggregation through map-reduce or some aggregation functions through a native query language.

RDBMS selects specific columns over single rows or ranges, depending on the query predicate. A doc-database typically returns entire documents, singly or across ranges, depending on the query.

So in exchange for schema flexibility, the document database gives up joining data across collections. What good is that? It seems the sweet spot for this technology is collections of documents of variable composition that don’t interact too much with other data (at least not in a performance critical manner).

Here’s an example application: property data. There are lots of reasons why the schema of attributes for two properties differ. Let me divide the differences into two categories: inherent and accidental. Inherent differences in the schema stem from no two properties being exactly alike. My house and the Empire State Building are both “properties” yet they share few attributes, location and last sale price being a couple. Un-leased space may be an attribute of a commercial property, but not of my house. Accidental schema differences derive from how and why information was collected and stored. The tax assessor, the real estate agent, and the environmental scientist have differing interests in properties, and so collect different data, or name the same attribute differently. Even across (and within) different governmental agencies and different multiple listing services property data schemas can vary widely. Schemas for the same property can vary. Even an attribute as seemingly inherent as location can accidentally vary. (Is the latitude/longitude measured to the geometric centroid, or the property entrance, or something else?) Also the tag (attribute name) can vary widely across schemas.

Collections like this with varying schemas are well suited to document databases. There are a limited number of important attributes for searching (which may be coerced into uniformity), but real time access frequently focuses on one document at a time, and aggregation processing can run in the background. Schema processing (i.e. identifying and consuming fields) is left up to the application. This can be done by versioning schemas, or traversing the schema tree, which is typically formatted in JSON/BSON. Whatever the method used, it is left up to the application. Also because of the lack of joining and the database returning entire documents (I don’t believe the concept of covering index is implemented) you will frequently abandon the relational design principle of “only store once”.

For my upcoming articles I had intended to investigate multiple noSQL technologies in parallel, but given limited resources (like time), I currently plan to work in MongoDB, although I may take a closer look at graph databases like Neo4j.

Source Control: Git

I’m not going to neglect the low-level practical aspects of engineering and managing production-ready systems, and nothing is more fundamental to that end than source control. I will be relying for the first time on Git for the upcoming projects. From all accounts it has a lot of advantages over other source control systems, especially when it comes to collaborative development. I may not have much to say about it in the future (the more source control stays out of your way, the better), but I will mention GitHub has just come out with a tailoring of their system specifically for Windows. Discussions can be found on their blog as well as Hacker News.

The following Git Resources and Evangelizings courtesy of Phil Haack:

The Git Parable

Think Like (a) Git

Better Git with PowerShell

Posh-Git: A set of PowerShell scripts which provide Git/PowerShell integration

SeeGit – The Git Repository Visualizer

And of course links to actual reference documents are handy:

Git Reference

Pro Git book

Git Magic

Parsing

Parsing (syntactic analysis) is one of the best understood branches of computer science.

So begins the preface to the first edition of Parsing Techniques a Practical Guide. All the better that the second edition of this authoritative work was published recently (2010). Parsing in information technology is best know as a means to the end of compiler construction, but the principles involved have much wider application, especially in the area of extracting semantic information from data, an ambitious undertaking where theory really does matter. Navigating through Dick Grune’s Website you can find the complete text of the first edition, the 2nd edition’s Annotated Bibliography, Sample Parsers, and other useful information and links.

Tools available to the F# developer include FParsec – A Parser Combinator Library for F#, originally derived from Parsec, a “monadic parser combinator library for Haskell”; and FsLex and FsYacc along with the Wikibooks article F Sharp Programming/Lexing and Parsing.

NLP

Unlike the well-understood field of parsing, Natural Language Processing is progressing at a break-neck pace. What I believe to be the (or at least “a”) core “entry-level” text, Foundations of Statistical Natural Language Processing, is already 13 years old. In one recent survey of the top 10 research papers in computer science the number one paper, Latent Dirichlet Allocation, is in the NLP field, but submitted and published after my core text! You can guess what’s next on my reading list.

The companion website to FSNLP includes plenty of links to NLP tools, and I anticipate drawing on the excellent resources at Wordnik in addition to whatever else I discover (or you tell me about).

Conclusion

So there you have it. I’ll be writing about F# functional programming and document database technology combining one of the best understood with one of the fastest moving areas of computer science. I hope you find it interesting.

I look forward to interacting with you, my readers, on this and upcoming articles. Please leave your thoughtful comments and questions and I will respond to them as best I can. I hope to learn from you.

Notes

[1] A classmate was so inspired by Wylbur she published a poem about it in the UCSB Letters & Sciences Scholars magazine.

[2] There is a large corpus of literature on the web explaining how to adopt functional techniques to popular languages like C# and Java. I’m not linking to any of it because I think it is a bad idea. I understand the thinking of the authors. Their aim is to introduce functional techniques into languages already familiar to most programmers, but I find the resulting code hard to read and hard to understand, which makes it hard to follow the point of the of the exercise. Microsoft had been introducing more functional components into each release of C# (generic collections and LINQ, not strictly a part of the C# language, have been resounding successes), but may have reached the practical limit, since C# 5.0 offers no new functional components.

[3] From here on my use of “functional” refers to “purely functional”.

[4] Euclid reportedly told King Ptolemy there is no royal road to understanding geometry (i.e. it requires hard work and commitment). Of course there is no royal road to functional programming either, but an increasing number of students appear to be embarking on the journey.

[5] What about stack overflow? Languages enabling purely functional programming support tail call recursion, obviating this concern.

[6] I’m sure I can’t adequately explain what a DSL formally is.

[7] At least two of Microsoft’s high profile computer scientists, Dr. Erik Meijer and Dr. Brian Beckman (a physicist by training), are heavily involved in the Haskell community and even have recorded lectures on Haskel on the Microsoft Channel 9 site.

[8] The major RDBMS’s have long featured functionality not strictly under the relational algebra. Massive key-value stores, map-reduce, and document databases are all fair game for incorporation. That would only perpetuate the value to end users of a one size fits all DBMS.