FSharpx Data Structures Discussion

In the ten weeks since I began submitting pull requests contributing Functional Data Structures to the FSharpx open source project I’ve learned a lot, not just expanding my knowledge of the F# language, .NET Framework, and Functional Data Structures, but Git, GitHub, and perhaps most interesting of all, contributing to an open source project.

Steffen Forkmann and Mauricio Scheffer, two of the core FSharpx team, deserve special call-outs for helping me come up to speed. Steffen ported some of my work before I began doing Git pull requests, and they’ve both been very patient with my Git blunders and other mistakes. It took some time, but I’m starting to understand the somewhat anarchic governance structure to open source. With my background leading medium-sized production projects, I try to develop a big picture of the entire stack. I’m learning OS contributors tend to be much more focused on immediate issues. I think this is for good reason as commit history is crucial to the health and long-term benefit of the project. It works like a self-organizing system with big picture benefits as emergent phenomena.

I still can’t resist formulating a bigger picture framework for the FSharpx Data Structures sub-project. What follows is purely my opinion, and I hope FSharpx contributors and others will correct me where I err, and encourage me where I am right with their feedback.

FSharpx.DataStructures Namespace is Purely Functional

This namespace is for Purely Functional Data Structures. There is already one exception, TransientVector which shares the IVector interface with the purely functional PersistentVector.

Mutable structures should have their own namespace, which currently doesn’t appear to exist. There is a “Collections” folder, but its components are in the FSharpx namespace. It includes extension modules to functional and mutable structures as well as one functional structure, NonEmptyList.

Structure Type and Module

The structure’s type should exist directly in the DataStructures namespace, as well as a module with the same name as the type, but not the type under the module. There are a few structures in the latter situation with no non-breaking remedy, except perhaps duplicating the entire type at the namespace level. The type should have all the members that operate on an instantiated type. The module should have let bindings to all the the type members as well as let bindings for all other “utility” functions on the type, like ofSeq and empty.

Understand the difference between a property and a method

Property Usage Guidelines One rule of thumb, if it is not O(1), it is not a property. (The inverse does not necessarily hold.)

Pattern Discriminator

Include any relevant pattern discriminators in the module. Hint: if the type implements TryUncons or TryUnsnoc you can implement pattern discriminators.

Try to Avoid Exceptions

If a structure’s method can throw an exception, include a TryMethodName method that returns option.

Necessary Members

Every data structure should have a length member (even if length is O(n)) and an IsEmpty member in the type and ofSeq in the module.

Additional Members

Of course implement all the members central to the structure, but there can be some nice O(1) or O(log n) values just waiting to be implemented that perhaps were not directly addressed in the research paper because they were not interesting. A couple of examples: the internal structure of the type allows O(1) rev or ofList.

Necessary Interface

Implement IEnumerable in the type. It should sequence elements in the inherent order of the data structure. For instance if a structure implements the head/tail paradigm, IEnumerable should sequence elements in the order of unconsing the structure down to empty. Most every structure has some logical, if not physical, order to its elements when flattened. (I can’t think of an exception right now.)

Implementing ofSeq and IEnumerable makes practically every Data Structure composable into every other, a very useful feature!

Type Interface

In theory data structures implementing the same paradigm, e.g. queue, should implement an interface that both restricts upcasting the interface to the original type and is a usable interface. Meeting both the restrictive and usable requirements has proven difficult. I haven’t given up on this project yet, just deferred it.

Hide unsafe underlying structures

When I first went to work on data structures in the FSharpx project, I noticed some of the structures exposing unsafe (in the sense that arbitrary use could lead to a dysfunctional structure) discriminated unions. I didn’t fully appreciate at the time how an OS project like FSharpx incrementally works toward ideal practices, and considered this an intentional practice, which I emulated.

As a result of Mauricio’s initiative to test for monoid law adherence, this [mal]practice has been exposed here and here.

This is a fair amount of clean-up work. Still, the exposed union does provide one more handle for the determined hacker to be inventive with, so perhaps the correct solution is to expose the union in the type and module in some manner that is clearly identified as unsafe. Or even collecting all of the unsafe unions in a single unsafe module. “Unsafe” or “unchecked” would be ideal names, except they are already C# key words.

Still thinking this over, and hope to get feedback on this.

Intellisense

Every public member and let binding should have intellisense documentation to make it user friendly and not require a literature search or inspection of the code to determine functionality. At the type level there should be intellisense documenting the purpose and use of the structure. Include URLs to additional literature where appropriate.

Data structures should give developers useful tools, not research projects.

Time Complexity

Adopting a practice from Haskell documentation, I put the time complexity for members and let bindings at the beginning of their intellisense documentation. I even went so far as to put O(1) on empty and singleton to accustom users to seeing the time complexity in the intellisense, even though strictly speaking time complexity for these bindings has no meaning.

Take time to evaluate the naming standard

While it’s a good practice to keep the name a structure is best known as, consider renaming properties and methods if it makes sense to fit in with F# paradigms. Use the FSharp.Collections and existing FSharpx.DataStructures namepaces as models, don’t just unthinkingly adopt the names from a research paper or the implementation in another language. Use ofList and ofArray not fromList or fromArray. I intentionally renamed the Heap members findMin and deleteMin to head and tail because I think that accurately conveys the head/tail paradigm like several other data structures (and I implemented Heaps as ascending or descending, so “min” is inappropriate). Stick with cons, snoc, head, tail, init, last wherever possible.

IReadOnlyList and IReadOnlyDictionary

IReadOnlyCollection and IReadOnlyDictionary, new to .NET 4.5, would appear to be useful to implement, however backwards compatibility of FSharpx raises a problem. A discussion of this issue is here.

Should FSharpx take over the PowerPack Data Structures?

Discussion: here

There are related threads for migrating other functions from PowerPack, for instance here and here

I don’t know what the right answer is.

Conclusion

Data Structures make F# a richer multi-technology-stack language. I keep a running collection of the structures from Okasaki’s 1998 book not yet ported to FSharpx here. (There’s at least one or two covered in the book not in this collection.) Feel free to request expediting porting any or all, or work on the port yourself (I’m happy to provide any assistance I can). There’s lots of other useful structures still “in the wild”.

Crisis in Open Source Land

Crisis #1 – WTF Happened to the Git Reference

For those of us who have not had the “aha” Git moment, and do not have every Git command memorized, a complete reference is indispensable. For G_d knows how long more topics in the only known complete reference have been broken than I care to count. Looking up every Git issue we encounter on Stackoverflow without a reference to put it all in context is a helluva way to come up to speed. This is like the Git cognoscenti putting a big sign on their clubhouse, “Hey, we learned it on our own”.

UPDATE: Github fixed this October 30. Thanks Github!

Crisis #2 — Microsoft Research Abandons Power Pack Collections Documentation

And the same goes for the Collections.Tagged documentation page. For now you can still get collections and collections.tagged on the Way Back Machine.

OK, not that many people care about F#, but if you come to my blog there is a good chance you do. MSR has probably been intending to take this page down for a long time. It was documenting an old release of the Power Pack Collections, but still better than nothing.

HashSet – Who cares. I think this got deprecated to something in the .Net generics collection.

HashMultiMap – Seems no one cares about this one either. Even though it is not a functional data structure, someone might find it useful, if only it were documented.

LazyList – <hyperbole>Everybody</hyperbole> uses LazyList. In fact, I’m baffled why the FSharp Team hasn’t integrated LazyList into FSharp.Core. It is in the Microsoft.FSharp.Collections namespace. So it’s bad enough F# newbies wonder why they can’t call up the docs in Visual Studio Help Viewer.

The Unifying Crisis

What I consider a bigger deal is the difficulty in formatting open source project documentation. Let’s face it. People who like to code open source projects like to code, and generally dislike writing documentation. My interest here is primarily .NET open source projects (my wish list applies to .NET projects), and admittedly I have not dug very deep into the solution. Open source authors and consumers would benefit greatly, and make project adoption much easier, if there were good tools for generating nicely formatted documentation outlines.

Ideally a markup language to generate Visual Studio Help Viewer format. (And ideally a way to integrate it into VS Help Viewer.)

Ideally a tool to generate a reference skeleton from the project code.

Less ideal would be generation and markup for any decent pseudo-standardized reference format. (There is probably something already available that will do this into Java Docs format, but really, do we have to?)

And I am sure there are some who are thinking “Unix man pages are just fine”.

The Git Article I Hoped not to Write

As I explained in my introductory article, Git is a core technology I am exploring. I had halfway expected, and hoped, I would have nothing to say on this topic. Of course hope is a plan doomed to failure, and in this case the go-to-hell-plan is to write an article.

Cutting to the Chase

The only resources you need for Git are

The Git Reference

Pro Git

The rest of this article is another data point in the seemingly endless Git literature all over the web. Continue at your own peril.

A little Background

Back in the dark ages of mainframe batch programming source control was via delta decks and their text file descendants. This apparently evolved into CVS, the first modern source control I was exposed to. I was PM of one quarter of a 40 member project, which in turn was part of a larger 300 person project. The 40 person team hired a contractor to set-up and administer CVS for the team (he eventually had other duties as well). In this role I never got any closer than that to CVS, since I wasn’t writing code. I wasn’t sure how to interpret having a full time person who did mostly source control. The project was doing several very inefficient things I had no control over.

SourceSafe is a ticking time-bomb if you don’t take serious back-up precautions. Nuff said. The first source control system I had the responsibility to set up from scratch was TFS. Ultimately all source control includes manual policies and procedures. The trick is understanding enough about a new source control system from reading the documentation and playing with it to develop your own complete system. (Back in the day there was less how-to literature available on this.) It’s been a long time now, and I’ve lost touch with whatever changes Microsoft has made, but we were happy with TFS. It fit in well with the team’s rapid-fire development style of constantly changing anything and everything, but corporate, in its wisdom, decreed all development teams must get on the company’s Subversion “Enterprise-class centralized version control for the masses” system. So another iteration of figuring out how this really works and how to fit it with our work style. For as long as I was with that team we were never happy with Subversion.

Back on my own again I’m once more learning a new version control system, this time with no one to bounce ideas off or ask questions, even embarrassed myself posting a stupid question yesterday (sorry Phil and Github!). Phil Haack graciously responed. I had read the Git Pro book over a year ago, but without putting Git into practice it all just leaked out of my mind.

At the critical level of functionality, Git is just another (though arguably superior) choice in managing the merging and auditing of source. It still takes gatekeeping policies and procedures to manage pull requests.

Why is Every GUI a Two-edged Sword Compared to the Command Line?

Coincident with my starting to use Git, Github came out with a Windows interface. Now this was both good and bad. The good was I didn’t even glance at the Git reference or Git Pro book to develop and publish my first open source project. The bad was I didn’t even glance at the Git reference or Git Pro book. Using the GUI worked well as long as I was developing in the master branch. My confusion started when I wanted to create a branch. Like all good GUIs, it just works. Unlike all good GUIs there is no help feature to clue you into what is going on. So having forgotten all I read about Git over a year ago, I created a branch and started looking for how to hook-up the branch to its workspace. Doh! It’s all the same workspace with Git.

…off to the command line, which involves learning what’s really going on; and set up a sandbox Git project, because I have a deep aversion to doing anything with production source code when I am not 100% sure of the consequences.

Here’s what I learned in a very short time:

The Windows interface saves some drudgery in setting up a repo. You can initialize a repo with the command line

>Git init

(which I did in my sandbox), but the Windows interface sets up helpful parameters in .gitignore and .gitattributes files specifically for Visual Studio development. I’m happy not to have to dive too deep into the ins and outs of these files just yet.

The situation with commits and branches is more nuanced. So for now I’ll switch to the command line because that’s the only way I’ll learn the full range of useful things I can do with Git. I was thinking about installing a Visual Studio Git extension, and Git Source Control Provider appears to be the only option, but I think I will put that off for now.

In the mean time,

>Git status

is my friend.

And Finally, the Punch Line

–everything-is-local

That’s the tag line on the git-scm.com site, and now I understand. You can use Git for simple source control, like any other SCM. And if I had been a little braver (or foolhardy) about using an undocumented GUI feature, I might still be there. Now, however, I see my way forward learning Git. It’s actually personal productivity software. Once you acquire a certain level of adeptness you can do things you would not have considered doing, either because it was too tedious or impossible. Things like exerting more control over the commit story your branch tells, or more frequent branching to allow experimentation that otherwise risks the integrity of another branch, or multiple versions, or things I haven’t thought of.

To update your local repository with work posted to the remote

>Git fetch origin