When Accuracy is the Wrong Evaluation Measure

Posted on May 31, 2012 by Jack Fox

Say you are developing a process returning a list of results, and you want to evaluate how well your process is doing. This is a process more complex than a database or dictionary lookup. You expect it cannot be perfect, but you have criteria for judging whether each result in the list is correct or not, and whether it returned all the intended results. Let’s call this general case information retrieval.¹

Definition: Information Retrieval – A process returning zero to many results. Each returned result can be evaluated as correct or incorrect. Likewise results not returned can be evaluated as correctly or incorrectly rejected.

So information retrieval systems have two valid states in relation to any target in the potential result set, correct selection and correct rejection. There are also two ways the system can be in error, namely incorrectly rejecting a result and incorrectly returning a result.

Definition: Type I Error – false negative or incorrectly rejecting a result.

Definition: Type II Error – false positive or incorrectly returning a result.

	correct	incorrect
returned:	good	Type II error
rejected:	Type I error	good

The naive approach to evaluating an information retrieval system is a simple comparison of correct results to incorrect results, measuring either accuracy or error, usually as a percentage.

Definition: Accuracy – (correctly returned + correctly rejected) / (set of potential results)

Definition: Error – (Type I and II errors) / (set of potential results)

Accuracy and error are complements of each other, and while they are good measures for many day-to-day problems, their usefulness falls apart when we evaluate information retrieval systems where the set of items that should be rejected is large.

We need to measure something else to evaluate systems like this, precision and recall.

Definition: Precision – Fraction (percent) of returned results in an information retrieval system that are correct.

(correctly returned) / (correctly returned + Type II errors)

Definition: Recall – Fraction (percent) of correct results in an information retrieval system that are returned.

(correctly returned) / (correctly returned + Type I errors)

By themselves neither precision or recall is a particularly good measure. Simply return nothing and precision will be “off the charts” (literally, since you will divide by zero!), but then recall will be zero. Alternately return everything and recall will be 100%, but precision suffers.

By calculating the harmonic mean of precision and recall we have a much better measure of information retrieval performance in one number.²

Definition: F Score – A measure of system accuracy derived from precision and recall. The most general case, giving equal weighting to both parameters, is the harmonic mean

F₁ = (2 * precision * recall) / (precision + recall)

You can vary the weighting of precision and recall by applying the formula with an additional parameter.

F_a = ((1 + a²) * precision * recall) / ((a² * precision) + recall)

Setting a > 1 weights recall higher, while a < 1 weights precision higher.

There is one more measure, fallout, which measures the rate of Type II errors. It shows the probability of a false positive, which may be an important measure, depending on your system.³

Definition: Fallout – probability of a Type II error (false positive) in an information retrieval system.

(Type II errors) / (Type II errors + correctly rejected)

Finally, lets look at the numbers from some made-up information retrieval systems. In the first two sets notice as we scale-up the proportion of of correctly rejected results in the set of potential results, accuracy and error get better and better, but precision, recall, and F score remain consistent (and the F score is consitently poor despite improving accuracy). In the last set notice the different ways we can achieve high accuracy with a poor F score.

correctly	correctly	Type I	Type II					F
returned	rejected	error	error	accuracy	error	precision	recall	score	fallout
25	99	100	3	54.6%	45.4%	89.3%	20.0%	32.7%	2.9%
25	990	100	3	90.8%	9.2%	89.3%	20.0%	32.7%	0.3%
25	9,900	100	3	99.0%	1.0%	89.3%	20.0%	32.7%	0.0%
25	99,000	100	3	99.9%	0.1%	89.3%	20.0%	32.7%	0.0%

100	99	3	25	87.7%	12.3%	80.0%	97.1%	87.7%	20.2%
100	990	3	25	97.5%	2.5%	80.0%	97.1%	87.7%	2.5%
100	9,900	3	25	99.7%	0.3%	80.0%	97.1%	87.7%	0.3%
100	99,000	3	25	100.0%	0.0%	80.0%	97.1%	87.7%	0.0%

34	99,850	115	1	99.9%	0.1%	97.1%	22.8%	37.0%	0.0%
100	99,700	100	100	99.8%	0.2%	50.0%	50.0%	50.0%	0.1%
75	99,700	75	150	99.8%	0.2%	33.3%	50.0%	40.0%	0.2%
125	99,625	245	5	99.8%	0.3%	96.2%	33.8%	50.0%	0.0%
195	99,525	5	275	99.7%	0.3%	41.5%	97.5%	58.2%	0.3%

Notes

[1] This is a restricted definition of information retrieval. As you can see in the Wikipedia article the term frequently applies when results have varying degrees of relevancy.

[2] Since perfect precision and recall together result in an F score of 1.0, I represent it as a percentage, which would probably make a real statistician cringe.

[3] For consistency’s sake also shown as a percentage.

A Course of F# Study

Posted on May 26, 2012 by Jack Fox

I’ve been tinkering off and on with F# for about two years. Now I plan to dive in and do some serious code writing, so I’m gathering all my resources in one place (here) ready to move forward. For anyone with a beginning to intermediate level of experience with F#, this is my recommended course of study.

Development Environment

If you plan to write F# in Mono I can’t help you setting up the environment, but now is a great time to get started with F# using Visual Studio 11 beta because it includes the upcoming F# 3.0, and the Intellisense for F# is better than ever in the beta. F# 3.0 includes the much anticipated Type Providers and F# LINQ queries against Type Providers data sources. Type Providers auto-magically provide strongly typed data structures from diverse sources at design time, meaning the data schema is available to you in Intellisense. Wow! Type Providers are available fora growing number of data providers¹, and if you can’t find the Type Provider you need, write your own.

You’ll want to install a couple VS extensions.

F# Outlining let’s you collapse sections of F# code, pretty much like you already can with C# in Visual Studio.
I also recommend getting both the MsTest Project Template and FsUnit from Daniel Mohl. The template will get you started a little faster on unit tests, and FsUnit will make your unit tests more composable, a little more in tune with how you (hopefully) write the rest of your F# code (i.e. in a functional manner).

Core Course

All you need to learn F# is the Wikibook F# Programming and the MSDN Visual F# references pages, most notably the F# Language Reference and F# Core Library Reference.

So open up VS11 beta and the MSDN reference pages and work through each chapter of F# Programming. It covers every topic of the language you need to know (with the possible exception of Code Quotations and Type Providers). This course is complete and easy to follow. I still use it as a reference.

You’ll also want to use the F# REPL, fsi, while writing code, or while studying code someone else has written. You have a lot of choices here. There’s a fsi window built into Visual Studio, or you can run it from the command line or PowerShell. There are at least two versions available on web pages, Try F# and Try Joinads. There’s even a Try F# iOS app.

You’ll find you have to deliberately write F# one statement at a time. Of course that’s literally true for other languages you’ve worked in, but especially while first learning you will find it challenging to get the syntax perfect as you attempt writing more sophisticated statements (and it has to be perfect). So take advantage of Intellisense in the Visual Studio code window, and fsi for syntax debugging until the statement builds correctly. If everything does not build in the VS code window (with the exception of the incomplete statement you are currently writing), Intellisense is not available . This is why you have to make sure each statement builds as you write it. But here’s your reward: if your statement builds (and you are coding in a purely functional style) it probably works as you intend. More time up-front thinking through what you are doing, and less time typing.

Beyond the Basics

Once you finish Awesome Princess’ excellent course, you’ll want to continue your graduate studies along with whatever productive coding you are doing. Studying the code others have published is a good place to start. F# Snippets is a repository of open source F# snippets that do all sorts of things created by anyone who cares to post. Hat tip to Tomas Petricek for creating this great site and the technology behind it. You know what’s really cool? Mouse over the code in any snippet on the site and you get Intellisense tool tips!

And then there are the bloggers. So many smart folks writing great articles on the why’s and how to’s of F# in particular and functional programming in general. Don Syme’s WebLog on F# and Related Topics deserves special mention, as he is the designer and architect of the F# programming language. Here’s a few others I’ve found useful (and I know there are many others). In some cases I link directly to the author’s F# tagged articles.

Tomas Petricek’s homepage

Bug squash, Mauricio Scheffer’s blog

Clear Lines, Mathias Brandewinder’s blog

Random Ravings of a Red Headed Code Monkey, Daniel Mohl’s blog

Rash thoughts about .NET, C#, F# and Dynamics NAV, Steffen Forkmann’s blog

Faisal’s space, can’t find Faisal’s last name

Intellifactory Blog, creators of WebSharper

F# and Data Mining, Yin Zhu’s blog

Chris Smith’s completely unique view and then continued at a Chris Smith on Software, but apparently Chris hasn’t blogged about F# since 2010

Simon Cousins: almost functioning, Simon Tyler Cousins’ blog

MOIRAE Software Engineering Ltd, a software engineering company based in the north east of England

F# News, Jon Harrop’s blog

Matthew Podwysocki, can’t seem to filter on just F#

Inside F#, can’t find Brian’s last name

Geez this is getting kind of long.

Here’s the F# open-source community on GitHub. Any of these folks I missed probably have F# blogs too.

…and last, but not least

FPish – Online community of functional developers, more than a blog…

Theory Matters

Start getting used to the arrow notation F# uses to describe function signatures. This is a big deal. Function signatures are going to be constantly in your face, and sometimes subtly lurking in the background through everything you do in F#. So pay attention to them from the beginning. With a little effort you will get good at understanding them.

Once you complete the course, read a few articles, and write some functional code, you start realizing how different functional programming is from the old familiar imperative style. And you will probably run across articles that get pretty deep in the theoretical weeds. Inevitably you will read the term “monad”, and you probably won’t understand what the heck the article is about. I spent a long time reading and rereading articles about monads, studying code implementing monads, and I still didn’t get it.

Finally I found a concise, understandable description of monads. Here’s another article that is a little longer, a little more abstract, but still comprehensible. And finally you may find the Wikipedia article useful (after you start to catch-on). Some people might find an explanation illustrated in C# useful.

Don’t try to force-feed yourself on monads. You can get a lot done (even with monads) without fully understanding them.

Thankfully the helpful fellows on the FSharpx team have implemented a number of the most common monads, and a bunch of other basic F# functional constructs in their useful .NET library. It’s well worth the effort to download the source code and study it. While you are studying the monad code in FSharpX keep handy this set of Matthew Podwysocki articles:

Much Ado About Monads – State Edition

Much Ado About Monads – Reader Edition

Much Ado About Monads – List Edition

Much Ado About Monads – Maybe Edition

Much Ado About Monads – Creating Extended Builders

Much Ado About Monads – Creating Extended Builders Part II

A Kick in the Monads – Creating Extended Builders Part III

A Kick in the Monads – Writer Edition

OK, enough of monads. I’m going to throw one more series of articles at you. Monads is not a prerequisite, in fact skip monads altogether and read this series. It gets to the heart of a broad area of functional programming, specifically using F#.

Catamorphisms, part one

Catamorphisms, part two

Catamorphisms, part three

Catamorphisms, part four

Catamorphisms, part five

Catamorphisms, part six

Catamorphisms, part seven

Catamorphisms, part eight

Purely functional data structures are an important part of functional programming. Chris Okasaki popularized this subject with his ground-breaking PhD thesis, and the expansion of his thesis into the book (adding more structures that weren’t part of his original research). F# programmers are fortunate that many, if not all, of the structures in Chris’ book have been implemented in F# (don’t forget to click “Older Entries” at the bottom of the page, there are two pages of headings). Hat tip to Steffen Forkmann for bringing this to my notice. Whoever Julien is, he needs to stand up and take a bow.

Had Enough?

If not, then Try Joinads, Tomas Petricek’s excursion into a F# research extension for concurrent, parallel and asynchronous programming. I think his work on joinads is really valuable, and I hope Microsoft soon incorporates it into a F# 3.1 release, but for now it requires a hacked version of the open source F# compiler, and supposedly breaks Intellisense.

Notes

[1] F#3.0 comes with a handful of built-in type providers and FSharpX provides more. At a minimum you should find Type Providers for OData services, SQL schemas and connections, WSDL web services, resource files, regex, file system, csv, Excel, JSON, XML, Registry, XAML, and App Settings.

Introducing craftyThoughts

Posted on May 23, 2012 by Jack Fox

I fell into The Craft through an accident of history. I had returned to this country at the height of the ’82 recession after three years of adventures abroad with no special skills looking for any kind of entry level position to earn a living. It took nearly a year of night school and dead-end day jobs to land a position leading to System 370 assembler coding, developing health insurance claims adjudication systems. Ironically, my only brush with computers at university was a statistics course I dropped. Coding up punch cards was so uninspiring, but a tour of the data center exposed me to Wylbur.¹ The possibilities from that brief encounter nurtured my thoughts for years.

More accidents of history steered me toward the Microsoft technology stack, and kept me out of Internet development until the 21st century. Eventually life led me to an Internet startup in the SaaS space that became successful.

The Craft

In some ways today’s Internet workforce of engineers, coders, hackers, and analysts resembles the medieval masons who built castles and cathedrals. We are virtually, if not physically, itinerant, going from company to company, industry to industry, even founding industries. The journeyman stonecutter had more freedom than contemporary agricultural serfs because acquiring skill required time and practice and contemporary society was in an expansive phase of stone building. And both workforces understand building things. Cutting and setting stone then, and journeyman Internet work today, can provide a good living and lots of personal choice, but entrée to The Craft requires more. It requires seriousness of purpose and love of understanding…

Lest I wax too philosophical, let me call out one Master of The Craft, Douglas Crockford, on the importance of studying history, Crockford on JavaScript – Volume 1: The Early Years. And lucky lazy student! The contiguous history of our craft only stretches back to Alan Turing’s seminal 1936 paper, On Computable Numbers, with an Application to the Entscheidungsproblem. There were significant precursors, most notably Boole and Babbage (and some might include Leibnitz and Lovelace).

One more bit of grounding from an established Master

“…there’s really no such thing as raw general intelligence–it’s all just computation.”

– Stephen Wolfram

Plan for craftyThoughts

So with applications experience spanning eight industries and a long-time interest in semantics, I’m launching this site to approach the problem of meaning (semantics) in data and automated systems. To that end I throw out this postulate (which I do not attempt to prove): theory matters. If you attempt anything at all sophisticated in automation, you must build on the foundational engineering work that has come before. Theory and history teach us what is possible.

I intend to scratch my own itch. Articles will take the form of topical notes organized for presentation. Rather than crafting persuasive arguments or rehashing information widely available I will link to where others have already done this. I don’t claim prior expertise in any of the areas I am investigating. I’m not sure I will ever present anything original, and if I do convey any synergy whatsoever, it is because I stand on a foundation built by others.

Among the enabling technologies and application theory I intend to investigate and present:

Functional Language

If you have some facility in the most important programming language today and you know about the good parts, then you know about functions as first-class citizens, anonymous functions, and closures). Congratulations! You have scratched the surface of functional programming, and along with the goodness this provides, you may also wonder if the trade of functionality for problems was a good deal given the issues with debugging anonymous functions in JavaScript. This is hardly an endorsement of functional programming. In exchange for some useful tools you now write software that is harder to maintain.

Well, if JavaScript is your entire exposure to functional language, then you have not been entirely exposed. I’m going to make a bold claim: writing software using functional programming in a purely functional style (if not already enforced by the language) will result in far, far fewer defects right out of the box, more concise software (i.e. fewer lines of code), software whose source code more readily reveals its intent, and software that in many, if not most cases can easily be multi-threaded. (It is possible to write purely functional software in established object-oriented and imperative languages, but I don’t recommend it.)²

Functional languages have been around since 1958, when the grand-daddy of them all, Lisp, appeared. So if functional languages have been around so long, why aren’t they more “mainstream”? Some authors have put forth the “victim of own success” argument, but I think the root cause is more profound, having to do with how we were taught to think about programming. From John von Neumann to Donald Knuth the pioneers of Computer Science naturally fell into the imperative style of programming (with the notable exception of John McCarthy).

do this

do this

test this

depending on results

loop

branch

etc.

Imperative programming requires less thought up front. Indeed, most experienced programmers fall right into writing code without fully thinking through the problem at hand because they intuitively recognize a familiar design pattern (even if they cannot name or explain the pattern). The success of this technique depends on “just in time” thinking through problems as they arise. For your typical CRUD application this is good enough. In the best cases a skilled and careful programmer delivers a structured, simple application incorporating enough, but not too much object-orientedness. In the hands of lesser talent all sorts of problems arise. Over-reliance on modern debugging tools during the code writing phase can lead to hacks upon hacks, solutions to problems in code ever further removed from the application’s problem at hand. Over-exuberant OO style leads to less maintainable code, and on and on. Combating the problems inherent to imperative programming has spawned whole classes of techniques, products, and specialties: detailed coding specifications (beyond mere functional specs), shop-enforced coding standards, code walk-throughs, code quality analysis tools, code generation tools, TDD, BDD, etc.

Sixty years worth of tools and techniques arising from the problems of imperative programming still have their place for functional programming³, don’t get me wrong. And imperative programming is not going away. It is responsible for nearly all the foundations of the computer-enabled culture we live in, including the website you are now on. It’s just that in many cases, especially the hard cases, there is a better way going forward.

So why has functional programming, despite its advantages, not gone “mainstream” (i.e. why aren’t medium and large shops using it)? And why the intensity of interest in functional programming today? Let me answer the second question first: better access to educational opportunities. Remember I said theory matters? Well theory doesn’t stand still, especially in a field as young as computer science/engineering. The foundations upon which engineering students are educated and the avenues available for continuing education just keep getting better, practically on a daily basis. When Andrew Ng’s Machine Learning course at Stanford, which typically enrolled 350 students, was first offered as a free online course in 2012, reportedly over 100,000 people enrolled worldwide. With easy access to resources, and a solid theoretical basis, the Royal Road⁴ to understanding functional programming (along with virtually every other science and engineering topic) is open to all. There is a ground swell of young (and not so young) engineers discovering functional programming for themselves.

Finally, my take on the difficulty of establishing functional programming as a more “mainstream” development paradigm. It involves a radically different thought process. You have to do more thinking up-front and you have to do a different kind of thinking. Whereas the thinking in imperative development is along the lines of “how to do something”, functional development requires in-depth comprehension of “what we are doing”. Let that sink in for a minute. It is not trivial. The general course of development for functional programming is to abstract the problem at hand. This is what I mean when I say functional programs better reveal the “intent” of the task they accomplish. This is a completely different thought process. Do you remember in grammar school when most of your classmates hated math word problems? I read once that John von Neumann’s greatest skill was his ability to take a problem from any field and abstract it to the right mathematical form. Rather than a dozen or so design patterns to automatically fall into and start coding, as with imperative programming, you have literally all of abstract math and logic to choose from.

Fortunately that is not as daunting as it sounds. There are a couple of meta design patterns that become second nature in functional programming. One is recursion. Recursion gets talked about in imperative programming circles too. It’s something you “should” know how to do. In practice it’s not used all that often, perhaps because in imperative languages the implementation always feels stilted (and there’s something called stack overflow to worry about if you recurse too many times). Functional languages, on the other hand, make recursion so easy and natural that with a little practice you will be seeing recursion in practically every problem.⁵ Another very useful tool, available in the ML family of languages and some others, is pattern matching, which let’s you interrogate the “shape” of a data structure in very sophisticated ways. Some comparisons can be drawn to regular expressions, but don’t take the analogy too far. Regular expressions is for matching patterns in strings, while “pattern matching” operates on data structures (which can still be a string), and you can generally rely on your pattern matching code to be more transparent than some of the crazy regular expressions you have seen (once again, revealing intent), and therefore more reliable (i.e. less prone to unintended behavior).

Another reason not to be daunted by the word problems, you seldom abstract the problem at hand into a single function. Rather the thought process usually starts with identifying the lowest level problems in the bigger assignment, tackling each in turn and building up to solving the whole problem.

Skeptic: But wait! We do this all the time in imperative and OO languages. We create class libraries and interfaces and API’s.

Me: In functional programming you create functions specific to your business at hand. What you end up with is a programming language specific to the area of your problem, a Domain Specific Language or DSL.⁶

Skeptic: That’s a distinction without a difference.

Me: OK, I’ve been holding out on you. In (purely) functional programming there are no variables. In fact, strike that word from your vocabulary when talking about functional programming. There are values bound to tokens. That’s it. It never changes. The beauty of this is once your statement builds, it probably works correctly (you’ll know soon enough). Go ahead and throw in a unit test. That’s good practice after all. But literally, instead of programmers using the debugger to guess how to get out of the kludge they coded themselves into, you use the REPL to quickly verify the statement you just wrote does what you thought through. After that you’re done. It never changes. That’s also why your functional programs are so easy to multi-thread.

Skeptic: You can’t code without variables!

Well, it certainly is a paradigm shift. There’s a lot less typing (even less mouse clicking) and a lot more thinking going on. Think back again to word problems again. Now we’re talking algebra, calculus, and beyond. You know why so few people are good at them? Because it takes a lot of force to overcome the inertia of not thinking about them. And it’s much easier for teachers to stick to teaching the fundamentals than mentoring students on how to think through applications.

Anyway, enough evangelizing about functional programming. There’s a lot more theory behind this: currying, monads, purely functional data structures (remember, theory matters). I barely scratched the surface, but I want to keep my articles closer to presenting information and ideas, rather than selling ideas. I’ll end with a link to Microsoft’s white paper on F# in the Enterprise and Awesome Princess’ article on F# Basic Concepts.

F#, by the way, is my functional language of choice (another accident of history). Maybe I should have gone “all in” and studied Haskell.⁷

noSQL, Schemaless, Document-based Database

Tha relational database, like object-oriented programming, has been such a great tool it is easy to fall into the habit of assuming it is the right tool for every job. That it is a “good enough” tool for so many jobs perpetuates this confusion. Relational databases are not going away. In fact I expect the leading RDBMS’s to subsume more of the functionality currently staked-out under the noSQL banner.⁸ On the other hand the need for alternative database models is also not going away. We have to evaluate the right tool for the job at hand.

The noSQL moniker pretty much covers every alternative database model, a real heterogeneous group. My immediate interest in this polyglot space is the so-called schemaless database (which of course is not schemaless at all), more frequently referred to as document-oriented. document-based, or just document database. What interests me about this technology is the ability to store and reference data that logically belongs together, but is hard to coerce into a consistent relational schema.

As a reference point, let’s compare and contrast generic relational and document databases.

RDBMS has tables. Doc-database has collections.

RDBMS has rows. Doc-database has documents.

RDBMS has columns. Doc-database has fields, but any given fields are arbitrarily present or not present in a document, with the exception of an identifier field. In other words, the table schema in an RDBMS is locked-down (until some administrative action changes it, and then the change applies uniformly to all data in the table), while in a doc-database every single document in a collection could have a different schema.

RDBMS has indices on selected table columns. Doc-database has indices on selected fields. There is always some limit to the number of indices allowed per table/collection. The practical design and implementation considerations regarding indices between the two technologies are remarkably similar.

RDBMS allows joining tables on columns. While in principle limited joining would be possible in a doc-database technology, I don’t think any have actually implemented it. Like everything else to do with the schema, joining is left up to the application.

RDBMS allows partitioning and/or sharding . Doc-database allows sharding.

RDBMS implements SQL, providing many native aggregation functions. Doc-database allows aggregation through map-reduce or some aggregation functions through a native query language.

RDBMS selects specific columns over single rows or ranges, depending on the query predicate. A doc-database typically returns entire documents, singly or across ranges, depending on the query.

So in exchange for schema flexibility, the document database gives up joining data across collections. What good is that? It seems the sweet spot for this technology is collections of documents of variable composition that don’t interact too much with other data (at least not in a performance critical manner).

Here’s an example application: property data. There are lots of reasons why the schema of attributes for two properties differ. Let me divide the differences into two categories: inherent and accidental. Inherent differences in the schema stem from no two properties being exactly alike. My house and the Empire State Building are both “properties” yet they share few attributes, location and last sale price being a couple. Un-leased space may be an attribute of a commercial property, but not of my house. Accidental schema differences derive from how and why information was collected and stored. The tax assessor, the real estate agent, and the environmental scientist have differing interests in properties, and so collect different data, or name the same attribute differently. Even across (and within) different governmental agencies and different multiple listing services property data schemas can vary widely. Schemas for the same property can vary. Even an attribute as seemingly inherent as location can accidentally vary. (Is the latitude/longitude measured to the geometric centroid, or the property entrance, or something else?) Also the tag (attribute name) can vary widely across schemas.

Collections like this with varying schemas are well suited to document databases. There are a limited number of important attributes for searching (which may be coerced into uniformity), but real time access frequently focuses on one document at a time, and aggregation processing can run in the background. Schema processing (i.e. identifying and consuming fields) is left up to the application. This can be done by versioning schemas, or traversing the schema tree, which is typically formatted in JSON/BSON. Whatever the method used, it is left up to the application. Also because of the lack of joining and the database returning entire documents (I don’t believe the concept of covering index is implemented) you will frequently abandon the relational design principle of “only store once”.

For my upcoming articles I had intended to investigate multiple noSQL technologies in parallel, but given limited resources (like time), I currently plan to work in MongoDB, although I may take a closer look at graph databases like Neo4j.

Source Control: Git

I’m not going to neglect the low-level practical aspects of engineering and managing production-ready systems, and nothing is more fundamental to that end than source control. I will be relying for the first time on Git for the upcoming projects. From all accounts it has a lot of advantages over other source control systems, especially when it comes to collaborative development. I may not have much to say about it in the future (the more source control stays out of your way, the better), but I will mention GitHub has just come out with a tailoring of their system specifically for Windows. Discussions can be found on their blog as well as Hacker News.

The following Git Resources and Evangelizings courtesy of Phil Haack:

The Git Parable

Think Like (a) Git

Better Git with PowerShell

Posh-Git: A set of PowerShell scripts which provide Git/PowerShell integration

SeeGit – The Git Repository Visualizer

And of course links to actual reference documents are handy:

Git Reference

Pro Git book

Git Magic

Parsing

Parsing (syntactic analysis) is one of the best understood branches of computer science.

So begins the preface to the first edition of Parsing Techniques a Practical Guide. All the better that the second edition of this authoritative work was published recently (2010). Parsing in information technology is best know as a means to the end of compiler construction, but the principles involved have much wider application, especially in the area of extracting semantic information from data, an ambitious undertaking where theory really does matter. Navigating through Dick Grune’s Website you can find the complete text of the first edition, the 2nd edition’s Annotated Bibliography, Sample Parsers, and other useful information and links.

Tools available to the F# developer include FParsec – A Parser Combinator Library for F#, originally derived from Parsec, a “monadic parser combinator library for Haskell”; and FsLex and FsYacc along with the Wikibooks article F Sharp Programming/Lexing and Parsing.

NLP

Unlike the well-understood field of parsing, Natural Language Processing is progressing at a break-neck pace. What I believe to be the (or at least “a”) core “entry-level” text, Foundations of Statistical Natural Language Processing, is already 13 years old. In one recent survey of the top 10 research papers in computer science the number one paper, Latent Dirichlet Allocation, is in the NLP field, but submitted and published after my core text! You can guess what’s next on my reading list.

The companion website to FSNLP includes plenty of links to NLP tools, and I anticipate drawing on the excellent resources at Wordnik in addition to whatever else I discover (or you tell me about).

Conclusion

So there you have it. I’ll be writing about F# functional programming and document database technology combining one of the best understood with one of the fastest moving areas of computer science. I hope you find it interesting.

I look forward to interacting with you, my readers, on this and upcoming articles. Please leave your thoughtful comments and questions and I will respond to them as best I can. I hope to learn from you.

Notes

[1] A classmate was so inspired by Wylbur she published a poem about it in the UCSB Letters & Sciences Scholars magazine.

[2] There is a large corpus of literature on the web explaining how to adopt functional techniques to popular languages like C# and Java. I’m not linking to any of it because I think it is a bad idea. I understand the thinking of the authors. Their aim is to introduce functional techniques into languages already familiar to most programmers, but I find the resulting code hard to read and hard to understand, which makes it hard to follow the point of the of the exercise. Microsoft had been introducing more functional components into each release of C# (generic collections and LINQ, not strictly a part of the C# language, have been resounding successes), but may have reached the practical limit, since C# 5.0 offers no new functional components.

[3] From here on my use of “functional” refers to “purely functional”.

[4] Euclid reportedly told King Ptolemy there is no royal road to understanding geometry (i.e. it requires hard work and commitment). Of course there is no royal road to functional programming either, but an increasing number of students appear to be embarking on the journey.

[5] What about stack overflow? Languages enabling purely functional programming support tail call recursion, obviating this concern.

[6] I’m sure I can’t adequately explain what a DSL formally is.

[7] At least two of Microsoft’s high profile computer scientists, Dr. Erik Meijer and Dr. Brian Beckman (a physicist by training), are heavily involved in the Haskell community and even have recorded lectures on Haskel on the Microsoft Channel 9 site.

[8] The major RDBMS’s have long featured functionality not strictly under the relational algebra. Massive key-value stores, map-reduce, and document databases are all fair game for incorporation. That would only perpetuate the value to end users of a one size fits all DBMS.

craftyThoughts

Composability -> Expressiveness -> Correctness -> Testability -> Intention

Monthly Archives: May 2012

When Accuracy is the Wrong Evaluation Measure

Notes

A Course of F# Study

Development Environment

Core Course

Beyond the Basics

Theory Matters

Had Enough?

Notes

Introducing craftyThoughts

The Craft

Plan for craftyThoughts

Functional Language

noSQL, Schemaless, Document-based Database

Source Control: Git

Parsing

NLP

Conclusion

Notes