NLP

I fell into The Craft through an accident of history. I had returned to this country at the height of the ’82 recession after three years of adventures abroad with no special skills looking for any kind of entry level position to earn a living. It took nearly a year of night school and dead-end day jobs to land a position leading to System 370 assembler coding, developing health insurance claims adjudication systems. Ironically, my only brush with computers at university was a statistics course I dropped. Coding up punch cards was so uninspiring, but a tour of the data center exposed me to Wylbur.¹ The possibilities from that brief encounter nurtured my thoughts for years.

More accidents of history steered me toward the Microsoft technology stack, and kept me out of Internet development until the 21st century. Eventually life led me to an Internet startup in the SaaS space that became successful.

The Craft

In some ways today’s Internet workforce of engineers, coders, hackers, and analysts resembles the medieval masons who built castles and cathedrals. We are virtually, if not physically, itinerant, going from company to company, industry to industry, even founding industries. The journeyman stonecutter had more freedom than contemporary agricultural serfs because acquiring skill required time and practice and contemporary society was in an expansive phase of stone building. And both workforces understand building things. Cutting and setting stone then, and journeyman Internet work today, can provide a good living and lots of personal choice, but entrée to The Craft requires more. It requires seriousness of purpose and love of understanding…

Lest I wax too philosophical, let me call out one Master of The Craft, Douglas Crockford, on the importance of studying history, Crockford on JavaScript – Volume 1: The Early Years. And lucky lazy student! The contiguous history of our craft only stretches back to Alan Turing’s seminal 1936 paper, On Computable Numbers, with an Application to the Entscheidungsproblem. There were significant precursors, most notably Boole and Babbage (and some might include Leibnitz and Lovelace).

One more bit of grounding from an established Master

“…there’s really no such thing as raw general intelligence–it’s all just computation.”

– Stephen Wolfram

Plan for craftyThoughts

So with applications experience spanning eight industries and a long-time interest in semantics, I’m launching this site to approach the problem of meaning (semantics) in data and automated systems. To that end I throw out this postulate (which I do not attempt to prove): theory matters. If you attempt anything at all sophisticated in automation, you must build on the foundational engineering work that has come before. Theory and history teach us what is possible.

I intend to scratch my own itch. Articles will take the form of topical notes organized for presentation. Rather than crafting persuasive arguments or rehashing information widely available I will link to where others have already done this. I don’t claim prior expertise in any of the areas I am investigating. I’m not sure I will ever present anything original, and if I do convey any synergy whatsoever, it is because I stand on a foundation built by others.

Among the enabling technologies and application theory I intend to investigate and present:

Functional Language

If you have some facility in the most important programming language today and you know about the good parts, then you know about functions as first-class citizens, anonymous functions, and closures). Congratulations! You have scratched the surface of functional programming, and along with the goodness this provides, you may also wonder if the trade of functionality for problems was a good deal given the issues with debugging anonymous functions in JavaScript. This is hardly an endorsement of functional programming. In exchange for some useful tools you now write software that is harder to maintain.

Well, if JavaScript is your entire exposure to functional language, then you have not been entirely exposed. I’m going to make a bold claim: writing software using functional programming in a purely functional style (if not already enforced by the language) will result in far, far fewer defects right out of the box, more concise software (i.e. fewer lines of code), software whose source code more readily reveals its intent, and software that in many, if not most cases can easily be multi-threaded. (It is possible to write purely functional software in established object-oriented and imperative languages, but I don’t recommend it.)²

Functional languages have been around since 1958, when the grand-daddy of them all, Lisp, appeared. So if functional languages have been around so long, why aren’t they more “mainstream”? Some authors have put forth the “victim of own success” argument, but I think the root cause is more profound, having to do with how we were taught to think about programming. From John von Neumann to Donald Knuth the pioneers of Computer Science naturally fell into the imperative style of programming (with the notable exception of John McCarthy).

do this

do this

test this

depending on results

loop

branch

etc.

Imperative programming requires less thought up front. Indeed, most experienced programmers fall right into writing code without fully thinking through the problem at hand because they intuitively recognize a familiar design pattern (even if they cannot name or explain the pattern). The success of this technique depends on “just in time” thinking through problems as they arise. For your typical CRUD application this is good enough. In the best cases a skilled and careful programmer delivers a structured, simple application incorporating enough, but not too much object-orientedness. In the hands of lesser talent all sorts of problems arise. Over-reliance on modern debugging tools during the code writing phase can lead to hacks upon hacks, solutions to problems in code ever further removed from the application’s problem at hand. Over-exuberant OO style leads to less maintainable code, and on and on. Combating the problems inherent to imperative programming has spawned whole classes of techniques, products, and specialties: detailed coding specifications (beyond mere functional specs), shop-enforced coding standards, code walk-throughs, code quality analysis tools, code generation tools, TDD, BDD, etc.

Sixty years worth of tools and techniques arising from the problems of imperative programming still have their place for functional programming³, don’t get me wrong. And imperative programming is not going away. It is responsible for nearly all the foundations of the computer-enabled culture we live in, including the website you are now on. It’s just that in many cases, especially the hard cases, there is a better way going forward.

So why has functional programming, despite its advantages, not gone “mainstream” (i.e. why aren’t medium and large shops using it)? And why the intensity of interest in functional programming today? Let me answer the second question first: better access to educational opportunities. Remember I said theory matters? Well theory doesn’t stand still, especially in a field as young as computer science/engineering. The foundations upon which engineering students are educated and the avenues available for continuing education just keep getting better, practically on a daily basis. When Andrew Ng’s Machine Learning course at Stanford, which typically enrolled 350 students, was first offered as a free online course in 2012, reportedly over 100,000 people enrolled worldwide. With easy access to resources, and a solid theoretical basis, the Royal Road⁴ to understanding functional programming (along with virtually every other science and engineering topic) is open to all. There is a ground swell of young (and not so young) engineers discovering functional programming for themselves.

Finally, my take on the difficulty of establishing functional programming as a more “mainstream” development paradigm. It involves a radically different thought process. You have to do more thinking up-front and you have to do a different kind of thinking. Whereas the thinking in imperative development is along the lines of “how to do something”, functional development requires in-depth comprehension of “what we are doing”. Let that sink in for a minute. It is not trivial. The general course of development for functional programming is to abstract the problem at hand. This is what I mean when I say functional programs better reveal the “intent” of the task they accomplish. This is a completely different thought process. Do you remember in grammar school when most of your classmates hated math word problems? I read once that John von Neumann’s greatest skill was his ability to take a problem from any field and abstract it to the right mathematical form. Rather than a dozen or so design patterns to automatically fall into and start coding, as with imperative programming, you have literally all of abstract math and logic to choose from.

Fortunately that is not as daunting as it sounds. There are a couple of meta design patterns that become second nature in functional programming. One is recursion. Recursion gets talked about in imperative programming circles too. It’s something you “should” know how to do. In practice it’s not used all that often, perhaps because in imperative languages the implementation always feels stilted (and there’s something called stack overflow to worry about if you recurse too many times). Functional languages, on the other hand, make recursion so easy and natural that with a little practice you will be seeing recursion in practically every problem.⁵ Another very useful tool, available in the ML family of languages and some others, is pattern matching, which let’s you interrogate the “shape” of a data structure in very sophisticated ways. Some comparisons can be drawn to regular expressions, but don’t take the analogy too far. Regular expressions is for matching patterns in strings, while “pattern matching” operates on data structures (which can still be a string), and you can generally rely on your pattern matching code to be more transparent than some of the crazy regular expressions you have seen (once again, revealing intent), and therefore more reliable (i.e. less prone to unintended behavior).

Another reason not to be daunted by the word problems, you seldom abstract the problem at hand into a single function. Rather the thought process usually starts with identifying the lowest level problems in the bigger assignment, tackling each in turn and building up to solving the whole problem.

Skeptic: But wait! We do this all the time in imperative and OO languages. We create class libraries and interfaces and API’s.

Me: In functional programming you create functions specific to your business at hand. What you end up with is a programming language specific to the area of your problem, a Domain Specific Language or DSL.⁶

Skeptic: That’s a distinction without a difference.

Me: OK, I’ve been holding out on you. In (purely) functional programming there are no variables. In fact, strike that word from your vocabulary when talking about functional programming. There are values bound to tokens. That’s it. It never changes. The beauty of this is once your statement builds, it probably works correctly (you’ll know soon enough). Go ahead and throw in a unit test. That’s good practice after all. But literally, instead of programmers using the debugger to guess how to get out of the kludge they coded themselves into, you use the REPL to quickly verify the statement you just wrote does what you thought through. After that you’re done. It never changes. That’s also why your functional programs are so easy to multi-thread.

Skeptic: You can’t code without variables!

Well, it certainly is a paradigm shift. There’s a lot less typing (even less mouse clicking) and a lot more thinking going on. Think back again to word problems again. Now we’re talking algebra, calculus, and beyond. You know why so few people are good at them? Because it takes a lot of force to overcome the inertia of not thinking about them. And it’s much easier for teachers to stick to teaching the fundamentals than mentoring students on how to think through applications.

Anyway, enough evangelizing about functional programming. There’s a lot more theory behind this: currying, monads, purely functional data structures (remember, theory matters). I barely scratched the surface, but I want to keep my articles closer to presenting information and ideas, rather than selling ideas. I’ll end with a link to Microsoft’s white paper on F# in the Enterprise and Awesome Princess’ article on F# Basic Concepts.

F#, by the way, is my functional language of choice (another accident of history). Maybe I should have gone “all in” and studied Haskell.⁷

noSQL, Schemaless, Document-based Database

Tha relational database, like object-oriented programming, has been such a great tool it is easy to fall into the habit of assuming it is the right tool for every job. That it is a “good enough” tool for so many jobs perpetuates this confusion. Relational databases are not going away. In fact I expect the leading RDBMS’s to subsume more of the functionality currently staked-out under the noSQL banner.⁸ On the other hand the need for alternative database models is also not going away. We have to evaluate the right tool for the job at hand.

The noSQL moniker pretty much covers every alternative database model, a real heterogeneous group. My immediate interest in this polyglot space is the so-called schemaless database (which of course is not schemaless at all), more frequently referred to as document-oriented. document-based, or just document database. What interests me about this technology is the ability to store and reference data that logically belongs together, but is hard to coerce into a consistent relational schema.

As a reference point, let’s compare and contrast generic relational and document databases.

RDBMS has tables. Doc-database has collections.

RDBMS has rows. Doc-database has documents.

RDBMS has columns. Doc-database has fields, but any given fields are arbitrarily present or not present in a document, with the exception of an identifier field. In other words, the table schema in an RDBMS is locked-down (until some administrative action changes it, and then the change applies uniformly to all data in the table), while in a doc-database every single document in a collection could have a different schema.

RDBMS has indices on selected table columns. Doc-database has indices on selected fields. There is always some limit to the number of indices allowed per table/collection. The practical design and implementation considerations regarding indices between the two technologies are remarkably similar.

RDBMS allows joining tables on columns. While in principle limited joining would be possible in a doc-database technology, I don’t think any have actually implemented it. Like everything else to do with the schema, joining is left up to the application.

RDBMS allows partitioning and/or sharding . Doc-database allows sharding.

RDBMS implements SQL, providing many native aggregation functions. Doc-database allows aggregation through map-reduce or some aggregation functions through a native query language.

RDBMS selects specific columns over single rows or ranges, depending on the query predicate. A doc-database typically returns entire documents, singly or across ranges, depending on the query.

So in exchange for schema flexibility, the document database gives up joining data across collections. What good is that? It seems the sweet spot for this technology is collections of documents of variable composition that don’t interact too much with other data (at least not in a performance critical manner).

Here’s an example application: property data. There are lots of reasons why the schema of attributes for two properties differ. Let me divide the differences into two categories: inherent and accidental. Inherent differences in the schema stem from no two properties being exactly alike. My house and the Empire State Building are both “properties” yet they share few attributes, location and last sale price being a couple. Un-leased space may be an attribute of a commercial property, but not of my house. Accidental schema differences derive from how and why information was collected and stored. The tax assessor, the real estate agent, and the environmental scientist have differing interests in properties, and so collect different data, or name the same attribute differently. Even across (and within) different governmental agencies and different multiple listing services property data schemas can vary widely. Schemas for the same property can vary. Even an attribute as seemingly inherent as location can accidentally vary. (Is the latitude/longitude measured to the geometric centroid, or the property entrance, or something else?) Also the tag (attribute name) can vary widely across schemas.

Collections like this with varying schemas are well suited to document databases. There are a limited number of important attributes for searching (which may be coerced into uniformity), but real time access frequently focuses on one document at a time, and aggregation processing can run in the background. Schema processing (i.e. identifying and consuming fields) is left up to the application. This can be done by versioning schemas, or traversing the schema tree, which is typically formatted in JSON/BSON. Whatever the method used, it is left up to the application. Also because of the lack of joining and the database returning entire documents (I don’t believe the concept of covering index is implemented) you will frequently abandon the relational design principle of “only store once”.

For my upcoming articles I had intended to investigate multiple noSQL technologies in parallel, but given limited resources (like time), I currently plan to work in MongoDB, although I may take a closer look at graph databases like Neo4j.

Source Control: Git

I’m not going to neglect the low-level practical aspects of engineering and managing production-ready systems, and nothing is more fundamental to that end than source control. I will be relying for the first time on Git for the upcoming projects. From all accounts it has a lot of advantages over other source control systems, especially when it comes to collaborative development. I may not have much to say about it in the future (the more source control stays out of your way, the better), but I will mention GitHub has just come out with a tailoring of their system specifically for Windows. Discussions can be found on their blog as well as Hacker News.

The following Git Resources and Evangelizings courtesy of Phil Haack:

The Git Parable

Think Like (a) Git

Better Git with PowerShell

Posh-Git: A set of PowerShell scripts which provide Git/PowerShell integration

SeeGit – The Git Repository Visualizer

And of course links to actual reference documents are handy:

Git Reference

Pro Git book

Git Magic

Parsing

Parsing (syntactic analysis) is one of the best understood branches of computer science.

So begins the preface to the first edition of Parsing Techniques a Practical Guide. All the better that the second edition of this authoritative work was published recently (2010). Parsing in information technology is best know as a means to the end of compiler construction, but the principles involved have much wider application, especially in the area of extracting semantic information from data, an ambitious undertaking where theory really does matter. Navigating through Dick Grune’s Website you can find the complete text of the first edition, the 2nd edition’s Annotated Bibliography, Sample Parsers, and other useful information and links.

Tools available to the F# developer include FParsec – A Parser Combinator Library for F#, originally derived from Parsec, a “monadic parser combinator library for Haskell”; and FsLex and FsYacc along with the Wikibooks article F Sharp Programming/Lexing and Parsing.

Unlike the well-understood field of parsing, Natural Language Processing is progressing at a break-neck pace. What I believe to be the (or at least “a”) core “entry-level” text, Foundations of Statistical Natural Language Processing, is already 13 years old. In one recent survey of the top 10 research papers in computer science the number one paper, Latent Dirichlet Allocation, is in the NLP field, but submitted and published after my core text! You can guess what’s next on my reading list.

The companion website to FSNLP includes plenty of links to NLP tools, and I anticipate drawing on the excellent resources at Wordnik in addition to whatever else I discover (or you tell me about).

Conclusion

So there you have it. I’ll be writing about F# functional programming and document database technology combining one of the best understood with one of the fastest moving areas of computer science. I hope you find it interesting.

I look forward to interacting with you, my readers, on this and upcoming articles. Please leave your thoughtful comments and questions and I will respond to them as best I can. I hope to learn from you.

Notes

[1] A classmate was so inspired by Wylbur she published a poem about it in the UCSB Letters & Sciences Scholars magazine.

[2] There is a large corpus of literature on the web explaining how to adopt functional techniques to popular languages like C# and Java. I’m not linking to any of it because I think it is a bad idea. I understand the thinking of the authors. Their aim is to introduce functional techniques into languages already familiar to most programmers, but I find the resulting code hard to read and hard to understand, which makes it hard to follow the point of the of the exercise. Microsoft had been introducing more functional components into each release of C# (generic collections and LINQ, not strictly a part of the C# language, have been resounding successes), but may have reached the practical limit, since C# 5.0 offers no new functional components.

[3] From here on my use of “functional” refers to “purely functional”.

[4] Euclid reportedly told King Ptolemy there is no royal road to understanding geometry (i.e. it requires hard work and commitment). Of course there is no royal road to functional programming either, but an increasing number of students appear to be embarking on the journey.

[5] What about stack overflow? Languages enabling purely functional programming support tail call recursion, obviating this concern.

[6] I’m sure I can’t adequately explain what a DSL formally is.

[7] At least two of Microsoft’s high profile computer scientists, Dr. Erik Meijer and Dr. Brian Beckman (a physicist by training), are heavily involved in the Haskell community and even have recorded lectures on Haskel on the Microsoft Channel 9 site.

[8] The major RDBMS’s have long featured functionality not strictly under the relational algebra. Massive key-value stores, map-reduce, and document databases are all fair game for incorporation. That would only perpetuate the value to end users of a one size fits all DBMS.

craftyThoughts

Composability -> Expressiveness -> Correctness -> Testability -> Intention

Category Archives: NLP

Introducing craftyThoughts