Cascalog on Windows8

I am a complete novice in the *nix world, so running the first query of the first tutorial of Cascalog, a Clojure-based query language for Hadoop, involved considerable learning and debugging for me on Windows8. I’m going to pull together my notes here, because the information I needed was scattered over many places, and perhaps someone else can benefit. I imagine this all applies to Windows7 as well.

The *nix infrastructure you install on Windows is meant to live in an environment where there are no spaces in file paths. So you will save yourself a lot of trouble by installing Java, Cygwin, Maven, and anything else directly in your C:\ root rather than “C:\Program Files”. As we will see there are a whole bunch of OSS projects interacting with each other and version numbers become very important.

1) When the Cascalog documentation says to run on Java 1.6, it really means NOT on 1.7. Be sure to override the install path to “C:\Java”. Install the JRE and JDK. If the install did not already do it, add the User Environment Variable “JAVA_HOME” “C:\Java\jdk1.6.0_41″.

2) Install Apache Maven 3.0.5 to C:\apache-maven-3.0.5. Add the User Environment Variables “M2″ “%M2_HOME%\bin” and “M2_HOME” “C:\apache-maven-3.0.5″. I also added the User Environment Variable “MAVEN_OPTS” “-Xms512m -Xmx1024m -XX:PermSize=256m -XX:MaxPermSize=512m” while fumbling around, but I do not think it is necessary.

3) Next up is Cygwin, to get that “Linux feeling on Windows” (I installed v1.7.17), and Hadoop (I installed v1.1.2). I’m going to refer you to this tutorial for installing both Cygwin and Hadoop, but skip the part about Eclipse (unless you really want to, then you are on your own). A couple things either missing from the tutorial or I somehow missed. Add the System Environment Variable “CYGWIN” “ntsec tty”, and add C:\cygwin\bin to the path User Environment Variables in addition to the System Environment Variable path.

Cygwin provides a bash shell. Here is the reference manual.

4) A couple of notes whose meaning I have forgotten in regards to the Cygwin install, but they may make sense to you when you are in it:

after setting environment variables: start–>service.msc–>right click on cygwin service and start

now run command: ssh-host-config -y

5) You will need to install this patch for Windows.

…and I referenced this SO article for reasons I have forgotten. If you did not take my advice and installed Java in Program Files, you will need this SO article.

For completeness I will include this article on getting Hadoop to run on Cygwin 1.7 and JDK1.7 x64, but remember: Cascalog is not going to run on JDK 1.7!

6) Now we are up to installing Leiningen, which from my ignorance (and coming from a .NET background) seems to be a combination build environment and REPL for Clojure. Anyway this is a pretty important piece of the puzzle, so I recommend learning what you can about it. Copy “the batch file” (see the installation section of the README.md) to a path with no spaces. In your documents path WindowsPowerShell\Microsoft.PowerShell_profile.ps1 set-alias lein to the lein.bat path. Open PowerShell and run lein self-install. This will install the most recent Leiningen (in my case 2.1.3), which IS NOT going to work with Cygwin and Cascalog.

7) So the next step is rolling back Leiningen to v2.1.0. Edit lein.bat (near the top of the file) “set LEIN_VERSION=2.1.0″. Now go find the file leiningen-2.1.0-standalone.jar on the internet (sorry, I goofed-up and lost my reference to it, but you will find it), and drop it in the folder C:\Users[your user].lein\self-installs. Run lein -v to verify v2.1.0 (and you should now be able to figure out switching Leiningen versions).

8) To create a new Clojure project, just set the CD of your PowerShell to the path you want and run lein new {project name}. Set your CD up to the new folder just created and you are almost ready to run the Leiningen REPL (almost because in about every case you want to edit the project.clj file to do something useful).

9) Last step: none of the current documentation properly addressed editing the project.clj file for running Cascalog under Cygwin. I got this cleared-up on this thread in the very active Google group for cascalog-user.

Here’s the working project.clj:

(defproject mycascalog "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :dependencies [[org.clojure/clojure "1.5.1"]
                [cascalog "1.10.1"]]
  :profiles {
    :provided {
      :dependencies
        [[org.apache.hadoop/hadoop-core "0.20.2-dev"]]
            }}
) 

lein repl and you are ready to launch into the first Cascalog totorial.