Joe Crobak's Website

Header Image

Git Rebase Workflow

Posted on Jun 28, 2016

This blog post is an adaptation of an old presentation on some useful concepts and commands to keep your git history clean.

Goal

A clean git history, i.e.:

* cf43bac (HEAD, not-verbose) Title and Goal slides.
* e831535 Import project scaffolding
* 0d858b4 (master) initial commit

not:

* eb911c7 (HEAD, verbose) add bad example to gaol
* c01aacd add good example to goal
* f74382c fix type: Workflwo
* b9b376a Title and Goal slides.
* 7f492c6 Remove the example slide scaffolding
* 16ccdbb Import project scaffolding
* 0d858b4 (master) initial commit

Why?

A clean history = easier to see what happened when and why. e.g.:

A good git blame:

192f7d3c (Joe Crobak 2014-10-26 21:41:00 +0000 22) class: center, middle
192f7d3c (Joe Crobak 2014-10-26 21:41:00 +0000 23)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 24) # Git Rebase Workflow
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 25)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 26) ---
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 27)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 28) # Goal
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 29)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 30) A clean git history, i.e.:
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 31)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 32) ```
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 33) * cf43bac (HEAD, not-verbose) Title and Goal slides.
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 34) * e831535 Import project scaffolding
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 35) * 0d858b4 (master) initial commit
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 36) ```
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 37)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 38) not:
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 39)
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 40) ```
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 41) * eb911c7 (HEAD, verbose) add bad example to gaol
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 42) * c01aacd add good example to goal
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 43) * f74382c fix type: Workflwo
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 44) * b9b376a Title and Goal slides.
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 45) * 7f492c6 Remove the example slide scaffolding
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 46) * 16ccdbb Import project scaffolding
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 47) * 0d858b4 (master) initial commit
dafc1e27 (Joe Crobak 2014-10-26 21:46:03 +0000 48) ```
192f7d3c (Joe Crobak 2014-10-26 21:41:00 +0000 49)

Lines 24-49 are clearly one commit—you can tell by looking at the shas on the left side.

Versus an unclean history

16ccdbb4 (Joe Crobak 2014-10-26 21:41:00 +0000 22) class: center, middle
16ccdbb4 (Joe Crobak 2014-10-26 21:41:00 +0000 23)
f74382c7 (Joe Crobak 2014-10-26 21:46:36 +0000 24) # Git Rebase Workflow
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 25)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 26) ---
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 27)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 28) # Goal
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 29)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 30) A clean git history, i.e.:
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 31)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 32) ```
c01aacdc (Joe Crobak 2014-10-26 21:49:45 +0000 33) * cf43bac (HEAD, not-verbose) Title and Goal slides.
c01aacdc (Joe Crobak 2014-10-26 21:49:45 +0000 34) * e831535 Import project scaffolding
c01aacdc (Joe Crobak 2014-10-26 21:49:45 +0000 35) * 0d858b4 (master) initial commit
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 36) ```
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 37)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 38) not:
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 39)
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 40) ```
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 41) * eb911c7 (HEAD, verbose) add bad example to gaol
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 42) * c01aacd add good example to goal
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 43) * f74382c fix type: Workflwo
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 44) * b9b376a Title and Goal slides.
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 45) * 7f492c6 Remove the example slide scaffolding
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 46) * 16ccdbb Import project scaffolding
e449027d (Joe Crobak 2014-10-26 21:50:05 +0000 47) * 0d858b4 (master) initial commit
b9b376a1 (Joe Crobak 2014-10-26 21:46:03 +0000 48) ```
16ccdbb4 (Joe Crobak 2014-10-26 21:41:00 +0000 49)

Here the changes look like they were done piecemeal. See all the different shas on the left side. Is that information useful?

Let's look at the commit for the line with the title: f74382c7. What happened in that commit?

$ git show f74382c7
commit f74382c73b96f1f68cb1da7edeac14e0952ff6aa
Author: Joe Crobak <joe@undefined>
Date:   Sun Oct 26 21:46:36 2014 +0000

    fix type: Workflwo

diff --git a/presentation.html b/presentation.html
index 119f7c5..6786114 100644
--- a/presentation.html
+++ b/presentation.html
@@ -21,7 +21,7 @@

 class: center, middle

-# Git Rebase Workflwo
+# Git Rebase Workflow

 ---

Not very useful! (note also, that I had a typo in my commit message... I couldn't even spell 'typo' correctly).

Utility of git commits

Think about what is useful to the future you or your teammate. e.g.

  • why was a block of code was added?
  • when was a bug introduced?
  • can the commits be used to build a changelog?

How to keep git history clean

Go read about writing good git commit messages: http://tbaggery.com/2008/04/19/a-note-about-git-commit-messages.html

tl;dr:

  • First line is < 50 chars (used in git rebase -i, git log --pretty=oneline)
  • One item/feature per commit.
  • Descriptive commit messages.
  • Also, consider linking to the User Story/Ticket/etc item for the commit. This adds additional context for someone exploring a commit that changed a line of code.

Example of a good commit

Adding new feature 'bar'

Feature 'bar' improves response time by 10ms by x, y, and z.

Refs: http://jira.apache.org/path/to/jira/ticket.

Tools

Let's get a tour of several tools that we can use to keep our history clean. Here are a few:

  • git add --patch
  • git commit --amend
  • git cherry-pick
  • git rebase -i
  • git reset --soft
  • git reflog / git reset --hard

Caveat: these should be used before pushing to a remote—especially if someone else is using your remote branch! If you've already pushed to a remote, consider making a new one.

git add --patch

Useful when you have unrelated changes in a file (e.g. you fixed an unrelated typo and you want to keep it separate, or you combined a refactor and a new function). Drops you into an interactive session to select your changes for a single commit.

Ref: http://www.codeproject.com/Articles/650440/Git-Quick-Reference-Interactive-Patch-Staging-with

git commit --amend

Useful when you found a typo, have a quick fix, or need to fix the commit message of the previous commit. Applies staged changes (if any) and lets you ammend the commit message.

git cherry-pick

Useful when you want to pull apart a single branch into multiple branches so that you can submit separate branches/pull requests. Be careful about introducing merge conflicts, these should be truly independent commits.

e.g. let's say we have a branch that has two commits that implement separate features. We'd like to turn this branch into two separate feature branches (foo-feature-branch and bar-feature-branch):

[my-feature-branch] ~/code/example $ git lol
* 12d9b25 (HEAD, my-feature-branch) adding new feature 'bar'
* b68f1fa adding new feature 'foo'
* 94ba2ee (master) initial commit

Create a new feature branch for foo and cherry-pick the foo commit onto it.

[my-feature-branch] ~/code/example $ git checkout -b foo-feature-branch master
Switched to a new branch 'foo-feature-branch'
[foo-feature-branch] ~/code/example $ git cherry-pick b68f1fa
[foo-feature-branch ca2d000] adding new feature 'foo'
 1 file changed, 1 insertion(+)
 create mode 100644 foo.txt

Create a new feature branch for bar and cherry-pick the bar commit onto it.

[foo-feature-branch] ~/code/example $ git checkout -b bar-feature-branch master
Switched to a new branch 'bar-feature-branch'
[bar-feature-branch] ~/code/example $ git cherry-pick 12d9b25
[bar-feature-branch f0e95ca] adding new feature 'bar'
 1 file changed, 1 insertion(+)
 create mode 100644 bar.txt

Result:

[bar-feature-branch] ~/code/example $ git lol bar-feature-branch
* f0e95ca (HEAD, bar-feature-branch) adding new feature 'bar'
* 94ba2ee (master) initial commit
[bar-feature-branch] ~/code/example $ git lol foo-feature-branch
* ca2d000 (foo-feature-branch) adding new feature 'foo'
* 94ba2ee (master) initial commit

Note: this is only possible if you break up your commits initially!

git rebase -i

Useful when you have a bunch of local commits that you want to reorder or squash.

Run git rebase -i $commit (where $commit is an ancestor on your branch). Your editor will come up with something like:

pick 16ccdbb Import project scaffolding
fixup 7f492c6 Remove the example slide scaffolding
pick b9b376a Title and Goal slides.
fixup f74382c fix type: Workflwo

# Rebase 0d858b4..f74382c onto 0d858b4
#
# Commands:
#  p, pick = use commit
#  r, reword = use commit, but edit the commit message
#  e, edit = use commit, but stop for amending
#  s, squash = use commit, but meld into previous commit
#  f, fixup = like "squash", but discard this commit's log message
#  x, exec = run command (the rest of the line) using shell
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

Gotchas / notes

  • You can only rebase -i if you have a "clean" working directory.
  • git rebase -i pops you into your editor. Ensure you have $EDITOR set to something useful.
  • If you end up with a conflict (usually when reordering commits), you can abort the rebase with git rebase --abort.
  • This doesn't work well with merge commits (git rebase on the base branch rather than merge).
  • It rewrites history, so beware.
  • Prefix your commit messages with fixup: or squash: so they're easy to identify.

git reset --soft

Useful when you have a lot of local commits that are actually one commit. Do a "soft" reset to the commit before you started your work—all work ends up staged. Example:

current status:

[test] ~/code/example $ git lol
* 2c5d91c (HEAD, test) fixup: another fixup for bar
* 3706d1c fixup: bar
* 18af5f4 fixup: bar
* f0e95ca (bar-feature-branch) adding new feature 'bar'
* 94ba2ee (master) initial commit

reset to before I started working on bar:

[test] ~/code/example $ git reset --soft 94ba2ee

[test●] ~/code/example $ git lol
* 94ba2ee (HEAD, test, master) initial commit

[test●] ~/code/example $ git status
On branch test
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

  new file:   bar.txt

bar is staged, let's commit it:

[test●] ~/code/example $ git commit -m "add new feature 'bar'"

[test ae12373] add new feature 'bar'
 1 file changed, 1 insertion(+)
 create mode 100644 bar.txt

git reflog and git reset --hard

If things are screwed up, DON'T PANIC. The git reflog is your friend. It retains info on all local commits. For example, if I wanted to undo the git reset --soft from before:

Check the git reflog to find my old HEAD:

$ git reflog
ae12373 HEAD@{0}: commit: add new feature 'bar'
94ba2ee HEAD@{1}: reset: moving to 94ba2ee
2c5d91c HEAD@{2}: commit: fixup: another fixup for bar
3706d1c HEAD@{3}: commit (amend): fixup: bar
18dfb5d HEAD@{4}: commit: fixup: x
18af5f4 HEAD@{5}: commit: fixup: bar

Run git reset --hard to restore to 2c5d91c:

[test] ~/code/example $ git reset --hard 2c5d91c
HEAD is now at 2c5d91c fixup: another fixup for bar
[test] ~/code/example $ git lol
* 2c5d91c (HEAD, test) fixup: another fixup for bar
* 3706d1c fixup: bar
* 18af5f4 fixup: bar
* f0e95ca (bar-feature-branch) adding new feature 'bar'
* 94ba2ee (master) initial commit

Summary

Git is a powerful tool with a steep learning curve. This should be enough to help you get started with some of git's more advanced features. If you have questions or suggestions, you can find me on twitter!

Appendix: My ~/.gitconfig

The following config sets up colors in git status, highlights white space in diffs, adds a number of usueful aliases (git lol is my favorite, if you haven't noticed), and more.

$ cat ~/.gitconfig
[user]
  email = joe@undefined
  name = Joe Crobak
# great tips from http://cheat.errtheblog.com/s/git and
# http://mislav.uniqpath.com/2010/07/git-tips/
[color]
    ui = auto
[color "branch"]
    current = red bold
    local = blue
    remote = green
[color "diff"]
    meta = black
    frag = magenta
    old = red
    new = green
    whitespace = red reverse
[color "status"]
    added = green bold
    changed = yellow bold
    untracked = cyan bold
[core]
    # tabs are an error, as are trailing spaces
    whitespace=tab-in-indent,trailing-space
  excludesfile = /Users/joe/.gitignore
[alias]
    st = status
    ss = status -sb
    ci = commit
    br = branch
    co = checkout
    df = diff
    lg = log -p
    ls = log --oneline --decorate
    lol = log --graph --decorate --pretty=oneline --abbrev-commit
    lola = log --graph --decorate --pretty=oneline --abbrev-commit --all
    ls = ls-files
[difftool "sourcetree"]
    cmd = opendiff \"$LOCAL\" \"$REMOTE\"
    path =
[mergetool "sourcetree"]
    cmd = /Applications/SourceTree.app/Contents/Resources/opendiff-w.sh \"$LOCAL\" \"$REMOTE\" -ancestor \"$BASE\" -merge \"$MERGED\"
    trustExitCode = true
[filter "media"]
  clean = git-media-clean %f
  smudge = git-media-smudge %f
Leave a Comment

Luigi and AWS

Posted on Sep 7, 2014

Luigi and S3

One of Luigi’s two primitives is the Target. A Target is used to check for existence of data when determining if a Task can be run. It's also open to open a Target locally and read the data through a Luigi Task.

When working with data in S3, there are two ways to build targets. The first is to use Luigi's builtin support for shelling out to hadoop fs, which supports the s3n:// (S3-native filesystem). For example:

from luigi.hdfs import HdfsTarget

target = HdfsTarget("s3n://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/1gram/data")
print(target.exists())

Configuration: Using an HdfsTarget requires setting fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey in core-site.xml.

Note: Shelling out to hadoop fs can be slow (you're starting up a JVM, so it's usually at least 1 second), which can add up when checking a lot of files.

The second option is to use Luigi's built-in S3Target (or one of its subclasses). In most situations, the S3EmrTarget is the most appropropriate—this Target checks for existence of a _SUCCESS flag rather than existence of a "directory" (S3 doesn't really support directories—the closest approximation is a prefix-query).

Note that this choise is important: when a MapReduce job outputs to HDFS, it typically renames the output atomically (this is only an operation on metadata in the NameNode) to the final destination. But this is not possible with S3, which doesn't support atomic rename. Thus, the best solution is often to set mapreduce.fileoutputcommitter.marksuccessfuljobs to true, and check for the existence of a _SUCCESS flag (which will be written after the files are all moved into their final destination).

In any case, here's a code example:

from luigi.s3 import S3Target, S3EmrTarget

target = S3Target("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/1gram/data")
print(target.exists())

target = S3EmrTarget("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/1gram/")
print(target.exists())  # checks for s3://.../1gram/_SUCCESS, which won't be there.

Configuration: * You must have boto installed. * You must have a [s3] section in your client.cfg with values for aws_access_key_id and aws_secret_access_key (IAM credentials should also be supported, but I haven't tried).

Hybrid Targets

Luigi also supports a LocalFileSystem and local File. It can be useful to use these for local testing but to use s3 in production. In this case, it's easy to write a delegating Target. For example:

from luigi.s3 import S3Target

def DelegatingTarget(self, path, *args, **kwargs):
    if path.startswith("s3://"):
        return S3EmrTarget(path, *args, **kwargs)
    return File(path, *args, **kwargs)

s3_target = DelegatingTarget("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/1gram/")
local_target = DelegatingTarget("/tmp/foo")

Luigi and Redshift

Luigi has support for loading data stored in S3 into Amazon Redshift, which is a data warehousing system. The S3CopyToTable, S3JsonCopyToTable, and RedshiftManifestTasks Tasks each implement a variant of loading data.

In order to keep track of which data has been loaded into Redshift, Luigi uses a maker table, which keeps track of when a table was updated. It uses the RedshiftTarget to insert and check for entries into the marker table.

With the Task and Target, we have nearly all the pieces in place to build a Task to load data into Redshift. The last missing piece is some input data. A gotcha that tends to trip up folks new to luigi is that a Task can only require other tasks. Thus, we need an ExternalTask for the input data (unless there is already a Luigi Task for generating the data). There are some pre-build ExternalTasks for data stored in S3—we'll be using S3PathTask

Here's an example task to load the eng-gb 1grams into Redshift:

from luigi.s3 import S3PathTask
from luigi.contrib.redshift import RedshiftTarget, S3CopyToTable


class OneGramsToRedshift(S3CopyToTable):
    s3_load_path = luigi.Parameter(default="s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/1gram/data")
    aws_access_key_id = luigi.Parameter()
    aws_secret_access_key = luigi.Parameter()

    def requires(self):
        return S3PathTask(path=self.s3_load_path)

    def create_table(self, connection):
        connection.cursor().execute(
            """CREATE TABLE {table} (

            )"""
        .format(table=self.table))

TODO

Leave a Comment

Migration to Jekyll

Posted on Sep 1, 2014

I've migrated my blog from self-hosted Wordpress to Jekyll on S3. The site is more responsive (and much less of a security threat). Comments are now hosted on disqus.

These articles were particularly helpful to me, and you might find them useful if you're doing a similar migration:

Leave a Comment

using jarjar to solve hive and pig antlr conflicts

Posted on Oct 20, 2012

Pig 0.9+ and Hive 0.7+ (and maybe older versions, too) both use antlr. Unfortunately, they use incompatible versions which causes problems if you try to pull in both pig and hive via ivy or maven. Oozie has come up with a deployment workaround for this problem, but HCatalog is still on Pig 0.8 because of the above issue.

Per the recommendation of a co-worker (thanks johng!), I checked out jarjar to shade the pig jar to avoid the conflicts of antlr and all the other dependencies in the pig "fat jar." Here are the steps that I used:

  1. Download jarjar 1.3 (note that the 1.4 release of jarjar seems to have a bad artifact).
  2. Generate a list of "rules" to rewrite the non-pig classes in the pig-withouthadoop jar:
    1. jar tf pig-0.10.0-cdh4.1.0-withouthadoop.jar | \
        grep "\.class" | grep -v org.apache.pig | \
        python -c "import sys;[sys.stdout.write('.'.join(arg.split('/')[:-1]) +'\n') for arg in sys.stdin]" | \
        sort | uniq | \
        awk '{ print "rule " $1 ".* org.apache.pig.jarjar.@0" }' > \
        pig-jarjar-automated.rules
    2. The above command generates one rule per package containing a class file, rewriting the class with a prefix of org.apache.pig.jarjar. The rules are stored in pig-jarjar-automated.rules. See below for the rules file that I generated.

  3. Run jarjar with the rules listed to generate a new jar:
    1. java -jar jarjar-1.3.jar process \
        pig-jarjar-automated.rules \
        pig-0.10.0-cdh4.1.0-withouthadoop.jar \
        pig-0.10.0-cdh4.1.0-withouthadoop-jarjar.jar
    2. Checkout jarjar's command line docs for more info, including the rules file format.
  4. Check the contents of your new jar.
    1.  jar tf pig-0.10.0-cdh4.1.0-withouthadoop-jarjar.jar | \
        egrep ".class$" | grep -c -v "org/apache/pig"
    2. The above command should return 0 to show that all classes have been rewritten under org/apache/pig.
That's it! We uploaded this jar into our internal nexus repo to use within tests. We're still using the vanilla pig jar for our pig CLI and within oozie, but we may do further testing to see what happens once we start rolling out hcatalog.

For pig-0.10.0-cdh4.1, the rules file looks like this:

rule com.google.common.annotations.* org.apache.pig.jarjar.@0
rule com.google.common.base.* org.apache.pig.jarjar.@0
rule com.google.common.base.internal.* org.apache.pig.jarjar.@0
rule com.google.common.cache.* org.apache.pig.jarjar.@0
rule com.google.common.collect.* org.apache.pig.jarjar.@0
rule com.google.common.eventbus.* org.apache.pig.jarjar.@0
rule com.google.common.hash.* org.apache.pig.jarjar.@0
rule com.google.common.io.* org.apache.pig.jarjar.@0
rule com.google.common.math.* org.apache.pig.jarjar.@0
rule com.google.common.net.* org.apache.pig.jarjar.@0
rule com.google.common.primitives.* org.apache.pig.jarjar.@0
rule com.google.common.util.concurrent.* org.apache.pig.jarjar.@0
rule dk.brics.automaton.* org.apache.pig.jarjar.@0
rule jline.* org.apache.pig.jarjar.@0
rule org.antlr.runtime.* org.apache.pig.jarjar.@0
rule org.antlr.runtime.debug.* org.apache.pig.jarjar.@0
rule org.antlr.runtime.misc.* org.apache.pig.jarjar.@0
rule org.antlr.runtime.tree.* org.apache.pig.jarjar.@0
rule org.apache.tools.bzip2r.* org.apache.pig.jarjar.@0
rule org.stringtemplate.v4.* org.apache.pig.jarjar.@0
rule org.stringtemplate.v4.compiler.* org.apache.pig.jarjar.@0
rule org.stringtemplate.v4.debug.* org.apache.pig.jarjar.@0
rule org.stringtemplate.v4.gui.* org.apache.pig.jarjar.@0
rule org.stringtemplate.v4.misc.* org.apache.pig.jarjar.@0

Leave a Comment

Workflow Engines for Hadoop

Posted on Jul 5, 2012

Over the past 2 years, I've had the opportunity to work with two open-source workflow engines for Hadoop. I used and contributed to Azkaban, written and open-sourced by LinkedIn, for over a year while I worked at Adconion. Recently, I've been working with Oozie, which is bundled as part of Cloudera's CDH3. Both systems have a lot of great features but also a number of weaknesses. The strengths and weaknesses of both systems don't always overlap, so I hope that each can learn from the other to improve the tools available for Hadoop.

In that vain, I'm going to produce a head-head comparison of the two systems considering a number of different features. In the follow comparisons, I'm considering the version of Azkaban found in master on github (with exceptions noted) and Oozie from CDH3u3.

Job Definition

Both systems support defining a workflow as a DAG (directed acyclic graph) made up of individual steps.

Azkaban

In Azkaban, a "job" is defined as a java properties file. You specify a job type, any parameters, and any dependencies that job has. Azkaban doesn't have any notion of a self-contained workflow -- a job can depend on any other job in the system. Each job has a unique identifier which is used to reference dependent jobs.

Oozie

In Oozie, a "jobs" are referred to as "actions". A workflow is defined in an XML file, which specifies a start action. There are special actions such as fork and join (which fork and join dependency graph), as well as the ability to reference a "sub-workflow" defined in another XML file.

Job Submission

Azkaban

To submit a job to Azkaban, one creates a tar.gz or zip archive and uploads it via Azkaban's web interface. The archive contains any jars necessary to run the workflow, which are automatically added to the classpath of job at launch time.

It's possible to bypass the archive upload (this is what we did at Adconion), and directly place the files on the filesystem then tell Azkaban to reload the workflow definitions. I liked this approach because we were able to use RPMs to install workflows, and thus gave us the ability to rollback to a previous version.

Oozie

Oozie comes with a command-line program for submitting jobs. This command-line program interacts with the Oozie server via REST. Unfortunately, the REST api (at least in our version of Oozie) doesn't have very good error reporting. It's actually very easy to cause the server to 500 in which case you have to investigate Oozie's logs to guess at the problem.

Before submitting a job, the job definition, which is a folder contain xml and jar files, must be uploaded to HDFS. Any jars that are needed by the workflow should be placed in the "lib" directory of the workflow folder. Optionally, Oozie can include "system" libraries by setting a sytem library path in oozie-site and adding a property setting. Note that *only* HDFS is supported, which makes testing an Oozie workflow cumbersome since you must spin-up a MiniDFS cluster.

Running a Job

Azkaban

Azkaban provides a simple web-interface for running a job. Each job is given a name in its definition, and one can choose the appropriate job in the UI and click "Run Now". It's also easy to construct a HTTP POST to kick-off a job via curl or some other tool.

Determining which job to launch, though, can be quite confusing. With Azkaban, you don't launch via the "first" or "start" node in your DAG, but rather, you find the last node in your DAG and run it. This causes all the (recursive) dependent jobs to run.  This model means that you sometimes have to jump through some hoops to prevent duplicate work from occurring if you have multiple sinks with a common DAG.

Azkaban runs the driver program as a child process of the Azkaban process. This means that you're resource constrained by the memory on the box, which caused us to DOS or box a few times (Azkaban does have a feature to limit the number of simultaneous jobs, which we did use to alleviate this problem. But then your job-submission turns into FIFO).

Oozie

Once a workflow is uploaded to HDFS, one submits or runs a job using the Oozie client. You must give Oozie the full path to your workflow.xml file in HDFS as a parameter to the client. This can be cumbersome since the path changes if you version your workflows (and if you don't version your workflow, a re-submission could cause a running job to fail). Job submission typically references a java properties file that contains a number of parameters for the workflow.

Oozie runs the "driver" program (e.g. PigMain or HiveMain or your MapReduce program's main) as a MapTask. This has a few implications:

  1. If you have the wrong scheduler configuration, it's possible to end up with all map task slots occupied only by these "driver" tasks.
  2. If a map task dies (disable preemption and hope that your TaskTrackers don't die), you end up with an abandoned mapreduce job. Unless you kill the job, retrying at the Oozie level will likely fail.
  3. There's another level of indirection to determine what's happened if your job failed. You have to navigate from Oozie to Hadoop to the Hadoop job to the map task to the map task's output to see what happened.
  4. But also, you don't have a single box that you might DOS.

Scheduling a Job

Azkaban

Azkaban provides a WebUI for scheduling a job with cron-like precision. It's rather easy to recreate this HTTP POST from the command-line.

Oozie

Oozie has a great feature called "coordinators". A coordinator is an XML file that optionally describes datasets that a workflow consumes, and also describes the frequency of your dataset. For example, you can tell it that your dataset should be created daily at 1am. If there are input datasets described, then your workflow will only be launched when those datasets are available.

A coordinator requires a "startDate", which is rather annoying in the usual case (I just want to launch this workflow today and going forward… we have taken to making the startDate a parameter since we don't necessarily know when the coordinator will be released), but also makes it very easy to do a backfill of your data. E.g. if you have a new workflow that you want to run over all data from the first of the year onwards, just specify a startDate of Jan 1st.

Azkaban doesn't include anything like Oozie's coordinators. At Adconion, we wrote our own version of it, which also supported some nice features like reruns when data arrive late.

Security

Azkaban

Azkaban doesn't support secure Hadoop. That means, if you're running CDH3 or Hadoop 0.20.200+, that all of your jobs will be submitted to Hadoop as a single user. There have been discussions about fixing this, and I know that Adconion was working on something. Even so, with the fair scheduler it's possible to assign jobs to different pools.

Oozie

Oozie has built-in support for secure Hadoop including kerberos. We haven't used this, but it does mean that you have to configure Hadoop to allow Oozie to proxy as other users. Thus, jobs are submitted to the cluster as the user that submitted the job to Oozie (although it's possible to override this in a non-kerberos setting).

Property Management

Azkaban

Azkaban has a notion of "global properties" that are embedded within Azkaban itself. These global properties can be referenced from within a workflow, and thus a generic workflow can be built as long as different values for these global properties are specified in different environments (e.g. testing, staging, production). Typical examples of global properties are things like the location of the pig global props and database usernames and passwords.

Azkaban determines which Hadoop cluster to talk to by checking for HADOOP_HOME and HADOOP_CONF_DIR directories containing core-site.xml and mapred-site.xml entries. This also allows you to specify things like the default number of reducers very easily.

Global properties are nice, because if you need to tweak one you don't have to redeploy the workflows that depend on them.

Oozie

Oozie doesn't have a notion of global properties. All properties must be submitted as part of every job run. This include the jobtracker and the namenode (so make sure you have CNAMEs setup for those in case they ever change!). Also, Oozie doesn't let you refer to anything with a relative path (including sub-workflows!), so we've taken to setting a property called workflowBase that our tooling provides.

At foursquare, we've had to build a bunch of tooling around job submission so that we don't have to keep around all of these properties in each of our workflows. We're still stuck with resubmitting all coordinators, though, if we have to make a global change. Also, the jobtracker/namenode settings are extra annoying because you *must* specify these in each and every workflow action. Talk about boilerplate. I assume that since Yahoo has use-cases for supporting multiple clusters for a particular Oozie, but the design over-complicates things for the typical case.

Reruns

Azkaban

A neat feature of Azkaban is partial reruns - i.e. if your 10 step workflow fails on step 8, then you can pickup from step 8 and just run the last 3 steps. This was possible to do via the UI. This was an attractive feature of Azkaban, but we didn't use it.

Oozie

In order to get a similar feature in Oozie, each action in your workflow must be a sub-workflow, then you can run the individual sub-workflows. At least in theory -- it turns out that you have to set so many properties that it becomes untenable, and even with the right magic incantation, I couldn't get this to work well.

Reruns of failed days in a coordinator are easy, but only in an all-or-nothing sense -- if the last step of the workflow failed, there's no easy way to rerun it.

UI

Azkaban

Azkaban has a phenomenal UI for viewing workflows (including visualizing the DAG!), run histories, submitting workflows, creating schedules, and more. The UI does have some bugs, such as when you run multiple instances of the same workflow, the history page gets confused. But in general, it's very easy to tell what the state of the system is.

Oozie

The Oozie UI, on the other hand, is not very useful. It's all Aaax, but is formatted in a window sized for a 1999 monitor. It's laggy, doubl-clicks don't always work, and things that should be links aren't. It's nearly impossible to navigate once you have a non-trivial number of jobs because jobs aren't named with any human-readable form, the UI doesn't support proper sorting, and it's too laggy.

Monitoring

Azkaban

Azkaban supports a global email notification whenever a job finishes. This is a nice and cheap mechanism to detect failures. Also, my Adconion-colleague Don Pazel contributed a notification system that can be stitched up to detect failures, run times, etc and expose these via JMX or HTTP. That's what we did at Adconion, but that piece wasn't open-sourced.

Oozie

With Oozie, it's possible to have an email action that mails on success or failure, but an action has to be defined for each workflow. Since there's no good way to detect failure, we've written a workflow that uses the Oozie REST api to check the status of jobs and then sends us a daily email. This is far from ideal since we sometimes don't learn about a failure until hours after it occurred.

Testing

Azkaban

Testing with Azkaban can be achieved by instantiating the Azkaban JobRunner and using the java api to submit a job. We had a lot of success with this at Adconion, and tests ran in a matter of seconds.

Oozie

Oozie has a LocalOozie utility, but it requires spinning up a HDFS cluster (since Oozie has lots of hard-coded checks that data lives in HDFS). Thus, integration testing is slow (on the order of a minute for a single workflow).

Oozie also has a class that can validate a schema, which we've incorporated to our build. But that doesn't catch things like parameter typos, or referencing non-existant actions.

Custom Job Types

Azkaban

Writing a custom job is fairly straightforward. The Azkaban API has some abstract classes you can subclass. Unfortunately, you must recompile Azkaban to expose a new job type.

Oozie

Admittedly, I haven't tried this. But the action classes that I've seen are well into the hundreds of lines of code.

Baked-in support

Azkaban

Azkaban has baked-in support for Pig, java, shell, and mapreduce jobs.

Oozie

Oozie has baked-in support for Pig, Hive, and Java. The shell and ssh actions have been deprecated. In addition, though, an Oozie action can have a "prepare" statement to cleanup directories to which you might want to write. But these directories *must* be HDFS, which means that if you use <prepare> then your code is less testable.

Storing State

Azkaban

Azkaban uses JSON files on the filesystem to store state. It caches these files in memory using a LRU cache, which makes the UI responsive. The JSON files are also easy to investigate if you want to poke-around bypassing the UI. Creating a backup is an easy snapshot of a single directory on a filesystem.

Oozie

Oozie uses an RDBMS for storing state, and one must backup that RDBMS via whatever mechanism in order to create a backup.

Documentation

Both systems feature rich documentation. Oozie's documentation tends to be much longer and larger since it includes XML fragments as well as details of the XML schemas. Over the years, Azkaban's documentation has at times fallen out of sync with implementation, but the general docs are still maintained.

A note on Oozie versions

To be fair to Oozie, I haven't tried the latest version, yet. Hopefully many issues I've noted are fixed, but if not, it'll be easier to file bug reports once on the latest versions. A number of bugs I found in the CDH3u3 version of Oozie were either fixed or not applicable to trunk, so it became difficult to keep track of what was what.

Summary

Both Azkaban and Oozie offer substantial features and are powerful workflow engines. The weaknesses and strengths of the systems tend to complement one-another, and it'd be fantastic if each system integrated the strengths of the other to improve.

It's worth noting that there are also a number of other workflow systems available that I haven't used. I am not discounting these systems, but I have zero authority to speak on them. Lot's of folks also seem to be using in-house engines, and it'd be fantastic to see more work open-sourced.

Writing a general-purpose workflow engine is very hard, and there are certainly remnants of the "LinkedIn" or "Yahoo" way of doing things in each system. As communities grow, hopefully these engines will start to lose those types of annoyances.

The second-half of 2012 could be very interesting with workflow-engine improvements. For example, there is talk of a new UI for building Oozie workflows, HCatalog and Oozie integration could be interesting, and YARN integration could make for a better-solution to distributing the "driver" programs for workflows (I've heard rumblings that Oozie could go down this path).

Lastly, I realize that this post more than likely has mistakes or oversights. These are inadvertent -- if you find any, please make a note in the comments, and I will try to correct.

Leave a Comment

Blog Search

About

Joe Crobak is a software engineer at the United States Digital Service and runs Hadoop Weekly.

Elsewhere on the internet:

subscribe via RSS