Hardscrabble 🍫

By Max Jacobson

chainable shell functions

06 Feb 2017

I learned a neat shell script refactoring strategy yesterday that I’d like to share. First some background:

I used to use a rubygem to help me write this blog. When I wanted to create a new post called “chainable shell functions” I would have run:

bin/poole draft "chainable shell functions"

And it would create a file called _drafts/chainable-shell-functions.md with some metadata in the first few lines.

Yesterday I got the urge to try replacing that rubygem with a custom shell script which does exactly the same thing.

I am an enthusiastic novice shell scripter.

I’m vaguely aware there are different dialects of shell scripting and that I’m probably using the wrong one.

Really I’m not expert in this stuff.

But while writing this one I learned something interesting that I’m going to share with you now.

Here is the first draft (annotated with comments for your convenience):

#!/usr/bin/env bash

# fail fast if any expression fails
set -e

# read all of the arguments into a string
title="$*"

# OK don't worry about this gnarly line, I'm going to break it down
slug=$(
  echo "$title" | sed "s/ /-/g" | tr -dc '[:alnum:]-' | tr '[:upper:]' '[:lower:]'
)

# the file we're going to create
filename="./_drafts/$slug.md"

# create the folder if it doesn't already exist
mkdir -p _drafts

# stop if the file already exists -- I don't want to overwrite an in-progress draft
if [[ -e "$filename" ]]; then
  echo "$filename already exists"
  exit 1
fi

# create the draft by piping a string into a file
echo "---
title: $title
date: $(date '+%Y-%m-%d')
---

Alright, this is where your post goes." > $filename

# Print a successful message
echo "Created $filename"

OK did you read that? Great.

So you saw that line I promised I would break down? The idea with that line is that I want to take the input, which is the title of the post, and figure out what is an appropriate filename for the post. I’m figuring that out by applying a series of transformations to the title:

  • echo "$title"
    • just repeats the title, directing the output into a “pipe”, which the next command will read
  • sed "s/ /-/g"
    • sed is a “stream editor”; it reads in a stream of data and prints out a stream of data
    • here we’re using regular expressions to “s” or substitute all occurences of a space with - (hyphen)
    • we want hyphens because they make for nicer looking URLs than spaces, which get escaped to %20.
    • the g at the end means “global”; without it, we would only subsitute the first space
  • tr -dc '[:alnum:]-'
    • tr is short for “translate”
    • -d means “delete”
    • -c means “complementary”
    • this command means “delete all the characters that complement this set of characters”
    • in other words, “delete all the characters that aren’t alphanumeric or a hyphen”
  • tr '[:upper:]' '[:lower:]'
    • “translate” again!
    • this time we’re translating all of the upper-case letters to lower-case letters
  • Finally, we stop piping the output to the next command, and we’re done, so the result is saved in that local variable.

OK so that’s a lot going on in one line, and because of the compact nature of these commands, it’s not super readable.

In other languages, when I have a lot going on in one function, I want to split out smaller, well-named functions. Can I do the same thing here?

At first I wasn’t sure. I knew it was possible to write functions that received arguments by checking $1, $2, etc in the function, but I wasn’t sure how to make them “return” values…

After a little googling I learned: you can just write a shell function that calls commands that read from a pipe, and pipe things to that function.

Let me show you what I mean.

Here’s the second (and, frankly, final) draft:

#!/usr/bin/env bash

set -e

function dashify() {
  sed "s/ /-/g"
}

function removeSpecialChars() {
  tr -dc '[:alnum:]-'
}

function downcase() {
  tr '[:upper:]' '[:lower:]'
}

title="$*"
slug=$(
  echo "$title" | dashify | removeSpecialChars | downcase
)
filename="./_drafts/$slug.md"
mkdir -p _drafts

if [[ -e "$filename" ]]; then
  echo "$filename already exists"
  exit 1
fi

echo "---
title: $title
date: $(date '+%Y-%m-%d')
---

Alright, this is where your post goes." > $filename

echo "Created $filename"

Look at that!

When to use defined? to memoize in Ruby

05 Feb 2017

Here’s a quick Ruby thing.

traditional memoization in Ruby

Let’s say you have an object whose responsibility is to give a haircut to a dog.

(I may have recently been reading about this)

class DogStylist
  def initialize(dog_id)
    @dog_id = dog_id
  end

  def perform
    if dog
      dog.sedate
      dog.groom
      dog.instagram
    end
  end

  private

  def dog
    Dog.find(@dog_id)
  end
end

This is kind of fine, but it has one problem: each time you reference dog, you’re calling the dog method, which queries the database each time it’s called, so you’re querying the database over and over, when you really only need to do so once.

Better to write it like this:

def dog
  @dog ||= Dog.find(@dog_id)
end

Here you’re still calling the dog method over and over, but now it’s “memoizing” the result of the database query.

But what does that mean?

Here’s a more verbose version of the dog method that does the same thing:

def dog
  @dog = @dog || Dog.find(@dog_id)
end

You can see that ||= is a syntactical shorthand similar to +=.

In case you’re unfamiliar with +=, here’s an example. These two statements are equivalent:

count = count + 1
count += 1

Here’s an even more verbose version of the dog method that does the same thing:

def dog
  if @dog
    @dog
  else
    @dog = Dog.find(@dog_id)
  end
end

The goal here is to avoid evaluating the database query more than once. The first time the method is called, the @dog instance variable is not defined. In Ruby, it’s safe to reference an instance variable that isn’t defined. It will return nil. And nil is falsey, so the database query will be evaluated, and its result assigned to the instance variable.

This is where things get interesting.

Ponder this question: does this memoization strategy guarantee that the database query will only be executed once, no matter how many times the dog method is called?

It doesn’t.

Why????

I’ll tell you.

What if there is no dog with that ID? Dog.find(4000) returns either a dog, or nil. And, like we said earlier, nil is falsey. So hypothetically, if our perform method looked like this:

def perform
  dog
  dog
  dog
  dog
  dog
end

… then we would execute the database query five times, even though we made an effort to prevent that.

This is actually totally fine, because our perform method isn’t written like that (again, that was just a hypothetical). Our perform method only calls the dog method more than once if it’s truthy, so there’s no problem here.

memoization using defined?

Let’s consider another example, where things aren’t as hunky-dory. Hold please while I contrive one.

OK, I’ve got it.

Let’s say we only want to groom a dog when he or she is unkempt. When she logs into our web site, we want to pepper some subtle calls to action throughout the page encouraging her to book an appointment. We’ll need a method to check if she is unkempt, and we’re going to call it a few times. It looks like this:

class Dog
  HAIRS_THRESHOLD = 3_000_000

  def unkempt?
    Hair.count_for(self) > HAIRS_THRESHOLD
  end
end

That’s right: we’ve got a table in our database for all of the hairs on all of our dogs.

You can imagine this unkempt? method might be kind of “expensive”, which is to say “slow”.

Let’s try adding some memoization to this method:

def unkempt?
  @unkempt ||= Hair.count_for(self) > HAIRS_THRESHOLD
end

Here our goal is to prevent doing the expensive database query (Hair.count_for(self)) more than once.

Ponder this question: does our memoization strategy accomplish this goal?

Answer: it does not.

What?????

I know. Let me show you.

You can try running this Ruby script yourself:

$count = 0
class Hair
  def self.count_for(dog)
    $count += 1
    puts "called #{$count} times"
    2_000_000
  end
end

class Dog
  HAIRS_THRESHOLD = 3_000_000

  def unkempt?
    @unkempt ||= Hair.count_for(self) > HAIRS_THRESHOLD
  end
end

dog = Dog.new
puts "Is the dog unkempt? #{dog.unkempt?}"
puts "Is the dog unkempt? #{dog.unkempt?}"

It outputs the following:

called 1 times
Is the dog unkempt? false
called 2 times
Is the dog unkempt? false

In this script, I have a fake implementation of the Hair class. It’s meant to demonstrate that the count_for method is being called more than once, even though we specifically tried for it not to.

So what’s going on here?

Well, in a way, everything is working as it’s supposed to. The first time we call the unkempt? method, the @unkempt instance variable is not defined, which means it returns nil, which is falsey. When the instance variable is falsey, we evaluate the expression and assign its result, false, to the instance variable. The second time we call the unkempt? method, the @unkempt instance variable is defined, but its value is now false, which is also falsey (which you have to admit is only fair). So, again, because the instance variable is falsey, we evaluate the expression and assign its result, false, to the instance variable.

Shoot – that kind of makes sense.

So what to do? Here’s another way to write this:

def unkempt?
  if defined?(@unkempt)
    @unkempt
  else
    @unkempt = Hair.count_for(self) > HAIRS_THRESHOLD
  end
end

This approach uses Ruby’s built-in defined? keyword to check whether the instance variable is defined at all, rather than if its value is truthy. This is more resilient to the possibility that your value may be falsey.

I wish there were a more succinct way to write this, because I think it’s generally how you actually want your code to behave when you use ||=.

To be fair, you can avoid defined? and instead write this method like this:

def unkempt?
  @hair_count ||= Hair.count_for(self)
  @hair_count > HAIRS_THRESHOLD
end

It’s really just a matter of taste if you prefer one over the other.

Alright, take care.

Using git to track git

21 Aug 2016

I made a screencast to share a fun idea I had while exploring a bit how git works.

You may know that when you use git to track a project, it creates a hidden .git directory with some files in it. But what actually goes on in there? And when do the contents of those files change?

Here’s the idea: I know a tool for tracking the changes to a directory over time, and that tool is git itself!

So in this screencast you can see me try and do that – I initialized a git repository, which created a .git folder, and then I initialized another git repository within that .git directory.

I still don’t have a really great understanding of how git represents the data, although I’ve read Mary Rose Cook’s very good essay about this topic Git From The Inside Out, which does contain those answers (I read it a while ago and forgot the details).

But I feel like I learned a few things thru this little experiment, specifically about when they change.