Saturday, August 5, 2017

Drive Refactors with a Git Pre-Push Hook

When I'm writing code, if I see something I want to fix, I write a quick FIXME comment. When I'm working on my own projects, this is no big deal. But I don't like adding FIXMEs to code at work, or in open source repos. And really, I prefer not to keep them hanging around in my own repos, either. So I set up a git pre-push hook. It looks like this:

 ᕕ(ᐛ)ᕗ cat .git/hooks/pre-push
#!/bin/bash

if ag fixme . -i --ignore tmp/ --ignore log/ --ignore node_modules/ --ignore vendor/; then
  echo
  echo 'not pushing! clean up your fixmes'
  echo
  exit 1
else
  exit 0
fi

This code uses ag to make sure that there are no FIXMEs in my code, skipping dependency directories since they're not my problem, and either returns a Unix failure code, interrupting the push, if FIXMEs are present, or a success code, allowing the push to proceed, if they aren't.

In other words, if I tell git to push code which contains a FIXME, git effectively says no.

This worked well for a while. But soon, at work, somebody else added a FIXME to a repo. So I commented out my hook for a few days or weeks, and then, when I had a spare second, I fixed their FIXME. But soon enough, somebody else added a FIXME, and it was harder to fix. So I commented out the hook, and forgot all about it.

Eventually, though, I had to do some pretty complicated work, and I generated a lot of FIXMEs in the process. As an emergency hack, I uncommented the hook, did my main work, tried to push my branch, and got automatically chastised for trying to push a FIXME. So I went in, fixed them, and pushed the branch.

This is my routine now. If I'm dealing with a big enough chunk of work that I can't take a break to fix a FIXME, I just uncomment the hook for the period of time that I'm working on that topic branch. It's a really good way to stay on task while also putting together a to-do list for refactoring. You basically can't finish the branch until you go through the FIXME to-do list, but you're 100% free to ignore that to-do list, and focus on the primary task, until it's time to push the branch. So you can easily build refactoring into every topic branch without getting distracted (which is the primary risk of refactoring as you go).

It's probably pretty easy to modify this hook so that it only looks at unstaged changes, so I might do that later on, but it works pretty well as a habit already.

Sunday, June 4, 2017

Reddit Users Are A Subset Of Reddit Users

An interesting blog post on view counting at Reddit reminded me of an old post I wrote almost ten years ago. Reddit engineer Krishnan Chandra wrote:

Reddit has many visitors that consume content without voting or commenting. We wanted to build a system that could capture this activity by counting the number of views a post received.

He goes on to describe the scaling and accuracy problems in this apparently simple goal. But he never addresses a few really weird assumptions the Reddit team appears to have made: that every Reddit user is logged in, or even has an account in the first place; that every Reddit user has only one account; and that every account is only ever used by one person.

As I wrote back in the ancient days of 2008:

Assuming a one-to-one mapping between users and people utterly disregards practical experience and common sense. One human could represent themselves to your system with several logins. Several humans could represent themselves as one login. Both these things could happen. Both these things probably will happen.

In other words, "users" aren't real. Accounts and logins are real. The people who use those accounts and logins are also real. But the concept of "users" is a social fiction, and that includes the one-to-one mapping "users" implies between accounts and people. In some corners of the vast Reddit universe, creating more than one "user" per account is as common as it is in World of Warcraft.

As I say, this is all from a blog post back in 2008. More recently, in 2016, I decided that the best way to use Reddit (and Hacker News as well) is without logging in. In other words, the best use case is not to be a "user":

reading Reddit without logging in is much, much more pleasant than being a "user" of the site in the official sense. The same is true of Hacker News; I don't get a lot out of logging in and "participating" in the way that is officially expected and encouraged. Like most people who use Hacker News, I prefer to glance at the list of stories without logging in, briefly view one or two links, and then snark about it on Twitter...

neither of these sites seems to acknowledge that using the sites without being a "user" is a use case which exists. In the terms of both sites, a "user" seems to be somebody who has a username and has logged in. But in my opinion, both of these sites are more useful if you're not a "user." So the use case where you're not a "user" is the optimal use case.


On the Reddit blog, discussing all the technical aspects of view counting, Chandra said the team's goal was to make it easy for people to understand how much activity their posts were driving, and to include more passive activity like viewing without commenting. But there's this obvious logical disconnect here, where Reddit excludes from this analysis any user who isn't logged in. You could even make the case that they'd be better served just pulling the info from Google Analytics, although that'd be naive.

First, given the scale involved, there's likely to be enough accuracy issues that a custom solution would be a good idea in either case. Second, Reddit's view-counting effort made no attempt to identify when two or more accounts belonged to the same human being, or when two or more human beings shared the same login. An off-the-shelf analytics solution could probably provide insight into that, but not definitive answers.

My favorite explanation for all this is that it's just a simple institutional blind spot: people who work at Reddit are probably so used to thinking that people who don't log in "don't count" that it seemed perfectly reasonable to literally not count them. The concept of "users" is such a useful social fiction that the company's employees probably just genuinely forgot it was a social fiction at all (or never realized it in the first place). However, if your goal is to provide view counting with accuracy, skimming over these questions makes your goal impossible to achieve.