Finding Great (and Profitable) Ideas in the Computer Science Literature

I spend quite a bit of time trawling through recent computer science papers, looking for anything algorithmic that might improve my team’s product and Help People Get Jobs. It’s been a mixed bag so far, often turning up a bunch of pretty math that won’t scale at Indeed. But looking through the computer science literature can pay off big, and more of us should use the research to up our game as software developers.

Word cloud generated by entering the term 'inverted index'

Word cloud generated by WordItOut

Why read a computer science paper

The first question you might ask is why? Most working developers, after all, simply never read any computer science papers. Many smart developers look at me blankly when I even suggest that they do a literature search. “You mean look on StackOverflow?”

The short answer: to get an edge on your problem (and occasionally on your competition or your peers).

Some academic is looking into some deep generalization of whatever problem you are facing. They are hungry (sometimes literally, on academic salaries) to solve problems, and they give away the solutions. They are publishing papers at a ferocious pace, because otherwise their tenure committees will invite them to explore exciting opportunities elsewhere. Academics think up good, implementable approaches and give them away for free. And hardly anyone notices or cares, which is madness. But a smart developer can sometimes leverage this madness for big payouts. The key is knowing how to find and read what academics write.

Finding computer science papers

Thousands of computer science papers are published each year. How do you find a computer science paper worth reading? As with so many questions in this new century, the answer is Google, specifically Google Scholar.

As near as I can tell, Google Scholar includes almost all the academic papers ever written, for free. Almost every computer science paper since Alan Turing is accessible there. With Scholar, Google is providing one of the most amazing resources anyone has ever given away. Some links point to papers behind paywalls, but almost all those have extra links to copies that aren’t. I’ve read hundreds of papers and never paid for one.

Google doesn’t even attempt to monetize it. Nobody in the general public has heard about scholar.google.com. More surprisingly: according to my Google contacts, not many Googlers have heard about it either.

With Google Scholar, you’ve solved the problem of finding interesting papers.

Filtering computer science papers

Next, the problem is filtering and prioritizing the interesting papers you find.

Google Scholar search algorithms are powerful, but they aren’t magic. Even your best search skills will net you too many papers to read and understand. The chance that you are reading the one that will most help your work is small.

Here’s my basic strategy for quickly finding the best ones.

First, figure out the paper’s publication date. This seems like an obvious bit of metadata, but you’ll rarely find the date on the paper itself. Instead, look for clues in Google Scholar. You can also assume that it’s two years after the latest paper listed in the citations. This seems sloppy, but it’s effective. Computer science papers older than fifteen years are unlikely to contain anything of value beyond historical interest.

Next, read the first paragraph of the paper. This paragraph covers the problem the researchers are trying to solve, and why it’s important. If that problem sounds like yours, score! Otherwise, unless the authors have hooked you on the intrinsic interest of their results, dump it and move on to the next paper.

If things still seem promising, read the second paragraph. This paragraph covers what the authors did, describes some constraints, and lets you know the results (in broad strokes). If you can replicate what they did in your environment, accept the constraints, and the results are positive, awesome. You’ve determined the paper is worth reading!

How to read a computer science paper

The biggest trick to reading an academic paper is to know what to read and what not to read. Academic papers follow a structure only slightly more flexible than that of a sonnet. Some portions that look like they would help you understand will likely only confuse. Others that look pointless or opaque can hold the secrets to interpreting the paper’s deeper meanings.

Here’s how I like to do it.

Don’t read the abstract. The abstract conveys the gist of the paper to other researchers in the field. These are folks who’ve spent the last decade thinking about similar problems. You’re not there yet. The abstract will likely confuse you and possibly frighten you, but won’t help you understand the topic.

Don’t read the keywords. Adding keywords to papers was a bad idea that nonetheless seems to have stuck. Keywords tend to mislead and won’t add anything you wouldn’t get otherwise. Skip ’em, they’re not worth their feed.

Read the body of the paper closely. Do you remember the research techniques your teachers tried to drum into you in eighth grade? You’ll need them all. You’re trying to reverse engineer just what the researchers did and how they did it. This can be tricky. Papers tend to leave out many shared assumptions behind the research, as well as many details and small missteps. Read every word. Look up phrases or words you don’t know — Wikipedia is usually fine for this. Write down questions. Try to figure out not just what the researchers did, but what they didn’t do, and why.

Don’t read the code. This is counterintuitive, because the clearest way software developers communicate is through code — ideally with documentation, revision history, cross-references, test cases, and review comments.

It doesn’t work that way with academics. To a first approximation, code in academic papers is worthless. The skills necessary to code well are either orthogonal to or actively opposed to the skills necessary for interesting academic research. It’s a minor scandal that most code used in academics is unreviewed, not version-controlled, lacks any test cases, and is debugged only to the point of “it didn’t crash, mostly, today.” That’s the good stuff. The bad stuff is simply unavailable, and quite probably long-deleted by the time the paper got published. Yes, that’s atrocious. Yes, even in computer science.

Read the equations. Academics get mathematics, so their equations have all the virtues that software developers associate with the best software: precision, correctness, conciseness, evocativeness. Teams of smart people trying to find flaws offer painstaking reviews of the equations. In contrast, a bored grad student writes the code, which nobody reads.

Don’t read the conclusions section. It adds nothing.

Leveraging a computer science paper for further search

Academic papers offers a bounty of contextual data in references to other papers. Google Scholar excels at finding papers, but there’s no substitute for actually following the papers that researchers used to inform their work.

Follow the citations in the related work. Authors put evocative descriptions of the work that matters to them in “Related Work.” This provides an interesting contrast for interpreting their work. In some ways, this section memorializes the most important social aspects of academic work.

Follow the citations in the references. Long before HTML popularized hypertext, academic papers formed a dense thicket of cross-references, reified as citations. For even the best papers, half of the value is the contents, half is the links. Citations in papers aren’t clickable (yet), but following them is not hard with Google Scholar.

Repeated citations of older papers? There’s a good chance those are important in the field and useful for context. Repeated citations of new papers? Those papers give insight into the trajectory of the subject. Odd sounding papers with unclear connections to the subject? They are great for getting the sort of mental distance that can be useful in hypothesis generation.

Once you’ve done all that…

It’s just a simple matter of coding. Get to it!

Dave Griffith has been building software systems for over 20 years.