Time-Tested: 7 Ways to Improve Velocity When A/B Testing a New UX

A/B testing holistic redesigns can be tough. Here at Indeed, we learned this firsthand when we tried to take our UX from this to this:

job search ux v1-2

The Indeed mobile Search Engine Results Page (SERP) circa mid-2017 (left) and circa mid-2018 (right)

Things didn’t go so hot. We’d spent months and months coming up with a beautiful design vision, and then months and months trying to test all of these changes all at once. We had a bunch of metrics move (mostly down) and it was super confusing because we couldn’t figure out what UI changes caused what effects.

So, we took a new approach. In the middle of 2018, we founded the Job Search UI Lab, a cross-functional team with one goal: to scientifically test as many individual UI elements as we could to understand the levers on our job search experience. In just the last 12 months, our team ran over 52 tests with over 502 groups. We’ve since used our learnings to successfully overhaul the Job Search UX on both desktop and mobile browsers.

In this blog post, we share some of the A/B test accelerating approaches we incorporated in the JSUI Lab — approaches that garnered us the 2018 Indeed Engineering Innovation Award. Whether you’re interested in doing a UX overhaul or just trying out a new feature, we think you can incorporate each of these tips into your A/B testing, too!

#1: Have a healthy backlog

No one should ever be waiting around to start development on a new test. Having a healthy backlog of prioritized A/B tests for developers helps you roll out A/B tests one after the other.

One way that the JSUI Lab creates a backlog is by gathering all of our teammates — regardless of their role — at the beginning of each quarter to brainstorm tests. We pull up the current UX on mobile and desktop and ask questions about how each element or feature works. Each question or idea ends up on its own sticky note. We end up with over 40 test ideas, which we then prioritize based off of how each test might address a job seeker pain point or improve the design system while minimizing effort. And while we may not get to every test in our backlog, we never have to worry about not having tests lined up.

#2: Write down hypotheses ahead of time

In the hustle and bustle of product development, sometimes experimenters don’t take the time to specify their hypotheses for a given A/B test ahead of time. Sure, not writing hypotheses may save you 10–30 minutes up front. But this can come back to bite you once the test is completed, when your team is looking at dozens of metrics and trying to make a decision about what to do next.

Not only is it confusing to see some metrics go up while others go down, chances are you’re probably also seeing some false positives (also known as Type I error) the more metrics you look at. You may even catch yourself looking at metrics that wouldn’t feasibly be affected by your test (e.g., “How does changing this UI element from orange to yellow affect whether or not a job seeker gets a call back from an employer?!”).

So do yourself a solid. Pick 3–4 metrics that your test could reasonably be expected to move, conduct a power analysis for each one, and write down your hypotheses for the test ahead of time.

#3: Test UI elements one at a time

This one’s a little counterintuitive. It might seem like it would increase the amount of time it would take to do a UX holistic redesign by testing each and every UI element separately. But by testing elements one at a time, the conclusions about our tests were more sound. Why? Because we could more clearly establish causality.

Consequently, we were able to take all of the learnings from our tests and roll them into one big test that we were fairly confident would perform well. Rather than see metrics tank like the first time we did a holistic design test, we actually saw some of Indeed’s biggest user engagement wins for 2018, in less than half the time of the first attempt.

By running tests on UI elements one at a time, we were able to iterate on our design vision in a data-driven way and set up our holistic test for success.

SERP 2019

Indeed’s mobile SERP circa mid-2019

So, what do these tests look like in practice? Below are a few examples of some of the groups we ran. You’ll notice that the only real difference between the treatments is a minor change, like font size or spacing.

tests

#4: Consider multivariate tests

Multivariate tests (sometimes referred to as “factorial tests”) test all possible combinations of each of the factors of interest in your A/B test. So, in a way, they’re more like an A/B/C/D/E/… test! What’s cool about multivariate tests is that you end up with winning combinations that you would have missed had you tested each factor one at a time.

An example from the JSUI Lab illustrates this benefit. We knew from UX research that our job seekers really cared about salary when making the decision to learn more about a job. In 2018, this was how we displayed salary on each result:

account coordinator listing

We wanted to see if increasing the visual prominence using color, font size, and bolding would increase job seeker engagement with search results. So, we developed four font size variants, four color variants, and two variants that were bolded or unbolded. We ended up with 4x4x2 groups for 32 total groups including control.

account coordinator tests

While multivariate tests can speed up how you draw conclusions about different UI elements, they’re not without their drawbacks. First and foremost, you’ll need to weigh the tradeoffs to statistical power, or the likelihood that you’ll detect a given effect if one actually exists (also known as Type II error). Without sufficient statistical power, you risk not detecting an effect of your test if there is one.

Power calculations are a closed-form equation that require your product team to make tradeoffs between your α-level and β-level of choice, your sample size (n), and the effect size you care about your treatment having (p1). On Indeed, we have the benefit of having over 220M+ unique users each month. That level of traffic may not be available to you and your team. So, to have sufficient statistical power, you’ll potentially need to run your experiment for longer, run groups at higher allocations, cut some groups, or be willing to introduce more Type I error, depending on how small of an effect you’d like to confidently detect.

closed form calculation
The closed form calculation is a power test between two proportions

With a typical A/B test, it’s usually relatively straightforward to analyze with a t-test. Multivariate tests, however, will benefit from multivariate regression models, which will allow you to suss out the effects of particular variables and their interaction effects. Here’s a simplified regression equation:

simplified-regression-equation

And an example of a regression equation for one of the tests we ran that modified both font size and the spacing on the job card:

job card regression equation

Another caveat of multivariate tests is that they can quickly become infeasible. If we had 10 factors with 2 levels each, we’d have a 2^10 multivariate test, with a whopping 1,024 test groups. In cases like these, running what’s called a fractional factorial experiment might make more sense.

Finally, multivariate tests may sometimes yield some zany combinations. In our salary example above, our UX Design team was mildly mortified when we introduced the salary variant with 16pt, green, and bolded font. We lovingly referred to this variant as “the Hulk.” In some cases, it may not be feasible to run a variant due to accessibility concerns. In the JSUI Lab, we determine on a case-by-case basis whether the tradeoff of statistical rigor is worth a temporarily poor user experience.

sw engineer listing

#5: Deploy CSS and JavaScript changes differently

Sometimes a typical deploy cycle can get in the way of testing new features quickly. At Indeed, we developed a tool called CrashTest, which allows us to sidestep the deploy cycle. CrashTest relies on a separate code base of CSS and JavaScript files that are injected into “hooks” in our main code base. While installing CrashTest hooks follows the standard deploy, once hooks are set up, we can inject new CSS and JavaScript treatments and see the changes reflected in our product in just a few minutes.

In the JSUI Lab, we rely on our design technologist Christina to quickly develop CSS and JavaScript treatments for dozens of groups at a time. With CrashTest, Christina can develop her features and get them QAed by Cory. We can push them into production that same day using our open source experimentation management platform Proctor. Had we relied on the typical deploy cycle, it would have taken Christina’s work several more days to be seen by job seekers, and that much more time until we had results from our A/B tests.

#6: Have a democratized experimentation platform

Combing through logs and tables to figure out how your tests performed is not the best use of time. Instead, consider building or buying an experimentation platform for your team. As a data-driven company, Indeed has an internal tool for this called TestStats. The tool displays how each test group performed on key metrics and whether the test has enough statistical power to draw meaningful conclusions at the predetermined effect size. This makes it easy to share and discuss results with others.

#7: Level up everyone’s skills through cross-training

On the JSUI team, we firmly believe that allowing everyone to contribute to team decisions equally helps our team function better. Our teammates include product managers, UX designers, QA engineers, data scientists, program managers, and design technologists. Each of us brings a unique background to the team. Teaching each other the skills we use in our day-to-day jobs helps increase velocity for our A/B tests because we’re able to talk one another’s language more readily.

For instance, I’m a product scientist, and led a training on A/B testing. This allowed all of the other members of JSUI Lab to feel more empowered to make test design decisions without my direct guidance every time. Our UX designer Katie shadowed our product managers CJ and Kevin as they turned on tests. Katie now turns on tests herself. Not only does this kind of cross-training reduce the “bus factor” on your team, it can also be a great way of helping your teammates master their subject and improve their confidence in their own expertise.

Now it’s time to test!

Whether you take only one or two tips or all seven, they can be a great way of improving your velocity when running A/B tests. The Job Search UI Lab has already started sharing these simple steps with other teams at Indeed. We think they’re more broadly applicable to other companies and hope you’ll give them a try, too.

And if you’re passionate about A/B testing methods, Indeed’s hiring!


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

The Evolving Language of Data Science 

…or Grokking the Bokeh of Scarse Meaning Increasement

“You keep using that word. I do not think it means what you think it means.” — Dr. Inigo Montoya


I’m a technical writer at Indeed. One of the many great things about my job is that I get to work with smart people every day. A fair amount of that work involves translating between them. They will all be speaking English, but still might not understand each other. This is a natural consequence of how knowledge advances in general, and how English develops in particular. 

As disciplines evolve, alternate meanings and new words develop to match. That can extend to creating new phrases to name the disciplines themselves (for example, what is a data scientist?). English’s adoption of such new words and meanings has always been pragmatic. Other Western languages have more formal approval processes, such as French’s Académie française and German’s reliance on a single prestigious dictionary. The closest to formal authorities for correct English are popular dictionaries such as the Oxford English Dictionary, the American Heritage Dictionary, and Merriam-Webster. None of them reign supreme.

This informal adoption of new words and meanings can lead to entire conversations in which people don’t realize they’re discussing different things. For example, consider another recently adopted word: “bokeh.” This started as a term in the dialect of professional photography, for the aesthetically pleasing blurred look that strong depth of field can give a picture. “Bokeh” is also the name for a specific python data visualization package. So “bokeh” may already be headed for a new meaning within the realm of data science.

As a further example of the fluid nature of English, “bokeh” comes from the Japanese word boke (暈け or ボケ). In its original form it meant “intentional blurring,” as well as sometimes “mental haze,” i.e., confusion.

 

A row of lowers that becomes blurry in the distance, for the word "bokeh" A montage of various images relating to the data science usage of "bokeh"

Bokeh of flowers

Photo by Sergei Akulich on Unsplash

Data science bokeh

https://bokeh.pydata.org/ 

The clouded meaning of “data”

A data scientist told me that when she hears “the data” she tends to think of a large amount of information, a set large enough to be comprehensive. She was surprised to see another team’s presentation of  “the data” turn out to be a small table inside a spreadsheet that listed a few numbers. 

This term can also cause confusion between technical fields. Data scientists often interpret “data” as quantitative, while UX researchers interpret “data” as qualitative.

Exploring evolving language with Ngram Viewer

A product science colleague introduced me to the Google Books Ngram Viewer. It’s a search engine that shows how often a word or phrase occurs in the mass of print books Google has scanned. Google’s collection contains most books published in English from AD 1500 to 2008.

I entered some new words that I had come across, and screened out occurrences that weren’t relevant, such as place or person names and abbreviations. I also set the search to start from 1800. Medieval data science could be interesting, but I expect it to be “scarse.” (That’s not a typo.)

Features

When I first came across this newer meaning of “features,” I wasn’t even aware that it had changed. From previous work with software development and UX, I took “features” to mean “aspects of a product that a user will hopefully find useful.” But in data science, a “feature” relates to covariates in a model. In less technical English, a measurable property or characteristic of a phenomenon being observed. 

This dual meaning led me to a fair amount of head-scratching when I was documenting an internal data science application. The application had software features for defining and manipulating data features. 

The following graph indicates this emerging meaning for “feature” by tracking the emergence of a related phrase, “model feature.” 

Ngram graph for the phrase "model feature"

Diving into Ngram’s specific citations, the earliest mention I can find that’s near this meaning is in 1954. Interestingly, it’s from a book on management science:

Screenshot from Google Books summary of "Management Science"

The next use that seems exact turns up in 1969, in the Digest Record from Association for Computing Machinery, Society for Industrial and Applied Mathematics, Institute of Electrical and Electronics Engineers. Leaving aside the intervening comma, the example is so dead-on that I wonder if we’re looking at near the exact moment this new meaning was fully born:

Screenshot of Google Books summary for "The Digest Record"

To grok

“Grok” is an example of English going so far as to steal words from languages that don’t even exist. Robert A. Heinlein coined the word in his 1961 science fiction classic Stranger in a Strange Land. In the novel, the Martian phrase “grok” literally means “drink” and metaphorically means “understanding something so completely that you and it are one.” 

Ngram graph for the word "grok"

Like many other aspects of science fiction and fantasy, computer programming culture absorbed the term. The Jargon File from 1983 shares an early defined example:

GROK (grahk) verb.
  To understand, usually in a global sense especially, to understand
all the implications and consequences of making a change. Example:
“JONL is the only one who groks the MACLISP compiler.”

Since then, computer jargon has absorbed “grok” and applied it in many different ways. One immediate example is the source code and reference engine OpenGrok. It’s intended to let users “grok (profoundly understand) source code and is developed in the open.”

Salt

Salt is an example of a common word that has gone through two steps of technical change. First it gained a meaning relating to information security, and then an additional one in data science. 

As a verb and noun, “salt” originally meant what it sounds like – adding the substance chemically known as NaCl to food for flavoring and preservation. It gained what is perhaps its better-known technical meaning in information security. Adding “salt” to password hashing makes encrypted passwords more difficult to crack. In the word’s further and more recent permutations in data science, “salt” and “resalt” mean to partly randomize the results of an experiment by shuffling them. The following ngram graph tracks the association of “salt” and “resalt” over time. 

This was hard to parse out, and required diving deeply into Ngram’s options. I ended up graphing the different times “salt” modifies the words “food,” “password,” or “data.” Google stopped scanning in new books in 2008 – you can see the barest beginning of this new usage in 2007.

Ngram graph for the word "salt"

Pickling

Traditionally “pickling” refers to another way to treat food, this one almost entirely for preservation. In Python, this refers to the object serialization method made possible by the Pickle module. Data scientists have found increasing use for this term, in ways too recent to find on Ngram.

The bleeding edge of language?

Here are some words that may just be in the sprouting stage of wider usage.

Scarse

This came from an accidental jumble of words in a meeting, and has remained in use since. It describes situations where data is both scarce (there’s not a lot of it) and sparse (even when there is some, it’s pretty thin). 

This meaning for “scarse” doesn’t appear in the Ngram graph. So it appears we’re seeing mutation and evolution in word form in the wild. Will it take root and prosper, continuing to evolve? Only time will tell.

Increasement

“We should look for the source of that error message increasement.”

I’ve observed this word once in the wild–from me. “Increasement” came to me in a meeting, as a word for the amount of an increase over time. I had never used the word before. It just seemed like a word that could exist. It had meaning similar to other words, and fit those other words’ rules of word construction.

In the context I used, its meaning isn’t exactly the same as “increment.” Increment refers to a specific numeric increase. One wouldn’t refer, for example, to an increasing amount of users as an increment. You might, however, refer to it as an increasement.

Searching for increasement revealed that this word previously existed but fell out of common usage, as shown on the following graph.

Ngram graph for the word "increasement"

Previous examples:

The Fathers of the English Church

Paul was, that he should return again to these Philippians, and abide, and continue amongst them, and that to their profit; both to the increasement of their faith


The Harleian miscellany; or, A collection of … pamphlets and tracts … in the late earl of Oxford’s library

….when she saw the man grown settled and staid, gave him an assistance, and advanced him to the treasurership, where he made amends to his house, for his mis-spent time, both in the increasement of his estate and honour…

Perhaps it’s time for “increasement” to be rebooted into common use?

Bottom line

Language is likely to continue evolving as long as we use language. Words in general, and English words in particular, and words in English technical dialects above all, are in a constant state of flux. Just like the many fields of knowledge they discuss.

So if you’re in a technical discussion and others’ responses aren’t quite what you expect, consider re-examining the technical phrases you’re using. 

The people you’re talking with might grok those words quite differently.

 

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone

Recognize Class Imbalance with Baselines and Better Metrics

In my first machine learning course as an undergrad, I built a recommender system. Using a dataset from a social music website, I created a model to predict whether a given user would like a given artist. I was thrilled when initial experiments showed that for 99% of the points in my dataset, I gave the correct rating – I was wrong only 1% of the time!

When I proudly shared the results with my professor, he revealed that I wasn’t, in fact, a machine learning prodigy. I’d made a mistake called the base rate fallacy. The dataset I used exhibited a high degree of class imbalance. In other words, for 99% of the pairs between user and artist, the user did not like the artist. This makes sense: there are many, many musicians in the world, and it’s unlikely that one person has even heard of half of them (let alone actually enjoys them).

class imbalance

When we’re unprepared for it, class imbalance introduces problems by producing misleading metrics. The undergrad version of me ran face-first into this problem: accuracy alone tells us almost nothing. A trivial model that predicts that no users like any artists can achieve 99% accuracy, but it’s completely worthless. Using accuracy as a metric assumes that all errors are equally costly; this is frequently not the case.

Consider a medical example. If we incorrectly classify a tumor as malignant and request further screening, the cost of that error is worry for the patient and time for the hospital workers. By contrast, if we incorrectly state that a tumor is benign when it is in fact malignant, the patient may die.

Examine the distribution of classes

Moving beyond accuracy, there are a number of metrics to think about in an imbalanced problem. Knowing the distribution of classes is the first line of defense. As a rule of thumb, Prati, Batista, and Silva find that class imbalance doesn’t significantly harm performance in cases where the minority class makes up 10% or more of the dataset. If you find that your dataset is imbalanced more than this, pay special attention.

I recommend starting with an incredibly simple model: pick the most frequent class. scikit-learn implements this in the DummyClassifier. Had I done this with my music recommendation project, I would quickly have noticed that my fancy model wasn’t really learning anything.

Evaluate the cost

In an ideal world, we could calculate the exact costs of a false negative and a false positive. When evaluating our models, we could multiply those costs by the false negative and false positive rates to come up with a number that describes the cost of our model. Unfortunately, these costs are often unknown in the real world, and improving the false positive rate usually harms the true positive rate.

To visualize this tradeoff, we can use an ROC curve. Most classifiers can output probability of membership in a certain class. If we choose a threshold (50%, for example), we can declare that all points with probability over the threshold are members of the positive class. Varying the threshold from a low percentage to a high percentage produces different ways of classifying points that have different true positive and false positive rates. Plotting the false positive rate on the x-axis and the true positive rate on the y-axis, we get an ROC curve.

As an example, I trained a classifier on the yeast3 dataset from KEEL and created an ROC curve:

ROC for LogisticRegression on yeast3

While we could certainly write the code to draw an ROC curve, the yellowbrick library has this capability built in (and it’s compatible with scikit-learn models). These curves can suggest where to set the threshold for our model. Further, we can use the area under them to compare multiple models (though there are times when this isn’t a good metric).

The next time you’re working on a machine learning problem, consider the distribution of the target variable. A huge first step towards solving class imbalance is recognizing the problem. By using better metrics and visualizations, we can start to talk about imbalanced problems much more clearly.

More on class imbalance

In my upcoming talk at ODSC West, I’ll dive deeper into the causes of class imbalance. I’ll also explore different ways to address this error. I hope to see you in October!


Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on RedditEmail this to someone