Proctor: Indeed’s A/B Testing Framework

(Editor’s Note: This post is the first of a series about Proctor, Indeed’s open source A/B testing framework.)

A/B Testing at Indeed

Indeed’s mission is to help people get jobs. We are always asking ourselves the question “What’s best for the jobseeker?” We answer that question by testing and measuring everything. We strive to test every new feature and every improvement in every product at Indeed, and we measure the impact of those changes to ensure they are helping us achieve our mission.

In October 2013, Tom Bergman and Matt Schemmel presented an @IndeedEng talk on Proctor, Indeed’s A/B testing framework. In that talk, we announced that we had made Proctor available open source. Since then, we have also open sourced Proctor Webapp, the web application that we use to manage Proctor test definitions.

In the October talk, Tom gave the example of a simple A/B test to determine if changing the background color of a button would improve user experience. Figure 1 shows control group A, in which we haven’t changed our Find Jobs button, and test group B, in which the button has a blue background.

Figure 1: Testing an existing black text on gray background find jobs button treatment (A) against a version with a white text and blue background (B)

As discussed in our logrepo talk and blog post, we log everything at Indeed so that we can analyze, learn, and improve our products. For this simple test, we logged the group (A or B) of each user visiting Indeed and the subsequent clicks. Then we used our analysis tools to determine that the test group led to more searches and greater overall user engagement.

The example above has one test behavior, but we typically try out multiple alternate behaviors in a given test. In this test, we would be likely to try more than one different background color.

We can also test multiple ideas at the same time, as in the example in Figure 2, in which one test is for the button text and the other is for the background color. Testing multiple variables (like text and color) for a particular area of functionality is known as “multivariate testing.”

Figure 2: Running two tests on the same button simultaneously—where we test changes to text color, background, and text content (“find jobs” against “search”).

We’ve been doing A/B testing at Indeed for years, and many of the lessons we learned informed the development of Proctor. The October talk covers in more detail Proctor’s design decisions and ways to use for it for more than just A/B testing. In this blog post, we focus on some of Proctor’s key features and concepts, and we explain the nuts and bolts of how we use Proctor at Indeed.

Proctor Features and Concepts

Standard representation

Proctor provides a standard JSON representation of test definitions and allows adjustments to those definitions to be deployed independently of code. We refer to the full set of test definitions as the test matrix. A test matrix can be distributed to multiple applications as a single file, allowing for greater agility when managing tests and for sharing of consistent test definitions across multiple applications. Figure 3 shows a very simple version of our button test, with 50% of users allocated to the control group A (bucket 0) and 50% to the test group B (bucket 1).

"buttontst": {
"description": "backgroundcolortest",
"salt": "buttontst",
"buckets": [
{
"name": "control",
"value": 0,
"description": "current button treatment (A)"
},
{
"name": "altcolor",
"value": 1,
"description": "test background color (B)"
}
],
"allocations": [
{
"ranges": [
{
"length": 0.5,
"bucketValue": 0
},
{
"length": 0.5,
"bucketValue": 1
}
]
}
],
"testType": "USER"
}


Figure 3: a simple Proctor test definition

To understand this example, here is a quick overview of some Proctor terminology:

• Every test has a testType. The most common type is USER, meaning that we use a user identifier to map to a test variation. More on test types later.
• Each test is made of an array of buckets and an array of allocations.
• A bucket is a variation, or group, within a Proctor test definition. Each bucket has a short name, an integer value, and a human-friendly description.
• An allocation specifies the size of the buckets as an array of ranges. Each range has a length between 0 and 1 and a reference to the bucketValue for the bucket. Ranges in an allocation must sum to 1. You can have more than one allocation if you use rules (more about that later).

Proctor Webapp

Using the Proctor Webapp, you can manage and deploy test definitions from a web browser. You can customize the application in a number of ways, allowing integration with:

• revision control systems for maintaining history of test changes,
• issue tracking systems for managing test modification workflow, and
• other external tools, such as build and deployment systems.

Figure 4: Screenshot of a test definition in the Proctor Webapp

Java code generation from JSON test specifications

Test specifications in Proctor are JSON files that are independent of the test definitions and allow applications to declare the tests and buckets of which they are aware. They can be used in the build process for Java code generation and at runtime to load the relevant subset of the test matrix.

Code generation is optional but provides compile-time type-safety, so you don’t have to litter your code with string literals containing test and bucket names. The generated classes also make it easier to work with tests in Java code and in template languages (figure 5 shows a JSP example). Furthermore, the generated Java objects enable serialization of test group membership into formats like JSON or XML.

<c:if test="${groups.buttontstAltColor}"> .searchBtn { background-color: #2164f3; color: #ffffff; } </c:if>  Figure 5: Conditional CSS based on test group membership in a JSP template Rule-based contextual allocation Using Proctor’s rule definition language, your system can apply tests and test allocations by evaluating rules against runtime context. For example, you can define your entire test to only be available for a certain segment of users, or you can adjust the allocation of test groups depending on the segment. Your test could be 50% A and 50% B for users in one country, and 25% each A/B/C/D for users in all other countries. Rule-based group assignment allows for great flexibility in how you roll out and evaluate your tests. "allocations" : [ { "rule" : "'US' == country && 'en' == userLanguage", "ranges": [ { "length": 0.5, "bucketValue": 0 }, { "length": 0.5, "bucketValue": 1 } ] }, { "rule" : null, "ranges" : [ { "length" : 1.0, "bucketValue" : -1 } ] } ]  Figure 6: 50/50 test for US English, test inactive (bucket -1) for everyone else Payloads The ability to attach data payloads to test groups in test definitions allows you to simplify your code. In figures 7 and 8, we demonstrate how the color being tested for the button can be specified as a payload in the test definition and accessed in the template. Although in this example the total amount of template code is not reduced, if you had multiple test variations, each with its own color, the use of payloads would result in fewer lines of code. "buckets": [ { "name": "control", "value": 0, "description": "current button treatment (A)", "payload": { "stringValue": "#dddddd" } }, { "name": "altcolor", "value": 1, "description": "test background color (B)", "payload": { "stringValue": "#2164f3″ } } ]  Figure 7: Attaching a data payload containing a color value to the test group B <style> .searchBtn { background-color:${groups.buttontstPayload}; }
</style>


Figure 8: Using the data payload in CSS in a JSP template

Flexible test types

Proctor has a flexible concept of test types, allowing bucket determination to be based on user (typically defined by a tracking cookie value), account ID (which can be fixed across devices), email address, or completely random across requests. You can also extend Proctor with your own test types. Custom test types are useful, for example, when you want test group determination to be based on a context- or content-based attribute such as page URL or content category.

Unbiased, Independent Tests

To assign a bucket for a test, Proctor maps the input identifiers (e.g. user ID) to an integer value using a uniformly distributed hash function. The range assignments for a bucket determine the range of integers that define each bucket. Figure 9 shows a simple example with a 50/50 control/test distribution. Since the hash function is uniform, the distribution of bucket assignments should be unbiased.

Figure 9: 50/50 control/test buckets mapped onto an integer range for use with hash function

Furthermore, Proctor tests are independent, meaning that group membership in one test has no correlation with membership in another. This independence is accomplished by assigning a different salt to each test. The salt is used along with the identifiers as input to the hash function. Including the salt in the test definition allows for two advanced features:

1. You can intentionally align buckets in different tests (make them dependent) by sharing a salt (shared salts must start with “&”). In practice, we have very rarely seen the need to align two tests in this way.
2. You can “shuffle” the distribution of a test by changing its salt, resulting in completely different output from the hash function. This shuffling can be used to reassure yourself that there is no accidental bias in a test.

Proctor at Indeed

Proctor has become a crucial part of Indeed’s data-driven approach to product development, with over 100 tests and 300 test variations currently in production. In our next post, we will provide more details on how we use Proctor at Indeed.

UPDATE: Our second post in this series, How Indeed Uses Proctor for A/B Testing is now available.