SHAP Plots: The Crystal Ball for UI Test Ideas

Photo by Sam on Unsplash

 

Have you ever wanted a crystal ball that would predict the best A/B test to boost your product’s growth, or identify which part of your UI drives a target metric?

With a statistical model and a SHAP decision plot, you can identify impactful A/B test ideas in bulk. The Indeed Interview team used this methodology to generate optimal A/B tests, leading to a 5-10% increase in key business metrics.

Case study: Increasing interview invites

Indeed Interview aims to make interviewing as seamless as possible for job seekers and employers. The Indeed Interview team has one goal: to increase the number of interviews happening on the platform. For this case study, we wanted UI test ideas that would help us boost the number of invitations sent by employers. To do this, we needed to analyze their behavior on the employer dashboard, and try to predict interview invitations.

Employer using Indeed Interview to virtually interview a candidate.

Convert UI elements into features

The first step of understanding employer behavior was to create a dataset. We needed to predict the probability of sending interview invitations based on an employer’s clicks in the dashboard.

We organized the dataset so each cell represented the number of times an employer clicked a specific UI element. We then used these features to predict our targeted action: clicking the Set up interview button vs. not clicking on the button.

Set up interview button on the employer dashboard

Train the model on the target variable

The next step was to train a model to make predictions based on the dataset. We selected a tree-based model, CatBoost, due to its overall superior performance and ability to detect interactions among features. And, just like any model, it works effectively with our interpretation tool – SHAP plot.

We could have used correlation or logistic regression coefficients, but we chose SHAP plot combined with a tree-based model because it provides unique advantages for model interpretation tasks. Two features with similar correlation coefficients could have dramatically different interpretations in SHAP plot, which factors in feature importance. In addition, a tree-based model usually has better performance than logistic regression, leading to a more accurate model. Using SHAP plot combined with a tree-based model provides both performance and interpretability.

Interpret SHAP results into positive and negative predictors

Now that we have a dataset and trained model, we can interpret the SHAP plot generated from it. SHAP works by showing how much a certain feature can change the prediction value. In the SHAP plot below, each row is a feature, and the features are ranked based on descending importance: the ones at the top are the most important and have the highest influence (positive or negative) on our targeted action of clicking Set up interview.

The data for each feature is displayed with colors representing the scale of the feature. A red dot on the plot means the employer clicked a given UI element many times, and a blue dot means the employer clicked it only a few times. Each dot also has a SHAP value on the X axis, which signifies the type of influence, positive or negative, that the feature has on the target and the strength of its impact. The farther a dot is from the center, the stronger the influence.

SHAP plot displaying features A-O ranked by descending influence on the model (regardless of positive or negative). Each feature has red and blue dots (feature value) organized by SHAP value (impact on model output). Features outlined in red: A, B, D, F, H, I, K, L, and N. Features outlined in blue: E, G, M, and O.

SHAP plot with features outlined in red for positive predictors, and blue for negative predictors

Based on the color and location of the dots, we categorized the features as positive or negative predictors.

  • Positive Predictor – A feature where red dots are to the right of the center.
    • They have positive SHAP value: usage of this feature predicts the employer will send an interview invitation.
    • In the SHAP plot above, Feature B is a good example.
  • Negative Predictor – A feature where red dots are to the left of the center.
    • They have negative SHAP value: usage of this feature predicts the employer will not send an interview invitation.
    • Feature G is a good example of this.

Red dots on both sides of the center are more complex and need further investigation, using tools such as dependency plots (also in SHAP package).

Note that this relationship between feature and target is not causal yet. A model can only claim causality when it assumes all confounding variables have been included, which is a strong assumption. While the relationships could be causal, we don’t know for certain until they are verified in A/B tests.

Generate test ideas

Our SHAP plot contains 9 positive predictors and 4 negative predictors, and each one is a potential A/B test hypothesis of the relationship between the UI element and the target. We hypothesize that positive predictors boost target usage, and negative predictors hinder target usage.

To verify these hypotheses, we can test ways to make positive predictors more prominent, and direct the employer’s attention to them. After the employer clicks on the feature, we can direct attention to the target, in order to boost its usage. Another option is to test ways to divert the employer’s attention away from negative predictors. We can add good friction, making them less easy to access and see if usage of the target increases.

Boost positive predictors

We tested changes to the positive predictors from our SHAP plot to make them more prominent in our UI. We made Feature B more prominent on the dashboard, and directed the employer’s attention to it. After the employer clicked Feature B, we showed a redesigned UI with improved visuals to make the Set up interview button more attractive.

The results were a 6% increase in clicking to set up an interview.

Divert away from negative predictors

We also tested changes to the negative predictors from our SHAP plot in the hopes of increasing usage of the target. We ran a test to divert employer attention away from Feature G by placing it close to the Set up interview button on the dashboard. This way it was easier for the employer to choose setting up an interview instead.

This change boosted clicks to send interview invitations by 5%.

Gaze into your own crystal ball

A SHAP plot may not be an actual crystal ball. When used with a statistical model, however, it can generate UI A/B test ideas in bulk and boost target metrics for many products. You might find it especially suitable for products with a complex and nonlinear UI, such as user dashboards. The methodology also provides a glimpse of which UI elements drive the target metrics the most, allowing you to focus on testing features that have the most impact. So, what are you waiting for? Start using this method and good fortune will follow.

 

Cross-posted on Medium

Indeed SRE: An Inside Look

Photo by Kevin Ku on Unsplash

Indeed adds over 30 million jobs online every month, which helps connect 250 million job seekers to prospective employers. How do we keep our services available, fast, and scalable? That’s the ongoing challenge for our site reliability engineering (SRE) team.

What is SRE?

The idea behind SRE is simple: The team ensures that a company’s core infrastructure works effectively. SRE originated in 2003 when Google formed a small production engineering team to address reliability issues. Its initial focus was on-call, monitoring, release pipeline, and other operations work. The team established service-level indicators and objectives (SLIs and SLOs) to improve infrastructure across the company. Other companies took note, and SRE soon became an industry standard.

SRE is distinct from other engineering roles. Team members work across business areas to ensure that services built by software engineering (SWE) teams remain scalable, performant, and resilient. Working with platform teams, SRE helps manage and monitor infrastructure like Kubernetes. SRE teams build frameworks to automate processes for operations teams. They might also develop applications to handle DNS, load balancing, and service connections for network engineering teams.

These functions are crucial for any company competing in today’s tech world. However, because of the vast range of technologies and methods available, each SRE team takes a different approach.

SRE at Indeed

At Indeed, we established an SRE team in 2017 to increase attention on reliability goals and optimize value delivery for product development teams. Our SRE team uses an embedded model, where each team member works with a specific organization. They code custom solutions to automate critical processes and reduce toil for engineers.

Indeed SRE focuses on these key goals:

Promote reliability best practices. SRE helps product teams adopt and iterate on metrics, such as SLOs, SLIs, and error budget policies. They promote an Infrastructure as Code (IaC) model. That means they write code to automate management of data centers, SLOs, and other assets. They also drive important initiatives to improve reliability and velocity, like Indeed’s effort to migrate products to AWS.

Drive the creation of reliability roadmaps. At Indeed, the SRE team spends more than 50% of their time on strategic work for roadmaps. They analyze infrastructure to define how and when to adopt new practices, re-architect systems, switch to new technologies, or build new tools. Once product teams approve these proposals, SRE helps design and implement the necessary code changes.

Strive for operational excellence. SRE works with product teams to identify operational challenges and build more efficient tools. They also guide the process of responding to and learning from critical incidents, adding depth to individual team retrospectives. Their expertise in incident analysis helps them identify patterns and speed up improvements across the company.

Who works in Indeed SRE?

Our SRE team is diverse and global. We asked a few team members to talk about how they arrived at Indeed SRE.

Ted, Staff SRE

I love programming. Coming from a computer science background, I started my career as a software engineer. As I progressed in my role, I became interested in certain infrastructure related challenges. How can we move a system to the cloud and maximally reduce the costs? How do we scale a legacy service to several machines? What metrics should we collect—and how frequently—to tell if a service works as intended?

Later, I discovered that these questions are at the intersection of SWE and SRE. Without realizing it, I had implemented SRE methodology in every company I’d worked for! I decided to apply at Indeed, a company with an established SRE culture where I could learn—not only teach.

Working for Indeed SRE gives me more freedom to select my focus than working as a SWE. I can pick from a range of tasks: managing major outages, building internal tools, improving reliability and scalability, cleaning up deprecated infrastructure, migrating systems to new platforms. My work also has a broad impact. I can improve scalability for 20+ repositories in different programming languages in one go. Or I can migrate them to a new environment in a week. SRE has given me deeper knowledge of how services from container orchestration tools to front end applications are physically managed, which makes me a better engineer.

Jessica, Senior SRE

Before joining Indeed SRE, I tried many roles, from QA to full-stack web developer to back-end engineer. Over time, I realized that I liked being able to fix issues that I identify. I wanted to communicate and empathize with the customer instead of being part of a feature factory. Those interests led me to explore work in operations, infrastructure, and reliability. That’s when I decided on SRE.

Now I support a team that works on a set of role-based authentication control (RBAC) services for our clients. All our employer-facing services use this RBAC solution to determine whether a particular user is authorized to perform an action. Disruptions can lead to delays in our clients’ hiring processes, so we have to make sure they get fast, consistent responses.

The best thing about being on the SRE team is working with a lot of very talented engineers. Together, we solve hard problems that software engineers aren’t often exposed to. The information transfer is amazing, and I get to help.

Xiaoyun, Senior SRE Manager

When I joined Indeed in 2015, I was a SWE and then a SWE manager. At first I worked on product features, but gradually my passion shifted to engineering work. I started improving the performance of services, e.g., making cron jobs run in minutes instead of hours. This led me to explore tools for streaming process logs and database technology for improving query latency.

Then I learned about SRE opportunities at Indeed that focused on those subjects. I was attracted to the breadth and depth offered by SRE. Since joining, I have worked with a range of technologies, services, and infrastructure across Indeed. At the same time, I’ve had the opportunity to dive deep into technologies like Kafka and Hadoop. My team has diagnosed and solved issues in several complex AWS managed services.

Indeed also encourages SRE to write reliability focused code. This makes my background useful—I enjoy using my SWE skills to solve these kinds of challenges.

Yusuke, Staff SRE

I joined Indeed in 2018 as a new university graduate. In school, I studied computer science and did a lot of coding. I learned different technologies from infrastructure to web front-end and mobile apps. Eventually I decided to start my career in SRE, which I felt utilized my broad skill set better than a SWE role would.

I started on a back-end team that builds the platform to enable job search at Indeed. To begin, we defined SLIs and SLOs, set monitors for them, and established a regular process to plan capacity. Soon we were re-architecting the job processing system for better reliability and performance. We improved the deployment process with more resilient tooling. I helped adopt cloud native technologies and migrate applications to the cloud. To track and share investigation notes, we also started building an internal knowledge base tool.

I enjoy Indeed SRE because I can flex different skills. With the nature and the scale of the system we’re supporting, I get to share my expertise in coding, technologies, and infrastructure. SRE members with different backgrounds are always helping each other to solve problems.

Building your SRE career

Develop a broad skill set

SRE works with a variety of systems, so it’s important to diversify your technical skills. Besides SWE skills, you’ll need an understanding of the underlying infrastructure. A passion for learning and explaining new technologies is helpful when making broader policy and tool recommendations.

Focus on the wider organization

SRE takes a holistic view of reliability practices and core systems. When working with shared infrastructure, your decisions can affect systems across the company. To prioritize changes, you need to understand how others are using those systems and why. Working across different teams is a positive way to achieve personal and professional growth, and it advances your SRE journey.

Join us at Indeed

If you’re a software engineer, pivoting to SRE gives you exposure to the full stack of technologies that enable a service to run. If you’re currently doing operational work (in SRE or elsewhere), Indeed’s broad approach can add variety to your workload. Each team we work with has its own set of reliability challenges. You’ll be able to pick projects that interest you.

Indeed SRE also provides opportunities to grow. Our SRE culture is well established and always expanding. You’ll work with SWE and other roles, learning from each other along the way.

If you’re interested in challenging work that expands your horizons, browse our open positions today.

Speed Matters, But It Isn’t Everything

Photo by Jonathan Chng on Unsplash

Over the last few years at Indeed, we noticed our public-facing web applications were loading more slowly. We tested numerous ways to improve performance. Some were very successful, others were not.

We improved loading speeds by 40% but we also learned that speed is not always the most important factor for user experience.

Performance metrics

We measured loading speed using two key metrics:

We chose a weighted average instead of a single metric. This provided a more accurate measure of perceived load time, and helped us answer two critical questions:

  • How long did the user wait before the page seemed responsive?
  • How long did the user wait before they could interact with the page?

Though these metrics came with tradeoffs, we decided to use them instead of Google Web Vitals because they gave the broadest coverage across our user base. After deciding on these metrics, we had simple, observable, and reportable data from hundreds of applications and across a variety of web browsers.

Successful methods for improving speed

While we tried many strategies, the following efforts provided the biggest increases in performance.

Flushing <Head/> early

Browsers generally use the most resources during page load when they are downloading and parsing static resources such as JS, CSS, and HTML files. To reduce this cost, we can send static content early, so the browser can begin to download and parse files even before those files are required. This eliminates much of the render-blocking time these resources introduce.

By flushing the HTML head early on multiple applications, we saw load time improvements of 5-10%.

This implementation comes with a few trade-offs, however, since flushing the HTML document in multiple chunks can result in confusing error modes. Once we’ve flushed the first part of the response, we’re no longer able to change parts of the response, such as status code or cookies. Even if an error occurs somewhere before the last part of the response, we can’t change these headers. We’ve implemented some common libraries that help with these complications.

Reducing files on the critical path

Apart from the total number of bytes, one of the most important aspects for page load time is the number of total resources – especially render-blocking resources – required on the critical path for rendering. In general, the more blocking files you request, the slower the page. For example, a 100kB page served with 5 files will be significantly faster than a 100kB page served with 10 files.

In an A/B test, we reduced the number of render-blocking files from 30 to 12, a 60% reduction. The total amount of bytes shipped during page load was roughly identical. This test provided a 2+ second improvement for domContentLoadedEventEnd at the 95th percentile for our desktop and mobile search pages, as well as significant improvements in largestContentfulPaint.

To dive into this further, we explored the cost of a single extra CSS file. We ran a test on one of our highest trafficked pages to reduce the number of CSS files by 1. Page load times improved by a statistically significant amount, about 15ms at the 95th percentile.

Improving the runtime cost of CSS-in-JS

As more of our applications started using our newest component library, built on top of the Emotion library, we noticed 40% slower page loads.

The Emotion library supports CSS-in-JS, a growing industry trend. We determined that rendering CSS-in-JS components added extra bytes to our JavaScript bundles. The runtime cost of this new rendering strategy – along with the added bytes – caused this slowdown. We built a webpack plugin that precompiled many of our most commonly used components, reducing their render costs and helped address the problem.

This strategy resulted in a massive improvement, decreasing the slowdown from 40% to about 5% in aggregate, at the 95 percentiles. However, the CSS-in-JS approach still incurred more runtime cost than more traditional rendering approaches.

Factors outside our control

Along with testing improvements, we analyzed the types of users, locales, and devices that had an impact on page speeds.

Device type and operating system

For Android devices, which are generally lower powered than their iOS counterparts, we saw 63% slower timings for firstContentfulPaint, and 107% slower timings for domContentLoadedEventEnd.

Windows users saw 26% slower timings for domContentLoadedEventEnd compared to their iOS counterparts. These results were somewhat expected, since Windows devices tend to be older.

This data provided important takeaways:

  • The performance impact of features and additional code is non-linear: newer, robust devices can incur 100kB more code without an impact to performance, while older devices see a much bigger slowdown as a result.
  • Testing applications using real user metrics (RUM) is critical to understanding performance, since performance varies so widely based on device and the operating system’s capabilities.

Connection type and network latency

We used the Network Information API to collect information about various connection types. The API is not supported in all browsers, making this data incomplete, however, it did allow us to make notable observations:

  • 4G connection types were 4 times faster than 3G, 10 times faster than 2G, and 20 times faster than connections that were less than 2G. Put another way, network latency accounts for a huge percent of our total latency.
  • For browsers that report connection type information, 4G connection types make up 95% of total traffic. Including all browser types drops this number closer to 50%.

Networks vary greatly by country, and for some countries it takes over 20 seconds to load a page. By excluding expensive features such as big images or videos in certain regions, we deliver simpler, snappier experiences on slower networks.

This is by far the simplest way to improve performance, but it comes at the cost of complexity.

Results of speed and other factors

The impact of performance on the web varies. Companies such as Amazon have reported that slowdowns of just 1 second could result in $1.6 billion in lost sales. However, other case studies have reported a more muddled understanding of the impact of performance.

Over the course of our testing, we saw some increases in engagement based on performance improvements. But we’re not so sure they’re strongly correlated to performance improvements alone.

Reliability vs speed

Our current understanding of these increases in engagement is that they are based on increased reliability rather than an improvement in loading speed.

In tests where we moved our static assets to a content delivery network (CDN), we saw engagement improvements, but we also saw indications of greater reliability and availability. In tests that improved performance but not reliability, we did not see strong improvements in engagement.

The impact of single, big improvements

In tests where we improved performance by a second or more (without improving reliability), we saw no significant changes in our Key Performance Indicators.

Our data suggests that for non-commerce applications, small to medium changes in performance do not meaningfully improve engagement.

Engagement vs performance

Our observations reminded us not to equate performance with engagement when analyzing our metrics. One stark example of this point was the different performance metrics observed for mobile iOS users versus mobile Android users.

While Android users had nearly 2 times slower rendering, there was no observable drop in engagement when compared to iOS users.

So when does speed matter?

After a year of testing strategies to improve speed, we found some that are worth the effort to improve performance. While these improvements were measurable, they were not significant enough to drive changes to key performance indicators.

The bigger lesson is that while a certain level of speed is required, other factors matter too. The user’s device and connection play a large role in the overall experience. The silver lining is that knowing we cannot fully control all these factors, we can be open to architectural strategies not specifically designed for speed. Making minor trade-offs in speed for improvements in other areas can result in an overall better user experience.

 

Cross-posted on Medium