New Eng Manager at Indeed? First: Write Some Code

I joined Indeed in March 2016 as an “industry hire” manager for software engineers. At Indeed, engineering managers start as individual contributors (ICs) before taking on more responsibilities. Working with my team as an IC better prepared me to move into a management role.

new eng manager

Before my first day, I talked with a few engineering managers about what to expect. They advised that I would spend about 3-6 months contributing as an individual developer. I would write unit tests and code, commit changes, do code reviews, fix bugs, write documentation, and more.

I was excited to hear about this approach, because in my recent years as an engineering manager, I had grudgingly stopped contributing at the code level. Instead, I lived vicariously through others by doing code reviews, participating in technical design reviews, and creating utilities and tools that boosted team productivity.

When new managers start in the Indeed engineering organization as an IC, they can rotate through several different teams or stay with a single team for about a quarter. I was in the latter camp and joined a team that works on revenue management.

Onboarding as an individual contributor

My manager helped to onboard me and directed me to self-guided coursework on our wiki. I was impressed by the amount of content provided to familiarize new hires with the tools and technologies we use at Indeed. In my experience, most companies don’t invest enough in creating and maintaining useful documentation. Equally as valuable, fellow Indeedians gladly answered my questions and helped me to get unblocked when I encountered technical hurdles. I really appreciated that support as a new employee.

During my time as an IC, I had no management responsibilities. That was a change for me….and it was wonderful! I focused on code. I built technical competence and knocked the rust off mental processes that I hadn’t needed to use for awhile. I observed practices and processes used by the team to learn how I could become equally productive. I had a chance to dive deeper into Git usage. I wrote unit and DAO tests to increase code coverage. I learned how to deploy code into the production environment. For the first time in a long while, I wrote production code for new features in a product.

To accelerate my exposure to the 20 different projects owned by my team, I asked to be included on every code review. I knew I wouldn’t be able to contribute to all of the projects, but I wanted to be exposed to as many as possible. Prior to my request, the developer typically selected a few people to do a code review and nominated one to be the “primary” reviewer. Because I was included in every review, I saw code changes and the comments left by team members on how to improve the code. I won’t claim I understood everything I read in every code review, but I did gain an appreciation for the types of changes. I recommend this approach to every new member of a team, not just managers.

Other activities helped me integrate with people outside of my team. For example, I scheduled lunch meetings with everyone who had interviewed me. This was mostly other engineering managers, but I also met with folks in program management and technical writing. Everyone I contacted was open to meeting me. These lunch meetings allowed me to get a feel for different roles; how they planned and prioritized work; their thoughts on going from IC to manager; and challenges that they had faced during their tenure at Indeed. On-site lunches (with great food, by the way) allowed me to meet both engineering veterans as well as people in other departments.

Transitioning into a managerial role

By the time I was close to the end of my first full quarter, I had contributed to several projects. I had been exposed to some of the important systems owned by my team. Around this time, my manager and I discussed my transition into a managerial role. We agreed that I had established enough of a foundation to build on. I took over 1-on-1 meetings, quarterly reviews, team meetings, and career growth discussions.

Maintaining a technical focus

Many software engineers who take on management roles struggle with the idea of giving up writing code. But in a leadership position, what matters more is engaging the team on a technical level. This engagement can take a variety of forms. Engineering managers at Indeed coach their teams on abstract skills and technical decisions. When managers have a deeper understanding of the technology, they can be more effective in their role.

I am glad that I had a chance to start as an IC so that I could earn my team’s trust and respect. A special shout out to the members of the Money team: Akbar, Ben, Cheng, Erica, Kevin, Li, and Richard.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Finding Anomalies in User Behavior with Python

anomaly_detection_banner_cropped

In the course of helping over 200 million unique visitors every month find jobs, we end up with a lot of data. The data we collect can tell us a lot about the behavior of our users, and for the most part we observe predictable patterns in that behavior. But unexpected changes could be evidence of failures in our system or an actual shift in user behavior. When the data shows something strange, we want to understand.

Identifying and acting on anomalies in user behavior is a complex problem. To help detect these anomalies, we take advantage of several open-source libraries, chief among them Twitter’s AnomalyDetection library.

Observing anomalies in user behavior

Detecting anomalies in a large set of data can be easy if that data follows a regular, predictable pattern over time. If we saw the same range of pageviews for our job listings every day, as simulated in Figure 1, it would be easy to identify outliers.

anomaly_1

Figure 1. Single outlier

But most of the data we collect is determined by user behavior, and such data does not follow patterns that we can easily observe. A variety of factors influence user behavior. For example, each of the following factors might affect what we consider a “normal” range of pageviews, depending on the geographic location of the user:

  • What day of the week is it?
  • What time of day is it?
  • Is it a holiday?

We might understand some factors in advance. We might understand others after analyzing the data. We might never fully understand some.

Our anomaly detection should account for as many variations as possible, but still be precise enough to provide significant statistical outliers. Simply saying “traffic is normally higher on Monday morning” is too fuzzy: How much higher? For how long?

Figure 2 shows a variable range of expected data, while Figure 3 shows a range of actual data. Anomalies within the actual data are not immediately visible.

anomaly_2

Figure 2. Expected data

anomaly_3

Figure 3. Actual data

Figure 4 shows the actual and expected data overlaid. Figure 5 shows the difference between the two at any point in time. Viewing the data in this way highlights the significant anomaly in the final data point of the sequence.

anomaly_4

Figure 4. Expected and actual overlaid

anomaly_5

Figure 5. Difference between actual and expected data

We needed a sophisticated method to quickly identify and report on anomalies such as these. This method had to be able to analyze existing data to predict expected future data. And it had to be accurate enough so that we wouldn’t miss anomalies requiring action.

Step one: Solving the statistical problem

The hard mathematical part was actually easy, because Twitter already solved the problem and open sourced their AnomalyDetection library.

From the project description:

AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend. The AnomalyDetection package can be used in wide variety of contexts. For example, detecting anomalies in system metrics after a new software release, user engagement post an A/B test, or for problems in econometrics, financial engineering, political and social sciences.

To create this library, Twitter started with an extreme studentized deviate (ESD) test, also known as Grubb’s test, and improved it to handle user behavior data. Originally, the test used the mean value for a set of data to identify outliers. Twitter’s developers realized that using the median value was more precise for web use, as user behavior can be volatile over time.

The result is a resource that makes it easy to quickly identify anomalous results in time-based datasets with variable trends. Twitter’s data scientists use this data to perform their own internal analysis and reporting. For example, they report on tweets per second and the CPU utilization of their internal servers.

Twitter’s library allowed us to use historical data to estimate a highly complicated set of user behavior and quickly identify anomalous behavior. There was only one problem: Twitter wrote the library using R, while our internal alerting systems are implemented in Python.

We decided to port Twitter’s library to Python so that it would work directly with our code.

Step two: Porting the library

Of course, porting code from one language to another always involves some refactoring and problem solving. Most of our work in porting the AnomalyDetection library dealt with differences between how math is supported in R and in Python.

Twitter’s code relies on several math functions that are not natively supported in Python. The most important is a seasonal decomposition algorithm called seasonal and trend decomposition using loess (STL).

We were able to incorporate STL from the open-source pyloess library. We found many of the other math functions that Twitter used in the numpy and scipy libraries. This left us with only a few unsupported math functions, which we ported to our library directly by reading the R code and replicating the functionality in Python.

Taking advantage of the excellent work done by our neighbors in the open-source community allowed us to greatly reduce the effort required in porting the code. By using the pyloess, numpy, and scipy libraries to replicate the R math functions Twitter used, one developer completed most of the work in about a week.

Open-source Python AnomalyDetection

We participate in the open source community because, as engineers, we recognize the value in adapting and learning from the work of others. We are happy to make our Python port of AnomalyDetection available as open source. Download it, try it out, and reach out to us on GitHub or Twitter if you need any help.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

A Funny Thing Happened on the Way to Java 8

Indeed is migrating from Java 7 to Java 8. We’re migrating in phases, starting with running our Java 7 compiled binaries on JRE 8, the Java 8 runtime. Rather than switch everything at once, we started with a few canary apps to get a sense for the kinds of problems we might find.

We selected our job search web app as our first canary. Last May, we switched a single job search server in production to JRE 8. We’d been running with JRE 8 in a QA environment for several weeks without any trouble. But when we switched to JRE 8 in production, the system load rose from 5 (a normal level) to 20 (an abnormal level): high enough to require a fix before proceeding. Fixing the problem required unusual measures.

Production metrics in Datadog

I started by looking at Datadog, a third-party, cloud-based metric and event correlation service that we use for real-time monitoring. The following Datadog graph shows the system load by host for every job search server in one of our data centers. The outlier (blue line) is the server running JRE 8.

figure-1-system-load-with-64m-codecache

Figure 1. System load for job search servers in a data center

Here is the median response time by host. Again, the server running JRE 8 is the outlier.

figure-2-median-response-time-64m-codecache

Figure 2. Median response time by host

The job search servers that were still on JRE 7 had normal system loads and normal median response times, so the problem was specific to the JRE 8 server.

It’s not unusual for a Java process to slow down because of garbage collection (GC) activity. We publish our GC metrics to Datadog, and the GC graphs for the JRE 8 instances looked normal.

Production log files

Next, I checked the web application log files. I retrieved every log message from the four hours when we used JRE 8, plus the log messages for the previous 24 hours. There were too many messages to read by hand. Instead, I focused on changes in the distribution of errors. I extracted the name of the associated Java class for each error and then generated histograms for different time periods.

The histograms showed a slight increase in memcache client timeouts during the JRE 8 period. They also showed we used an open-source memcache client library that is no longer maintained. A Google search didn’t turn up any JRE 8 specific problems with the library. I glanced through its source code but didn’t see any problems. I decided that the increase in timeouts was most likely just a side effect of the higher system load.

Reproducing the problem in QA

Since the evidence from our production environment didn’t identify the problem, the next step was to try to reproduce it in our QA environment. As I mentioned earlier, the web app had been running normally on JRE 8 in QA. That wasn’t surprising, because performance problems might not manifest themselves until the system is under enough load, and our QA servers don’t get as much traffic as our production servers. I hoped that if I increased traffic beyond normal QA levels, we would see the same symptoms.

I queried our Imhotep system to find the relative frequencies of our most common incoming requests. Then, using JMeter, I simulated that mixture of traffic and ran it twice while monitoring system load, once each with instances on JRE 7 and JRE 8. In both cases, the web app processed about 50 requests per second. The system load was about the same in both cases, too.

Fifty requests per second is a higher traffic load than we normally simulate in QA, but it is still not at the level we receive in production. When we started this effort, our production job search instances each served approximately 350 requests per second. To reproduce the problem in QA, I would need to get closer to that traffic volume.

Ruling out JMeter as a bottleneck

Why was our application pegged at 50 requests/sec? Before blaming the QA environment, I needed to rule out JMeter as the bottleneck. I installed JMeter on a second server, configured it like the first JMeter instance, and then ran both at the same time. If the two instances combined had run at more than 50 requests/sec, I would know JMeter was the bottleneck. In that case, the next step would be to add more JMeter instances until we either saturated the web app or caused the system load to rise to 20.

It turned out JMeter wasn’t the bottleneck. When I ran both instances at the same time, the aggregate throughput was still 50 requests/sec.

Eliminating slow services

The job search web app depends on 22 services. If some of them are slow enough, the web app will slow down, too. A service in QA may be slower than its production counterpart. We sometimes replace slow services with fake but fast versions when isolating the dependent application for performance analysis. I decided to create fake versions of the five services that our application uses the most.

Using the new fake services, JMeter drove the webapp at 100 requests/sec. JRE 8 was about 15% slower than JRE 7, but the system load still wasn’t very high — and the difference between JRE 7 and JRE 8 varied a lot from one run to the next. Still, since the numbers were different, I profiled the performance of both configurations using YourKit.

YourKit

YourKit is a Java profiler you attach to a Java process to collect and analyze runtime statistics. Once attached, you can create a performance snapshot: an aggregation of stack traces from every Java thread from multiple points in time. YourKit provides multiple ways to visualize a performance snapshot, including a hot spot view that shows which methods consumed the most time. Since our web app on JRE 8 consumed a lot more CPU than on JRE 7, I expected the respective hot spot views to differ.

The views were about the same, presumably because I still hadn’t put the systems under enough load in QA.

Taking performance profiles in production

At this point, I could have invested more time trying to reproduce the problem in QA, but instead I decided to use YourKit in production. We took performance snapshots in production with both JRE 7 and JRE 8. The only significant difference between the two was that the JRE 8 configuration seemed to use a lot more CPU. In fact, it used much more CPU than the performance snapshot could account for. It was as if the application was busy doing something in the background that YourKit couldn’t measure. We then decided to try profiling with the perf command.

Using the perf command

A Java process is more than a collection of Java code; it’s also a JVM composed of multi-threaded C++. A Java profiler can’t measure what that C++ code does; for that, you need a different kind of profiler, like the Linux perf command.

The perf command can collect call stacks from a Linux process. It uses the process’s debug symbol table to map a call stack’s addresses from virtual addresses to symbolic names of functions and methods. The JVM binary doesn’t contain a debug symbol table for your Java code, but there’s a way to query a Java process for its Java debug symbol table, too.

In production, we configured perf to capture the call stack for every web app thread every ~10ms for 30 seconds. That yielded about 11,000 call stacks. I aggregated the data into a flame graph (Figure 3).

In the flame graph, the y axis is stack depth. Any box at the top represents code that was on CPU when a sample was taken. The box below it is its caller, and so on. The flame graph in Figure 3 is really tall because our call stacks are deep. The x axis is just an alphabetical sort. The wider the box, the more call stacks it appears in. Green boxes are Java code, yellow is C++ code in the JVM, and red is C code in the JVM. Lightness and saturation are chosen for contrast and have no special meaning.

figure-3-first-flame-graph

Figure 3. Flame graph showing job search call stacks

Figure 4 shows the bottom portion of the flame graph:

figure-4-zoomed-first-flame-graph

Figure 4. Flame graph zoomed view

The CompileBroker::compiler_thread_loop (left, in yellow area) appears in 39% of call stacks. That is the bytecode compiler, the part of the JVM that converts frequently used bytecode into machine instructions. The value 39% seemed like a lot, and it represented something in the background that we couldn’t measure with YourKit.

Why would it consume so much time? The bytecode compiler writes its machine code to a dedicated area of memory called the codecache, which has a configurable size. There are JVM settings to configure what happens when the codecache fills up, but by default, the JVM is supposed to make room by flushing older entries. YourKit showed that our codecache was close to full, so it seemed worthwhile to try increasing the maximum size.

Doubling the codecache

We doubled the codecache size in production from 64MB to 128MB, restarted the web app, and let it run over the weekend. The application behaved normally and the system load stayed at a normal level. We ran the perf command and flame-graphed the results (Figure 5).

figure-5-zoomed-second-flame-graph-128m-codecache

Figure 5. Flame graph zoomed view after doubling the codecache

The width of the CompileBroker::compiler_thread_loop box (right) shows that the HotSpot compiler was less busy than before: about 20% of call stacks vs. about 40% before.

Codecache fluctuation

Even with a 128MB codecache, we eventually ran into trouble. Figure 6 shows what happened. The flat blue line is the maximum codecache size, and the purple curve below it is the amount of codecache used. At about 8am, the usage curve began to fluctuate and didn’t stabilize again until about 3am the next day.

figure-6-datadog-codecache-fluctuation

Figure 6. Codecache graph from Datadog showing fluctuation

The codecache churn slowed response times. Figure 7 shows the median response times for every job search server in that data center. The outlier in orange is the JRE 8 server. It’s about 50% worse than the rest.

figure-7-median-response-times-128m-codecache

Figure 7. Job search web app median response times

The right size for our web app

Our new codecache size of 128MB was still only half of the Java 8 default. To try to address the performance problem, we decided to double the size again to 256MB. We hoped our codecache usage would level off below this size. If 256MB still wasn’t enough, we could try doubling once more. If that didn’t work, we would experiment with JVM optimizer settings to reduce the codecache usage.

One factor in our favor was our deployment schedule. We usually redeploy this web app at least once a week and often multiple times per week. Aside from the Christmas holidays, we never go two weeks without redeploying. We decided to configure a total of 8 instances with a 256MB codecache and let them run for at least a week. We didn’t detect any problems, so we declared it a success.

Why more codecache?

The default codecache size for JRE 8 is about 250MB, about five times larger than the 48MB default for JRE 7. Our experience is that JRE 8 needs that extra codecache. We have switched about ten services to JRE 8 so far, and all of them use about four times more codecache than before.

One big contributor to the 4x multiplier is tiered compilation. The JVM has two bytecode compilers: C1, optimized for fast startup, and C2, optimized for throughput of long-running processes. In earlier Java versions, C1 and C2 were mutually exclusive; you chose C1 with the ‑client switch or C2 with the ‑server switch. Java 7 introduced tiered compilation, which runs C1 at startup and then switches to C2 for the remainder of the process. Tiered compilation is intended to boost startup times for long-running server processes and, according to Oracle, can yield better long-term throughput than C2 alone. In Java 7 you have to explicitly enable it, but in Java 8 it is on by default. In QA, when I disabled tiered compilation on JRE 8, the codecache size dropped by 50%.

Java 8 also uses more codecache because it optimizes more aggressively than Java 7 does. For example, the InlineSmallCode option controls how short a previous compiled method needs to be in order to be inlined. Java 8 raised the default from 1000 to 2000, which means the Java 8 optimizer allows longer methods to be inlined than before. If you inline more methods, and the inlined methods are bigger, you need more codecache.

Do developers migrating to Java 8 need to worry about this?

If you don’t override the default codecache size and have 100MB per JVM to spare, you can probably ignore the codecache increase. Most Indeed services do not override the default codecache size. They will use more codecache on Java 8 than on Java 7, but the codecache is still a lot smaller than our average heap size.

If you do override the default codecache size but are not memory-constrained, it will probably suffice to remove the override. I recommend monitoring your codecache size just to be safe. This Github project includes an example that shows how to upload JMX codecache statistics to Datadog.

If you are already memory-constrained, you may want to measure how disabling tiered compilation affects your startup time, long-term performance, and memory consumption.


Acknowledgments

Connor Kelly and Mike Salsone in our Operations team supported this project, patiently putting up with lots of unusual requests.

Indeed software engineers David Wahler and Chen Yang helped me interpret the YourKit profiles. David also suggested using the perf command and helped me interpret the flame graphs.

References

Brendan Gregg is a Netflix engineer who specializes in performance measurement. His website is full of performance-related articles, and his talks on YouTube are worth watching. Brendan popularized using flame graphs for visualizing performance data. He also contributed JVM changes that made it possible to view Java symbols in perf data.

When I first learned about codecache, I checked the Oracle documentation for conceptual material and relevant JVM settings:

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone