Finding Anomalies in User Behavior with Python

In the course of helping over 200 million unique visitors every month find jobs, we end up with a lot of data. The data we collect can tell us a lot about the behavior of our users, and for the most part we observe predictable patterns in that behavior. But unexpected changes could be evidence of failures in our system or an actual shift in user behavior. When the data shows something strange, we want to understand.

Identifying and acting on anomalies in user behavior is a complex problem. To help detect these anomalies, we take advantage of several open-source libraries, chief among them Twitter’s AnomalyDetection library.

Observing anomalies in user behavior

Detecting anomalies in a large set of data can be easy if that data follows a regular, predictable pattern over time. If we saw the same range of pageviews for our job listings every day, as simulated in Figure 1, it would be easy to identify outliers.

Actual, single outlier on Day 10.

Figure 1. Single outlier

But most of the data we collect is determined by user behavior, and such data does not follow patterns that we can easily observe. A variety of factors influence user behavior. For example, each of the following factors might affect what we consider a “normal” range of pageviews, depending on the geographic location of the user:

  • What day of the week is it?
  • What time of day is it?
  • Is it a holiday?

We might understand some factors in advance. We might understand others after analyzing the data. We might never fully understand some.

Our anomaly detection should account for as many variations as possible, but still be precise enough to provide significant statistical outliers. Simply saying “traffic is normally higher on Monday morning” is too fuzzy: How much higher? For how long?

Figure 2 shows a variable range of expected data, while Figure 3 shows a range of actual data. Anomalies within the actual data are not immediately visible.

Variable range of expected data over time.

Figure 2. Expected data

Variable range of actual data over time.

Figure 3. Actual data

Figure 4 shows the actual and expected data overlaid. Figure 5 shows the difference between the two at any point in time. Viewing the data in this way highlights the significant anomaly in the final data point of the sequence.

Expected and actual data over time.

Figure 4. Expected and actual overlaid

Graphed anomalies with the largest anomaly in the final data point of the sequence.

Figure 5. Difference between actual and expected data

We needed a sophisticated method to quickly identify and report on anomalies such as these. This method had to be able to analyze existing data to predict expected future data. And it had to be accurate enough so that we wouldn’t miss anomalies requiring action.

Step one: Solving the statistical problem

The hard mathematical part was actually easy, because Twitter already solved the problem and open sourced their AnomalyDetection library.

From the project description:

AnomalyDetection is an open-source R package to detect anomalies which is robust, from a statistical standpoint, in the presence of seasonality and an underlying trend. The AnomalyDetection package can be used in wide variety of contexts. For example, detecting anomalies in system metrics after a new software release, user engagement post an A/B test, or for problems in econometrics, financial engineering, political and social sciences.

To create this library, Twitter started with an extreme studentized deviate (ESD) test, also known as Grubb’s test, and improved it to handle user behavior data. Originally, the test used the mean value for a set of data to identify outliers. Twitter’s developers realized that using the median value was more precise for web use, as user behavior can be volatile over time.

The result is a resource that makes it easy to quickly identify anomalous results in time-based datasets with variable trends. Twitter’s data scientists use this data to perform their own internal analysis and reporting. For example, they report on tweets per second and the CPU utilization of their internal servers.

Twitter’s library allowed us to use historical data to estimate a highly complicated set of user behavior and quickly identify anomalous behavior. There was only one problem: Twitter wrote the library using R, while our internal alerting systems are implemented in Python.

We decided to port Twitter’s library to Python so that it would work directly with our code.

Step two: Porting the library

Of course, porting code from one language to another always involves some refactoring and problem solving. Most of our work in porting the AnomalyDetection library dealt with differences between how math is supported in R and in Python.

Twitter’s code relies on several math functions that are not natively supported in Python. The most important is a seasonal decomposition algorithm called seasonal and trend decomposition using loess (STL).

We were able to incorporate STL from the open-source pyloess library. We found many of the other math functions that Twitter used in the numpy and scipy libraries. This left us with only a few unsupported math functions, which we ported to our library directly by reading the R code and replicating the functionality in Python.

Taking advantage of the excellent work done by our neighbors in the open-source community allowed us to greatly reduce the effort required in porting the code. By using the pyloess, numpy, and scipy libraries to replicate the R math functions Twitter used, one developer completed most of the work in about a week.

Open-source Python AnomalyDetection

We participate in the open source community because, as engineers, we recognize the value in adapting and learning from the work of others. We are happy to make our Python port of AnomalyDetection available as open source. Download it, try it out, and reach out to us on GitHub or Twitter if you need any help.