Mesos at Indeed: Fostering Independence at Scale

Independent teams are vital to Indeed development. With a growing organization of over 600 engineers across nearly 100 teams, we strive to reduce the number of team dependencies. At Indeed, we let teams manage their own deployment infrastructure. This benefits velocity, quality, and architectural scalability.

Mesos logo.

Apache Mesos helps us eliminate operational bottlenecks and empower teams to more fully own their products.

The operations bottleneck

During Indeed’s early years, we manually provisioned and configured applications. For every new application, we sent a request to the operations team, which would then attempt to find a server with enough capacity to run the new application. If none could be found, operations would spin up new virtual machines or order additional servers. Provisioning a new application could take up to two months.

Subsequent deployments were faster and more self-service, but that first provisioning step was a definite problem. This led to developers optimizing for their own velocity at the expense of application design. Applications became bloated monoliths, as it was easier to bolt on new services than to undertake the time-consuming process of creating new applications. This didn’t scale. Something had to change.

Enter Mesos

Around three years ago, Indeed began using Mesos, which gave teams the freedom to configure, deploy, and monitor their applications themselves. Today, if an application needs more CPU, memory, or disk, the team adds it. The primary benefits of this are:

No gatekeepers. This increases velocity and the propensity for scalable architecture.

Teams know the performance profile of their application. Because teams must specify their own CPU, memory, and disk numbers, they become familiar with expected performance, troubleshooting, and areas for improvement.

Increased reliability. When a server goes down, applications are restarted elsewhere. Engineers no longer have to manage individual instances.

Indeed’s Mesos ecosystem

Mesos alone is not enough to reliably run applications. Our Mesos ecosystem incorporates many open source projects and in-house applications built to create a seamless experience for teams.

Marathon for daemons

We use Marathon to run daemons. Most Indeed developers are unaware of that, though. Since Indeed runs in ten data centers and Marathon can only run in one, we wrote a system called Marvin that coordinates deployment across all data centers. Developers independently specify their resource requirements, the version of the application to run, the number of instances, and in which data centers to run. An agent runs in each data center that compares the defined configuration with Marathon’s configuration. If they don’t match, the agent works directly with Marathon to scale up/scale down instances or initiate a new deployment.

Our internal tool for batch and one-off jobs

We also run a large number of batch or “one-off” jobs. Marathon is not appropriate for this use case, so we built Orc, a Mesos Framework similar to Marathon that handles configuring and scheduling these jobs—tasks that previously fell on the operations team. Because we built the tool, we can make a number of optimizations, such as last-host affinity. This interacts nicely with our Resilient Artifact Distribution system so that jobs run close to the data they require. Developers can also schedule their jobs to run whenever they need to.

Orc user interface, including what to run, when to run, and resources sections.

Orc, Indeed’s tool for configuring and scheduling batch and “one-off” jobs

HAProxy for load balancing

With Marathon constantly bringing up new instances on different servers and different ports, we needed an easy way to address these instances. We use HAProxy as a reverse proxy due to its well-known performance characteristics. We wrote a small application that discovers where daemons are running and generates HAProxy configurations to match. When the configuration changes, we try to dynamically update HAProxy using its Runtime API. If that’s not possible, we restart HAProxy using a seamless reload mechanism that ensures that no packets or requests are lost.

Vault for configuration

Lastly, we required a robust way to configure our applications. Most applications at Indeed are configured via a simple, flat properties file. Prior to Mesos, we used Puppet to disseminate properties files to each data center, but this wasn’t self-service and there was a high degree of lag. We wanted to make it quick and easy for teams to securely configure their applications themselves, so we designed a system built around Vault, HashiCorp’s product for managing secrets. Before an application runs, we generate a short-lived token for retrieving the properties. We built a small Marathon plugin that does this for Marvin daemons, and we modified Orc to do this for batch jobs.

Result: Independent teams and scalable applications

All of these changes led to a 14% decrease in deployment time. Additionally, it reduced provisioning time from months to minutes and allowed our development teams to take more responsibility for their applications.

We’ve seen a sharp fall in configuration and deployment tickets, and we’ve reduced our average configuration ticket resolution time from 15.6 to 3.4 days. As a result, the operations team can focus on more pressing initiatives, like re-allocating resources to create a site reliability engineering practice.

We are now working toward Docker containerized deployments on Mesos. Developers will eventually roll their own Docker images, automatically scan them for vulnerabilities, and easily deploy their containerized apps on our cloud infrastructure. With these upcoming advances, we will continue enabling new capabilities on top of Mesos, allowing our engineering teams to independently create scalable applications.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Open Source at Indeed: Sponsoring the Python Software Foundation

At Indeed, we’re committed to taking a more active role in the open source community. Earlier this year, we joined the Cloud Native Computing Foundation. This week, we are pleased to announce that Indeed is sponsoring the Python Software Foundation

Python Software Foundation

We write lots of Python code at Indeed — it’s one of our major languages — so we benefit from a thriving Python ecosystem. Indeed is excited to join other industry leaders who support the Python Foundation. We believe Indeed has a lot to bring to the Python community, including participation, promotion, and sponsorships. Supporting the Python Software Foundation is a great place for us to start. We recognize that great open source software relies on engagement at all levels, and we are looking forward to becoming a steadfast supporter of the Python community.

As we continue to take a more active role in the open source community, Indeed will seek out additional partnerships, sponsorships, and memberships.

For updates on Indeed’s open source projects, visit our open source site. If you’re interested in open source roles at Indeed, visit our hiring page.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone

Improving Security with OAudit Toolbox

Trust is a big part of why Indeed continues to be the world’s number one job site. Users trust us to keep their information safe. We’ve always taken our responsibilities seriously, and as abuse of personal data continues to dominate the news, we cannot afford to lose diligence. That’s why we need to keep our own data safe — not just to protect Indeed employees and corporate data, but also to protect the data that our users entrust us with.

I work on Indeed’s Information Security team and in early 2017 started to address shortcomings in our Google GSuite implementation. Our solution — the OAudit Toolbox — is now available as an open source tool that you can use as well.

The problem: Third party apps can be risky

A major area of risk for Indeed’s GSuite implementation is integrating third party apps.

GSuite users can grant applications access to their account — anything from basic account info to full read/write access to Gmail. The OAuth Scopes presented in the authorization prompt control access to Google resources.

For businesses running GSuite, this presents a number of issues:

User education. Users may unknowingly grant access, or they might not understand the privacy or security implications of their choices.

Data sharing. Authorizing apps for certain scopes allows third parties access to sensitive data. Your business might not have the proper data sharing contracts in place for this kind of access.

Data exfiltration. Malicious applications can use this OAuth flow to effectively phish and exfiltrate data from Google accounts.

Limited tooling. At the time of my investigation, Google’s options for restricting apps were limited: the API only allowed for reactive blacklisting. The relatively new feature allowing for whitelisting of connected apps requires that you actually have a whitelist of apps.

Policy culture. Many companies allow overly permissive scope access. Implementing Google’s new whitelist functionality could require your security team to review, approve, and whitelist a giant backlog of apps — not to mention fielding requests from employees who feel their productivity depends on Sketchy Mail Tracker Pro™.

It wasn’t long before we had a real-world attack that highlighted these issues and helped us define our own solution.

The proof: A massive phishing attack

In May 2017, a massive phishing attack masquerading as a Google Docs invitation hit Google Apps users all over the world.


Users received an email, apparently sent from one of their contacts, requesting that they view a shared document. If they worked for a business using GSuite, this wasn’t an unusual request. What was unusual was this fake “Google Docs” requesting permission to access the recipient’s email and contacts. If users granted this access, the fake “Google Docs” app then sent the same phishing email onward to the victim’s contacts, using the victim’s account and masquerading as them.

As part of Indeed’s Security team, I was on the front lines of our company’s response to this attack. After several hours of panic and revocation of OAuth tokens, I felt oddly inspired. I wanted to fill in the gaps in GSuite’s available tooling, detect attacks like this one sooner, and find a way to educate users about the dangers of authorizing third party apps. Along with my coworker Dustin Decker, I started work on a set of tools that might get us closer to a perfect solution.

Our solution: OAudit Toolbox

OAudit Toolbox is a set of tools that detects third party app integrations and notifies users of their danger and scope.

How OAudit Toolbox works

OAudit Toolbox contains two major components:

  • Oaudit-collector indexes authorization events from the Google Admin API into Elasticsearch
  • Oaudit-notifier sends notifications with educational information about Oauth scopes and contains whitelisting/blacklisting logic

Blacklisting allows us to revoke access to a list of known bad apps in near-real time, including malicious apps and apps in violation of corporate policy. Access is only revoked to apps authorized after the app is defined in the blacklist.

Whitelisting allows us to stop notifications for trusted apps so that users don’t suffer alert fatigue.

  1. Oaudit-collector fetches authorization token event data using the Google Admin SDK API.
  2. Oaudit-collector indexes fetched data in Elasticsearch.
  3. Oaudit-notifier checks to see whether the authorized application is whitelisted, blacklisted, or unknown.
    • If the app is whitelisted, a notification is not sent. This is typically used for apps that have passed security review, and in the case of third party apps, have the appropriate data sharing agreements in place (if applicable).
    • If the app is blacklisted, a notification is sent to the user that the app is blacklisted and access has been revoked. This is used for malicious apps (such as the ones seen in the 2017 phishing attack) and apps that go against corporate policy.
    • Apps that are neither blacklisted nor whitelisted trigger a notification to the user informing them of the potential risk of authorizing untrusted applications and how to revoke unwanted access.

How OAudit Toolbox helps

OAuth Toolbox solves, or at least mitigates, each of the issues I identified with using GSuite.

Solution: User education

There are a million reasons why users might authorize third party apps:

  • A manager told them to
  • They don’t understand the authorization prompt
  • They haven’t heard of all the fun acronyms like DPA and MNDA that help us more securely share data with third parties
  • And many more!

With OAudit enabled, users receive an easy-to-understand visualization of the risks associated with each third party app, including user-friendly descriptions of the risks. Each scope is assigned a score based on the risk of sharing sensitive data, and highlighted with an associated color.

After enabling this feature, we saw a significant increase in questions from our users about whether apps are safe to use. We received more requests for our application security team to review third party apps. We were also contacted by teams outside of our engineering organization. Non technical users had previously felt uneasy about using some third party tools but lacked the technical or security context to explain why.

Solution: Data sharing

Users add tools to Google Apps to do things like improve spreadsheet visualizations, enable mail marketing campaigns, or send themselves Google Form results.

To the typical user, it’s not obvious that these integrations exist on a third party’s server rather than within GSuite itself. Some users assume that Google thoroughly vets each app. Other users don’t realize that once this data resides on third party equipment, they have no control over it beyond contractual agreements regarding further usage and sharing. For companies that must comply with GDPR, this issue goes beyond security and becomes regulatory.

Using the OAudit Toolbox has helped us socialize the concept of high-risk data sharing. At the same time, we have been able to work with our privacy and contracts teams to get the appropriate agreements in place where needed and revoke access to those apps that don’t pass our assessments. Retroactively revoking access to unapproved (but non-malicious) apps using the blacklist has been reasonably effective, as the expected functionality of these apps does not involve immediate data exfiltration.

Solution: Data exfiltration

While there are legitimately useful third party apps, there are also malicious apps masquerading as useful tools, such as the “Google Docs” app involved in the phishing attack of 2017.

You can always revoke a malicious app’s access, but doing so is minimally helpful because these apps are likely to exfiltrate data and/or abuse your account as soon as possible. The reliability of this method also depends on the timeliness of token logs — which, during the Google Docs phishing attack, were as far as 12 hours behind. More recently, token activity lag time is between 1-10 minutes, though Google claims this can be as much as a few hours.

Since the OAuth Toolbox also sends data to Elasticsearch, we recommend setting up ElastAlert or Watcher to detect never-before-seen apps being authorized or a spike in a single app being authorized in a short period of time.

By proactively warning users about dangers, OAudit Toolbox is more like an early alerting system for malicious apps, or IDS, and less like a proactive blocker, such as an IPS.

Solution: Available tooling

When development on OAuth Toolkit started, the ability to block app access was scattered across GSuite. A Google Apps administrator could block Marketplace Apps, Drive API, and Chrome extensions. That would have been too heavy-handed for us to successfully implement. Administratively, it was difficult to manage with minimal ROI since the solution was incomplete.

Google now allows you to block OAuth access to Google Apps such as Drive, GMail, and Contacts using the Google Admin Security panel. Optionally, you can block only “high risk” scopes, but there is no documentation as to which scopes are considered high risk. There is also a whitelist available to allow use of trusted apps.

This is a useful addition if and when your team is ready and able to:

  • Block all apps in use
  • Know what you need to whitelist
  • Have a workflow in place for approving new applications
  • Have application security resources for reviewing those apps

We were glad to see Google take these steps, but OAudit Toolkit provided us a solution that was easier to implement and less disruptive to workflow.

Solution: Policy culture

A major hurdle for implementing this tool and the subsequent review process wasn’t technical, but human. Going from an open, “bring your own app” culture to a more restrictive, seemingly bureaucratic process is a struggle — especially when transparency is a keystone of company culture.

We found a few practices to be helpful:

  • Use tools (such as OAudit Toolbox), training, tech talks, or internal blog posts to introduce users to the risk posed by third party apps.
  • Instead of adding a pile of legalese to your Acceptable Use policy, keep any actions users need to perform simple and available on an easily accessed page: the intranet home page, a wiki page, or an IT Support landing page.
  • Bucket frequently used, high-risk tools into categories such as “Mail Merge” or “Spreadsheet automation.” Doing so makes it easier to rip-and-replace them with a single, approved app rather than evaluate 50 unique apps.
  • Accept that you might not solve the problem overnight. Even if you can only mitigate the problem by allowing continued access to some apps and approving or restricting new apps, that still puts you in a better place than you were before.

Get started with OAudit Toolbox

We’ve made OAudit Toolbox open source and available so you can benefit from our experience. By integrating it with your company’s Google suite, you can take similar steps to improve organizational education and, hopefully, better resist the next phishing attack. Let’s stay ahead of these malicious actors and keep the web — or at least our little corners of it — safer.

Cross-posted on Medium.

Tweet about this on TwitterShare on FacebookShare on LinkedInShare on Google+Share on RedditEmail this to someone