« all talks

Scaling Software Platforms: Moving to gRPC Java

This talk was held on Wednesday, March 13, 2019

Senior Software Engineer Mya Pitzeruse explores the challenges she faced while considering a large scale move from Indeed’s proprietary services framework to gRPC. She describes her process of establishing testing parameters, building baseline metrics, simulating workloads, and the lessons we learned from this investigation.

Loading...

KRISTEN STARR: So first, I'd like to welcome Mya Pitzeruse to talk to us about gRPC Java. Mya is a senior software engineer working on service infrastructure. She's involved in several ongoing initiatives to improve Indeed's infrastructure and capabilities. One such effort is the migration from Indeed's proprietary services framework to gRPC.

Mya first joined Indeed in June 2013 working on employer products. And in 2017, she moved to Indeed's Service Infrastructure team full-time. Welcome, Mya.

MYA PITZERUSE: Hi, everyone. Today I'm here to talk to you about why Indeed is moving to gRPC Java. As Kristen had said, my name is Mya. I'm a senior software engineer working on service infrastructure. I'm the technical lead for our Service Architecture team. My team is responsible for the development of tools, services, and frameworks that drive Indeed's service-oriented architecture.

Historically, Indeed has built its systems using Boxcar, an internal distributed service protocol. It's known for its reliability, high throughput, and low latency. We primarily support Java and Go implementations, but we do have some secondary support for languages like Python and Node.

About a year and a half ago, we first started to explore integrations with gRPC. gRPC, for those of you who are unfamiliar, is an open source RPC framework. It came out of Google and is backed by HTTP2 for the underlying transport. Out of Box, they support 12 different languages, compared to our four internally.

Recently they celebrated their first birthday. Now, I know what some of you might be thinking. Why would we explore integration with such a young technology? Having only celebrated their first birthday, there may be some risk in adopting it.

Well, companies like Google, Microsoft, and Netflix are all adopting gRPC internally. They're starting to build their systems and services using this piece of open source technology. This means that from an onboarding perspective, we can now hire engineers from industry with prior experience on this framework. From an engineering perspective, this means that we can maintain one last component of our infrastructure. And we're able to leverage an entire community of engineers to drive our service infrastructure.

Currently, engineers spend about a month in onboarding, learning about the different tools, technologies, and systems at Indeed, one of those components being Boxcar. Imagine joining a company where you don't have to learn how to build systems. You're able to just leverage the tools and technologies that you know.

We built our initial integration with gRPC. We started going around a different product teams and working with them to adopt our platform. As we had these discussions, a common theme arise between each of the teams that we talk to, as well as some of the technical leaders. They wanted to know how Indeed's Boxcar framework compared to the open source gRPC one.

They were aware of the various features and additional capabilities that they would get from adopting an open source technology. But they were more concerned about the performance of their system. They had SLOs in place, Service Level Objectives, that they were shooting for constantly. And they maintained agreements with existing teams that they didn't want to violate.

At the time, my team had a very, very weak understanding of how these two systems really compared. We had a few test processes that vetted our integration. But beyond that, they didn't have anything that was sending production-like traffic.

And so we went back to the drawing board. We started to develop a system that would allow us to effectively test both of these frameworks, head to head. Today, we'll go through and discuss the system that we built and our whole methodology into how we actually compared and vetted these two frameworks.

First, I'll provide an overview of a system that we built. Then, I'll enumerate the various test parameters that we were able to adjust throughout the course of our experimentation. After, I'll demonstrate how we built a baseline off of our production traffic. And we're able to use that to vet and compare these two frameworks. We'll simulate a few of our different production workloads and monitor how changes to configuration impacts the underlying performance. And finally, will reflect on what we had learned throughout the course of this process and offer you some parting tips that you can bring back to your own company.

First, Indeed built a system called Hyperloop. It was originally developed to exhaustively test low-level components of Boxcar before we went through a release to production. Some bugs that we've encountered in the past only appeared after letting a system run for an extensive period of time. So we were able to leverage this framework to be able to vet the system before releasing it.

With a few modifications, we were able to stand two copies of the system up, side by side-- one, dedicated to running the Boxcar transport, and the other to running gRPC. To monitor the system, we leverage an open source tool called Prometheus. Prometheus is an open source, time-series database that works by scraping different data backends for application-level metrics.

At Indeed, we use StatsD to emit this information in production. And so we were able to leverage the StatsD exporter provided by the Prometheus community and translate our metrics into something that the server could understand.

Hyperloop by itself is a rather simple process. We have a consumer and a service. The service can be tuned to reflect any latency that we see on a given production system. The consumer can be tuned to reflect any production workload that we see, so being able to induce more or less load depending on giving set of configuration.

Both consumers and services emit stats to their local exporters that help inform us what they actually perceive. The information that we then collect from Prometheus is then downloaded and rolled up locally for us to do our evaluation. Once we had a system that we were able to monitor, test, and mutate, we wanted to go through and identify which parameters were within the scope of our testing.

Boxcar had a rather simple story. Clients would specify the number of connections they would open to a given service. And services would specify the maximum number of connections they were willing to accept. But as we dug through the gRPC client code and the server code, we didn't want to expose all of these knobs to our application engineers. They really only wanted to know the things that they needed to tune to keep their applications performant.

On the client, we identified two key components. The first is the executor. The executor is responsible for any communication with out of band load balancers, like the gRPC load balancer. Because Indeed doesn't leverage this component, we actually won't talk too much about how it's tuned in this talk. But it's important to be aware that it is there, should you choose to leverage this implementation.

The second component is an event loop group. The event loop group is responsible for all data that is being read and written to on the underlying connections. If you choose to leverage the Google APIs Extensions Library, you may choose to leverage a channel pool.

And a channel pool is just a composition of more managed channels. It works by enabling you to scale out the number of connections open to a given backend. By default, gRPC only opens one connection to each backend service. So you can choose to leverage this to increase your throughput.

On our service side, we have three primary components-- the boss event group is responsible for accepting all new connections coming into the system. If you use TLS, it then handshakes the connection. And when the connection is stable, it hands it off to a pool.

The worker event loop group is similar to your client sides event loop group. This one's responsible for all the I/O that is on the server side. So any reads and writes to underlying connections go through this group.

When a full request is read, it's handed off to your executor. The executor is actually responsible for directly invoking your backend service methods. Once we knew what components we wanted to tune and which components we wanted to see impact the performance, we started to build our baseline off of production traffic.

But as we started to build this baseline, we had some larger questions arise. What was the scale of request loads that we were working with? Were we talking thousands, tens of thousands, hundreds of thousands? We weren't sure.

What did the distribution of these request loads looked like across all of our applications? Each team knew roughly what their own systems did, but nobody had a grand view of every system at Indeed. And lastly, what latency did clients see as they performed these various workloads?

To answer the question about scale of requests on our service side, I was able to pull the top 25 list of services at Indeed by request rates. Docservice, one that we quite often talk about in open source and tech talks, came in at number 4, with 32,000 operations a second. If we were to break that information down across all the instances that were receiving traffic at the time, you can roughly figure that each instance of Docservice handled about 683 operations per second.

But Docservice wasn't our fastest. We had a top service that was at 55,000 operations a second. And there was much fewer instances of this application receiving traffic. This brought their each instance request rate up to about 2,205 operations per second. But this only answered half the problem. What request loads did our services see? And as we've seen here with different numbers of instances, clients might be issuing different request loads as well.

We then took some production metrics and not only graphed client applications by their request rates, but by the number of connections they had open. On our x-axis, we have the number of connections and applications it has opened. On the y, the request rates. Each represents an application in our infrastructure.

On the far right, we can see some more chron-like tasks that have higher request rates. They spin up large connection pools. They do a lot of work in a very short period of time, and everything gets torn down at the end. It makes sense for that kind of pattern.

But the other important part to note is where the majority of the dots lie-- less than 500 operations per second. So we could effectively use this information to get a rough idea of the distribution of systems in our production ecosystem. Knowing that most systems were at 500, we had a pretty easy bar to target.

Docservice, being on the higher end, 638, that top service with 2,205, and then finally that highest client with 4,500. But what about latency? We got request rates. That wasn't too bad. But the latency was difficult to track down.

As we looked at existing production ecosystems, we found each data center had a different baseline for latency. Some data centers were much slower than others. And so we weren't able to leverage that information to build a baseline. It led us to actually do more of a comparative analysis using our clean room environment.

Once we knew what kind of rates we are looking to replicate, both on client and on server, we were able to start simulating. For starting Boxcar configuration, we just picked whatever when people set the most in production. It was about 48 connections.

But not having a baseline for gRPC configuration, we ended up going through and digging through some code. The event loop groups were set off of the number of cores within your system. When you have a statically provisioned server, this works great-- four cores, four threads.

When leveraging something like Mesos where you have a clustered environment, the number of cores in your system is actually the number of cores in your cluster, not the number of cores on your machine or the number of cores that you're allocated. And so we didn't want to let these pools match or spin up more connections than we needed.

And so we assumed a single core per system. Picking some relatively small numbers to start with, we figured 32 workers issuing two requests, sleeping for 100 milliseconds, wouldn't get us very far. We wound up getting us 639 operations per second.

We were able to match request rates on the server. And then even when we looked at the 90th percentile on the client, the client-side response time was identical at 11 milliseconds. And that was for the full duration of an hour.

So the way that this ran is we ran the system for about a two-hour window, and then took an hour of the time frame to analyze the data. So we spun it up to four calls this time. So we'd issue four calls and then sleep for 100 milliseconds. This time, 1,278-- about double, more or less what we'd expect. Again, looking at that duration of an hour, we held out at 11 milliseconds the response time.

So we increase to six. At six, we were able to get up to 1,917 operations per second. No change in configuration. But this time we actually saw a slight deviation in our client-side response times. Since we were only talking about a millisecond, we didn't want to stop and fuss, because it could just be a network blip. So we pushed forward. We increased our batch size again to eight. And this time, we actually saw our client-side request rates differ. We see Boxcar trailing by 2,555 operations a second and gRPC at 2,551.

When we looked at these response times, we noticed that they sat a little bit farther apart right-- one, two millisecond difference. We did notice that they set up a little bit higher and came down closer to the ends. Note the graph looks a little funny, may look like a step function, but the time of difference that we're dealing with here is one millisecond-- one or two. So we don't really have much to shift.

We increased again to 10. And this time, Boxcar topped out at 2,909 operations per second, while gRPC trailed by 2,671. Again, sitting closer to more to that two millisecond difference. So we learned that as we increase load on the system, we started seeing more latency, likely because one of these connection pools were constrained.

When we did our final test, we noticed that we actually started to top out. Earlier increments would go up by about 600 operations per second. And our last few were only going up by a few hundred operations per second. And this called out to an implementation bottleneck that we had. And it was almost intentionally put in there.

With gRPC only at 12 milliseconds a call, we could have a theoretical max of 2,666 operations per second. With boxcar at 11, 2,909. And so we decided to look at our numbers and go back to the point where just before our system was under contention. We went back to that 2,555 operations per second. The first thing we noticed was that the server was trailing in request rates. And so we decided to do, noticing that the server was trailing, was increase its event loop group. When we did that, we brought the request rates back up to match that of Boxcar.

But one of the problems that we still saw was that the client still had some latency. And so we tried increasing some of that. After increasing it twice, we noticed that it wasn't really making that much of a difference on the actual underlying performance of the system. And so we accepted that that one millisecond that we had seen may just be the cost of adopting this open source framework.

Now our first task of testing was able to cover three of our use cases, but there was still that one outlier, that 4,500 operations per second. And so we decided to just double the number of worker threads, knowing that we were at the limit for the number of calls we could issue per batch. With that configuration, we were able to get up to 4,200 operations a second, pretty close to that top number that we saw there.

When we looked at the 90th and the 99th percentile response times, we noticed that the Boxcar implementation actually trailed much further behind the gRPC one. With Boxcar topping out at about 62 milliseconds and gRPC at 35, we were easily able to see the benefits at the tailing end of things.

Since we knew we were already putting the consumer under a lot of stress, we decided to go through and see how changing the number of channels impacted the performance in relation to changing the number of event loop groups, keeping in mind that increasing the number of channels that you have open will also increase the number of event loop groups.

And what we learned was that, even after increasing both of these, there was no real big benefit to leveraging a second connection. You could get just as far with tuning your event loop group. You can use much less memory if you only leverage a single connection.

So what did we learn through this entire process? Well, the first thing that we learned was when tuning clients you want to make sure you tune your event loop group based off of your expected I/O. So the more traffic that you see, the more requests that are being issued, the larger the pool you'll need to account for the writes to your underlying connections. If you leverage stream-based APIs, either client or server, you'll need to scale the number of workers that you have writing to this pool because you'll implicitly have a few more operations happening on the wire.

If you do choose to leverage a channel pool, make sure that you're choosing it for the right reasons. There's a few cases where it makes sense, especially if you have long-live streams. But there are other cases where it can put you down a foxhole.

On the server side, you want to tune your workers event loop group based similarly to that of the client-- again, thinking about the request rates coming in and out of the server. An increase in load on the server means you may have to adjust both the executor as well as your event loop group.

And as sit and I thought about the first question we talked about, how does gRPC compare to Boxcar. Well, we were able to go out to different product teams and ease some of their concerns. We could communicate to them that gRPC sees very similar performance to that of Boxcar, especially from the client's perspective. Servers are able to process very similar request rates, despite using much fewer resources. gRPC in general across both client and server uses much fewer resources and is able to sustain a much larger request rate.

In parting, I suggest that you find a good set of defaults that work for your company. Keep in mind, every company's workloads may be different. And so different defaults may be required for your internal systems.

When you're working on establishing these defaults, start small and take small steps forward. It's really tempting to make larger differences, but those larger deltas can actually lead to missing those key inflection points where you have an opportunity to suss out some latency.

And finally, a single connection can do a lot more than you might think. So give it a chance and work on tuning it and have some fun with it. Thank you.

[APPLAUSE]