Transcript
Ajila: I'm going to talk about optimizing JVMs for the cloud, and specifically how to do this effectively. I'll start off with a couple data points, just to frame the discussion that we'll have. Ninety-four percent of companies are using cloud computing. It's becoming more rare for developers to host machines on-prem now. Cloud computing offers the flexibility of increasing your computing resources. It decouples the task of maintaining hardware and writing software, so you can focus on writing applications, deriving business value, and you can offload the cost of maintaining infrastructure to your cloud provider. The result has been lower startup costs for a lot of small medium-sized businesses, and faster time to market. Here's another one. With all these benefits from cloud computing, a lot of the costs that small to medium-sized business spend is on cloud computing, which makes sense. This is becoming increasingly where more of the budget is going. It's very important to understand these costs and how to reduce them.
The main costs in providing cloud services are network, storage, and compute. The first, network, it's the cost of buying the hardware, setting it up, maintaining it. Likewise, with storage. To some extent, the same is true with compute. However, for most people, compute is where the flexibility and the cost is. You typically pay for CPUs and memory, times the duration of using it. This is probably where most people have the most flexibility in the cost. A lot of cloud providers provide a pay as you go model, so you only pay for what you use. It's very important to pay less if you can. Most people don't want to pay more for the same service. You want to pay as little as possible. We saw on the previous slide that for compute, typically it's CPU and memory that is the biggest cost there. For most applications, demand isn't constant. It scales over time, certain days of the week, weeks of the year, might be more than others. Using a deployment strategy that scales to the demand is the most effective way to do this. This can be downscale to zero approaches. You scale up when demand is higher, you scale down when demand is low. Again, this is not new, lots of people are doing these. Technologies like Knative and OpenFaaS make this possible. Another way to reduce costs in the cloud is to increase density. Essentially, it's to do more or the same with less. A lot of applications tend to be memory bound. In many cases, it's using less memory to achieve the same goals. What needs to be done is fairly straightforward. You have to scale and you have to increase density. However, achieving these goals can be challenging. To have a successful scale to zero strategy, you need to have very fast startup times. Typically, you have to be under a second. The lower the better. Or else you have to do a scale to one type of thing. Equally as important is how much memory your applications are consuming. If you can use 2 Gigs instead of 3, again, the savings are quite large there.
Here's the final one. Historically, Java has been the language of the enterprise. In the old days, Java applications were typically deployed as large monoliths, so large applications with long running times. You typically didn't care about startup because your application is running for hours, days, sometimes even weeks. More and more we're seeing Java workloads being moved to the cloud. In this case, the stats showing that 30% of Java applications are deployed in public clouds. This is from last year. That's just Java applications. There's more than just Java that runs on top of the JVM. The JVM is used quite a lot in the cloud. There's a large shift away from monoliths and a shift towards microservices. If you're running a JVM based app server, then it's very important that your JVM starts up very quickly. It's very important that you can tune your JVM to use as little memory as possible.
Background
In this presentation, I'm going to talk about techniques that you can use to improve startup time and reduce memory footprint. I am a JVM engineer for OpenJ9. Currently, my main focus is on startup enhancements, hence this talk. I've also done some work on Valhalla, so I've worked with the hot prototypes on OpenJ9. I did work on the virtual threads and FFI implementation of OpenJ9 as well. OpenJ9 is an open source JVM. It was open sourced back in 2017. It's based on the IBM commercial JVM, so the JVM has been around for a couple decades. The OpenJ9 was the open source version in 2017. The branded name is called Semeru. They're really the same thing. We'll talk about two things, how to improve JVM startup time and how to improve JVM memory density.
How to Improve JVM Startup Time
Now I'll talk about some of the solutions that are currently being looked at in industry. Some have been around for some time, but some are fairly new. I'll talk about the benefits and drawbacks to each one. First, we'll do a little review. Here's the JRE on the left, and you have the class libraries, and then the JVM. The JVM is the engine that runs the applications. This is not a complete diagram by any means. It's a high-level illustration of the components. In the JVM, we have the class loader. The class loader is basically responsible for consuming the application. When you think about how applications are packaged, it's typically JARs or modules. The main unit there that we're interested in is the class file. If you've seen the class file structure, it has a lot of non-uniform repeating types. It's not a structure that's very easy to use. When the JVM loads the class file, it's going to translate it into an intermediate representation. For purposes of this talk, I'll refer to as class metadata. Basically, it's an internal representation of the class file, that's a lot easier to index, a lot easier to do something with. This translation takes time. Once you have parsed the class file, you have to verify the bytecodes, which also takes time. Then once you've done that, then you're ready to interpret the bytecodes. Interpretation can be quite slow. If you compare interpretation with profile JIT method, the JIT method is about 100x faster. In order to get to peak performance, it does take quite a while. You do your class loading. You do interpretation, profile guided. You eventually walk up the compilation tiers, and then you get to peak performance. This is historically why the JVM has had slower startup times.
Now I'm going to show you some techniques to improve this. One of the approaches is class metadata caching. The idea is that you save some of that intermediate state that the JVM builds up. On a subsequent run, you don't have to parse the class file from start again, you just reuse what was there. You save time by not having to do a lot of the mechanics of class loading. OpenJ9 has a feature called shared class cache. It's very easy to enable, just a command line option. On the newer releases it's on by default. When you turn this on, it will save the class metadata in a file, and then any other JVM invocation can make use of that cache. There are minimal changes to the application. This is only if you're using a custom class loader that overrides load class. Even then the changes are fairly trivial. If you're not doing any of that, then there is no change required. No impact on peak performance, because it's just the regular JVM, so there's no constraints on JVM features either. The drawback is that in comparison to the other approaches I'll show you, the startup gains are relatively modest. You can get under a second with the Liberty pingperf application, which is pretty good. It's typically 30% to 40% faster for smaller applications. As you'll see, with the next approaches, things can get a lot faster.
Static compilation. The idea here is very similar to what you're probably familiar with C++ compilers. You're essentially compiling the entire application and creating a Native Image. Some of the popular examples are Graal Native Image. That's one. There's another one that was being worked on by Red Hat called Cubic CC. The idea is that you run your static initializers at build time, and you build up the object graph, build up the initial state. Once you have that state, you serialize the heap. Then you compile all the methods. Then that goes in your Native Image. Then that's it. The result is very fast startup times, under 90 milliseconds on some Quarkus native apps. There's also smaller footprints, because you're only keeping the things you need, you get to strip out all the other bits and pieces that you don't care about. The limitations go hand in hand with the benefits. Because it's a Native Image, you don't have any of the dynamic capabilities that JVMs offer. A lot of the reasons why people like the JVM is because of these dynamic capabilities. People have built applications using them. Things like dynamic class loading is limited, reflection. Essentially, you have to inform the compiler at build time, every class method that you're potentially going to need so it remains in the image. Also, there are longer build times. Compiling to native code takes a lot longer than compiling to a class file. The peak performance tends to not be as good as the JVM. That's because with a Native Image, you don't have a dynamic compiler, you don't have a JIT, you don't benefit from profile guided recompilation, and things like that. The peak performance tends to lag. Lastly, because it's not a JVM, you can't use your standard debugging tools. There's no JVM TI support, so you are limited on how you debug things.
The last one I'll show you is checkpoint/restore. The idea is that you run the JVM at build time, you pause it, and then you resume it at deployment. The benefit is that you get very fast startup time, because you don't have to do that at deployment. The other benefits are that it's still a JVM, so there's no impact on peak performance, no constraint on JVM functionality, everything just works as it did. The drawback is that there are some code changes required, because you do have to participate in the checkpoint. However, this might be an easier option, for people who are migrating, than Native Image, because you can still use the same JVM capabilities but you do have to opt into it.
Checkpoint/Restore in Userspace (CRIU)
I'm going to spend a few more minutes talking about this approach. I'll do a little demo just to show how it works. In order to checkpoint the JVM, we need a mechanism that can serialize the states and resume it at a later point in time. We did some experimentation at the JVM level, and we ran through a lot of hurdles, because there is actually a lot of OS resources required to recreate the initial state. We decided to use CRIU, which is Checkpoint/Restore In Userspace. It's a Linux utility, which essentially lets you pause and serialize the application and resume it. The idea is that you can take a running container or process, you can suspend it, serialize it to disk. Uses ptrace to query all the OS resources the application's using. Then it writes all the mapped pages to file, and uses Netlink to query the networking state. Then all that information is written to a bunch of files. Then at restore time, the resource file is in, replays those system calls that was used to create the initial state, maps back into memory, register values, and then it resumes it. This is similar to what you can do with hypervisors, if you've used this on KVM or VMware where you can just pause an application and resume it. It's the same idea. The benefit of doing this at the OS level is that you get a bit more. Because you're closer to the application you have more control as to what is being serialized, so you can serialize only the application instead of the entire VM, VM as in the KVM virtual machine in this case.
Here, we'll look at an example of how this works. Here would be the typical workflow where you would compile, build and link in application at build time, and then at runtime, you would run the application. Now I have two boxes for runtime, because, as I said earlier, it does take time for the application to ramp up and reach the optimal performance. That's indicated in the dark blue, and then in yellow is where it's at optimal. With checkpoint/restore, the idea is to shift as much of that startup and ramp-up into build time as possible, such that when you deploy your application, it's just ready to go, and you get better startup performance. The way you would do this, again, you would run the application at build time. You would pause it and serialize it, ideally, before you open any external connections, because they can be challenging to maintain those connections on restore. Then at deployment, you initiate restore. It's not instant, it does take some time. That's the red boxes there. It's typically a fraction of the amount of time it would take to start up the application. In the end, we really get something like this. There is some amount of time required to restore the application, but again, it's a much smaller fraction.
How can Java users take advantage of this? OpenJ9 has a feature called CRIUSupport. It provides an API for users to take a checkpoint and restore it. There are few methods there just to query support for the feature on the system. At the bottom, there are some methods there that basically let you control how the checkpoint is taken, so where to save the image, set logging level, file locks. Then in the middle, there are a few methods that allow you to basically modify the state before and after the checkpoint. The reason you might need this is when you take a checkpoint in one environment and you restore it in another environment, in many cases there are going to be incompatibilities between both environments. In the JVM, we try to compensate for a bunch of those things. Things like CPU, if you have certain CPU values baked in. For example, java.util.concurrent, the default helper allocates the number of threads based on number of CPUs. If you go from eight CPUs to two CPUs, then you could hit some performance issues there. We try to compensate for things like that for machine resources that may be different on the other side. We also fix up time-aware instances. Things like timers, we do automatic compensation such that they still make sense on the other end. There's going to be some cases where we don't know what the application is doing, so we provide hooks, so application developers themselves can also do those fixups.
Demo
I'll do a little demo. It's an application just to show how it works. This application is a fairly rudimentary application, but it's just going to simulate what happens with the real application. The real application will start, and then we're going to simulate loading and initializing classes with a few sleeps. Then we're just going to print ready. Compile that. It's doing exactly what we expected. It took about 3 seconds. Now we're going to try this with CRIUSupport. I've got a helper method here. All it's going to do is it's going to query if CRIUSupport is on the machine. Then it's going to setLeaveRunning to false, and this basically means that once we take a checkpoint, we're going to terminate the application. We're going to setShellJob to true. We have to inform CRIU that the shell is part of the process tree so it knows how to restore it properly. Then we setFileLocks to true. That's because the JVM internally uses file locks. Then there's some logging, and then we checkpoint the image. Compile that. Then I'll make the directory that will put the checkpoint data in. This option is not enabled by default, so we have to turn it on. That message there, that's from CRIU saying it has terminated the application. Then we will restore it. I'll just use the CRIU CLI to restore it. It took about 60 milliseconds. This basically gives you an idea of how this would work in practice.
Open Liberty (Web Services Framework)
That example was fairly simple. Again, we're interested in optimizing JVM for the cloud. Most people don't deploy Hello World applications to the cloud. We're going to try this with something more real. Next, we'll look at Open Liberty. Open Liberty is a web services framework, supports MicroProfile, Jakarta EE, and you can even run Spring Boot applications on it. We've been working with the Open Liberty team to integrate this feature. At Open Liberty, it's called InstantOn. The nice thing about doing this at the framework level is that you can actually abstract all the details of taking a checkpoint away from the user. As you'll see, with Liberty, you don't have to change the application at all, the framework handles all that for you. It's just a matter of configuring it differently. We'll look at an example. A typical application would be set up like this. You'd start with the Liberty Base Image. Then you would copy in your server config that basically tells it the hostname, the port, and all that stuff. Then you run features. Liberty only puts in the features that you'll use. Then you'll copy in your application WAR file. Then this last step here just runs the JVM, warms up the shared class cache, so it generates the class metadata. Does some AOT compiles. Then, that's it. It's already built, so I'll just run it. We see here it started in about 3.4 seconds. Let me go to my web page. There we go. Everything's working as expected.
Now we'll try the same thing except we'll do it with InstantOn. We'll take a look at the Dockerfile. It's identical, except that we've added this last step here. Essentially what's going to happen is that we're going to take a checkpoint while we're building the container image. We're going to run the server. We're going to pause it, serialize it, save its state, bundle everything in the image, so that when we restore it, we can just resume it. Let's do that. One thing to know is, a CRIU requires a lot of OS capabilities to serialize the state of the application. Before Linux 5.8, you would have had to do this with privilege support. As of Linux 5.8, there is a new capability called checkpoint/restore, which encapsulates a lot of the functionalities it needs. To do this in a container, you need checkpoint/restore, set PTRACE, and SETPCAP. Those are the only capabilities you need. When you're restoring it, you only need checkpoint/restore, and SETPCAP. You don't have to add ptrace. We'll restore the image. As you see, it took 2.75 seconds. It's about 10x improvement. My machine's not the fastest. On a performance machine, the numbers are a lot better. The idea is that we're seeing a 10x reduction in startup time, which is what we want. We want to do as much as possible at build time so that we have less to do at deployment time. As I was saying, these are the kinds of performance numbers you would get if you were doing this on a performance machine. Typically, a 10x to 18x improvement. With a small application, you're around the 130-millisecond range. It's not quite Native Image, but it's pretty close, and you don't have to give up any of the dynamic JVM capabilities. Again, our first response time is pretty low, too.
How to Improve JVM Density
Now we'll transition to talk about how to improve JVM density. When you look at what are the big contributors to footprints in the JVM, it's typically classes, Java heap, native memory, and JIT. There are other contributors like stack, memory, logging, but these tend to be the big four. With classes, again, JVMs typically don't operate on class file, they operate on some intermediate structure. If you've loaded a lot of classes, compiled a lot of code, you're going to have a lot of metadata associated with that, and that consumes a lot of memory. There are ways to reduce this cost. Earlier, I talked about shared classes. The way OpenJ9 creates its class metadata, it does it in two phases. The first phase, we call it RAM classes. Essentially, this is the static part of the class. This is the part of the class that will never change throughout the lifetime of the JVM. Things like bytecodes, string literals, constants, that kind of stuff. Then everything else is in the RAM class, so resolution state, linkage state, things like that. Class layout, because if you're using a different GC policy the layout is different. The thing about RAM classes is that if you are parsing the same class, the result is always the same. It's identical. If you have multiple JVMs on a node, you're going to create the same RAM class for each one, if you're loading the same classes. Typically, you'll at the very least be loading the JCL, which is identical. There's a lot of classes that are generated that are identical. The idea is that you can use the cache, generate it once, and then all the other JVMs can make use of it. It's not only a startup improvement, but it also reduces the footprint because now you only have one copy of the RAM class. With some applications, you can see up to a 20% reduction in footprint if you have a lot of classes.
Another one is the Java heap. The Java heap captures the application state. A lot of JVMs are configurable. There's a lot of different options you can parse in to set the heap geometry. You can also do things to make the heap expansion less aggressive. A lot of these things come at a cost. If you reduce the aggression of heap expansion, you often have more GCs and you get a throughput hit. It's a bit of a tradeoff how to approach this one. With native memory, there aren't as many options. Anything that's using a direct byte buffer or unsafe, is going to be allocating into native memory. There is an option called max memory size, that puts a hard limit on the amount of native memory you use. Again, if your application needs more, it just means that you hit an out of memory exception.
Lastly, the JIT. There are some things you can do to minimize the amount of memory used in the JIT. You can try to drive more AOT compilations with heuristics. AOT code can be shared between multiple JVMs, so you have one copy, instead of many. Whereas JIT-ed code cannot be shared. JIT-ed code is always local to the JVM. The JIT can also be configured to be more conservative with inlining. When you inline a method, you're essentially putting a copy of it. The more copies you have, the more memory you're using. The less inlining you do the less memory you use. Again, there's a tradeoff, because the less inlining you do, the less your throughput is. A lot of these cases, some of the solutions require a throughput tradeoff. There are a lot of options that you can play around with to optimize the JVM. Now everyone's familiar with the JVM, so it can be tricky to figure out which options to use. OpenJ9 has an ergonomics option called -Xtune:virtualized. The idea is that this option tunes the JVM for a constrained environment. We always assume that you're limited in terms of CPU, limited in terms of memory, and we try to be as conservative as possible. The effect is that you can reduce your memory consumption by 25%, and it comes at a roughly 5% throughput cost. If that is a cost you're willing to live with, this option is something that you can use.
OpenJ9: JITServer (Cloud compiler)
The last thing I'll talk about is something called JITServer, or cloud compiler. Essentially, it's a remote compilation technique. The idea is that if you're running a configuration with multiple JVMs, you're going to be doing a lot of compilation, there's going to be a large overlap in classes that are similar. Each JVM is going to have its own JIT that's going to compile the same class each time, which is wasteful. The idea is to decouple the JVM from the JIT, and put the JIT in as a service, and then the JVMs as clients to the JIT, so that the JVM's request compiles from the JITServer. The big benefit of this is that it makes the job of provisioning your nodes a lot simpler. You don't have to account for the memory spikes that occur when you do JIT compilation. If you look at the memory profiles in applications, when there's a lot of JIT activity, the memory goes up because you need a lot of scratch space to do the compilations. With a JITServer, it's a lot easier to tune more conservatively in terms of memory usage, because you know there won't be that variance in memory usage. Another benefit is that you have more predictable performance, you don't have the JIT stealing CPU from you. You have improved ramp-up performance, because the CPU used for the node is used to run the application, and the compilation is done remotely. This is very noticeable with smaller, shorter-lived applications.
Demo - Improve Ramp-up Time with JITServer
I'll do a little demo here. Essentially, the idea is that there's going to be three configurations. One JVM with 400 Megs, another with 200, another with 200 using JITServer. Then we'll just drive some load and output the results to Grafana. The first step, I basically start the JITServer. That's the command we're using there. Essentially, we're going to give it 8 CPUs and 1 gigabyte of memory. Then we're also going to keep track of the metrics. You don't actually have to do that when you deploy. For the purposes of this demo, I chose to do that. They actually show up there. Here are the metrics for the JITServer. On the top left is number of clients, on the right is CPU utilization. The bottom left is number of JIT threads. We start with one, and it's just idling. Then we have memory on the right. Now we'll start the nodes. There are three worker nodes. The first two are not connected to the JITServer, so one is just a 400-Meg node, and the other is a 200-Meg node. The last is a 200-Meg node that is connected to a JITServer. You'll see the option above. That's basically how you connect to a JITServer, and then you provide the other options to tell which host and which port to connect it to. We'll start those. Once it's connected, we should see number of this kind spike up to one, there we go. On the right, you can see there's an increase in CPU activity. When you start up a JVM, there's compilation that occurs to compile the classes in the class library. Typically, it doesn't take very long, so you can see we've already gone back down. CPU utilization is back down to zero. We only needed one JIT compilation thread. Now we're going to apply load to it. You can see that CPU is already starting to go up. That means, now it's just spiked. We've spiked up to three compilation threads, because there's more load applied to the application, there's more JIT requests and there's increased JIT activity. Now we're up to four compilation threads. We see that the memory usage is also increasing. We started at 1000, I think at this stage we're down to 900 Megs, so we're using about 100 Megs of memory. The important thing is that this is memory that's being used by the JITServer and not the node. You'd have to account for this when sizing your node, typically. With JITServer, you only have to account for it on the server. You basically size your nodes for steady state because you don't have these spikes.
Now we'll look at the three nodes. The top left is the 400-Meg node, the top right is the 200-Meg node, and the bottom left is the 200-Meg node with JITServer. You can already see that the ramp-up is much faster with JITServer. The 400-Meg node will get there eventually, but it'll take a bit more time. The 200-Meg node is going to be bound because when you're constrained, you have to limit the amount of compilation activity you can do. As you can see, we're back down to one JIT thread on the JITServer. That's because we've done most of the compilation. The CPU utilization has gone back down. The memory consumption is reducing now, because we don't need the scratch space for compilation anymore. These are the spikes that you would typically incur in every node. With the remote server, you only incur it in the JITServer. You can see that 400-Meg node there is ramping up, but the 200-Meg node JITServer has already reached steady state. Yes, that's basically how it works.
Here's another example, there are two configurations. At the top, we have three nodes with roughly 8 Gigs, and at the bottom, we have two nodes with 9 Gigs using JITServer. You can achieve the same throughput with both of those configurations. The savings are that with the bottom configuration, each node is sized more conservatively, because you don't have to account for the variation in memory usage. Here are some charts that show the performance in a constrained environment. The left one is unconstrained. You can see with JITServer you do have a ramp-up improvement. Peak throughput is the same. As you constrain it, you see that the peak throughput reduces. When you have less memory, you have less memory for the code cache, less memory for scratch space. Overall, it limits what you can do from a performance perspective. The more you constrain the node, the more limited you get, which is why the more constrained it is the better JITServer tends to do.
When Should You Use JITServer?
When should you use it? There are tradeoffs using this approach. Like any remote service, latency is very important. Typically, the requirement is that you have less than a millisecond latency. If you're above that, then the latency overheads just dominate, and there really isn't a point to using it at all. It's really good in resource constrained environments. If you're packing in a lot in a node, then this is where it's going to perform well. Also, it tends to perform well when you're scaling out.
Summary
We've looked at the main requirements for optimizing your experience in the cloud. We looked at startup and why that's important and why memory density is important. We looked at a few approaches there. The big one is definitely remote compilation. It seems to have the biggest impact on memory density. Then with startup, there are a few things you can do. There are tradeoffs for each approach. Technologies like checkpoint/restore with CRIUSupport are a way to achieve the best of static compilation and the existing class metadata caching techniques.
Azul's OpenJDK CRaC
Azul's got a feature called OpenJDK CRaC, which is very similar to CRIUSupport. There's a couple differences in the API, but for the most part, it achieves the same thing.
Questions and Answers
Beckwith: When I last looked at it maybe like five years ago, with respect to the JITServers, there were different optimizations that were tried. At first, the focus was mostly, how do we deliver to the worker nodes or stuff like that? Now I see that there was this memory limit constraints. Is there any optimizations just based on the memory limit constraints or it's equal for everything?
Ajila: The default policy when using JITServer is to go to the JITServer first, and see if it's there. There is a fallback. I'm not sure what the default heuristics are there. If it's not there, and it would be faster to compile it locally, you can compile it locally. We don't actually jettison the JIT, we just don't use it. Because if you don't use it, there's no real cost to having it there. Yes.
Beckwith: Is it a timeout or is it just based on the resources?
Ajila: I do know with the tech latency, and if the latency goes beyond a threshold, then yes.
Beckwith: There was a graph that you showed at the JVMLS when talking about startup and the [inaudible 00:45:01]? Maybe you can shed light where the boundaries are? In the JVM world, when do we call startup, or just what we call?
Ajila: Again, people sometimes measure this differently. Startup, in the JVM context, is basically, how long does it take you to get to main? For a lot of end user applications, main is actually where things start, for many people. It's after main that you start to load your framework, code, and stuff. Main is like the beginning in a lot of those cases, and there's a lot of code that has to be run before your application does something useful. From that perspective, startup can actually be later. Even then, after you've received that first request, the performance is not optimal yet. It does take time to get there. A lot of JVMs are profile dependent, meaning that you have to run, you have to execute the code a number of times before you generate something that's optimal, something that performs very well. That's the ramp-up phase. That's the phase where you have a lot of JIT activity. The JVM is profiling. The code path is figuring out which branches are taking, which are not, recompiling. Then after that phase, when JIT activity goes down, then you're past ramp-up, and now you're at steady state.
See more presentations with transcripts