Transcript
Küber: I'm Esteban Küber. A member of Rust Compiler team, and a Principal Engineer at Amazon Web Services. The talk that I originally drafted was going to focus on how to apply some specific libraries and approaches to reducing the footprint of your applications to do more with the same amount of hardware, and it was going to heavily focus on Rust. I am going to be talking about Rust, but while working on this presentation, the flow of the talk changed.
Rust
I have a lot of background in Rust. Rust is a systems language aimed at being performant, reliable, and allow high productivity when building complex systems. The mode of the language is empowering everybody to write performant code. Personally, I like to say that the unofficial ethos at Rust is, the problem is hard, but if we restrict the scope, we can statically check it. What this means is, in order to allow developers to write performant and reliable code, the language needs to restrict the kind of patterns that can actually be expressed. When you do that, when you introduce those restrictions in a targeted manner, what you end up with is a subset of the complex problem of memory safety, for example, where you can have the compiler do most of this work that you otherwise would. C developers need to manage memory manually in Rust. Even though it doesn't have a garbage collector, the feel for how you interact with memory is much more similar to a garbage collected language than it is to a systems language like C.
Rust and Java
Rust doesn't exist in a vacuum. At AWS, we use a lot of Java. It seems like a natural counterpoint to contrast what Rust actually is. Both Rust and Java have a very similar reason to exist. When Java was originally invented in the mid-90s, it was intended to be a memory safe language to work in embedded devices. When Rust was conceived of in the mid-2000s, the idea was to be able to write systems software in a safe manner, while bringing all of the new developments of language design from more academic sources. You may not know this, but the first several iterations of the Rust compiler were actually written in OCaml. There is a lot of sourcing of ideas and patterns from OCaml, at least early on. These two languages have a very similar idea. Effectively, they want to bring the same level of performance as code written in C++ have, but make it memory safe and reduce the number of footguns, the number of things that developers can get wrong, and help them to deliver reliable software. Both languages have evolved from this. They are no longer tied to their original niche, they have expanded.
Both languages also have very different, and in some cases diametrically different design considerations and decisions. For example, Java uses virtual machine in order to gain platform independence, and be able to run the same generated binary on multiple different machines. Whereas Rust is compiled down to a binary that your platform has to run. It doesn't have a runtime. It does in the same sense that C has a runtime. It has a layer of APIs to interact with your platform. Effectively, it doesn't have a runtime. Java uses dynamic dispatch by default. The JIT compilers are extremely able to remove some of this pointer indirection introduced by dynamic dispatch, but the default is dynamic dispatch. Whereas Rust, it monomorphises by default, which means if you have a function with a generic type parameter, the compiler will wait until it's found every single call to that function, look at the types, and it will generate binary code for that function for each combination of types. When you are calling a method in Rust, by default, it will be just as efficient as calling any other function, regardless of the fact that it had generics.
Of course, Rust also has dynamic dispatch, but is opted. Another huge difference is Rust is stack allocated by default. In fact, if you want to allocate something on the heap, you have to explicitly use a special type called Box, which is just a fancy way of expressing in the type system that this is a pointer to something in the heap. Whereas Java, except for primitive types, everything is allocated on the heap. Every single object that the developer writes, ends up allocated on the heap. Rust, by its very nature is only ahead-of-time compiled, which means that once you have generated a binary, if the compiler didn't pick the optimizations that matter for your specific use case, in this individual execution of the binary, then you're stuck. Your performance will be predictable, but it will never be able to take into account what your application is actually doing once it starts running. Whereas Java, because it has a virtual machine, it does have a JIT, and that means that as you're running a long-lived service with a very consistent workload, you can get close to perfect performance as long as these JIT optimizations kick in.
There are other differences. Rust does have a Async/Await as a primary part of the language. This is not that different to the same feature in C#, Python, other languages. In Java, this is not part yet of the language. It relies on APIs that you call. It's not part of the language. There is one big difference which is Rust has the Safe/Unsafe dichotomy. Unsafe code is regular Rust code that can also interact with pointers. It can dereference pointers and it can call functions that are external to Rust, meaning operating system syscalls, C functions from external libraries. That's all that unsafe does. It doesn't turn off the borrow trigger, it just lets you operate with pointers. Whereas Safe Rust doesn't have pointers, but checks every single reference, every borrow, which compiles down to a pointer access, it checks it statically at compile time for validity. Which means that you get the same memory safety guarantees that you get from a garbage collector at compile time.
These differences do have far-reaching implications. If we look at the memory layout, Rust ends up storing data very efficiently in the stack, which means that the data of your application is going to have great cache locality, which in turn translates to a very good performance. Whereas in Java, because objects are heap allocated, the default is that the cache locality suffers greatly, hence performance suffers greatly. The just-in-time submodule of the JVM will be able to perform escape analysis and improve performance under a consistent workload, but it takes some time to kick in. It cannot do everything that you can, because you cannot specify in your code that something needs to be stack allocated. That means that in a refactor you may be breaking some of the optimizations that you were used to seeing in production.
Rust Tradeoffs
Of course, everything is tradeoffs. The speed benefits that Rust has particularly around cache locality, they don't come for free. The Rust type system encourages unique ownership. You cannot alias to the same area of memory. It requires you to think of new ways of working. There are some old patterns that plainly do not work on Safe Rust. The go-to example for everybody on the internet is doubly linked lists. A single linked list has very clear chain of ownership. You have a start node, and every node owns the next one. In a doubly linked list, you effectively have a graph, and Safe Rust, the type system does not like when there isn't a single owner. There are ways of working around this. You can have arenas and replace your pointer access with indices into a region of memory. You can use Unsafe and just rely on pointers. There are ways around this. The things that you may try at first become somewhat difficult to understand why they don't work. Because the language is compiled ahead-of-time, means that you don't have runtime reflection, which limits some of the patterns that you may be able to express. Rust does not have a stable API, so linking into dynamic libraries is somewhat fraught. You can use a C API, and there is first party support for that, but it does make it much harder than it is in Java. All of these contributes to give a feeling of Rust being a difficult language, even though we've spent a lot of time trying to make it as easy as possible.
JVM Tradeoffs
Whereas on the JVM side, you also have tradeoffs. You have garbage collection, which means that you have very easy memory management. If you have a long running service that is high throughput, you are going to notice garbage collection pauses, for example. Allocating objects is very fast, but you're going to be allocating a lot of objects. You do have native compilation, but particularly with Graal Native, you can precompile all of your Java code into a platform dependent binary, but not all Java code will work. Anything that relies on reflection will not compile Java, which means that you are limited on the libraries that you can actually choose. Of course, because you have a runtime, you have the JVM, startup time becomes considerable compared to native binary.
An Experiment: Magic-Wormhole Replication
Because at Amazon we do have a lot of Java, and the team I belong to, the Rust Platform Team also falls under the umbrella with a sister team working on Java, we decided to perform an experiment. We wanted to compare the two languages in a representative way, and we ended up coming up with a small test case. There is this Python based tool called magic-wormhole. All it does is sending files from one file system to another. You have a sender and a receiver. The idea behind this was to just set the developers free to implement it in any way they wanted, and measure the results and observe what the process was like. We wanted to capture the developer experience. What tools were used? What were the different tradeoffs that people made? What couldn't we not use? As important, how did each perform? The design is incredibly simple. You have a receiver service. All it has to do is listen in a TCP port for an incoming file and write it to disk. A sender, all it does is send files to a receiver. We have a registry, which is a very simple HTTP service that operates like a DNS service. The receiver registers itself under an assumed name. The registry associates that name to an IP address and a TCP port. The sender then can say I want to send to these areas, and gets the networking information from the registry. This is literally the simplest networked application that you can write.
We wanted to timebox this, we didn't want this to go for too long. We said, let's give each developer two days to write this. We want the code to be somewhat small, we don't need something big. We ended up versioning around 300 lines. Of course, the best plans always never survive first contact, so there were some changes in this. The Rust code, I originally used Async/Await. There is more code. There is around 300 lines of code. The meat of the code was incredibly simple. Preallocate the buffer. Read contents from a stream in a loop. Write to that buffer. Then from that buffer, write straight to a file. Nothing too complex. I'm highlighting the logic for the command line for the looping operation. This is the meat of it. Fairly simple. You're taking bytes from a TCP connection, writing them to a file. We try to stick to fairly established libraries. We didn't go too crazy. In the case of Rust, it's a very regular stack. In the case of Java, we went through a few iterations and used different libraries. We tried multiple approaches to gather as much data as we could. One particularly interesting tidbit was, we not only used the JVM, we also compiled to GraalVM to see what the real-life benefits are of switching there.
Benchmarking
We of course benchmarked all in the same machine. We got the measurements for CPU time. All of the metrics that we collected, we got them through the time utility. We decided to send larger files in order to amortize the JVM startup cost, meaning that we were making more of an apples-to-apples comparison, particularly if you are more interested in long running services. Both Java and Rust senders were sending one file at a time. We weren't trying multi-threading at the time. The initial results with the initial Rust and Java versions were somewhat interesting. The original Java version was somewhat slow. It used HTTP library to perform the actual sending. It wasn't just sending bytes on the wire. It was using a library to do so much more. I also tried doing the same application in Rust without using async, and there wasn't much difference, which made me very happy. Of course, as soon as you tell an engineer that something they wrote is not as fast as it could be, of course, they go back to the drawing board and now it becomes a competition.
The other developer, William, he went back, and to his credit, he rewrote the code of the application that he had written and came back with an incredibly efficient Java application. Even though there wasn't much difference, I believe, between the Graal and JVM versions, it has been orders of magnitude faster than the initial implementation and faster than the Rust application. That's when I started measuring, because now I was in the back foot, and I was really intrigued why we were being twice as slow as Java. That's when I encountered, for example, that Tokio was introducing, even when not using it directly, as long as it was executed for the entire duration of the program, it was introducing some contention in the IO. I completely ripped off Tokio for the non-async version of Rust, and I got the same performance as Java. We continue iterating, and I wrote a new version that was able to send multiple files in parallel, but using a thread pool instead of Async/Await and Tokio. As you can see, there are still timing differences but it's all within the same ballpark. Sending files was as efficient as it could be. I even made a further iteration using the sendfile syscall directly and it didn't make much difference in the final results. I think there are probably more things that we can do here to further improve. At this point, I think we are fairly limited by the networking stack and the IO that we can push down the wire, and not as much by the code. This is great. Java and Rust are in very similar footings.
Then we look at the memory. Again, this is the initial Java implementation next to the Rust versions. The initial Java implementation was pulling in a full-blown HTTP stack. It makes sense that it will require more memory. The following iterations, we still have several orders of magnitude of difference in how much memory Java consumes. For Rust, all we needed was for each file that is being sent, we need a buffer of the size of the final file. That's all. Everything else in the binary, almost none of it ended up in the heap. All of the stack allocations were fairly small. Whereas for Java, on the JVM by default, because it relies on the JIT to be fast, it requires a huge heap, and lots of tiny allocations. Even when you compile down with Graal, you get a huge improvement, particularly in this tiny application. It's still orders of magnitude more than in Rust, where you have to opt in into using more memory.
The interesting thing is we found that writing the applications was relatively easy in both languages. There were hiccups, false starts. There were things that we had to rework. We found surprises, like in the case of file system IO, and the interaction of file system IO and Tokio. It was trivial to get something working in the first place. We noticed that, for very small files, the JVM was really slow to start. Sending small files, one at a time, that cost dominated the charts. There was a huge impact in some cases, if we were trying to fsync to disk. The IO throughput is what we ended up measuring in this benchmark. We were somewhat surprised at how well Java held up. The final numbers are in some iterations better than Rust. That was not something that we initially expected. We did expect Rust to be significantly better at memory utilization, and that thinking came through. We also know that for multi-threaded applications, and long-lived services, there are things that Java can do that Rust can't. Particularly writing lock-free data structures for high throughput is much more easy in Java, because you're relying on these optimizations. Of course, we also found something that is to be expected given the age of the languages, Java had way more libraries than Rust. Rust, even though it's ready for production, it's a very green environment still.
The End Result
By the end, we had written three different versions of the Rust app. We dropped the Tokio runtime in order to improve throughput. We added a way of using the sendfile syscall directly. We ended up using a thread pool based on rayon, which lets you send as many files as you want. Add the number of logical CPUs that your machine has. We measured that and we realized that it's effectively the same as sending one file at a time, one after the other. For the Java version, we had multiple iterations using native IO, using the standard library IO, using a full-blown HTTP service. We realized, for example, that syncing the file during teardown was introducing an additional delay, which is why we ended up writing files to tmpfs to remove any variation introduced by the SSD.
Benchmarking Languages on Graviton
Of course, we are not the only ones that did some benchmarking within the company. Other teams inside of the company have looked at a relatively simple CRUD application running on Graviton, which is the Arm-based lambda stack for AWS, and implemented it in multiple languages with different libraries and different stacks, different versions of each language. There are different requests, but two main things that were measured. One is the hot start latency, meaning you already have a lambda function running. What is the latency that the user perceives when they send the request? Even though there is variation, particularly in the tail latencies, the average case is comparable across languages. Rust does have some benefits. Tail latencies, for example, were better in Go than in Rust. The garbage collected languages have worse tail latencies in these cases, which makes sense when you have to deal with garbage collection pauses. At the same time, because they are all doing effectively the same thing, and it's a CRUD application, what you end up measuring is effectively a benchmark of your database access. That, I think, is represented in this chart, why there is so little variation in the p50 and p90 cases. When we move to cold start, that's where we really see the big differences in latency. The JVM has a high cost for startup. People are working to improve this, and so these graphs may no longer be accurate in the future. Even if we remove the outlier here, Java, we can see that there is a consistent cost for cold start latency for all of the garbage collected languages. All of the languages at runtime. One thing that I also wanted to highlight is how closely clustered the values particularly in Rust are. Your p50 is incredibly close to your p99, which is usually what you want to see. You want to see that the code is as consistent. The performance that you see is as consistent as possible.
All of those charts that I showed, the benchmarks, the code, and the results of the benchmarks can be seen in these repos. What I wanted to get across to you is not look at how great Rust is. The fact that Rust uses less memory means that you get better cache locality when you are executing, which directly translates to better performance on the same hardware. Which means that if you get a 10% improvement on the throughput of your application, purely by improving your memory utilization patterns, or using a language or a library that is more efficient, you can materially impact how much hardware you're using, how much energy you're using. You're doing more with the same hardware, or even with less. Moving from one language to another is a huge cost. It's not something to be taken lightly. As I showed in the charts earlier, you can iterate on Java, and you're going to find wins. There is incredibly good tooling to find where your performance is going, what your application is doing. Spending some time doing that, you can particularly at scale, have huge wins. I also want to show that if you are running a service written in Python, and it's running on 1000 machines, rewriting it in Rust if it's CPU intensive, if it's memory intensive, you can reduce the size of your fleet by a considerable amount. That translates to lower use of resources, and of course, lower cost of operation.
Because both Rust and Java allow you to write code for any platform, that is true of most languages. In those two ecosystems, there is a huge emphasis on making sure that libraries offer work regardless of what platform you're running on. If you're running already on lambda and you move to Graviton2, for example, the Arm-based version of lambda instead of x86, depending on your workload, you may get 34% better performance for the same price. In some cases, there are multiple case studies that we have. The numbers vary a lot depending on what you're measuring, what your application is doing. By moving to hardware that is just as powerful but with low resource consumption, you are also effectively reducing your energy consumption, without changing your code.
Conclusion
All of this is to say, measure twice, cut once. Finding that there are improvements that you're leaving on the table is the catalyst to huge wins. We are going to continue doing these kinds of comparative studies. We are not going to stop here. Do please look at the code in the provided links. The important thing is you can have lower resource utilization with relatively small changes. If you are looking in that direction, there are huge ways to do that by choosing your technologies carefully.
See more presentations with transcripts