Transcript
Evans: My name is Ben Evans. I'm a Senior Principal Software Engineer at Red Hat. Before joining Red Hat, I was lead architect for instrumentation at New Relic. Before that, I co-founded a Java performance company called jClarity, which was acquired by Microsoft in 2019. Before that, I spent a lot of time working with banks and financial companies, and also gaming as well. In addition to my work in my career, I'm also known for some of my work in the community. I'm a Java champion, a JavaOne Rockstar speaker. For six years, I served on the Java Community Process Executive Committee, which is the body that oversees all new Java standards. I was deeply involved with the London Java Community, which is one of the largest and most influential Java user groups in the world.
Outline
What are we going to talk about? We're going to talk about observability. I think there's some context that we really need to give around observability, because it's talked about quite a lot. I think there are still a lot of people, especially in the Java world, who find that it's confusing or a bit vague, or they're not quite sure exactly what it is. Actually, that's silly, because observability is really not all that conceptually difficult to understand. It does have some concepts which you might not be used to, but it doesn't actually take that much to explain them. I want to explain a bit about what observability is. I want to explain OpenTelemetry, which is an open source project and a set of open standards, which fit into the general framework of observability. Then with those two bits of theory in hand, we can turn and look at a technology called JFR, or JDK Flight Recorder, which is a fantastic piece of engineering, and a great source of data that can be really useful for Java developers who care about observability. Then we'll take a quick look as to where we are, take the temperature of our current status. Then we'll talk a little bit about the future and roadmap, because I know that developers always love that.
Why Observability?
Let's kick off by thinking about what observability is. In order to really do that, I want to start from this question of, why do we want to do it? Why is it necessary? I've got some interesting numbers here. The one I want to draw your attention to is the one on the left-hand side, which says roughly 63% of JVMs that are running in production currently are containerized. This number has come from our friends at New Relic who publish data. Since I put this deck together, they actually have a nice new result out which actually says that the 2022 numbers are actually a bit higher. Now they're seeing roughly 70% of all JVM based applications being containerized. For fun, on the right-hand side here, I'm also showing you the breakdown of the Java versions. Again, these numbers are about a year out of date. In fact, if we looked at them again today, we would see that in fact, Java 11 has increased even more than that. Java 11 is now in the lead, very slightly over Java 8. I know that people are always curious about these numbers. Obviously, they're not a perfect proxy for the Java market as a whole because it's just New Relic's customers, but it still represents a sample of tens of millions of JVMs. I think Gartner estimates that around about 1% of all production JVMs show up in the New Relic data. Not a perfect dataset by any means, but certainly a very interesting one.
The big takeaway that I want you to get out from here is that cloud native is increasingly our reality, 70% of applications are containerized. That number is still rising, and rising very quickly. It depends upon the market segment, of course. It depends upon the maturity that individual organizations have, but it is still a big number. It is still a serious trend that I think we need to take seriously for many reasons, but particularly because it has been such a fast growing segment. Containerization has happened really remarkably quickly. When an industry adopts a new practice as rapidly and as wholesale as they have in this case, then I think that that's a sign that you need to take it seriously and to pay some attention to it.
Why has this happened? Because observability really helps solve a problem which it exists in other architectures, but it's particularly apparent in cloud native, and that's an increase in complexity. We see these with things like microservices, we see it with certain other aspects of cloud native architectures as well. Which is that because there's just more stuff in a cloud native architecture, more services there, there's all kinds of new technologies, that traditional APM, Application Performance Monitoring, it's what APM stands for, those types of approaches just aren't really as suitable for cloud native. We need to do something new and something which is more suitable.
History of APM (Application Performance Monitoring)
To put this into some context, to justify it a little bit, we can look back 15 years, we go back to 2007. I was working at Morgan Stanley, we certainly had APM software that we were deploying into our production environments. They were the first generation of those types of technologies, but they did exist 15 years ago. We did get useful information out of them. Let's remember what the world of software development was like 15 years ago, it was a completely different world. We had release cycles that we measured in months, not in days or hours. Quite often, the applications that I was working with back in those days, we would have maybe a release every six weeks, maybe a release every couple of months. That was the cadence at which new versions of the software came out. This was before microservices. We had a service based architecture. These were large scale, quite monolithic services. Of course, we ran this all in our own data centers or rented data centers. There was no notion of an on-demand cloud in the same way that we have these days.
What this means is two things, because the architectures are stable for a period of months, a good operations team can get a handle on how the architecture behaves. They can develop intuition for how the different pieces of the architecture fit together, the things that can go wrong. If you have a sense of what can go wrong, you can make sure that you gather data at those points and see whether things are going to go wrong. You end up with a typical view of an architecture like this, this traditional 3-tier architecture. It's still a classic data source JVM level for application services, web servers, and some clustering and load balancing technologies. Pretty standard stuff. What can break? The load balancers can break. The web servers mostly are just serving static content, aren't doing a great deal. Yes, you could push a bad config or some bad routing to the web layer, but in practice if you do that, you're going to find it pretty quickly. The clustering software can have some slightly odd failure modes, and so on. It's not that complicated. There's just not the same level of stuff that can go wrong that we see for cloud native.
Distributed System Running On OpenShift
Here's a more modern example. I work for Red Hat, so of course, I have to show you at least one slide which has got OpenShift on it. There we have a bunch of different things. What you'll notice here is that this is a much more complex and much more sophisticated architecture. We have some bespoke services. We've got an EAP service there. We've got Quarkus, which is Red Hat's Kubernetes native Java deployment. We've even got some things which aren't written in Java, we've got Node.js. We've also got some things which are still labeled as services, but they're actually much more like appliances. When we have Kafka, for example, Kafka is a data transport layer. It's moving information from place to place and sharing it between services. It's not a lot of bespoke coding that's going on there, instead, that is something which is more like infrastructure than a piece of bespoke code. Here, like the clear separation between the tiers, is much more blurry. We've got a great admixture of microservices and infrastructural components like Kafka, and so on. The data layer is still there, but it's now augmented by a much greater complexity for services in that part of the architecture.
IoT/Cloud Example
We also have architectures which look nothing like traditional 3-tier architectures. This is a serverless example. This one really is cloud native. This one really is the thing that it will be very difficult to build with traditional IT architectures. Here we have IoT, so the internet of things. We have a bunch of sensors coming in from anywhere. Then we have some sort of server or even serverless provisioning, which produces an IoT stream job which is fed into a main datastore. Then we have other components which are watching that serverless datastore, and have some machine learning model that's being applied over the top of it. Now, the components are actually simpler in some ways. A lot of the complexity has been hidden, and is being handled by the cloud provider themselves for us. This is where I'm much closer to a serverless type of deployment.
How Do We Understand Cloud-Native Apps?
This basically brings us to the heart of how and why cloud native applications are different. They're much more complex. They have more services. They have more components. The topology, the way that the services interconnect with each other is far more complicated. There are more sources of change, and that change is occurring more rapidly. This has moved us a long way away from the sorts of architectures that I would have been dealing with at the early point in my career. Not only is that complexity and that more rapid change a major factor, we also must understand that there are new technologies with genuinely new behaviors of the type that we have never seen before, things like there are services which scale dynamically. There are, of course, containers. There are things like Kafka. There are function as a service, and serverless technologies. Then finally, of course, there is Kubernetes, which is a huge topic in and of its own right. That's our world. Those are the things that we have to face. Those are the challenges. That's why we need to do things in a different way.
User Perspective
Having said that, despite all of that additional complexity and all of that additional change in our landscape, certain questions, certain aspects, we still need answers to. We still need answers to the sorts of questions like, what is the overall health of the solution. What about root cause analysis? What about performance bottlenecks? Is this change bad? Have I introduced some regression, by changing the software and doing a rollout? Overall, what does the customer think about all of this? Key questions, they're always true on every type of architecture you deploy, whether this is an old school 3-tier architecture, all the way through to the latest and greatest cloud native architecture. These concerns, these things that we care about are still the same. That is why observability. We have a new world of cloud native, and we require the same answers to some of the same old questions, and maybe a few new answers to a few new questions as well. Broadly, we need to adapt our notion of what it is to provide good service and to have the tools and the capabilities to do that. That's why observability.
What Is Observability?
What is observability, exactly? There's a lot of people that have talked about this. I think that a lot of the discussion around it is overcomplicated. I don't think that observability is actually that difficult to understand conceptually. The way that I will explain it is like this. First of all, we instrument our systems and applications to collect the data that we need to answer those user level questions that we had, that we were just talking about a moment or two ago. You send that data outside of your production system. You send it to somewhere completely different, which is an isolated external system. The reason why, because if you don't, if you attempt to store and analyze that data within your production system, if your system is down, you may not be able to understand or analyze the data, because you may have a dependency on the system which is causing the outage. For that reason, you send it to somewhere that's isolated and external.
Once you have that data, you can then use things like a query language, or almost like an experimental approach of looking at the data, of digging into it and trying to see what's going on by asking open-ended questions. That flexibility is key, because it's that what provides you with the insights. You don't necessarily know what you're going to need to ask when you start trying to figure out, what is the root cause of this outage. Why are we seeing problems in the system? That flexibility, the unknown unknowns. The questions you didn't know you need to ask. That's very key for what makes a system an observability system rather than just a monitoring system. Ultimately, of course the foundation of this is systems control theory, which is how well can we understand the internal state of a system from outside of it. That's a fairly theoretical underpinning. We're interested in the practitioner approach here. We're interested in what insights that could lead you to taking action about your entire system. Can you observe? Not just single piece, but all of it.
Complexity of Microservice Architectures
Now the complexity of microservice architecture starts to come in. It's not just that there are larger numbers of smaller services. It's not just that there are multiple people who care about this Dev, DevOps, and management. It's also things like the heterogeneous tech stacks. In modern applications, you don't build every service or every component out of the same tech stack. Then finally, again, touched on Kubernetes, service cost to scale. Quite often that's run dynamically or automatically these days. That additional layer of complexity is added to what we have with microservices.
The Three Pillars
To help with diagnosing all of this, we have a concept of what's called the three pillars of observability. This concept is a little tiny bit controversial. Some of the providers of observability solutions and some of the thinkers in the space, claim that this is not actually that helpful a model. My take on it is that, especially for people who are just coming to the field and who are new to observability, that this is actually a pretty good mental model. Because these are things that people may already be slightly familiar with. It can provide them with a useful onramp to get into the data and into the observability mindset. Then they can decide whether or not to discard the mental model later or not. Metrics, logs, and traces. These are very different data types. They behave differently and have different properties.
A metric is just a number that describes a particular process or an activity, the number of transactions in, let's say, a 10-second window. That's a metric. The CPU utilization on a particular container. That's a metric. Notice, it's a timestamp, and it's a single number measured over a fixed interval of time basically. A log is an immutable record of an event that happened at a point in time. That blurs the distinction between a log and an event. A log might just be an entry in a Syslog, or an application log, good old Log4j or something like that. It might be something else as well. Then a trace. A trace is a piece of data which is used to show what was triggered by an individual user level request. Metrics, not really tied to particular requests. Traces, very much tied to a particular request, and logs, somewhere in the middle. We'll talk more about the different aspects of data that these things have.
Isn't This Just APM with New Marketing Terms?
If you were of a cynical mind, you might ask, isn't this just APM with new marketing? Here's why. Here's five reasons why I think it's not. Vastly reduced vendor lock-in. The open specification of the protocols on the wire, the open sourcing of at least some of the components, especially the client side components that you put into your application, those hugely help to reduce vendor lock-in. That helps keep vendors in the space competitive, and it helps keep them honest. Because if you have the ability to switch wire protocol, and maybe you only need to change a client component, then that means that you can easily migrate to another vendor should you wish to. Related to that, you will also see standardized architecture patterns and the fact that because people are now cooperating on protocols, cooperating on standards, and on the client components, we can now start to have a discourse amongst architects and amongst practitioners as to how we build this stuff out in a reliable and a sustainable way. That leads to better architecture practice, which also then feeds back into the protocols and components. Moving on from that, we also see that the client components are not the only pieces that are being developed. There is an increasing quantity and quality of backend components as well.
Open Source Approach
In this new approach, we can see that we've started from the point of view of instrumenting the client side, which in this case really means the applications. In fact, most of these things are going to be server components. It's typically thought of as being client side for the observability protocols. This will mean things like Java agents and other components that we're going to place into our code, whether that's bespoke or the infrastructural components which we'll also need to integrate with. From there, we'll send the data over the wire into a separate system, which is marked here as data collection. This component too is likely to be open source, at least for the receiving part. Then we also require some data processing. The first two steps are now very heavily dominated by open source components. For data processing, that process is still ongoing. It is still possible to either use an open source component or a vendor for that part. The next step, we're closing the loop to bring it back around to the user again is visualization. Again, there are good stories here both from vendor code and from open source solutions. The market is still developing for these final two pieces.
Observability Market Today
In terms of today's market, and what is actually in use, there was a recent survey by the CNCF, the Cloud Native Computing Foundation. They found that Prometheus, which is a slightly older metrics technology, is probably the most widely used observability technology around today. They found that this was used by roughly 86% of all projects that they surveyed. This is of course a self-reported survey, and only the people who were actively interested and involved with observability will have responded to this. It's important to treat this data with a suitable amount of seasoning. It's a big number, and it may not have as much statistical validity as we might think. The project that we're going to spend a lot of time talking about, which is OpenTelemetry, was the second most widely used project at 49%. Then some other tools as well like Fluentd and Jaeger.
What takeaways do we have from this? One of the point which is interesting is that 72% of respondents employ up to 9 different tools. There is still a lack of consolidation. Even amongst the folks who are already interested in observability, and producing and adopting it within their organizations, over one-third of them complain that their organization lacks proper strategy for this. It is still early days. We are already starting to see some signs of consolidation. The reason why we're focusing and we're so interested on OpenTelemetry is because the OpenTelemetry usage is rising sharply. It's risen to 49% in just a couple of years. Prometheus has been around for a lot longer, and it seems to have mostly reached market saturation. Whereas OpenTelemetry is only still in some aspects moving out of beta, it's not fully GA yet. Yet, it's already being used by about half of the folks who are adopting observability as a whole. In particular, Jaeger, which was a tracing solution, have decided to end of life their client libraries. Jaeger is pivoting to be a tracing backend for its client and its data ingest libraries, to switch over completely to using OpenTelemetry. That is just one sign of how the market is already beginning to consolidate.
This is part of the process which we see where API monitoring traditionally dominated by proprietary vendors, now we're starting to move into this inflection point where we're moving from proprietary to open source led solutions. More of the vendors are switching to open source. When I was at New Relic, I was one of the people who led that switch of New Relic's code base from being primarily proprietary on the instrumentation side, to being completely open source. In the course of seven months, one of the last things I did at New Relic before I left was helped oversee the open sourcing of about $600 million worth of intellectual property. The market is definitely all heading in this general direction. One of the technologies, one of the key things behind this is OpenTelemetry. Let's take a look and let's see what OpenTelemetry actually is.
What Is OpenTelemetry?
OpenTelemetry is a set of formats, open standards, and libraries. It is not about data ingest, backend, or providing visualizations. It is about the components which end users will fit into their applications and their infrastructure. It is designed to be very flexible, and it is very explicitly cross-platform, it is not just a Java standard. Java is just one implementation of it. There are others for all of the major languages you can think of at different levels of maturity. Java is a very mature implementation. We also see things like .NET, and Node, and Go are all fairly mature as well. Other languages, Python, Ruby, PHP, Rust, are at varying stages of that maturity lifecycle. It is possible to get OpenTelemetry to work on top of bare metal or just in VMs, but there is no getting away from the fact that it is very definitely a cloud-first technology. The CNCF have fostered this, and they are in charge of the standard.
What Are Components of OpenTelemetry?
There are really three pieces to it that you might want to look at. The two big ones are the API and the SDKs. The API is what the developers of instrumentation and of the OpenTelemetry standard itself tend to use. Because they contain the interfaces, and from there, you can do things like, you can write an event exporter, you can write attribute libraries. The actual users, the application owners, the end users, will typically configure the SDK. The SDK is an implementation of the API. It's the default one, and it's the one you get by default. When you download OpenTelemetry, you get the API, you also get the SDK as a default implementation of that API. That then is the basis which you have for instrumenting your application using OpenTelemetry, and that will be your starting point if you're new to the project. There is also the plugin interfaces, which are used by a small group of folks who are interested in creating new plugins and extending the OpenTelemetry framework.
What you want to draw your attention to is that they describe these four guarantees. The API is guaranteed for three years, plugin interfaces are guaranteed for one year, and so is the SDK, basically. It's worth noting that the different components, metrics, logs, and tracing, are at different statuses at different points in their lifecycle. Currently, the only thing which is considered in scope for support is tracing. Although the metrics piece will probably also come into support very soon when it reaches 1.0. Some organizations depending upon the way you think about support, might consider these are not particularly long timescales. It will be interesting to see what individual vendors will do in terms of whether they honor these guarantees or whether they will treat them as a minimum. In fact, support for longer than this.
Here are our components. This is really what makes up OpenTelemetry. The specification comprising the API, the SDK, data and semantic conventions. Those are cross-language and cross-platform. All implementations must have the same view, as far as possible, as to what those things mean. Each individual language then also needs not only an API and an SDK, but we need to instrument all of the libraries and frameworks and applications that we have available. That should work as far as possible, completely out of the box. That instrumentation piece is a separate component from the specification and the SDK. Finally, one other very important component of the OpenTelemetry suite is what we call the collector. The collector is a slightly problematic name, because when people think of a collector, they think of something which is going to store and process their data for them. It doesn't do that. What it really is, is a very capable network protocol terminator. It's able to speak a whole variety of different network formats, and it effectively acts as a switching station, or a router, or a traffic terminator. It's all about receiving, processing, and re-exporting telemetry data in whatever format that it can find it in. Those are the primary OpenTelemetry components.
JDK Flight Recorder (JFR)
The next section is all about JFR. It is a pretty nice profiling tool. It's been around for a long time. It was originally first in Java 7, the first release of Java from Oracle, which is now well over 10 years ago. It's got this interesting history because Oracle didn't invent it, they bought it when they bought BEA Systems. Long before they did the deal with Sun Microsystems, they bought BEA, and BEA had their own JVM called JRockit. JFR originally stood for JRockit Flight Recorder. When they merged it into HotSpot with Java7, it became Java Flight Recorder, and then when they open sourced it, because from Java 7 up to Java 11, JFR was a proprietary tool. It didn't have an open source implementation. You could only use it in production if you were prepared to pay Oracle for a license. In Java 11, JDK Flight Recorder was added to OpenJDK, renamed to JDK Flight Recorder, and now everybody can use it.
It's a very nice profiling tool. It's extremely low overhead. Oracle claim that it gives you about a 1% impact. I think that's probably overstating the case. It depends, of course, a great deal on what you actually collect. The more data you collect, the more you disturb the process that's under observation. It's almost like quantum mechanics, the more you look at something and the more you observe it, the more you disturb it and mess around with it. I've certainly seen on a reasonable data collection profile around about 3%. If you're prepared to be more light touch on that, maybe you can get it down even further.
Traditionally, JFR data is displayed in a GUI console called Mission Control, or JMC. That's fine, but it has two problems that we're going to talk about. JFR by default generates an output file. It generates a recording file like an airplane black box, and JMC, Mission Control only allows you to load in one file at a time. Then you have the problem that, if you're looking across an entire cluster, you need lots of GUI windows open in order to see the different telemetry data from the different machines. That's not typically how we want to do things for observability. At first sight, if it doesn't look like JFR, is that suitable? We'll have to talk about how we get around that.
Using Flight Recorder
How does it work? You can start it with a command line flag. It generates this output file, and there are a couple of pre-configured profiles, they call them, which can be used to determine what data is captured. Because it generates an output file and dumps it to a disk, and because of the usage of command line flags, this can be a bit of a challenge in containers, as we'll see. Here's what some of the startup flags might look like. We've got a Java -XX:StartFlightRecorder, and then we've got a duration, and then a filename to dump it out to. This bottom example will allow you to start a flight recording. When the process starts, it will run for 200 seconds, and then it will dump out the file. For long running processes, this is obviously not great, because instead what's happening is that you've only got the first 200 seconds of the VM. If your process is up for days, that's actually not all that helpful.
There is a command called jcmd. Jcmd is used not just to control JFR, but it can be used to control many aspects of the Java virtual machine. If you're on the machine's console, you can start and stop and control JFR from the command line. Again, this is not really that useful for containers and for DevOps, because in many cases, with modern containers and modern deployments, you can't log into the machine. How do you get into it, in order to issue the command, in order to start the recording? There are all sorts of practices you can do to mitigate this. You can set things up so that JFR is configured as a ring buffer. What that means is the buffer is constantly running and it's recording the last however many seconds or however many megabytes of JFR information, and then you can trigger JFR to dump that buffer out as a file.
Demo - JFR Command line
Here's one I made earlier. This application is called heapothesys. This is by our friends and colleagues at Amazon. It is a memory benchmarking tool. We don't want to do too much. Let's give this a duration of 30 seconds to run rather than the 3 minutes. Let's just change the filename as well just so I don't obliterate the last one that I have. There we go. You can see that I've started this up, you can see that the recording is working. In about 30 seconds we should get an output to say that we've finished. The HyperAlloc benchmark, which is part of a directory called heapothesys, is a very useful benchmark for playing with the memory subsystem. I use it a lot for some of my testing and some of my research into garbage collection. Ok, so here we go, we have now got a new file, there it is, hyperalloc_qcon. From the command line, there's actually a JFR command. Here we go, jfr print. There's loads of data, lots of things to do with GC configuration, and all kinds of things, code cache statistics, all sorts of things that we might want, lots of things to do in the module system.
Here's lots of CPULoad events. If you look very carefully, you can see that they are about once a second. It's providing ticks which could easily be turned into metrics for CPU utilization, and so forth, as well. You see, we've got lots of nice numbers here. We've got the jvmUser, the jvmSystem, and the total of the machine as well. We can do these types of things with the command line. What else can we do from the command line? Let's just reset this back to 180. Now I'm just going to take all of the detail out so we're not going to start at startup. Instead, I'm going to run that, look at Jps from here, and now I can do jcmd. We'll just leave that running for a short amount of time. Now we can stop it. I forgot to give it a filename and to dump it. As well as the start and stop commands, I forgot to do a dump in the meantime. You actually also needed a JFR dump in there as well. That's just a brief example of showing you how you could do some of that with the command line.
The other thing which you can do is actually programmatic. You can actually take a file, and here's one I made earlier. Within the modern 11-plus JDK, you can see that we actually have a couple of entries, RecordedEvent and RecordingFile. This enables us to process the file. Down here, for example, on line 19, we can take in a RecordingFile, and then process it in a while loop where we take individual events, which are of this type, jdk.jfr.consumer.RecordedEvent. Then we can have some way of processing the events. I use a pattern for programmatically handling JFR events, which involves building these handlers. I have an interface called a RecordedEventHandler, which combines both the consumer and the predicate. Effectively, you test to see whether or not you will handle this event. Then if you can, then you consume it. Here's the test event, here's the predicate. Then the other event that we will typically also see is the consumer, so is the, accept. Then, basically, what this boils down to is something like a G1 handler. This one can handle a bunch of different events, G1HeapSummary, GCHeapSummary, and GCPhaseParallel. Then the accept event looks like this. We basically look at the incoming name, and figure out which of these it is. Then delegate to an overload of accept. That's just some code for programmatically handling events like this and for generating CSV files from them.
JFR Event Streaming
One of the other things which has also happened with recent versions of JFR, is this move away from dealing with files. JFR files are great if what you're doing is fundamentally performance analysis. Unfortunately, it has problems, for doing observability and for long term, always on production profiling. What we need to have is some telemetry stream of information. The first step towards this is in Java 14, which came out over two years ago now. That basically provided a mode for JFR, where you could get a callback. Instead of having to start and stop recordings and control them, you could just set up a thread, which said, every time one of these events that I've registered appears, please call me back, and I will respond to the event.
Example JFR Java Agent
Of course, one way that you might want to do this is with a Java agent. You could, for example, produce some very simple code like this. This is actually a complete working Java agent. We've got a premain method, so we will attach. Then we have a run method. I've cheated a little tiny bit, because there's a StreamEventSender object which I haven't implemented, and I'm showing you what it does. Basically, it sends up the events to anything that we would want. You might imagine that those just go over the network. Now instead of having a RecordingFile, we have a RecordingStream. Then all we need to do is to tell it which events we want to enable, so CPULoad. There's also one called JavaMonitorEnter. This basically is an event which lets you know when you're holding a lock for too long, so that we'll get a JFR event triggered every time a synchronized lock is held by any thread for more than 10 milliseconds. Long held locks effectively is what you can detect with that. You set those two up with the callback of which is the onEvent lines. Then finally, you call our start. That method does not return, because now your thread has just been sent up as an event loop, and it will receive events from the JFR subsystem as things happen.
What Is Current Status of OpenTelemetry?
How can we marry up JFR with OpenTelemetry? Let's take a quick look at what the status of OpenTelemetry actually is. Traces are 1.0. They've been 1.0 for I think about a year now. They allow you to track the progress of a single request. They are basically replacing older open standards, including OpenTracing, including Jaeger's client libraries. Distributed tracing within OpenTelemetry is eating the lunch of all of those projects. It seems very clear that that is how the industry, not just in Java, is going to do tracing going forwards. Metrics is so close to hitting 1.0. In fact, it may go 1.0 as early as this week. For JVM, that means both application and runtime metrics. There is still some work to do to make the JVM metrics, the ones that are produced directly by the VM itself, that is, the ones that we'll use JFR for, in order to get that to completely align. It's the focus of ongoing work. Metrics is now very close as well. Logging is still in draft state. We do not expect that we will get a 1.0 log standard until late 2022 at the earliest. Anything which is not a trace or a metric is considered to be a log. There's some debate about whether or not, as well as logs, we need events as a related or subtype of logs that we have.
Different Areas Have Different Competitors
The maturities are different in some ways. Traces, OTel is basically out in front. Prometheus, there's already a lot of folks using Prometheus, especially for Kubernetes. However, it's less well established elsewhere and it hasn't really moved a lot lately. I think that is a space where OTEL and a combined approach which uses OTel traces and OTel metrics can really potentially make some headway. The logging landscape is more complicated, because there are lots of existing solutions out there. It's not clear to me that OTel logging will make that much of an impact yet. It's very early days for that last one. In general, OpenTelemetry is going to be declared as 1.0 as soon as traces and metrics are done. The overall standard as a whole will go 1.0 very soon.
Java and OpenTelemetry
Let's talk about Java and OpenTelemetry. We've talked about some of these concepts already, but now let's try and weave the threads together, and bring it into the realm of what a Java developer or Java DevOps person will be expected to do day-to-day. First of all, we need to talk a little tiny bit about manual versus automatic instrumentation. In Java, unlike some other languages, there are really two ways of doing things. There is manual instrumentation, where you have full control. You can write whatever you like. You could instrument whatever you like, but you have to do it all yourself, and you have a direct coupling to the observability libraries and APIs. There's also the horrible possibility of human error here, because what happens if you don't instrument the right things, or you think something isn't important, and it turns out to be important? Not only do you not have the data, but you may not know that you don't have it. Manual instrumentation can be error prone.
Alternatively, some people like automatic instrumentation, this requires you to use a Java agent, or to use a framework which automatically supports OpenTelemetry. Quarkus, for example, has automatic inbuilt OTel support. You don't need a Java agent. You don't need to instrument everything manually. Instead, the framework will do a lot to support you. It's not a free lunch, you still require some config. In particular, when you've got a complex application, you may have to tell it certain things not to instrument just to make sure you don't drown in too much data. The downside of automatic is there could be a startup time impact if you're using a Java agent. There might be some performance penalties as well. You have to measure that. You have to determine for yourself which of these two routes is right for you. There's also something which is a little bit of a hybrid approach, which you could do as well. Different applications will reach different solutions.
Within the open-telemetry GitHub org, there are three main projects that we care about within the Java world. There's opentelemetry-java, this is the main instrumentation repo. It includes the API, and it includes the SDK. There is opentelemetry-java-instrumentation. This is the instrumentation for libraries and other components and things that you can't directly modify. It also provides an agent which enables you to instrument your applications as well. There's also opentelemetry-java-contrib. This is the standalone libraries, the things which are accompaniments to this. It's also where anything which is intended for the main repos, either the main OTel Java or the Java instrumentation repo, they go into contrib first. The biggest pieces of work that are in Java contrib right now are gathering of metrics by JMX, and JFR support, which is still very much in beta, we haven't finished it yet. We are still working on it.
This leads us to an architecture which looks a lot like this. You have applications with libraries which depend directly upon the API. Then we have an SDK, which provides us with exporters, which will send the data across the wire. For tracing, we will always require some configuration because we need to show where the traces are sent to. Typically, traces will be sampled. It is not normally possible to collect data about every single transaction and every single user request that is sent in. We need to sample, and the question is, how do we do the sampling? Do we sample everything at the same rate? Some people, notably the Honeycomb folks, very much want to sample errors more frequently. There is an argument to be made, the errors should be sampled at 100%, 200 oks, maybe not. There's also the question about whether you should sample uniformly or whether you should use some other distribution for determining how you sample. In particular, could you do some long tail sampling, where slow requests are also sampled more heavily than the requests which complete closer to the meantime? Metrics collection is also handled by the SDK. We have a metrics provider which is usually global as an entry point. We have three things that we care about, we have counters, which only ever increase, so transaction count, something like that. We have measures which are values aggregated over time, and observers which are the most complex type, and provide effectively a callback.
Aggregation in OpenTelemetry
One of the things which we should also say about OpenTelemetry, is that OpenTelemetry is a big scale project. It is designed to scale up to very large systems. In some ways, it's an example of a system, which is built for the big scale, but is still usable at medium and small scales. Because it's designed for big systems, it aggregates. Aggregation happens, not particularly in your app code or under the control of the user, but in the SDKs. It's possible to build complex architectures, which do multiple aggregations at multiple scales.
Status of OTel Metrics
Where are we with metrics? Metrics for manually instrumented code are stable. The wire format is stable. We are 100% production ready on the code. The one thing which we still might have a slight bit of variation on, and as soon as the next release drops, that won't change, is the exact nature or meaning of the data that's being collected from OTel metrics. If you are ready to start deploying OpenTelemetry, I would not hold back at this point on taking the OTel metrics as well.
Problems with Manual Instrumentation
There are a lot of problems with manual instrumentation. Trying to keep it up to date is difficult. You have confirmation biases that you may not know what's important. What counts as important will probably change as the application changes over time. There's a nasty problem with manual instrumentation, which is that you quite often only find out what is really important to your application in an outage, which goes against the whole purpose of observability. The whole purpose of observability is to not have to predict what is important, to be able to ask those questions where you didn't know you'd need to ask them at the outset. Manual instrumentation goes against that goal. For that reason, lots of people like to use automatic instrumentation.
Java Agents
Basically, Java agents install a hook. I did show an example of this earlier on, which contains a premain method. That's called a pre-registration hook. It runs before the main method of your Java application. It allows you to install transformer classes, which have the ability to rewrite code as it's seen. Basically, there is an API with a very simple hook, there's a class called instrumentation. You can add bytecode transformers and weavers, and then add them in as class transformers into instrumentation. That's where the real work is done, so that when the premain method exits, those transformers have been registered. Those transformers will be rewritten and able to spin up new code and to insert bytecode into classes as they're loaded. There are key libraries for doing this. In OpenTelemetry we use the one called Byte Buddy. There's also a very popular bytecode rewriting library called ASM, which is used internally by the JDK.
The Java agent that's provided by OpenTelemetry can attach to any Java 8 and above application. It dynamically injects bytecode to capture the traces. It supports a lot of the popular libraries and frameworks completely out of the box. It uses the OTLP exporter. OTLP is the OpenTelemetry Line Protocol. The network protocol which is really Google Protocol Buffers over gRPC, which is an HTTP/2 style of protocol.
Resources
If you want to have a look at the projects, the OpenTelemetry Java is probably the best place to start. It is a large and sophisticated project. I would very much recommend that you take some time to look through it if you're interested in becoming a developer on it. If you just want to be a user, I would just consume a published artifact from Maven Central or from your vendor.
Conclusion
Observability is a growing trend for cloud native developers. There are still plenty of people using things like Prometheus and Jaeger today. OpenTelemetry is coming. It is quite staggering how quickly it is growing and how many new developers are onboarding to it. Java has great data sources which could be used to drive OpenTelemetry, including technology like Java agents and JFR. There are active open source work to bring these two strands together.
See more presentations with transcripts