Key Takeaways
- To help understand if current LLM tools can help programmers become more productive, an experiment was conducted using improved code coverage of unit tests as an objective measure.
- No-cost LLMs were chosen to participate in this experiment; ChatGPT, CodeWhisperer, codellama:34b, codellama:70b, and Gemini. These are all free offerings which is why Github Copilot is not on this list.
- An experiment was designed to test each of the above selected LLM’s ability to generate unit tests for an already coded, non trivial web service. Each of the above mentioned LLMs were tasked with the same problem and prompting. Then the output was combined with the existing open source project which was then compiled and unit tests run. A record was kept of all the corrections needed to get the build to pass again.
- None of the LLMs could perform the task successfully without human supervision and intervention but many were able to accelerate the unit test coding process to some degree.
It hasn't even been two years since OpenAI announced ChatGPT which is the first mainstream Large Language Model from a generative pre-trained transformer to be made available to the public in a way that is very easy to use.
This release triggered lots of excitement and activity from Wall Street to the White House. Just about every fortune 500 company and tech start up is trying to figure out how to capitalize on LLMs. The developer ecosystem is ablaze with supporting tools and infrastructure such as Lang Chain that accelerates the integration of existing applications with LLMs.
At a recent tech conference in Denver, Eric Evans (creator of Domain-Driven Design) "encouraged everyone to start learning about LLMs and conducting experiments now, and sharing the results and learnings from those experiments with the community."
This article documents my personal contribution to that effort. This research is not based on any requirements or direction from any of my employers past or present.
The Experiment
I decided to come up with an experiment that I could perform on each of the more prevalent LLMs that are available in order to make comparisons between them and to explore the boundaries of what LLMs are capable of at least for the near term. I am not interested in or even worried about how LLMs will replace coders. You still need experienced developers when using LLMs because you need to be particularly vigilant about reviewing the suggestions. I am more interested in how LLMs can help coders become more productive by automating the more time consuming and menial yet still very important parts of writing code. I am referring to unit tests, of course.
Many will claim that unit tests are not a good place for LLMs because, in theory, with Test Driven Development you write the tests first and then the code afterwards. I have had experience in a fair number of companies where the unit tests were almost an afterthought and the amount of code being covered by the tests was directly proportionate to the amount of time remaining in the sprint. If coders could write more unit tests faster with LLMs, then there would be more code coverage and higher quality. I can already hear the TDD die-hards up in arms. Sorry but that is just the cold hard truth of it. Besides, how are you going to mock the external dependencies in the tests without first knowing the internal details of the implementations?
I have an open source repository where I implement identical microservices in various programming languages and tech stacks. Each implementation includes access to MySql fronted by Redis. I put each microservice on the same load test where I collect and analyze the performance data in order to make comparisons. I took a service class from the Java on Spring Boot implementation and scooped out all but three of the public, routable methods. Then I took the unit test code and removed all but one of the unit tests. I left the imports, setup and the annotation-based dependency injection in place. In the prompting, I ask for the other two unit tests to be generated. The entire prompt is 250 lines (about 1300 words) in length. Why did I use the Java on Spring Boot stack? There is a lot of code readily available online, without licensing restrictions, which was used as some of the training data for the LLMs. Spring Boot’s approach is heavily annotation-based which requires deeper understanding than simply studying the code itself.
I have approached this problem like it was a scientific experiment but it is not a very good one. An experiment is valuable only if it is reproducible but all of these technologies under evaluation are constantly evolving and the owning organizations spend a lot of money innovating on these products, and hopefully improving them. It is highly unlikely that you would arrive at the same results were you to conduct the same experiment today. Nevertheless, I parked a copy of the prompt that I used here just in case you wished to try.
OpenAI’s ChatGPT
It was the early days of ChatGPT, at the end of 2022, that brought transformer-based Large Language Models and OpenAI into the media spotlight. Of course I had to include ChatGPT in this experiment. There were a lot of rough patches in those early days: a small context window, low quality output, prompt forgetting, confident hallucinations. It's in a much better place now. Like all of the other technologies evaluated here, I used the free version. There are concerns around the revealing of proprietary information via prompt leaking when using these commercial LLMs. That is why I based the experiment on open source. There is nothing proprietary to leak. Prompt leaking is unavoidable because your prompts are used to fine-tune the LLM which, over time, improves the quality of its answers in the future.
How did ChatGPT perform? It did okay. The explanation of the results was concise and accurate. The output was useful but it did have bugs, especially in the dependency injection and mocking areas. The test coverage was just okay. The unit test code had assertions for individual properties, not found, and not null. Even though there were bugs, I still considered the output to be useful because I felt that it would have taken me more time to type the code myself than to fix the bugs from the generated code.
Amazon CodeWhisperer
The next technology that I evaluated in this way was Amazon CodeWhisperer. This is a plugin for Visual Studio Code. The experience is basically a better statement completion. You start typing and it finishes the line. Sometimes it completes a block of code. You then choose to either accept or reject the proposed change. Most of the time, I accepted the change then made whatever corrections were required to fix any bugs in the generated code. I felt the most productive with CodeWhisperer.
I believe that CodeWhisperer is similar in approach to Github Copilot which I did not evaluate because that costs money whereas CodeWhisperer was free. When it comes to the internals of CodeWhisperer, Amazon keeps their cards close to the vest. It's probably more than just an LLM but it does have that LLM feel to it. I suspect that CodeWhisperer was phoning home all the time because the IDE would freeze up often. You can disable CodeWhisperer which would result in the IDE becoming responsive again.
Code Llama
Ollama is an open source platform that allows you to build and run LLMs. It makes available a fair number of open source models that have already been pre-trained, including Meta’s Code Llama model. There is a lot of interest in these open source LLMs. Noted venture capitalist Bill Gurley identified Llama 2 (included in Ollama's library of models) as OpenAI's biggest threat at a summit in UCLA last year. Remember that prompt leaking issue that I mentioned earlier? Because you are hosting Ollama on VMs that are directly under your control, there is little possibility of prompt leaking.
Although it is not required, you are definitely going to want to run this on a system with a reasonably powered GPU. I don’t have a GPU on my personal laptop, so I provisioned a machine learning linux with CUDA system on an a2-highgpu-1g VM with a nvidia-tesla-a100 (312 TFLOPS) from the Google Cloud Platform to run the tests. More specifically, I used the codellama:34b model. From the meta blog that introduced this model, "Code Llama is a code-specialized version of Llama 2 that was created by further training Llama 2 on its code-specific datasets, sampling more data from that same dataset for longer. Essentially, Code Llama features enhanced coding capabilities, built on top of Llama 2." I ran the same test with codellama:70b on a NVIDIA A100 40GB and did see a slight improvement in the code coverage of the resulting generated code. Provisioning that instance cost $3.67 per hour at the time of this writing.
The results were a little bit worse than ChatGPT but not too bad. There were compile errors, missing packages and imports, and mocking and dependency injection bugs. With the 34b model, the only code coverage was to assert not null. With the 70b model, that was replaced with an assertion that what was returned from the service call matched what got injected in the mock of the underlying DAO call. There is no explanation or online references included in the results with Code Llama. The generated output did include code comments that did not add much value.
Google Gemini
The last LLM that I performed this experiment on was Gemini which is Google's rebranding of Bard. Like Code Llama, the generated output neglected to include the package statement or the imports which were available in the input. That was easy to fix. I am not sure if that was a mistake or simply a different interpretation of what I was asking for. Like all of the chat-based technologies, the output also had similar bugs with the dependency injection and mocking code. The code coverage was a little bit better as there were also tests for both cache hit and cache miss. The explanation was slightly better than ChatGPT and it cited the source that it used the most, which was not the open source repo where all of the code in the experiment came from. This was useful the same way that ChatGPT was useful. It took less time to fix the bugs than to code the two methods from scratch.
Conclusion
It can be quite challenging to obtain accurate and reliable numeric measurements for something as subjective as software quality. The following table attempts to summarize these findings in a numeric and therefore comparative way. The Myers Diff Algorithm is used to measure the number of lines of code added and deleted (a modified line is counted as both an add and a delete) needed to fix the bugs in the generated unit test code (because you are definitely going to have to fix the generated code). The Jacoco Code Coverage is the percentage of instructions (think Java byte code) covered by the unit tests divided by the total number of instructions (both covered and missed).
LLM-Based Generation of Unit Tests Results Summary By the Numbers
Generative AI Offering | Explanatory Analysis | Myers Diff Algorithm | Jacoco Code Coverage |
ChatGPT | yes | 8 | 29.12% |
CodeWhisperer | no | 26 | 27.81% |
codellama:34b | no | 117 | 23.42% |
Gemini | yes | 69 | 31.23% |
From these experiments, it became quite obvious that there was no Artificial General Intelligence present in the generation of any of this unit test code. The lack of any professional level comprehension of annotation based dependency injection and mocking made it clear to me that there was nothing deeply intelligent behind the figurative curtain. Quite simply, an LLM encodes a large number of documents tokenizing the input and capturing the context of that input based on its structure in the form of a transformer based neural network with a large number of weights known as a model.
When asked a question (such as a coding assignment), the model generates a response by predicting the most probable continuation or completion of the input. It considers the context provided by the input and generates a response that is coherent, relevant, and contextually appropriate but not necessarily correct. In that way, you can think of LLMs as a kind of complex, contextual form of search (albeit way more sophisticated than an Inverse Document Frequency or PageRank weighted stemmed term based search over skip lists that you find in web based search engines or Lucene circa 2020).
What I did find in the LLMs included in this experiment was a really good code search capability that turned out to be useful to an experienced developer. What was a disappointment for me with ChatGPT, Ollama, and Gemini was that I have been conditioned to expect human intelligence on the other side of a chat window. However, I have no such conditioning with statement completion. CodeWhisperer didn't disappoint not because the AI was better but because the user experience did a better job at managing my expectations.
What’s Next?
There are also a few corporate concerns that may need to be addressed before adoption of unit test code completion style generative AI can exit the experimental phase.
I have already discussed prompt leaking. That should be a big issue for corporations because a lot of the corporation’s code will need to get included in the prompting and most corporations typically view their code as proprietary. If your prompting doesn’t go back to a model instance that is shared with other corporations, then you don’t have to worry about prompt leaking with other corporations. One option is to run the LLM locally (such as Ollama) which requires that every developer works on a machine with a sufficiently powered GPU. Another option is to subscribe to a single tenant, non-shared version of ChatGPT or Gemini. Yet another option is to turn off prompt driven fine tuning of the model entirely. The third option is currently available but not the second option.
The other concern is around cost. I suspect that generative AI pricing today is focused on increasing market share and does not yet cover all the costs. In order to transition from growth to profitability, those prices are going to have to increase. That NVIDIA A100 40GB I mentioned in the Code Llama section above costs around $10,000-ish today. There is also the question of power consumption. There is ongoing innovation in this area but typically, GPUs consume about three times as much power as CPUs. The single tenant approach is more secure but also more expensive because the vendor cannot benefit from economies of scale outside of building the foundation model. The improvement in productivity was only marginal. Cost is going to factor into the equation of long-term usage.
As I mentioned earlier, this space is moving fast. I did most of these experiments in early 2024. Since then, Amazon Code Whisperer got its own enterprise upsell. There are some early versions of both ChatGPT and Gemini fueled plugins for both IntelliJ and VS Code. Meta has released Llama 3 and, yes, it’s already available in Ollama. Where will we be this time next year? In terms of product placement, strategic positioning, government regulation, employee disruption, vendor winners and losers, who can say? Will there be AGI? Nope.