Unit Test Generation using Generative AI : A Comparative Performance Analysis of Autogeneration Tools ICSE 2024, April 2024, Lisbon, Portugal
variation, selection, and optimization, EvoSuite generates JUnit test
cases and provides a report on the eectiveness of the produced
test suite based on metrics like code coverage and mutation score.
Pynguin [
21
] is another tool which utilizes SBST for generating
unit tests in Python programming language. The variable types are
dynamically assigned at runtime in Python which makes it dicult
to generate unit tests. Pynguin examines a Python module to gather
details about the declared classes, functions, and methods. It then
creates a test cluster with every relevant information about the
module being tested, and during the generation process, chooses
classes, methods, and functions from the test cluster to build the test
cases. We use Pynguin as our baseline for comparing the suitability
and eectiveness of LLMs in unit test generation.
Randomized test generation techniques: Randoop [
26
] uses
feedback-directed random testing to generate test cases. The basic
idea behind this technique is to generate random sequences of
method calls and inputs that exercise dierent paths through the
program. As the test runs, Randoop collects information about
the code coverage achieved by the test, as well as any exceptions
that are thrown. Based on this feedback, Randoop tries to generate
more test cases that are likely to increase code coverage or trigger
previously unexplored behaviour.
AI-based techniques: Ticoder [
18
] presents an innovative Test-
Driven User-Intent Formalisation (TDUIF) approach for generat-
ing code from natural language with minimal formal semantics.
Their system, TICODER, demonstrates improved code generation
accuracy and the ability to create non-trivial functional unit tests
aligned with user intent through minimal user queries. TOGA [
11
]
introduces a neural method for Test Oracle Generation, using a
transformer-based approach to infer exceptional and assertion test
oracles based on the context of the focal method.
CODAMOSA [
19
] introduces an algorithm that enhances Search-
Based Software Testing (SBST) by utilizing pre-trained large lan-
guage models (LLMs) like OpenAI’s Codex [
25
]. The approach com-
bines test case generation with mutation to produce high-coverage
test cases and requests Codex to provide sample test cases for under-
covered functions. Our paper conrms some of the ndings from
CODAMOSA. For instance, our results also show that a combination
of LLM and Pynguin (SBST-based) can lead to a better coverage. At
the same time, CODAMOSA does not explore some of the research
questions which we explored in this paper such as (1) How correct
are the assertions generated by LLMs? (2) Do the LLM-generated
tests align with the intended functionality of the code? (3) How
does the performance of LLM improve over multiple iterations of
prompting?
5 Conclusion
In this study, we discovered that ChatGPT and Pynguin demon-
strated nearly identical coverage for both small and large code
samples, with no statistically signicant dierences in average cov-
erages across all categories. When iteratively prompting ChatGPT
to enhance coverage, by providing the indices of missed statements
from previous iteration, improvements were notable for categories
2 and 3, reaching saturation at 4 iterations, while no improvement
occurred for category 1.
Notably, individually missed statements by both tools showed
minimal overlap, hinting at the potential for a combined approach
to yield higher coverage. Lastly, our assessment of the correctness
of ChatGPT-generated tests revealed a decreasing trend in the per-
centage of incorrect assertions from Category 1 to 3, which could
possibly suggest that assertions generated by ChatGPT are more
eective in cases where code units are well dened.
ChatGPT operates with a focus on understanding and generating
content in natural language rather than being explicitly tailored
for programming languages. While ChatGPT may be capable of
achieving high statement coverage in the generated unit tests, a
high percentage of the assertions within those tests might be in-
correct. A more eective approach to generating correct assertions
would be based on the actual semantics of the code. This presents
a concern that ChatGPT may prioritize coverage over the accuracy
of the generated assertions, which is a potential limitation in using
ChatGPT for generating unit tests, and a more semantic-based ap-
proach might be needed for generating accurate assertions. Future
research endeavors could delve into several promising avenues
based on the ndings of this study. Firstly, exploring how ChatGPT
refactors code from procedural scripts and assessing whether the
refactored code preserves the original functionality could provide
valuable insights into the model’s code transformation capabilities.
Additionally, investigating the scalability of ChatGPT and Pynguin
to larger codebases and more complex projects may oer a broader
understanding of their performance in real-world scenarios. Fur-
thermore, a comprehensive exploration of the combined use of
ChatGPT and Pynguin, considering their complementary strengths,
could be undertaken to maximize test coverage and eectiveness.
Lastly, examining the generalizability of our observations across
diverse programming languages and application domains would
contribute to a more comprehensive understanding of the applica-
bility and limitations of these tools.
References
[1] [n. d.]. OpenAI Platform. https://platform.openai.com
[2]
2022. OpenAI’s ChatGPT: Optimizing Language Models for Dialogue –
cloudHQ. https://blog.cloudhq.net/openais-chatgpt-optimizing-language-
models-for-dialogue/
[3]
Touque Ahmed and Premkumar Devanbu. 2023. Few-Shot Training LLMs
for Project-Specic Code-Summarization. In
Proceedings of the 37th IEEE/ACM
International Conference on Automated Software Engineering
(Rochester, MI,
USA)
(ASE ’22)
. Association for Computing Machinery, New York, NY, USA,
Article 177, 5 pages. https://doi.org/10.1145/3551349.3559555
[4]
James H. Andrews, Tim Menzies, and Felix C.H. Li. 2011. Genetic Algorithms
for Randomized Unit Testing.
IEEE Transactions on Software Engineering
37, 1
(2011), 80–94. https://doi.org/10.1109/TSE.2010.46
[5]
Luciano Baresi and Matteo Miraz. 2010. TestFul: Automatic Unit-Test Gen-
eration for Java Classes. In
Proceedings of the 32nd ACM/IEEE International
Conference on Software Engineering - Volume 2
(Cape Town, South Africa)
(ICSE ’10)
. Association for Computing Machinery, New York, NY, USA, 281–284.
https://doi.org/10.1145/1810295.1810353
[6]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jerey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya
Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners.
arXiv:2005.14165 [cs.CL]
[7]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario
Amodei. 2017. Deep Reinforcement Learning from Human Preferences. In
Advances in Neural Information Processing Systems
, I. Guyon, U. Von Luxburg,
S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30.