[20]
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, and Mark
Steedman. Sources of hallucination by large language models on inference tasks, 2023.
[21]
Rodrigo Pessoa Medeiros, Geber Lisboa Ramalho, and Taciana Pontual Falcão. A systematic
literature review on teaching and learning introductory programming in higher education. IEEE
Transactions on Education, 62(2):77–90, 2019. doi: 10.1109/TE.2018.2864133.
[22]
Martin Monperrus. The living review on automated program repair. Technical Report hal-
01956501, HAL Archives Ouvertes, 2018. URL
https://www.monperrus.net/martin/
repair-living-review.pdf.
[23] OpenAI. GPT-4 Technical Report, 2023.
[24] OpenAI Codex. https://openai.com/blog/openai-codex.
[25] OpenAI Pricing. https://openai.com/pricing.
[26]
Tung Phung, Victor-Alexandru Padurean, José Cambronero, Sumit Gulwani, Tobias Kohn,
Rupak Majumdar, Adish Singla, and Gustavo Soares. Generative AI for programming education:
Benchmarking chatgpt, gpt-4, and human tutors. CoRR, abs/2306.17156, 2023. doi: 10.48550/
arXiv.2306.17156. URL https://doi.org/10.48550/arXiv.2306.17156.
[27]
Tung Phung, Victor-Alexandru P
˘
adurean, Anjali Singh, Christopher Brooks, José Cambronero,
Sumit Gulwani, Adish Singla, and Gustavo Soares. Automating human tutor-style programming
feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint
validation. arXiv preprint arXiv:2310.03780, 2023.
[28]
Julian Aron Prenner and Romain Robbes. Automatic program repair with openai’s codex:
Evaluating quixbugs. arXiv preprint arXiv:2111.03922, 2021.
[29]
Julian Aron Prenner, Hlib Babii, and Romain Robbes. Can openai’s codex fix bugs? an
evaluation on quixbugs. In Proceedings of the Third International Workshop on Automated
Program Repair, pages 69–75, 2022.
[30]
Coursera programming assignment grading. https://www.coursera.support/s/article/209818753-
Programming-assignments.
[31]
Anthony Robins. Learning edge momentum: A new account of outcomes in cs1. Computer
Science Education, 20(1):37–71, 2010.
[32]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models
for code. arXiv preprint arXiv:2308.12950, 2023.
[33] Esther Shein. The cs teacher shortage. Communications of the ACM, 62(10):17–18, 2019.
[34]
Chunqiu Steven Xia and Lingming Zhang. Conversational automated program repair. arXiv
preprint arXiv:2301.13246, 2023.
[35]
Michihiro Yasunaga and Percy Liang. Break-it-fix-it: Unsupervised learning for program repair.
In International Conference on Machine Learning, pages 11941–11952. PMLR, 2021.
[36]
Jooyong Yi, Umair Z Ahmed, Amey Karkare, Shin Hwei Tan, and Abhik Roychoudhury. A
feasibility study of using automated program repair for introductory programming assignments.
In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering (FSE),
pages 740–751, 2017.
[37]
Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and
Gust Verbruggen. Repairing bugs in python assignments using large language models. arXiv
preprint arXiv:2209.14876, 2022.
[38]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang,
Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and
chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
9