A group of computer experts from the University of Massachusetts Amherst recently unveiled an innovative technique for crafting complete proofs that help to avert programming errors and ensure the integrity of the underlying software code.
The innovative technique, dubbed Baldur, taps into the capabilities of artificial intelligence through expansive language models (LLMs). This approach, when used alongside the advanced program Thor, has achieved an impressive success rate of close to 66%. The group gained recognition with a Distinguished Paper award at the ACM’s esteemed European Conference on Software Engineering and Symposium on Software Engineering Foundations.
“Sadly, we’ve grown accustomed to encountering bugs in our software, despite its ubiquity and daily use by everyone,” remarks Yuriy Brun, a professor at UMass Amherst’s Manning College of Information and Computer Sciences and the leading author of the study. The impact of software glitches can vary widely, from the mildly irritating – such as erratic formatting or unexpected shutdowns – to the downright disastrous, especially in instances involving security compromises or when precision software is critical for space missions or medical device operation.
Historically, there have been several approaches to software validation. A common one involves a person meticulously reviewing each line of code by hand to ensure there are no mistakes. One could execute the code and compare its performance to anticipated outcomes. For instance, if you activate the “return” key in your text editor expecting a new line but get a question mark instead, you know there’s a flaw in the coding. Both strategies, however, are susceptible to human error. Vetting for every conceivable error is prohibitively labor-intensive, expensive, and unrealistic for anything beyond the most simple systems.
A far more rigorous, yet challenging, strategy is creating a mathematical proof that confirms the software behaves as predicted. A proof verification program is then employed to guarantee the validity of the proof. This procedure is known as machine-checking. Crafting these proofs by hand, however, is an arduous task requiring deep expertise. “These proofs can be several times the length of the software code itself,” explains Emily First, the study’s principal investigator who conducted this research for her PhD thesis at UMass Amherst.
The emergence of LLMs, with ChatGPT being a notable example, suggests a potential resolution: automating the creation of such proofs. But, as Brun notes, “a big issue with LLMs is their tendency for inaccuracy; instead of an outright failure that flags an issue, they may silently give a wrong answer, seemingly correct. And often, a silent failure is the worst kind.”
First’s research, undertaken at Google, employed Minerva, an LLM that was initially trained on a vast span of common language text and further fine-tuned using 118GB of scientific documents and web pages rich with mathematical content. Later, the LLM was further honed using a language known as Isabelle/HOL, which is used to write mathematical proofs. Baldur subsequently generated a complete proof and, in collaboration with a proof verification program, scrutinized its accuracy. If an error was detected, the incorrect proof and details about the mistake were relayed back to the LLM for learning and to help it form a new, hopefully correct, proof. This approach has notably improved accuracy. Before Baldur, the best tool for auto-generating proofs, Thor, succeeded 57% of the time. When combined, Baldur and Thor can produce proofs 65.7% of the time.