Are we close to having a fully automated software engineer?

Introduction

In the fast-paced world of software development, engineering leaders constantly seek innovative solutions to enhance productivity, reduce time-to-market, and ensure high-quality code. Language model (LM) agents in software engineering workflows promises the possibility to revolutionise how teams approach coding, testing, and maintenance tasks. However, the potential of these agents is often limited by their ability to effectively interact with complex development environments

To address this challenge researchers at Princeton published a paper discussing the possibility of a super smart SWE-agent, an advanced system that can maximise the output of LM agents in software engineering tasks using an agent computer interface or ACI, that can navigate code repositories, perform precise code edits, and execute rigorous testing protocols.

We will discuss key motivations and findings from this research that can help engineering leaders prepare for the future that GenAI might is promising to create for all of us which we should not afford to ignore

What is the need for this?

Traditional methods of coding, testing, and maintenance are time-consuming and prone to human error. LM agents have the capability to automate these tasks, but their effectiveness is limited by the challenges they face in interacting with development environments.

If LM agents can be made to be more effective at executing software engineering work, it can help engineering managers reduce the workload on human developers, accelerating development cycles, and improving overall software reliability

What was their Approach?

SWE-agent: a system that facilitates LM agents to autonomously use computers to solve software engineering tasks. SWE-agent’s custom agent-computer interface (ACI) significantly enhances an agent’s ability to create and edit code files, navigate entire repositories, and execute tests and other programs.

SWE-agent is an LM interacting with a computer through an agent-computer interface (ACI), which includes the commands the agent uses and the format of the feedback from the computer.

LM agents have been so far only used for code generation with moderation and feedback. Applying agents to more complex code tasks like software engineering remained unexplored

LM agents are typically designed to use existing applications, such as the Linux shell or Python interpreter. However, to perform more complex programming tasks such as software engineering, human engineers benefit from sophisticated applications like VSCode with powerful tools and extensions. Inspired by human-computer interaction.

LM agents represent a new category of end user, with their own needs and abilities. Specialised applications like IDEs (e.g., VSCode, PyCharm) make scientists and software engineers more efficient and effective at computer tasks. Similarly, ACI design aims to create a suitable interface that makes LM agents more effective at digital work such as software engineering

The researchers assumed a fixed LM and focused on designing the ACI to improve its performance. This meant shaping their actions, their documentation, and environment feedback to complement an LM’s limitations and abilities

Experimental Set-up

DataSets: We primarily evaluate on the SWE-bench dataset, which includes 2,294 task instances from 12 different repositories of popular Python packages. We report our main agent results on the full SWE-bench test set and ablations and analysis on the SWE-bench Lite test set. SWE-bench Lite is a canonical subset of 300 instances from SWE-bench that focus on evaluating self-contained functional bug fixes. We also test SWE-agent’s basic code editing abilities with HumanEvalFix, a short-form code debugging benchmark.

Models: All results, ablations, and analyses are based on two leading LMs, GPT-4 Turbo (gpt-4-1106-preview) and Claude 3 Opus (claude-3-opus-20240229). We experimented with a number of additional closed and open source models, including Llama 3 and DeepSeek Coder, but found their performance in the agent setting to be subpar. GPT-4 Turbo and Claude 3 Opus have 128k and 200k token context windows, respectively, which provides sufficient room for the LM to interact for several turns after being fed the system prompt, issue description, and optionally, a demonstration.

Baselines: We compare SWE-agent to two baselines. The first setting is the non-interactive, retrieval augmented generation (RAG) baselines. Here, a retrieval system retrieves the most relevant codebase files using the issue as the query; given these files, the model is asked to directly generate a patch file that resolves the issue. The second setting, called Shell-only, is adapted from the interactive coding framework introduced in Yang et al. Following the InterCode environment, this baseline system asks the LM to resolve the issue by interacting with a shell process on Linux. Like SWE-agent, model prediction is generated automatically based on the final state of the codebase after interaction.

Metrics. We report % Resolved or pass@1 as the main metric, which is the proportion of instances for which all tests pass successfully after the model generated patch is applied to the repository

Results

The result demonstrated that LM agent called SWE-agent that worked with custom agent-computer-interface or ACI was able to resolve 7 times more software tasks that pass the test bench compare to a RAG using the same underlying models i.e. GPT-4 Turbo and Claude 3 Opus and 64% better performance to Shell-only.

This research ably demonstrates the direction that agentic architecture is making (with the right supporting tools) in making a fully functional software engineer a distant but possible eventuality

Read the complete paper here and let us know if you believe if this is a step in the positive direction

Would you like an autonomous software engineer in your team?

Yes