How to Diagnose Multi-Agent System Failures with Automated Attribution: A Step-by-Step Guide
Introduction
Large Language Model (LLM) multi-agent systems are powerful tools for tackling complex tasks through collaboration. However, when these systems fail, developers often face a tedious debugging process: sifting through massive logs to pinpoint which agent caused the failure and at what step. Researchers from Penn State University, Duke University, and partners like Google DeepMind recently introduced a solution called Automated Failure Attribution, along with the Who&When benchmark dataset. This guide will walk you through applying this method to your own multi-agent projects, transforming a frustrating “needle-in-a-haystack” hunt into a streamlined diagnostic workflow.

What You Need
- Access to the Who&When dataset (hosted on Hugging Face) and the open-source code (available on GitHub). Both are linked in the original research paper.
- Basic familiarity with LLM multi-agent systems and their interaction logs (e.g., agent messages, task states).
- Python environment with libraries like PyTorch and Transformers.
- A sample failing multi-agent scenario (either from your own system or a simulated one) to test the attribution methods.
Step-by-Step Guide
Step 1: Understand the Automated Failure Attribution Problem
Before diving into code, grasp the core challenge: given a failed multi-agent task, you need to identify which agent was responsible at which point in the interaction chain. This is not about blame, but about root-cause localization. The research defines this as a new problem—Automated Failure Attribution—and provides the first benchmark to evaluate solutions. Read the paper (linked above) to understand the formal definition and existing manual debugging pitfalls.
Step 2: Set Up the Who&When Dataset
The Who&When dataset contains multiple multi-agent task scenarios with labeled failure points. To get started:
- Visit the Hugging Face dataset page (link).
- Download the dataset using the
datasetslibrary:from datasets import load_dataset; dataset = load_dataset('Kevin355/Who_and_When') - Familiarize yourself with the structure: each entry includes a log of agent interactions, the final outcome (success/failure), and ground-truth labels for the responsible agent and timestamp.
Step 3: Choose an Automated Attribution Method
The paper evaluates several methods. For your guide, we’ll focus on the simplest baseline—Rule-based Chain-of-Thought (CoT)—and the most effective one—Attribution via Agent Tracing (AAT). You can find implementations in the GitHub repository.
- Clone the repository:
git clone https://github.com/mingyin1/Agents_Failure_Attribution - Install dependencies:
pip install -r requirements.txt - Run the baseline method on a sample from Who&When to confirm setup:
python baseline_cot.py --dataset who_and_when
Step 4: Apply Attribution to Your Own System’s Logs
To diagnose failures in your own multi-agent system, you’ll need to format your logs to match the dataset’s structure. The code expects a JSON or CSV file with fields for agent names, timestamps, and messages. Follow these sub-steps:

- Extract interaction logs from your system. Each message should include sender, recipient, content, and a sequential step number.
- Add a label column if you already know the failure cause (for testing). Otherwise, leave it blank.
- Run the AAT model on your logs:
python aat_model.py --input my_logs.json --output attributions.json - The output will list per-log entries the predicted responsible agent and the step where the error likely occurred.
Step 5: Interpret Results and Iterate
With attributions in hand, you can now efficiently debug. For each failure:
- Review the predicted agent’s actions around the identified step.
- Check for communication errors, misunderstood instructions, or knowledge gaps.
- Apply fixes—e.g., adjust prompt, improve agent memory, or add validation checks.
- Re-run the system to verify the fix.
The benchmark shows that automated attribution accelerates debugging by up to 3× compared to manual log archaeology.
Tips for Success
- Start with provided examples before using your own logs. The Who&When dataset includes diverse failure types (e.g., planning, retrieval, reasoning errors).
- Combine methods: Use rule-based CoT for a quick first pass, then AAT for deeper analysis.
- Log verbosely: The more structured your logs, the better the attribution accuracy. Include task context and final outputs.
- Contribute back: The dataset is open for expansion—if you encounter a unique failure, consider adding it to Who&When.
- Stay updated: The research was accepted as a Spotlight at ICML 2025; watch for future improvements in attribution models.
By following this guide, you’ll turn the daunting task of debugging multi-agent failures into a systematic, automated process. Happy diagnosing!
Related Articles
- From Rejects to Resources: How Semiconductor Binning Powers Affordable Electronics
- Accelerated Immune Cell Aging: A New Blood Test for Early Depression Detection
- NASA Seeks Industry Partners for Mars Communication Network
- Preserving Team Dynamics in the Age of AI: A Guide to Balancing Efficiency and Connection
- How to Observe and Photograph Fireballs from the International Space Station: An Astronaut's Guide
- Artemis II Crew Brings 'Rise' to Capitol Hill After Historic Lunar Journey
- 6 Timeless Lessons on Getting Rich in America: Why Flexibility Beats Any Formula
- VECT Ransomware: How a Critical Encryption Flaw Turns It Into an Accidental Wiper