使用 Spring AI 进行 LLM 响应评估：利用递归顾问构建 LLM-as-a-Judge

工程 | Christian Tzolov | 2025 年 11 月 10 日 | ...

评估大型语言模型 (LLM) 输出的挑战对于众所周知的不确定性 AI 应用程序至关重要，尤其是在它们投入生产时。

当评估现代 LLM 产生的细致入微、上下文相关的响应时，ROUGE 和 BLEU 等传统指标显得力不从心。人工评估虽然准确，但成本高昂、速度慢且无法扩展。

引入 LLM-as-a-Judge —— 一种强大的技术，它使用 LLM 本身来评估 AI 生成内容的质量。研究表明，复杂的评判模型与人类判断的一致性高达 85%，这实际上高于人与人之间的一致性（81%）。

在本文中，我们将探讨 Spring AI 的 递归顾问（Recursive Advisors） 如何为实现 LLM-as-a-Judge 模式提供一个优雅的框架，使您能够构建具有自动化质量控制的自我改进 AI 系统。要了解有关递归顾问 API 的更多信息，请查阅我们之前的文章：使用 Spring AI 递归顾问创建自我改进的 AI 代理。

💡 演示：在 evaluation-recursive-advisor-demo 中找到完整的示例实现。

理解 LLM-as-a-Judge

LLM-as-a-Judge 是一种评估方法，其中大型语言模型（LLM）评估其他模型或其自身生成内容的质量。LLM-as-a-Judge 不仅仅依赖于人类评估员或传统的自动化指标，而是利用 LLM 根据预定义的标准对响应进行评分、分类或比较。

为什么它有效？ 评估本质上比生成更容易。当您使用 LLM 作为评判时，您要求它执行一个更简单、更专注的任务（评估现有文本的特定属性），而不是创建原创内容并平衡多个约束的复杂任务。一个很好的类比是，批评比创造更容易。发现问题比预防问题更简单。

LLM-as-a-Judge 评估模式 主要有两种

直接评估（逐点评分）：评判评估单个响应，提供可以通过自我完善来改进提示的反馈
成对比较：评判从两个候选响应中选择更好的一个（在 A/B 测试中很常见）

LLM 评判评估质量维度，例如相关性、事实准确性、对来源的忠实性、指令依从性以及跨医疗保健、金融、RAG 系统和对话等领域的整体连贯性和清晰度。

选择合适的评判模型

虽然像 GPT-4 和 Claude 这样的通用模型可以作为有效的评判，但专门的 LLM-as-a-Judge 模型在评估任务中始终优于它们。Judge Arena 排行榜跟踪各种模型在评判任务中的表现。

Spring AI：完美的基石

Spring AI 的 ChatClient 提供了一个流式 API，非常适合实现 LLM-as-a-Judge 模式。其 Advisors 系统允许您以模块化、可重用的方式拦截、修改和增强 AI 交互。

最近引入的递归顾问（Recursive Advisors）通过启用循环模式进一步扩展了这一点，这非常适合自我完善的评估工作流

public class MyRecursiveAdvisor implements CallAdvisor {
    
    @Override
    public ChatClientResponse adviseCall(ChatClientRequest request, CallAdvisorChain chain) {
        
        // Call the chain initially
        ChatClientResponse response = chain.nextCall(request);
        
        // Check if we need to retry based on evaluation
        while (!evaluationPasses(response)) {

            // Modify the request based on evaluation feedback
            ChatClientRequest modifiedRequest = addEvaluationFeedback(request, response);
            
            // Create a sub-chain and recurse
            response = chain.copy(this).nextCall(modifiedRequest);
        }
        
        return response;
    }
}

我们将实现一个 SelfRefineEvaluationAdvisor，它使用 Spring AI 的递归顾问来体现 LLM-as-a-Judge 模式。该顾问将自动评估 AI 响应并根据反馈驱动的改进重试失败的尝试：生成响应 → 评估质量 → 如果需要则根据反馈重试 → 重复直到达到质量阈值或达到重试限制。

让我们检查一下演示高级评估模式的实现

SelfRefineEvaluationAdvisor 的实现

此实现演示了直接评估评估模式，其中评判模型使用逐点评分系统（1-4 分）评估单个响应。它将其与自我完善策略相结合，通过将特定反馈纳入后续尝试来自动重试失败的评估，从而创建了一个迭代改进循环。

该顾问体现了两个关键的 LLM-as-a-Judge 概念

逐点评估：根据预定义标准，每个响应都会获得一个单独的质量分数
自我完善：失败的响应会触发重试尝试，并提供建设性反馈以指导改进

（基于文章：使用 LLM-as-a-judge 🧑‍⚖️ 进行自动化和多功能评估）

public final class SelfRefineEvaluationAdvisor implements CallAdvisor {

    private static final PromptTemplate DEFAULT_EVALUATION_PROMPT_TEMPLATE = new PromptTemplate(
        """
        You will be given a user_question and assistant_answer couple.
        Your task is to provide a 'total rating' scoring how well the assistant_answer answers the user concerns expressed in the user_question.
        Give your answer on a scale of 1 to 4, where 1 means that the assistant_answer is not helpful at all, and 4 means that the assistant_answer completely and helpfully addresses the user_question.

        Here is the scale you should use to build your answer:
        1: The assistant_answer is terrible: completely irrelevant to the question asked, or very partial
        2: The assistant_answer is mostly not helpful: misses some key aspects of the question
        3: The assistant_answer is mostly helpful: provides support, but still could be improved
        4: The assistant_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

        Provide your feedback as follows:

        \\{
            "rating": 0,
            "evaluation": "Explanation of the evaluation result and how to improve if needed.",
            "feedback": "Constructive and specific feedback on the assistant_answer."
        \\}

        Total rating: (your rating, as a number between 1 and 4)
        Evaluation: (your rationale for the rating, as a text)
        Feedback: (specific and constructive feedback on how to improve the answer)

        You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

        Now here are the question and answer.

        Question: {question}
        Answer: {answer}

        Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.

        Evaluation:
        """);

    @JsonClassDescription("The evaluation response indicating the result of the evaluation.")
    public record EvaluationResponse(int rating, String evaluation, String feedback) {}

    @Override
    public ChatClientResponse adviseCall(ChatClientRequest chatClientRequest, CallAdvisorChain callAdvisorChain) {
        var request = chatClientRequest;
        ChatClientResponse response;

        // Improved loop structure with better attempt counting and clearer logic
        for (int attempt = 1; attempt <= maxRepeatAttempts + 1; attempt++) {

            // Make the inner call (e.g., to the evaluation LLM model)
            response = callAdvisorChain.copy(this).nextCall(request);

            // Perform evaluation
            EvaluationResponse evaluation = this.evaluate(chatClientRequest, response);

            // If evaluation passes, return the response
            if (evaluation.rating() >= this.successRating) {
                logger.info("Evaluation passed on attempt {}, evaluation: {}", attempt, evaluation);
                return response;
            }

            // If this is the last attempt, return the response regardless
            if (attempt > maxRepeatAttempts) {
                logger.warn(
                    "Maximum attempts ({}) reached. Returning last response despite failed evaluation. Use the following feedback to improve: {}",
                    maxRepeatAttempts, evaluation.feedback());
                return response;
            }

            // Retry with evaluation feedback
            logger.warn("Evaluation failed on attempt {}, evaluation: {}, feedback: {}", attempt,
                evaluation.evaluation(), evaluation.feedback());

            request = this.addEvaluationFeedback(chatClientRequest, evaluation);
        }

        // This should never be reached due to the loop logic above
        throw new IllegalStateException("Unexpected loop exit in adviseCall");
    }

    /**
     * Performs the evaluation using the LLM-as-a-Judge and returns the result.
     */
    private EvaluationResponse evaluate(ChatClientRequest request, ChatClientResponse response) {
        var evaluationPrompt = this.evaluationPromptTemplate.render(
            Map.of("question", this.getPromptQuestion(request), "answer", this.getAssistantAnswer(response)));

        // Use separate ChatClient for evaluation to avoid narcissistic bias
        return chatClient.prompt(evaluationPrompt).call().entity(EvaluationResponse.class);
    }

    /**
     * Creates a new request with evaluation feedback for retry.
     */
    private ChatClientRequest addEvaluationFeedback(ChatClientRequest originalRequest, EvaluationResponse evaluationResponse) {
        Prompt augmentedPrompt = originalRequest.prompt()
            .augmentUserMessage(userMessage -> userMessage.mutate().text(String.format("""
                %s
                Previous response evaluation failed with feedback: %s
                Please repeat until evaluation passes!
                """, userMessage.getText(), evaluationResponse.feedback())).build());

        return originalRequest.mutate().prompt(augmentedPrompt).build();
    }
}

关键实现特性

递归模式实现 顾问使用 callAdvisorChain.copy(this).nextCall(request) 创建子链进行递归调用，从而实现多次评估，同时保持正确的顾问顺序。

结构化评估输出 使用 Spring AI 的结构化输出功能，评估结果被解析为具有评分（1-4）、评估理由和具体改进反馈的 EvaluationResponse 记录。

单独的评估模型 使用专门的 LLM-as-a-Judge 模型（avcodes/flowaicom-flow-judge:q4）和不同的 ChatClient 实例来减轻模型偏差。设置 spring.ai.chat.client.enabled=false 以启用使用多个聊天模型。

反馈驱动的改进 失败的评估会包含具体的反馈，这些反馈会融入重试尝试中，使系统能够从评估失败中学习。

可配置的重试逻辑 支持可配置的最大尝试次数，当达到评估限制时会优雅降级。

整合

以下是将 SelfRefineEvaluationAdvisor 集成到完整的 Spring AI 应用程序中的方法

@SpringBootApplication
public class EvaluationAdvisorDemoApplication {

    @Bean
    CommandLineRunner commandLineRunner(AnthropicChatModel anthropicChatModel, OllamaChatModel ollamaChatModel) {
        return args -> {
            
            ChatClient chatClient = ChatClient.builder(anthropicChatModel) // @formatter:off
                    .defaultTools(new MyTools())
                    .defaultAdvisors(
                        
                        SelfRefineEvaluationAdvisor.builder()
                            .chatClientBuilder(ChatClient.builder(ollamaChatModel)) // Separate model for evaluation
                            .maxRepeatAttempts(15)
                            .successRating(4)
                            .order(0)
                            .build(),
                        
                        new MyLoggingAdvisor(2))
                .build(); 
                
            var answer = chatClient
                .prompt("What is current weather in Paris?")
                .call()
                .content();

            System.out.println(answer);
        };
    }

    static class MyTools {
        final int[] temperatures = {-125, 15, -255};
        private final Random random = new Random();
        
        @Tool(description = "Get the current weather for a given location")
        public String weather(String location) {
            int temperature = temperatures[random.nextInt(temperatures.length)];
            System.out.println(">>> Tool Call responseTemp: " + temperature);
            return "The current weather in " + location + " is sunny with a temperature of " + temperature + "°C.";
        }
    }
}

此配置使用 Anthropic Claude 进行生成，使用 Ollama 进行评估（避免偏差），要求评分达到 4 分，最多重试 15 次。它包括一个天气工具，该工具生成随机响应以触发评估。weather 工具在 2/3 的情况下生成无效值。

SelfRefineEvaluationAdvisor（顺序 0）评估响应质量并在需要时通过反馈重试，然后是 MyLoggingAdvisor（顺序 2），它记录最终的请求/响应以进行可观察性。

运行时，您将看到如下输出

REQUEST: [{"role":"user","content":"What is current weather in Paris?"}]

>>> Tool Call responseTemp: -255
Evaluation failed on attempt 1, evaluation: The response contains unrealistic temperature data, feedback: The temperature of -255°C is physically impossible and indicates a data error.
 
>>> Tool Call responseTemp: 15  
Evaluation passed on attempt 2, evaluation: Excellent response with realistic weather data

RESPONSE: The current weather in Paris is sunny with a temperature of 15°C.

🚀 亲自尝试：包含配置示例（包括不同的模型组合和评估场景）的完整可运行演示可在 evaluation-recursive-advisor-demo 项目中找到。

结论

Spring AI 的递归顾问使 LLM-as-a-Judge 模式的实现既优雅又可用于生产。SelfRefineEvaluationAdvisor 演示了如何构建自我改进的 AI 系统，该系统自动评估响应质量，通过反馈重试，并在无需人工干预的情况下扩展评估。

主要优点包括自动化质量控制、通过单独的评判模型缓解偏差以及与现有 Spring AI 应用程序的无缝集成。这种方法为聊天机器人、内容生成和复杂 AI 工作流提供可靠、可扩展的质量保证基础。

实施 LLM-as-a-Judge 技术的关键成功因素包括

使用专用评判模型以获得更好的性能（Judge Arena 排行榜）
通过单独的生成/评估模型来减轻偏差
确保确定性结果（temperature = 0）
使用整数刻度和少量示例来设计提示
对高风险决策保持人工监督

⚠️ 重要提示

递归顾问是 Spring AI 1.1.0-M4+ 中的一个实验性新功能。 目前，它们仅支持非流式传输，需要仔细的顾问排序，并且由于多次 LLM 调用而可能增加成本。

对维护外部状态的内部顾问要特别小心——它们可能需要额外的关注以在迭代中保持正确性。

始终设置终止条件和重试限制以防止无限循环。

资源

Spring AI 文档

LLM-as-a-Judge 研究

Judge Arena 排行榜 - 最佳评判模型的当前排名
用 MT-Bench 和 Chatbot Arena 评判 LLM-as-a-Judge - 介绍 LLM-as-a-Judge 范式的奠基性论文
评判的裁决：通过人类一致性对 LLM 评判能力的全面分析 - 引入了一个两步基准测试，通过测试 54 个 LLM 作为评判与人类判断和一致性模式的相关性来评估其性能，揭示了 27 个模型无论大小都能通过类人或超一致的判断行为达到顶级性能。
LLMs-as-Judges：基于 LLM 的评估方法综合调查
从生成到判断：LLM-as-a-judge 的机遇与挑战 (2024) - 涵盖 LLM-as-a-Judge 完整格局的调查，具有系统分类和最新挑战
LLM-as-a-Judge 资源中心 - 包含论文列表、工具和正在进行的研究的中央存储库
偏好泄露：LLM-as-a-judge 中的污染问题 - 关于评判模型偏差的最新研究
谁是你的评判？关于 LLM 生成判断的可检测性 - 关于判断检测和透明度的新兴研究

Spring 博客

使用 Spring AI 进行 LLM 响应评估：利用递归顾问构建 LLM-as-a-Judge

理解 LLM-as-a-Judge

选择合适的评判模型

Spring AI：完美的基石

SelfRefineEvaluationAdvisor 的实现

关键实现特性

整合

结论

⚠️ 重要提示

资源

获取 Spring 新闻通讯

领先一步

获得支持

即将举行的活动