feat: support separate LLM models for agent, answer, and judge in ben… by cudaMancpy · Pull Request #1376 · MemMachine/MemMachine

cudaMancpy · 2026-04-28T06:02:57Z

Introduce support for three separate language models in the retrieval-agent benchmark framework:

Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent
Answer model (retrieval_agent.answer_llm_model): Used for answer generation
Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation

Each model falls back to llm_model if not explicitly set.

Purpose of the change

The retrieval-agent benchmark framework previously used a single language model (retrieval_agent.llm_model) for all operations: agent planning, answer generation, and LLM judge evaluation. This limited flexibility when users wanted to use different models for different tasks (e.g., a smaller/faster model for agent planning, a more capable model for answer generation, and a separate model for evaluation).

Description

Introduce support for three separate language models in the retrieval-agent benchmark framework:

Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent
Answer model (retrieval_agent.answer_llm_model): Used for answer generation
Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation

Each optional model (answer_llm_model, judge_llm_model) falls back to llm_model if not explicitly set, ensuring backward compatibility.

Changes include:

Update init_memmachine_params() in agent_utils.py to return two models and their IDs
Update all benchmark callers (wikimultihop, locomo, hotpotqa, longmemeval) to handle the new return signature
Update llm_judge.py to resolve judge_llm_model with fallback to llm_model
Update evaluate.py to record judge_model_id in evaluation results
Update README.md with three-model configuration documentation

Scope is limited to The changes are limited to the evaluation/retrieval_agent/ and evaluation/utils/ directories. No changes to core MemMachine server functionality. The configuration.yml schema extends existing retrieval_agent section with two optional fields.

Fixes/Closes

Fixes #1264

Type of change

[Please delete options that are not relevant.]

New feature (non-breaking change which adds functionality)
Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

[Please delete options that are not relevant.]

Unit Test
Test Script (please provide)

Test Results: [Attach logs, screenshots, or relevant output]

./run_test.sh wikimultihop multi_model search retrieval_agent 100 --search-concurrency 1 --judge-concurrency 2

(.venv) dkjeong@Supermicro-EMR:~/mm/evaluation/retrieval_agent$ cd result/
(.venv) dkjeong@Supermicro-EMR:~/mm/evaluation/retrieval_agent/result$ ll
total 1744
drwxrwxr-x 3 dkjeong dkjeong   4096  4월 28 14:58 ./
drwxrwxr-x 4 dkjeong dkjeong   4096  4월 28 14:34 ../
drwxrwxr-x 2 dkjeong dkjeong   4096  4월 28 14:58 final_score/
-rw-rw-r-- 1 dkjeong dkjeong 887178  4월 28 14:58 wikimultihop_retrieval_agent_evaluation_metrics_multi_model.json
-rw-rw-r-- 1 dkjeong dkjeong 883476  4월 28 14:58 wikimultihop_retrieval_agent_output_multi_model.json

wikimultihop_retrieval_agent_output_multi_model.json — Contains the generated answers along with answer_model_id indicating which answer model was used.
wikimultihop_retrieval_agent_evaluation_metrics_multi_model.json — Contains the evaluation metrics along with judge_model_id indicating which judge model was used.

Checklist

[Please delete options that are not relevant.]

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

Further comments

[Add any other relevant information here, such as potential side effects, future considerations, or any specific questions for the reviewer. Otherwise, type "None".]

honggyukim · 2026-04-28T06:07:37Z

Hi @sscargal, @cudaMancpy is also our team member. Please have a look. Thanks!

cudaMancpy · 2026-04-28T08:24:13Z

There is a typo in the branch name, so I modified the branch name and the pull request was closed. I renamed it to its previous name and re-opened pull request. Sorry for dirtying the conversation window...!

malatewang · 2026-05-02T00:49:33Z

Can you fix the lint problem?

honggyukim · 2026-05-02T01:51:47Z

@malatewang Thanks for the review!

@cudaMancpy You also have to remove the merge commit in the commit list.

Merge branch 'main' into dongkyu/seperate-llms
- https://github.com/MemMachine/MemMachine/pull/1376/commits

Copilot

Pull request overview

This PR extends the retrieval-agent benchmark configuration to support role-specific LLM selection (agent/planner vs answer generation vs LLM judge), while preserving backward compatibility via fallback to retrieval_agent.llm_model.

Changes:

Add answer_llm_model and judge_llm_model to retrieval-agent configuration (both fallback to llm_model).
Update benchmark initialization to resolve separate agent/answer models and record agent_model_id / answer_model_id in output attributes.
Update LLM judge + evaluation pipeline to resolve/record judge_model_id, and document the three-model setup in the benchmark README.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
packages/server/src/memmachine_server/common/configuration/retrieval_config.py	Extends `RetrievalAgentConf` with `answer_llm_model` and `judge_llm_model` fields.
evaluation/utils/agent_utils.py	Resolves separate agent vs answer language models and returns resolved model IDs to callers.
evaluation/retrieval_agent/wikimultihop_search.py	Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/wikimultihop_ingest.py	Updates unpacking to match the new init return signature.
evaluation/retrieval_agent/README.md	Documents three-role LLM configuration and provides an example config.
evaluation/retrieval_agent/longmemeval_test.py	Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/locomo_search.py	Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/locomo_ingest.py	Updates unpacking to match the new init return signature.
evaluation/retrieval_agent/llm_judge.py	Resolves judge model via `judge_llm_model` with fallback to `llm_model`.
evaluation/retrieval_agent/hotpotQA_test.py	Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/evaluate.py	Records `judge_model_id` in evaluation results and selects judge model with fallback.

Comments suppressed due to low confidence (1)

evaluation/retrieval_agent/evaluate.py:18

evaluate.py imports memmachine_server.common.configuration before adding the repo's package roots (packages/server/src) to sys.path. When running this script directly from the repo (without installing the server package), this import will fail. Move the sys.path setup above this import, or append the same PACKAGE_ROOTS used by the other retrieval_agent scripts (REPO_ROOT/packages/*/src).

from memmachine_server.common.configuration import Configuration
from tqdm import tqdm

REPO_ROOT = Path(__file__).resolve().parents[2]
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…chmarks Introduce support for three separate language models in the retrieval-agent benchmark framework: - Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent - Answer model (retrieval_agent.answer_llm_model): Used for answer generation - Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation Each model falls back to llm_model if not explicitly set. Signed-off-by: Dongkyu Jeong <dongkyu1.jeong@sk.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

 async def init_memmachine_params(
    resource_manager: ResourceManagerImpl,
    session_id: str = "",
    agent_name: str = "ToolSelectAgent",
    message_sentence_chunking: bool = False,
-) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase]:
+) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase, str, str]:
    """Initialize MemMachine components from a ResourceManagerImpl.


+            "Neither retrieval_agent.judge_llm_model nor"
+            "retrieval_agent.llm_model is set in configuration.yml"


+retrieval_agent:
+  llm_model: agent_model         # Used for retrieval/planning agent
+  answer_llm_model: answer_model  # Optional: used for answer generation (falls back to llm_model)
+  judge_llm_model: judge_model   # Optional: used for LLM judge (falls back to llm_model)
+  reranker: bm25_reranker


cudaMancpy force-pushed the dongkyu/seperate-llms branch from 8125b45 to fe68fc5 Compare April 28, 2026 06:13

cudaMancpy closed this Apr 28, 2026

cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:08

cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:09

cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:09

cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:12

cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:15

cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:17

cudaMancpy reopened this Apr 28, 2026

malatewang approved these changes Apr 30, 2026

View reviewed changes

cudaMancpy force-pushed the dongkyu/seperate-llms branch 4 times, most recently from 4345a2b to 761b321 Compare May 4, 2026 07:40

sscargal requested review from Copilot, edwinyyyu and sscargal May 5, 2026 01:25

sscargal added this to the v0.3.8 milestone May 5, 2026

sscargal requested a review from Tianyang-Zhang May 5, 2026 01:25

Copilot started reviewing on behalf of sscargal May 5, 2026 01:26 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

Comment thread evaluation/utils/agent_utils.py

Comment thread evaluation/retrieval_agent/README.md Outdated

Comment thread evaluation/retrieval_agent/llm_judge.py Outdated

cudaMancpy force-pushed the dongkyu/seperate-llms branch 3 times, most recently from 7f2bce5 to ddc9cc6 Compare May 7, 2026 06:35

edwinyyyu approved these changes May 7, 2026

View reviewed changes

cudaMancpy force-pushed the dongkyu/seperate-llms branch from ddc9cc6 to d98bd92 Compare May 8, 2026 01:45

cudaMancpy force-pushed the dongkyu/seperate-llms branch from d98bd92 to abeb86e Compare May 8, 2026 02:22

sscargal requested a review from Copilot May 8, 2026 16:55

Copilot started reviewing on behalf of sscargal May 8, 2026 16:56 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

sscargal merged commit b221491 into MemMachine:main May 8, 2026
48 checks passed

honggyukim deleted the dongkyu/seperate-llms branch May 11, 2026 04:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support separate LLM models for agent, answer, and judge in ben…#1376

feat: support separate LLM models for agent, answer, and judge in ben…#1376
sscargal merged 1 commit into
MemMachine:mainfrom
skhynix:dongkyu/seperate-llms

cudaMancpy commented Apr 28, 2026 •

edited

Loading

Uh oh!

honggyukim commented Apr 28, 2026

Uh oh!

cudaMancpy commented Apr 28, 2026

Uh oh!

malatewang commented May 2, 2026

Uh oh!

honggyukim commented May 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		"Neither retrieval_agent.judge_llm_model nor"
		"retrieval_agent.llm_model is set in configuration.yml"

Conversation

cudaMancpy commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose of the change

Description

Fixes/Closes

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Screenshots/Gifs

Further comments

Uh oh!

honggyukim commented Apr 28, 2026

Uh oh!

cudaMancpy commented Apr 28, 2026

Uh oh!

malatewang commented May 2, 2026

Uh oh!

honggyukim commented May 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cudaMancpy commented Apr 28, 2026 •

edited

Loading