Skip to content

feat: support separate LLM models for agent, answer, and judge in ben…#1376

Merged
sscargal merged 1 commit into
MemMachine:mainfrom
skhynix:dongkyu/seperate-llms
May 8, 2026
Merged

feat: support separate LLM models for agent, answer, and judge in ben…#1376
sscargal merged 1 commit into
MemMachine:mainfrom
skhynix:dongkyu/seperate-llms

Conversation

@cudaMancpy

@cudaMancpy cudaMancpy commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Introduce support for three separate language models in the retrieval-agent benchmark framework:

  • Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent
  • Answer model (retrieval_agent.answer_llm_model): Used for answer generation
  • Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation

Each model falls back to llm_model if not explicitly set.

Purpose of the change

The retrieval-agent benchmark framework previously used a single language model (retrieval_agent.llm_model) for all operations: agent planning, answer generation, and LLM judge evaluation. This limited flexibility when users wanted to use different models for different tasks (e.g., a smaller/faster model for agent planning, a more capable model for answer generation, and a separate model for evaluation).

Description

Introduce support for three separate language models in the retrieval-agent benchmark framework:

  • Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent
  • Answer model (retrieval_agent.answer_llm_model): Used for answer generation
  • Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation

Each optional model (answer_llm_model, judge_llm_model) falls back to llm_model if not explicitly set, ensuring backward compatibility.

Changes include:

  • Update init_memmachine_params() in agent_utils.py to return two models and their IDs
  • Update all benchmark callers (wikimultihop, locomo, hotpotqa, longmemeval) to handle the new return signature
  • Update llm_judge.py to resolve judge_llm_model with fallback to llm_model
  • Update evaluate.py to record judge_model_id in evaluation results
  • Update README.md with three-model configuration documentation

Scope is limited to The changes are limited to the evaluation/retrieval_agent/ and evaluation/utils/ directories. No changes to core MemMachine server functionality. The configuration.yml schema extends existing retrieval_agent section with two optional fields.

Fixes/Closes

Fixes #1264

Type of change

[Please delete options that are not relevant.]

  • New feature (non-breaking change which adds functionality)
  • Documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

[Please delete options that are not relevant.]

  • Unit Test
  • Test Script (please provide)

Test Results: [Attach logs, screenshots, or relevant output]

./run_test.sh wikimultihop multi_model search retrieval_agent 100 --search-concurrency 1 --judge-concurrency 2
(.venv) dkjeong@Supermicro-EMR:~/mm/evaluation/retrieval_agent$ cd result/
(.venv) dkjeong@Supermicro-EMR:~/mm/evaluation/retrieval_agent/result$ ll
total 1744
drwxrwxr-x 3 dkjeong dkjeong   4096  4월 28 14:58 ./
drwxrwxr-x 4 dkjeong dkjeong   4096  4월 28 14:34 ../
drwxrwxr-x 2 dkjeong dkjeong   4096  4월 28 14:58 final_score/
-rw-rw-r-- 1 dkjeong dkjeong 887178  4월 28 14:58 wikimultihop_retrieval_agent_evaluation_metrics_multi_model.json
-rw-rw-r-- 1 dkjeong dkjeong 883476  4월 28 14:58 wikimultihop_retrieval_agent_output_multi_model.json
  • wikimultihop_retrieval_agent_output_multi_model.json — Contains the generated answers along with answer_model_id indicating which answer model was used.
  • wikimultihop_retrieval_agent_evaluation_metrics_multi_model.json — Contains the evaluation metrics along with judge_model_id indicating which judge model was used.

Checklist

[Please delete options that are not relevant.]

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

Further comments

[Add any other relevant information here, such as potential side effects, future considerations, or any specific questions for the reviewer. Otherwise, type "None".]

@honggyukim

Copy link
Copy Markdown
Contributor

Hi @sscargal, @cudaMancpy is also our team member. Please have a look. Thanks!

@cudaMancpy cudaMancpy force-pushed the dongkyu/seperate-llms branch from 8125b45 to fe68fc5 Compare April 28, 2026 06:13
@cudaMancpy cudaMancpy closed this Apr 28, 2026
@cudaMancpy cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:08
@cudaMancpy cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:09
@cudaMancpy cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:09
@cudaMancpy cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:12
@cudaMancpy cudaMancpy deleted the dongkyu/seperate-llms branch April 28, 2026 08:15
@cudaMancpy cudaMancpy restored the dongkyu/seperate-llms branch April 28, 2026 08:17
@cudaMancpy cudaMancpy reopened this Apr 28, 2026
@cudaMancpy

Copy link
Copy Markdown
Contributor Author

There is a typo in the branch name, so I modified the branch name and the pull request was closed. I renamed it to its previous name and re-opened pull request. Sorry for dirtying the conversation window...!

@malatewang

Copy link
Copy Markdown
Contributor

Can you fix the lint problem?

@honggyukim

Copy link
Copy Markdown
Contributor

@malatewang Thanks for the review!

@cudaMancpy You also have to remove the merge commit in the commit list.

@cudaMancpy cudaMancpy force-pushed the dongkyu/seperate-llms branch 4 times, most recently from 4345a2b to 761b321 Compare May 4, 2026 07:40
@sscargal sscargal requested review from Copilot, edwinyyyu and sscargal May 5, 2026 01:25
@sscargal sscargal added this to the v0.3.8 milestone May 5, 2026
@sscargal sscargal requested a review from Tianyang-Zhang May 5, 2026 01:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the retrieval-agent benchmark configuration to support role-specific LLM selection (agent/planner vs answer generation vs LLM judge), while preserving backward compatibility via fallback to retrieval_agent.llm_model.

Changes:

  • Add answer_llm_model and judge_llm_model to retrieval-agent configuration (both fallback to llm_model).
  • Update benchmark initialization to resolve separate agent/answer models and record agent_model_id / answer_model_id in output attributes.
  • Update LLM judge + evaluation pipeline to resolve/record judge_model_id, and document the three-model setup in the benchmark README.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/server/src/memmachine_server/common/configuration/retrieval_config.py Extends RetrievalAgentConf with answer_llm_model and judge_llm_model fields.
evaluation/utils/agent_utils.py Resolves separate agent vs answer language models and returns resolved model IDs to callers.
evaluation/retrieval_agent/wikimultihop_search.py Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/wikimultihop_ingest.py Updates unpacking to match the new init return signature.
evaluation/retrieval_agent/README.md Documents three-role LLM configuration and provides an example config.
evaluation/retrieval_agent/longmemeval_test.py Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/locomo_search.py Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/locomo_ingest.py Updates unpacking to match the new init return signature.
evaluation/retrieval_agent/llm_judge.py Resolves judge model via judge_llm_model with fallback to llm_model.
evaluation/retrieval_agent/hotpotQA_test.py Uses the answer model for generation and records agent/answer model IDs in outputs.
evaluation/retrieval_agent/evaluate.py Records judge_model_id in evaluation results and selects judge model with fallback.
Comments suppressed due to low confidence (1)

evaluation/retrieval_agent/evaluate.py:18

  • evaluate.py imports memmachine_server.common.configuration before adding the repo's package roots (packages/server/src) to sys.path. When running this script directly from the repo (without installing the server package), this import will fail. Move the sys.path setup above this import, or append the same PACKAGE_ROOTS used by the other retrieval_agent scripts (REPO_ROOT/packages/*/src).
from memmachine_server.common.configuration import Configuration
from tqdm import tqdm

REPO_ROOT = Path(__file__).resolve().parents[2]
if str(REPO_ROOT) not in sys.path:
    sys.path.append(str(REPO_ROOT))


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evaluation/utils/agent_utils.py
Comment thread evaluation/retrieval_agent/README.md Outdated
Comment thread evaluation/retrieval_agent/llm_judge.py Outdated
@cudaMancpy cudaMancpy force-pushed the dongkyu/seperate-llms branch 3 times, most recently from 7f2bce5 to ddc9cc6 Compare May 7, 2026 06:35
@cudaMancpy cudaMancpy force-pushed the dongkyu/seperate-llms branch from ddc9cc6 to d98bd92 Compare May 8, 2026 01:45
…chmarks

Introduce support for three separate language models in the retrieval-agent benchmark framework:
- Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent
- Answer model (retrieval_agent.answer_llm_model): Used for answer generation
- Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation

Each model falls back to llm_model if not explicitly set.

Signed-off-by: Dongkyu Jeong <dongkyu1.jeong@sk.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment on lines 383 to 389
async def init_memmachine_params(
resource_manager: ResourceManagerImpl,
session_id: str = "",
agent_name: str = "ToolSelectAgent",
message_sentence_chunking: bool = False,
) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase]:
) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase, str, str]:
"""Initialize MemMachine components from a ResourceManagerImpl.
Comment on lines +65 to +66
"Neither retrieval_agent.judge_llm_model nor"
"retrieval_agent.llm_model is set in configuration.yml"
Comment on lines +421 to +425
retrieval_agent:
llm_model: agent_model # Used for retrieval/planning agent
answer_llm_model: answer_model # Optional: used for answer generation (falls back to llm_model)
judge_llm_model: judge_model # Optional: used for LLM judge (falls back to llm_model)
reranker: bm25_reranker
@sscargal sscargal merged commit b221491 into MemMachine:main May 8, 2026
48 checks passed
@honggyukim honggyukim deleted the dongkyu/seperate-llms branch May 11, 2026 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: Support role-specific LLM configuration in retrieval-agent benchmarks

6 participants