feat: support separate LLM models for agent, answer, and judge in ben…#1376
Conversation
|
Hi @sscargal, @cudaMancpy is also our team member. Please have a look. Thanks! |
8125b45 to
fe68fc5
Compare
|
There is a typo in the branch name, so I modified the branch name and the pull request was closed. I renamed it to its previous name and re-opened pull request. Sorry for dirtying the conversation window...! |
|
Can you fix the lint problem? |
|
@malatewang Thanks for the review! @cudaMancpy You also have to remove the merge commit in the commit list.
|
4345a2b to
761b321
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the retrieval-agent benchmark configuration to support role-specific LLM selection (agent/planner vs answer generation vs LLM judge), while preserving backward compatibility via fallback to retrieval_agent.llm_model.
Changes:
- Add
answer_llm_modelandjudge_llm_modelto retrieval-agent configuration (both fallback tollm_model). - Update benchmark initialization to resolve separate agent/answer models and record
agent_model_id/answer_model_idin output attributes. - Update LLM judge + evaluation pipeline to resolve/record
judge_model_id, and document the three-model setup in the benchmark README.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/server/src/memmachine_server/common/configuration/retrieval_config.py | Extends RetrievalAgentConf with answer_llm_model and judge_llm_model fields. |
| evaluation/utils/agent_utils.py | Resolves separate agent vs answer language models and returns resolved model IDs to callers. |
| evaluation/retrieval_agent/wikimultihop_search.py | Uses the answer model for generation and records agent/answer model IDs in outputs. |
| evaluation/retrieval_agent/wikimultihop_ingest.py | Updates unpacking to match the new init return signature. |
| evaluation/retrieval_agent/README.md | Documents three-role LLM configuration and provides an example config. |
| evaluation/retrieval_agent/longmemeval_test.py | Uses the answer model for generation and records agent/answer model IDs in outputs. |
| evaluation/retrieval_agent/locomo_search.py | Uses the answer model for generation and records agent/answer model IDs in outputs. |
| evaluation/retrieval_agent/locomo_ingest.py | Updates unpacking to match the new init return signature. |
| evaluation/retrieval_agent/llm_judge.py | Resolves judge model via judge_llm_model with fallback to llm_model. |
| evaluation/retrieval_agent/hotpotQA_test.py | Uses the answer model for generation and records agent/answer model IDs in outputs. |
| evaluation/retrieval_agent/evaluate.py | Records judge_model_id in evaluation results and selects judge model with fallback. |
Comments suppressed due to low confidence (1)
evaluation/retrieval_agent/evaluate.py:18
- evaluate.py imports memmachine_server.common.configuration before adding the repo's package roots (packages/server/src) to sys.path. When running this script directly from the repo (without installing the server package), this import will fail. Move the sys.path setup above this import, or append the same PACKAGE_ROOTS used by the other retrieval_agent scripts (REPO_ROOT/packages/*/src).
from memmachine_server.common.configuration import Configuration
from tqdm import tqdm
REPO_ROOT = Path(__file__).resolve().parents[2]
if str(REPO_ROOT) not in sys.path:
sys.path.append(str(REPO_ROOT))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7f2bce5 to
ddc9cc6
Compare
ddc9cc6 to
d98bd92
Compare
…chmarks Introduce support for three separate language models in the retrieval-agent benchmark framework: - Agent model (retrieval_agent.llm_model): Used for retrieval/planning agent - Answer model (retrieval_agent.answer_llm_model): Used for answer generation - Judge model (retrieval_agent.judge_llm_model): Used for LLM judge evaluation Each model falls back to llm_model if not explicitly set. Signed-off-by: Dongkyu Jeong <dongkyu1.jeong@sk.com>
d98bd92 to
abeb86e
Compare
| async def init_memmachine_params( | ||
| resource_manager: ResourceManagerImpl, | ||
| session_id: str = "", | ||
| agent_name: str = "ToolSelectAgent", | ||
| message_sentence_chunking: bool = False, | ||
| ) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase]: | ||
| ) -> tuple[EpisodicMemory, LanguageModel, AgentToolBase, str, str]: | ||
| """Initialize MemMachine components from a ResourceManagerImpl. |
| "Neither retrieval_agent.judge_llm_model nor" | ||
| "retrieval_agent.llm_model is set in configuration.yml" |
| retrieval_agent: | ||
| llm_model: agent_model # Used for retrieval/planning agent | ||
| answer_llm_model: answer_model # Optional: used for answer generation (falls back to llm_model) | ||
| judge_llm_model: judge_model # Optional: used for LLM judge (falls back to llm_model) | ||
| reranker: bm25_reranker |
Introduce support for three separate language models in the retrieval-agent benchmark framework:
Each model falls back to llm_model if not explicitly set.
Purpose of the change
The retrieval-agent benchmark framework previously used a single language model (retrieval_agent.llm_model) for all operations: agent planning, answer generation, and LLM judge evaluation. This limited flexibility when users wanted to use different models for different tasks (e.g., a smaller/faster model for agent planning, a more capable model for answer generation, and a separate model for evaluation).
Description
Introduce support for three separate language models in the retrieval-agent benchmark framework:
Each optional model (answer_llm_model, judge_llm_model) falls back to llm_model if not explicitly set, ensuring backward compatibility.
Changes include:
Scope is limited to The changes are limited to the evaluation/retrieval_agent/ and evaluation/utils/ directories. No changes to core MemMachine server functionality. The configuration.yml schema extends existing retrieval_agent section with two optional fields.
Fixes/Closes
Fixes #1264
Type of change
[Please delete options that are not relevant.]
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
[Please delete options that are not relevant.]
Test Results: [Attach logs, screenshots, or relevant output]
wikimultihop_retrieval_agent_output_multi_model.json— Contains the generated answers along withanswer_model_idindicating which answer model was used.wikimultihop_retrieval_agent_evaluation_metrics_multi_model.json— Contains the evaluation metrics along withjudge_model_idindicating which judge model was used.Checklist
[Please delete options that are not relevant.]
Maintainer Checklist
Screenshots/Gifs
Further comments
[Add any other relevant information here, such as potential side effects, future considerations, or any specific questions for the reviewer. Otherwise, type "None".]