Retrieval agent implementation and benchmark updates#1103
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds an “agent_mode” option to route long-term memory search through a new retrieval-agent orchestration (multi-hop, split-query, tool-selection), updates benchmarks/scripts, and propagates config + SDK changes to support an LLM model for the retrieval agent.
Changes:
- Add retrieval-agent framework (ToolSelect / ChainOfQuery / SplitQuery / MemMachine retriever) and wire it into
LongTermMemory.search_scored(..., agent_mode=...). - Extend configs + SDKs (Python + TS) to support
agent_modein search and addllm_modelto long-term memory config/resources. - Add/refresh evaluation scripts + docs for new benchmark workflows (LoCoMo, WikiMultiHop, HotpotQA) and simplify CLI usage.
Reviewed changes
Copilot reviewed 69 out of 89 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/memmachine/server/api_v2/test_spec.py | Asserts default agent_mode behavior in request spec parsing. |
| tests/memmachine/server/api_v2/test_router.py | Verifies router passes agent_mode through to service layer. |
| tests/memmachine/retrieval_agent/test_retrieval_agent.py | New unit tests covering retrieval-agent behaviors and reranking. |
| tests/memmachine/rest_client/test_memory.py | Adds Python REST client tests for agent_mode payload field. |
| tests/memmachine/main/test_memmachine_mock.py | Updates mocked config defaults to include llm_model for LTM. |
| tests/memmachine/common/session_manager/test_session_manager.py | Adds llm_model to episodic memory test configuration. |
| tests/memmachine/common/configuration/test_configuration.py | Validates merge/update behavior for new llm_model field. |
| src/memmachine/server/api_v2/service.py | Passes agent_mode through to MemMachine.query_search(); adds logging. |
| src/memmachine/retrieval_agent/common/agent_api.py | Introduces retrieval-agent base API, policies, and rerank helper. |
| src/memmachine/retrieval_agent/agents/tool_select_agent.py | Adds LLM-based tool routing agent. |
| src/memmachine/retrieval_agent/agents/split_query_agent.py | Adds query splitting agent for multi-entity single-hop queries. |
| src/memmachine/retrieval_agent/agents/memmachine_retriever.py | Adds direct MemMachine memory retriever agent wrapper. |
| src/memmachine/retrieval_agent/agents/coq_agent.py | Adds chain-of-query rewriting + sufficiency-check agent. |
| src/memmachine/retrieval_agent/agents/init.py | Exposes retrieval-agent classes in package exports. |
| src/memmachine/rest_client/memory.py | Adds agent_mode parameter to Python SDK Memory.search(). |
| src/memmachine/rest_client/README.md | Documents agent_mode usage in Python REST client README. |
| src/memmachine/main/memmachine.py | Propagates agent_mode through episodic search; validates llm_model. |
| src/memmachine/episodic_memory/long_term_memory/service_locator.py | Loads llm_model resource for long-term memory params. |
| src/memmachine/episodic_memory/long_term_memory/long_term_memory.py | Implements agent-mode branching and initializes retrieval agent. |
| src/memmachine/episodic_memory/episodic_memory.py | Plumbs agent_mode through episodic query path; adds logging. |
| src/memmachine/common/language_model/openai_responses_language_model.py | Adds reasoning_effort and token-usage-returning API method. |
| src/memmachine/common/filter/filter_parser.py | Marks FilterExpr as @runtime_checkable. |
| src/memmachine/common/errors.py | Adds language-model config errors for missing/default LLM model. |
| src/memmachine/common/configuration/language_model_conf.py | Adds contains_language_model() helper. |
| src/memmachine/common/configuration/episodic_config.py | Adds llm_model to long-term memory config/partial config. |
| src/memmachine/common/configuration/init.py | Adds default/check helpers for long-term-memory llm_model. |
| src/memmachine/common/api/spec.py | Adds agent_mode field to SearchMemoriesSpec. |
| src/memmachine/common/api/doc.py | Documents agent_mode and adds examples. |
| src/memmachine-ts/rest_client/tests/memmachine-memory.spec.ts | Adds TS client test asserting agent_mode in search payload. |
| src/memmachine-ts/rest_client/src/memory/memmachine-memory.types.ts | Adds agent_mode option to TS SearchMemoriesOptions. |
| src/memmachine-ts/rest_client/src/memory/memmachine-memory.ts | Includes agent_mode field in TS search request payload. |
| src/memmachine-ts/rest_client/README.md | Documents TS usage example for agent_mode. |
| sample_configs/episodic_memory_config.gpu.sample | Adds llm_model to long-term memory sample config. |
| sample_configs/episodic_memory_config.cpu.sample | Adds llm_model to long-term memory sample config. |
| memmachine-compose.sh | Updates config generation to rewrite LTM llm_model value. |
| evaluation/utils/memmachine_helper_restapiv2.py | Adds agent_mode to evaluation REST helper payload. |
| evaluation/utils/agent_utils.py | New helper utilities for retrieval-agent benchmark workflows. |
| evaluation/retrieval_agent/wikimultihop_search.py | Adds WikiMultiHop search benchmark script (agent/memmachine/llm modes). |
| evaluation/retrieval_agent/wikimultihop_ingest.py | Adds WikiMultiHop ingestion script. |
| evaluation/retrieval_agent/run_test.sh | New unified runner for locomo/wikimultihop/hotpotqa ingest/search. |
| evaluation/retrieval_agent/locomo_search.py | Adds LoCoMo search benchmark script using new agent utils. |
| evaluation/retrieval_agent/locomo_ingest.py | Adds LoCoMo ingestion script using new agent utils. |
| evaluation/retrieval_agent/locomo_delete.py | Simplifies LoCoMo delete script signature/typing. |
| evaluation/retrieval_agent/llm_judge.py | Updates judging model + parsing; simplifies IO paths. |
| evaluation/retrieval_agent/hotpotQA_test.py | Adds HotpotQA ingest/search benchmark script. |
| evaluation/retrieval_agent/generate_scores.py | Adds consolidated scoring report for retrieval-agent benchmark outputs. |
| evaluation/retrieval_agent/evaluate.py | Updates evaluation pipeline to new output schema fields. |
| evaluation/locomo/episodic_memory/locomo_ingest.py | Removes older legacy ingestion script. |
| evaluation/locomo/episodic_memory/locomo_config.yaml | Removes legacy YAML config stub. |
| evaluation/locomo/episodic_memory/README.md | Removes legacy locomo episodic_memory README. |
| evaluation/locomo/episodic_agent/run_experiments.py | Removes legacy episodic agent runner. |
| evaluation/locomo/episodic_agent/restapiv2_locomo_search_agent.py | Removes legacy restapiv2 search agent implementation. |
| evaluation/locomo/episodic_agent/memmachine_locomo.py | Removes legacy MemMachine locomo runner. |
| evaluation/locomo/episodic_agent/locomo_agent.py | Removes legacy OpenAI Agents-sdk locomo agent code. |
| evaluation/locomo/episodic_agent/generate_scores.py | Removes legacy scoring script. |
| evaluation/locomo/episodic_agent/init.py | Removes legacy package marker. |
| evaluation/locomo/episodic_agent/README.md | Removes legacy episodic agent README. |
| evaluation/episodic_memory/restapiv2_locomo_search.py | Adds CLI flag to enable agent_mode in legacy episodic workflow. |
| evaluation/episodic_memory/README.md | New README for legacy episodic workflow (with agent-mode option). |
| evaluation/README.md | Replaces top-level evaluation guide with retrieval-agent workflow docs. |
| docs/open_source/configuration.mdx | Documents llm_model under long-term memory config. |
| docs/install_guide/integrate/GPTStore.mdx | Documents agent_mode in schema snippet for GPT Store integration. |
| docs/getting_started/benchmarks.mdx | Points users to updated evaluation guide; expands benchmark mentions. |
| docs/examples/rest.mdx | Documents agent_mode in REST search example. |
| docs/api_reference/ts-rest/interfaces/SearchMemoryResult.mdx | Updates TS REST docs for renamed interfaces/options and adds agent_mode. |
| docs/api_reference/ts-rest/classes/MemMachineMemory.mdx | Updates TS class docs for new option type names and response structure. |
| docs/api_reference/python/memory_api.mdx | Updates Python SDK docs for expanded search() signature + agent_mode. |
| USAGE.md | Documents REST + Python SDK defaults including agent_mode. |
| README.md | Adds bibtex reference section relevant to agent prompts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af8e56a to
53a2d98
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 87 out of 105 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (6)
evaluation/utils/agent_utils.py:1
- Corrected spelling of 'matrics' to 'metrics'.
evaluation/retrieval_agent/wikimultihop_ingest.py:1 - Corrected spelling of 'ketword' to 'keyword'.
evaluation/retrieval_agent/hotpotQA_test.py:1 - Corrected spelling of 'needel' to 'needle'.
src/memmachine/retrieval_agent/agents/tool_select_agent.py:1 - Corrected spelling of 'intput' to 'input'.
src/memmachine/main/memmachine.py:1 - Corrected spelling of 'Faield' to 'Failed'.
src/memmachine/common/language_model/language_model.py:1 - Corrected spelling of 'intput' to 'input'.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e5aa675 to
4db1a12
Compare
sscargal
left a comment
There was a problem hiding this comment.
Suggested some changes.
ed1fd66 to
6a21bbc
Compare
2ea643c to
322e8a6
Compare
malatewang
left a comment
There was a problem hiding this comment.
I went through the code and also reviewed the PR with codex. There were not major problems. As we discussed offline, the agent should work on top of semantic and episodic memory. It can be addressed by furture PR
| self, query: QueryParam, episodes: list[Episode] | ||
| ) -> list[Episode]: | ||
| if query.limit <= 0: | ||
| return sorted(episodes, key=lambda x: x.timestamp) |
There was a problem hiding this comment.
should apply the top-k here
There was a problem hiding this comment.
In this case query.limit <= 0 indicates the 'k' is 0/none, therefore no top-k is applied.
97f37d5 to
aaa53a2
Compare
|
Refactored to only use top-level EpisodicMemory API in retrieval agent, and updated corresponding tests, docs. Please take a look, thanks! @malatewang @sscargal |
There was a problem hiding this comment.
Hi! This LGTM. I may keep the content the same, but change the wording in retrieval_agent_architecture.mdx to match the rest of the site's cohesive language style at some future point (not part of this PR).
Also - due to changes that were in flight at the same time as this PR, there are 4 conflicts in our /docs that must be resolved before we can merge. Please resolve those. Once you do, you should be able to squash and merge.
ec840da to
f9a5a64
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 95 out of 117 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| COMBINED_SUFFICIENCY_AND_REWRITE_PROMPT = """You are a meticulous expert in retrieval-augmented question answering evaluation and query rewriting. | ||
|
|
||
| Task: Given (1) an original user query, (2) rewritten queries already tried, and (3) retrieved documents, you must decide whether the documents are sufficient to answer the orig | ||
| inal query directly, completely, and explicitly. If insufficient, generate the NEXT BEST rewritten subquery to retrieve the missing evidence. If sufficient, set the rewritten qu | ||
| ery to the original query. | ||
|
|
||
| Hard constraints: | ||
| - Use ONLY the provided retrieved documents for sufficiency judgment. Do NOT use external knowledge, browsing, assumptions, or plausibility. |
There was a problem hiding this comment.
COMBINED_SUFFICIENCY_AND_REWRITE_PROMPT also contains mid-word hard wraps (e.g., orig\ninal, qu\nery). These look accidental and can materially affect the chain-of-query rewrite/sufficiency behavior. Suggest reflowing the prompt to remove mid-word newlines so instructions are unambiguous.
| if isinstance(result.index, int): | ||
| result["type"] = result.index.map(lambda x: categories[x - 1]) | ||
|
|
There was a problem hiding this comment.
if isinstance(result.index, int): will never be true because result.index is a pandas Index object, so the type column mapping won’t run. If you want to map numeric category IDs to names, consider using pd.api.types.is_integer_dtype(result.index) (or attempt result.index.astype(int) in a try/except) before applying the mapping.
| async def _do_rerank( | ||
| self, query: QueryParam, episodes: list[Episode] | ||
| ) -> list[Episode]: | ||
| if query.limit <= 0: | ||
| return sorted(episodes, key=lambda x: x.created_at) | ||
|
|
||
| if len(episodes) <= query.limit or self._reranker is None: | ||
| if len(episodes) == 0: | ||
| return episodes | ||
| return sorted(episodes, key=lambda x: x.created_at) | ||
|
|
There was a problem hiding this comment.
_do_rerank ignores query.limit when self._reranker is None (or when len(episodes) <= query.limit), returning all episodes sorted by created_at. This can violate the caller’s top_k/limit contract and potentially return a much larger payload than expected. Consider always truncating to query.limit when it’s > 0 (even without a reranker), and only then applying the final chronological sort if that ordering is required.
| TOOL_SELECT_PROMPT = """You are a tool router. Your task is to select exactly ONE tool name from the provided list that best fits the user query. Do not call any tools. Use only the text in {query}; d | ||
| o not assume external context or missing metadata. You may interpret dependency only from explicit linguistic structure (e.g., "use X to find Y", possessive/relationship chains) | ||
| . | ||
|
|
||
| GOAL | ||
| - Choose exactly one of: {coq}, {split_query}, {memory_retrieval} | ||
| - Output NONE only when the query type cannot be determined from {query} (e.g., empty/invalid/malformed). | ||
|
|
||
| MECHANISM (what to do, then how to do it) | ||
|
|
||
| 1) Validate input | ||
| - If {query} is empty, whitespace-only, null-like, non-linguistic garbage, or otherwise not classifiable as a query/request -> output NONE. | ||
|
|
||
| 2) Classify the query type from {query} using ONLY the operational criteria below (pick exactly one): | ||
|
|
||
| A) MULTI-HOP (dependency chain; later step depends on earlier result) | ||
| Choose MULTI-HOP if the query requires two or more dependent steps where you must first determine X, then use X to determine Y (or more). | ||
| - Explicit dependency signals: "then", "after", "using that", "based on that result", "from there", "which of those", "once you find", "given the answer to", "trace/derive". | ||
| - Relationship-chain dependency: queries that require resolving an intermediate entity/attribute to proceed (e.g., "X's spouse's company", "author of the book that inspired the | ||
| film"). | ||
| - Comparison/timeline becomes MULTI-HOP ONLY if it requires derived attributes that must be found first (e.g., "Which is older, the CEO of A or the founder of B?" requires findi | ||
| ng CEO/founder first). | ||
| - Tie-breaker: If any explicit dependency chain exists, classify as MULTI-HOP even if multiple entities/keywords appear. | ||
|
|
||
| B) SINGLE-HOP WITH MULTIPLE ENTITIES/KEYWORDS (independent sub-queries; no dependency) | ||
| Choose this when the query contains multiple entities/subjects that can be answered via separate independent lookups and then combined, without needing any earlier result to for | ||
| m the later lookup. | ||
| - Signals: "A and B", "A, B, and C", "for each of", "separately", "list ... for these items". |
There was a problem hiding this comment.
TOOL_SELECT_PROMPT appears to have accidental hard line breaks mid-word (e.g., ...{query}; d\no not..., findi\nng, for\nm). This will degrade the routing prompt quality and can lead to incorrect tool selection. Suggest reformatting the prompt to remove mid-word newlines/stray characters and keep sentences intact.
| selected_tool = None | ||
| for tool in self._children_tools: | ||
| if tool.agent_name in rsp: | ||
| selected_tool = tool | ||
| break | ||
| if selected_tool is None: | ||
| logger.warning("Tool %s not found", rsp) | ||
| if selected_tool is None and self._default_tool is not None: | ||
| logger.warning("No tool selected, using default tool") | ||
| return self._default_tool, input_token, output_token | ||
| return selected_tool, input_token, output_token |
There was a problem hiding this comment.
Tool selection currently checks if tool.agent_name in rsp, which is a substring match across the full model output. If the model ever returns extra text (e.g., repeats multiple tool names, includes examples, etc.), this can select the wrong tool nondeterministically. Consider strict parsing (e.g., take the first non-empty line, strip it, and require an exact match against known tool names; treat anything else as NONE/fallback).
| SPLIT_QUERY_PROMPT = """You are a search expert. Transform the input query into either: | ||
| - multiple single-hop sub-queries (2-6 lines), or | ||
| - the original query unchanged (1 line), | ||
| following the rules below. | ||
|
|
||
| Query | ||
| {query} | ||
|
|
||
| Mechanism (what to do, then how to do it) | ||
|
|
||
| 1) Decide whether to split (default: do NOT split) | ||
| - Do NOT split if the answer can be retrieved from a single page/infobox/database record/field set for the same entity and timeframe (i.e., one co-located source would likely co | ||
| ntain it). | ||
| - Split ONLY if you must retrieve >=2 distinct facts that are not co-located OR are for different entities and/or different timepoints/locations/contexts. | ||
| - Tie-breaker: when unsure, prefer NOT splitting. | ||
|
|
||
| 2) Special cases that limit splitting | ||
| - Multi-constraint single-entity queries (e.g., "A's birth date and birthplace"): | ||
| - Keep as ONE query if those attributes are typically co-located in the same reference entry for A. | ||
| - Split only if the attributes are typically separate lookups (or clearly not co-located). | ||
| - List-style questions ("all/each/every ... and their ..."): | ||
| - Keep as-is unless the query explicitly names 2-6 specific entities/subjects that can each become one sub-query. | ||
| - Do not split if it would likely require more than 6 lines. | ||
|
|
||
| 3) If splitting: produce single-hop fact-lookups only | ||
| - Each sub-query must be directly answerable by one fact lookup (one field/value) with no derived operations. | ||
| - Explicit ban: do NOT use derived/operation wording in sub-queries, including (non-exhaustive) "compare," "difference," "between," "rate," "top," "average," "change," "increase | ||
| ," "decrease," "percent," "rank," "versus," "more/less than." | ||
| - Rewrite derived intents into pure fact retrievals (e.g., "between X and Y" -> ask for X and ask for Y; "compare A and B in 2023" -> ask A in 2023 and ask B in 2023). | ||
|
|
||
| 4) Preserve intent and attach constraints correctly | ||
| - Keep the same entities/aliases and the same constraints (timeframe, location, context, units) from the original query. | ||
| - For mixed structures, maintain left-to-right entity/subject order AND attach any paired constraints to every relevant sub-query (e.g., "in 2023," "in Paris," "during WWII" app | ||
| ears on each line it applies to). | ||
| - Do not add assumptions or extra constraints. | ||
|
|
||
| 5) Handling common structures | ||
| - Conjunctions ("A and B"): one sub-query per entity/subject for the same attribute and constraints. | ||
| - Multi-entity multi-attribute: split by entity first in left-to-right order; within each entity, include only the minimal single-hop attribute per line needed to cover the orig | ||
| inal query (while respecting the 2-6 line cap). | ||
| - Relational questions ("A's relationship to B"): |
There was a problem hiding this comment.
SPLIT_QUERY_PROMPT contains mid-word line breaks (e.g., co\nntain, increase\n ,, app\nears) which look unintentional. These artifacts can confuse the model and reduce split quality. Suggest removing the hard wraps so the prompt text is contiguous and readable.
| # Rerank base on all queries concatenated | ||
| param = query.model_copy() | ||
| if len(sub_queries) > 1: | ||
| param.query += "\n".join(sub_queries) | ||
| final_episodes = await self._do_rerank(param, result) |
There was a problem hiding this comment.
When building the final rerank query, param.query += "\n".join(sub_queries) appends without a leading newline, so the original query and the first sub-query can run together (e.g., "orig?sub1?\nsub2?"). Consider replacing the query with a clearly delimited combined string (or add an explicit leading newline) so reranking uses the intended text.
cd47711 to
3d65e07
Compare
|
To be included in the memmachine_server package, file organization needs to be updated. |
3d65e07 to
3e650ab
Compare
3e650ab to
021739a
Compare
021739a to
6d4c664
Compare
Purpose of the change
Adding an option to involve the retrieval agent during memory search to improve the memory retrieval accuracy.
Description
retrieval_agentconfiguration.WikiMultihopandHotpotQAbenchmarks.Type of change
[Please delete options that are not relevant.]
How Has This Been Tested?
Test via benchmark scripts and unit tests
[Please delete options that are not relevant.]
Test Results: [Attach logs, screenshots, or relevant output]
As shown in the
Sample outputpart underevaluation/README.md.Checklist
[Please delete options that are not relevant.]
Maintainer Checklist
Further comments
Future improvements: