add retry logic for LLM judge and semantic memory consolidation by sscargal · Pull Request #1361 · MemMachine/MemMachine

sscargal · 2026-04-21T20:52:18Z

Purpose of the change

Both evaluate_llm_judge and _deduplicate_features now retry once on JSON parse failures (missing/malformed 'label' key or None response) instead of crashing or silently dropping results. Consolidation also gains context-length error handling, matching the feature-update path.

Description

Changes made:

evaluation/retrieval_agent/llm_judge.py

Added import logging, logger, and _MAX_JUDGE_ATTEMPTS = 2
Replaced the single-shot json_repair.loads(raw)["label"] with a retry loop: safe dict access via .get("label"), retries once on missing/non-dict result, defaults to 0 (WRONG) after all attempts are exhausted

packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py

Added _MAX_CONSOLIDATION_ATTEMPTS = 2
Extracted a new _try_consolidate helper method that: retries on None (parse failure), handles context-length errors (logs warning, returns None without propagating), handles other exceptions (respects debug_fail_loudly)
Simplified _deduplicate_features to call the helper and act on its result — this also resolves the C901 complexity violation

evaluation/retrieval_agent/test_llm_judge.py (new file)

12 unit tests covering: correct/wrong/unknown labels, retry on missing label, exhausted retries default, non-dict responses, call count verification, parametrized label values

packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py

3 new tests: retry succeeds on second attempt, exhausted retries leaves features unchanged, context-length error is handled gracefully

Fixes/Closes

Fixes #1332

Type of change

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Unit Test

Test Results: [Attach logs, screenshots, or relevant output]

New LLM judge tests (fast, no dependencies):
uv run pytest evaluation/retrieval_agent/test_llm_judge.py -v

Semantic ingestion tests (includes the 3 new consolidation tests):
uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py -v

Filter to just the new tests if you want a quicker check:
uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py -v -k "retry or exhaust or context_length_error"

Full server suite (validates nothing was broken):
uv run --frozen --all-extras pytest packages/server/server_tests -q

Lint/format (mirrors CI):
uv run ruff check evaluation/retrieval_agent/ packages/server/src/memmachine_server/semantic_memory/
uv run ruff format --check evaluation/retrieval_agent/ packages/server/src/memmachine_server/semantic_memory/

Checklist

I have signed the commit(s) within this pull request
My code follows the style guidelines of this project (See STYLE_GUIDE.md)
I have performed a self-review of my own code
I have commented my code
My changes generate no new warnings
I have added unit tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have checked my code and corrected any misspellings

Maintainer Checklist

Confirmed all checks passed
Contributor has signed the commit(s)
Reviewed the code
Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

None

Both evaluate_llm_judge and _deduplicate_features now retry once on JSON parse failures (missing/malformed 'label' key or None response) instead of crashing or silently dropping results. Consolidation also gains context-length error handling matching the feature-update path. Refs: MemMachine#1332 Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>

Copilot

Pull request overview

This PR adds defensive retry handling around structured-output failures for two LLM-driven paths: the evaluation-time LLM judge and server-side semantic feature consolidation, aiming to reduce transient JSON/validation failures from corrupting results or skipping consolidation.

Changes:

Add a 2-attempt retry loop to evaluate_llm_judge() when the parsed JSON is missing/invalid for the label field, defaulting to WRONG after retries.
Refactor semantic consolidation to route through a new _try_consolidate() helper that retries when llm_consolidate_features() returns None, and explicitly skips context-length errors.
Add server tests covering consolidation retry success, retry exhaustion, and context-length error handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`evaluation/retrieval_agent/llm_judge.py`	Adds retry + logging when judge output is missing/invalid, defaults to `WRONG` after retries.
`packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py`	Introduces `_try_consolidate()` and retry handling for consolidation, plus context-length error skip behavior.
`packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py`	Adds tests for consolidation retry behavior and context-length error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- retry consolidation on parse/validation exceptions, not only None - raise context-length as a sentinel so debug_fail_loudly does not trip on the graceful-skip path, matching the feature-update path - tighten llm judge label validation: require CORRECT/WRONG string - correct consolidation log messages to mention consolidation + set_id

sscargal · 2026-04-21T22:21:27Z

The latest commit addresses the CoPilot comments - 4 of 5 valid, all addressed:

#	Finding	Action
1	`_try_consolidate` didn't retry on validation/parse exceptions	Now retries non-context exceptions and only surfaces/re-raises after attempts are exhausted
2	`debug_fail_loudly` raised for the context-length-skip case	`_try_consolidate` now raises a private `_ConsolidationContextLengthError`; `_deduplicate_features` catches it and returns silently, matching the feature-update path at semantic_ingestion.py:212-221
3	Misleading "Failed to update features" log inside consolidation	Updated to "Failed to consolidate features … for set %s on attempt %d/%d"
4	evaluate_llm_judge accepted non-string/arbitrary labels	Now requires label to be a string, strips+upper-cases, and only accepts CORRECT/WRONG; otherwise retries
5	Missing test_llm_judge.py	Already fixed by commit `68d17db` — no action

Tests added (all passing):

evaluation/retrieval_agent/test_llm_judge.py: non-string label retries, dict label retries, whitespace/case normalization, retry-on-unknown-label
packages/server/server_tests/.../test_semantic_ingestion.py: retry-on-exception-success, retry-then-reraise under debug_fail_loudly, context-length-no-raise under debug_fail_loudly

Verification:

uv run pytest evaluation/retrieval_agent/test_llm_judge.py — 14 passed
uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py — 40 passed
Full server suite — 1253 passed, 2 skipped
ruff check / ruff format --check — clean on touched files
ty check — one pre-existing error at llm_judge.py:100 (unrelated to this PR)

junttang · 2026-04-22T03:53:30Z

Thank you for adding the retry logic!

sscargal requested review from Copilot, edwinyyyu and malatewang April 21, 2026 20:52

Copilot started reviewing on behalf of sscargal April 21, 2026 20:53 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Add missing file from PR

68d17db

Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>

sscargal requested a review from Copilot April 21, 2026 21:52

Copilot started reviewing on behalf of sscargal April 21, 2026 21:55 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Comment thread packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py

Comment thread evaluation/retrieval_agent/test_llm_judge.py

Comment thread packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py

sscargal added this to the v0.3.6 milestone Apr 21, 2026

jealous approved these changes Apr 21, 2026

View reviewed changes

edwinyyyu approved these changes Apr 21, 2026

View reviewed changes

sscargal merged commit 542a851 into MemMachine:main Apr 21, 2026
43 of 44 checks passed

sscargal deleted the feat/add-retry-logic-for-llm-judge branch April 21, 2026 23:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add retry logic for LLM judge and semantic memory consolidation#1361

add retry logic for LLM judge and semantic memory consolidation#1361
sscargal merged 3 commits into
MemMachine:mainfrom
sscargal:feat/add-retry-logic-for-llm-judge

sscargal commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sscargal commented Apr 21, 2026

Uh oh!

Uh oh!

junttang commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

sscargal commented Apr 21, 2026

Purpose of the change

Description

Fixes/Closes

Type of change

How Has This Been Tested?

Checklist

Maintainer Checklist

Screenshots/Gifs

Further comments

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sscargal commented Apr 21, 2026

Uh oh!

Uh oh!

junttang commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants