Skip to content

add retry logic for LLM judge and semantic memory consolidation#1361

Merged
sscargal merged 3 commits into
MemMachine:mainfrom
sscargal:feat/add-retry-logic-for-llm-judge
Apr 21, 2026
Merged

add retry logic for LLM judge and semantic memory consolidation#1361
sscargal merged 3 commits into
MemMachine:mainfrom
sscargal:feat/add-retry-logic-for-llm-judge

Conversation

@sscargal

Copy link
Copy Markdown
Contributor

Purpose of the change

Both evaluate_llm_judge and _deduplicate_features now retry once on JSON parse failures (missing/malformed 'label' key or None response) instead of crashing or silently dropping results. Consolidation also gains context-length error handling, matching the feature-update path.

Description

Changes made:

evaluation/retrieval_agent/llm_judge.py

  • Added import logging, logger, and _MAX_JUDGE_ATTEMPTS = 2
  • Replaced the single-shot json_repair.loads(raw)["label"] with a retry loop: safe dict access via .get("label"), retries once on missing/non-dict result, defaults to 0 (WRONG) after all attempts are exhausted

packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py

  • Added _MAX_CONSOLIDATION_ATTEMPTS = 2
  • Extracted a new _try_consolidate helper method that: retries on None (parse failure), handles context-length errors (logs warning, returns None without propagating), handles other exceptions (respects debug_fail_loudly)
  • Simplified _deduplicate_features to call the helper and act on its result — this also resolves the C901 complexity violation

evaluation/retrieval_agent/test_llm_judge.py (new file)

  • 12 unit tests covering: correct/wrong/unknown labels, retry on missing label, exhausted retries default, non-dict responses, call count verification, parametrized label values

packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py

  • 3 new tests: retry succeeds on second attempt, exhausted retries leaves features unchanged, context-length error is handled gracefully

Fixes/Closes

Fixes #1332

Type of change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

  • Unit Test

Test Results: [Attach logs, screenshots, or relevant output]

New LLM judge tests (fast, no dependencies):
uv run pytest evaluation/retrieval_agent/test_llm_judge.py -v

Semantic ingestion tests (includes the 3 new consolidation tests):
uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py -v

Filter to just the new tests if you want a quicker check:
uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py -v -k "retry or exhaust or context_length_error"

Full server suite (validates nothing was broken):
uv run --frozen --all-extras pytest packages/server/server_tests -q

Lint/format (mirrors CI):
uv run ruff check evaluation/retrieval_agent/ packages/server/src/memmachine_server/semantic_memory/
uv run ruff format --check evaluation/retrieval_agent/ packages/server/src/memmachine_server/semantic_memory/

Checklist

  • I have signed the commit(s) within this pull request
  • My code follows the style guidelines of this project (See STYLE_GUIDE.md)
  • I have performed a self-review of my own code
  • I have commented my code
  • My changes generate no new warnings
  • I have added unit tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have checked my code and corrected any misspellings

Maintainer Checklist

  • Confirmed all checks passed
  • Contributor has signed the commit(s)
  • Reviewed the code
  • Run, Tested, and Verified the change(s) work as expected

Screenshots/Gifs

N/A

Further comments

None

  Both evaluate_llm_judge and _deduplicate_features now retry once on
  JSON parse failures (missing/malformed 'label' key or None response)
  instead of crashing or silently dropping results. Consolidation also
  gains context-length error handling matching the feature-update path.

  Refs: MemMachine#1332

Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds defensive retry handling around structured-output failures for two LLM-driven paths: the evaluation-time LLM judge and server-side semantic feature consolidation, aiming to reduce transient JSON/validation failures from corrupting results or skipping consolidation.

Changes:

  • Add a 2-attempt retry loop to evaluate_llm_judge() when the parsed JSON is missing/invalid for the label field, defaulting to WRONG after retries.
  • Refactor semantic consolidation to route through a new _try_consolidate() helper that retries when llm_consolidate_features() returns None, and explicitly skips context-length errors.
  • Add server tests covering consolidation retry success, retry exhaustion, and context-length error handling.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
evaluation/retrieval_agent/llm_judge.py Adds retry + logging when judge output is missing/invalid, defaults to WRONG after retries.
packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py Introduces _try_consolidate() and retry handling for consolidation, plus context-length error skip behavior.
packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py Adds tests for consolidation retry behavior and context-length error handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/server/src/memmachine_server/semantic_memory/semantic_ingestion.py Outdated
Comment thread evaluation/retrieval_agent/llm_judge.py Outdated
Comment thread evaluation/retrieval_agent/llm_judge.py
Signed-off-by: Steve Scargall <37674041+sscargal@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread evaluation/retrieval_agent/test_llm_judge.py
  - retry consolidation on parse/validation exceptions, not only None
  - raise context-length as a sentinel so debug_fail_loudly does not trip
    on the graceful-skip path, matching the feature-update path
  - tighten llm judge label validation: require CORRECT/WRONG string
  - correct consolidation log messages to mention consolidation + set_id
@sscargal

Copy link
Copy Markdown
Contributor Author

The latest commit addresses the CoPilot comments - 4 of 5 valid, all addressed:

# Finding Action
1 _try_consolidate didn't retry on validation/parse exceptions Now retries non-context exceptions and only surfaces/re-raises after attempts are exhausted
2 debug_fail_loudly raised for the context-length-skip case _try_consolidate now raises a private _ConsolidationContextLengthError; _deduplicate_features catches it and returns silently, matching the feature-update path at semantic_ingestion.py:212-221
3 Misleading "Failed to update features" log inside consolidation Updated to "Failed to consolidate features … for set %s on attempt %d/%d"
4 evaluate_llm_judge accepted non-string/arbitrary labels Now requires label to be a string, strips+upper-cases, and only accepts CORRECT/WRONG; otherwise retries
5 Missing test_llm_judge.py Already fixed by commit 68d17db — no action

Tests added (all passing):

  • evaluation/retrieval_agent/test_llm_judge.py: non-string label retries, dict label retries, whitespace/case normalization, retry-on-unknown-label
  • packages/server/server_tests/.../test_semantic_ingestion.py: retry-on-exception-success, retry-then-reraise under debug_fail_loudly, context-length-no-raise under debug_fail_loudly

Verification:

  • uv run pytest evaluation/retrieval_agent/test_llm_judge.py — 14 passed
  • uv run --frozen --all-extras pytest packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_ingestion.py — 40 passed
  • Full server suite — 1253 passed, 2 skipped
  • ruff check / ruff format --check — clean on touched files
  • ty check — one pre-existing error at llm_judge.py:100 (unrelated to this PR)

@sscargal sscargal added this to the v0.3.6 milestone Apr 21, 2026
@sscargal sscargal merged commit 542a851 into MemMachine:main Apr 21, 2026
43 of 44 checks passed
@sscargal sscargal deleted the feat/add-retry-logic-for-llm-judge branch April 21, 2026 23:52
@junttang

Copy link
Copy Markdown
Contributor

Thank you for adding the retry logic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat]: Add retry logic for LLM judge and semantic memory consolidation on structured output failures

5 participants