[GH-1147] Fix null byte crash in semantic memory ingestion#1165
Conversation
974e192 to
1664e78
Compare
Add sanitize_pg_text utility to strip null bytes from text before writing to PostgreSQL. Apply sanitization in PG storage add_feature and update_feature, and add validators to SemanticCommand and LLMReducedFeature to strip null bytes at model boundary. Includes regression tests for null byte handling. (Rebased onto main via path translation — packaging refactor applied)
1664e78 to
e5c0b5f
Compare
There was a problem hiding this comment.
Pull request overview
This PR addresses PostgreSQL ingestion crashes caused by null bytes (\x00) in LLM-generated strings by stripping null bytes at both the Pydantic model boundary and the PostgreSQL storage boundary.
Changes:
- Added
sanitize_pg_text()utility and applied it to key text fields in the SQLAlchemy pgvector semantic storage writes. - Added Pydantic
field_validators to strip null bytes from LLM-parsed semantic command/feature payloads early. - Added unit + regression tests covering sanitization behavior and the original crash scenario.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/server/src/memmachine_server/semantic_memory/storage/text_sanitizer.py | Introduces a shared PostgreSQL text sanitization helper with warning logging. |
| packages/server/src/memmachine_server/semantic_memory/storage/sqlalchemy_pgvector_semantic.py | Applies sanitization when inserting/updating semantic features in PostgreSQL. |
| packages/server/src/memmachine_server/semantic_memory/semantic_model.py | Adds Pydantic validator to strip null bytes from SemanticCommand fields. |
| packages/server/src/memmachine_server/semantic_memory/semantic_llm.py | Adds Pydantic validator to strip null bytes from LLMReducedFeature fields. |
| packages/server/server_tests/memmachine_server/semantic_memory/test_semantic_model.py | Adds tests verifying Pydantic null-byte stripping for SemanticCommand and LLMReducedFeature. |
| packages/server/server_tests/memmachine_server/semantic_memory/storage/test_text_sanitizer.py | Adds unit tests for sanitize_pg_text() including warning logging behavior. |
| packages/server/server_tests/memmachine_server/semantic_memory/storage/test_semantic_storage.py | Adds regression test to ensure add_feature() with null bytes doesn’t crash. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
malatewang
left a comment
There was a problem hiding this comment.
Besides of the null byte, are there any other special character can cause error? When access the database, probably we should catch the exception, log it and handle it properly
Looking at google. The null byte is the only one that Postgres doesn't allow. |
|
@o-love, could you please resolve the merge conflict so we can proceed with the merge? Thanks. |
Signed-off-by: Oscar Love <olabradalove@gmail.com>
Purpose of the change
Prevent
asyncpg.CharacterNotInRepertoireErrorcrashes caused by null bytes (\x00) in LLM-generated strings inserted into PostgreSQL TEXT columns.Description
Implements defense-in-depth sanitization at two layers:
field_validatoronSemanticCommandandLLMReducedFeaturestrips null bytes as soon as LLM output is parsed, before it reaches any storage code.sanitize_pg_text()utility is called inadd_feature()andupdate_feature()to strip null bytes fromtag,feature, andvaluefields before insertion — catching any path that bypasses models (e.g. REST API direct inserts).The
sanitize_pg_textutility logs a warning when it strips null bytes so data-quality issues are visible in monitoring.Fixes/Closes
Fixes #1147
Type of change
How Has This Been Tested?
Tests added:
TestSanitizePgText— 10 unit tests covering clean passthrough, null byte stripping (single, multiple, leading, trailing, all-null), warning logging, and no-log for clean strings.TestSemanticCommandNullByteStripping— 4 tests verifying the Pydantic validator strips null bytes fromfeature,tag, andvaluefields onSemanticCommand.TestLLMReducedFeatureNullByteStripping— 4 tests verifying the same forLLMReducedFeature.test_add_feature_with_null_bytes_does_not_crash— Regression test confirming null bytes in feature values do not crash storage insertion.Test Results: 1329 passed, 9 skipped, 0 failures. Ruff lint and format clean.
Checklist
Screenshots/Gifs
N/A
Further comments
sanitize_pg_textutility is intentionally kept separate from the Pydantic validators so it can serve as a backstop for all PostgreSQL insertion paths, not just model-validated ones.