Skip to content

PR18: GGML Import & Pipeline Modules#10451

Open
agibsonccc wants to merge 4 commits into
masterfrom
pr/18-ggml-import-pipeline
Open

PR18: GGML Import & Pipeline Modules#10451
agibsonccc wants to merge 4 commits into
masterfrom
pr/18-ggml-import-pipeline

Conversation

@agibsonccc

@agibsonccc agibsonccc commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Summary

PR 18 of 22 PRs in the ag_new_release_updates_2 branch split. [Merge Layer 6]

  • GGUF format support: GGUFReader handles magic 0x46554747, v1/v2/v3 binary layout; GGMLFormatDetector distinguishes GGUF from legacy GGML via 4-byte magic
  • 18 architecture handlers: map GGUF tensor name patterns to SameDiff variables for LLaMA 1/2/3/4, Gemma 2/3, Mistral, Phi-3/3.5, ChatGLM, IBM Granite, LFM2 (SSM hybrid), Nemotron, OLMo, OpenELM, SmolVLM2, Qwen3-VL, MiniCPM-V, Whisper, and a generic fallback
  • Full quantization codec suite: byte-exact dequantizers for all GGML formats — Q4_0 through Q8_K, k-quant super-blocks (Q2_K through Q6_K), importance-matrix quants (IQ1–IQ4), and ternary formats (TQ1_0, TQ2_0)
  • Round-trip export: quantizers for Q4_0, Q4_1, Q4_K, Q5_0, Q5_1, Q5_K, Q6_K, Q8_0; GGMLModelExport + SameDiffToGGMLConverter for export
  • Adaptive quantization: AdaptiveLayerQuantizer walks Q2_K → F32 until size budget is met; protects embeddings, LM head, and first/last transformer blocks
  • Multimodal split files: MultimodalGGUFLoader handles separate encoder/decoder GGUF shards (Qwen3-VL, SmolVLM2)
  • SPI pipeline framework: samediff-pipeline-core defines AutoModel.fromPretrained() dispatching by file extension/magic; format loaders registered via ServiceLoader
  • SafeTensors pipeline: samediff-pipeline-safetensors includes SmolVLM2SafeTensorsBuilder and Qwen3VLSafeTensorsBuilder for direct SafeTensors loading

What Changed

nd4j/nd4j-ggml — Core GGML/GGUF Module (87 new files)

Entry points:

  • GGMLModelImport.javaimportModel(File), convertToSDZ(src, dst), inspectModel(File)
  • GGMLModelExport.javaexportModel(SameDiff, File, ExportOptions)
  • GGMLImportException.java / GGMLExportException.java — checked exception hierarchy

Format layer (format/):

  • GGUFReader.java — GGUF v1/v2/v3 binary parser (magic, version, KV metadata, tensor descriptors, data sections)
  • GGUFWriter.java — GGUF binary output with alignment padding
  • GGMLReader.java / GGMLWriter.java — legacy GGML format support
  • GGMLFormat.java / GGMLFormatDetector.java — auto-detection from 4-byte magic
  • GGMLDataType.java — enum of raw GGML dtype codes
  • GGMLHeader.java / GGMLTensorInfo.java / GGMLMetadata.java — format descriptor types
  • MultimodalGGUFLoader.java — loads split multimodal GGUF shards

Architecture layer (architecture/):

  • ModelArchitecture.java — interface: isCompatible(GGMLMetadata), buildSameDiff(GGUFReader, options)
  • ArchitectureRegistry.java — priority-ordered; auto-discovers via ServiceLoader
  • LayerTensorDiscovery.java — maps GGUF tensor name patterns to SameDiff variable names
  • Architecture handlers: LLaMAArchitecture.java, LLaMAExportArchitecture.java, Llama4Architecture.java, GemmaArchitecture.java, MistralArchitecture.java, PhiArchitecture.java, GLMArchitecture.java, GraniteArchitecture.java, LFM2Architecture.java, NemotronArchitecture.java, OLMoArchitecture.java, OpenELMArchitecture.java, GptOssArchitecture.java, SmolVLM2Architecture.java, Qwen3VLArchitecture.java, MiniCPMVArchitecture.java, WhisperArchitecture.java, GenericArchitecture.java
  • ArchitectureConfig.java / ExportArchitecture.java / ExportArchitectureRegistry.java — export-side counterparts

Quantization layer (quantization/):

  • GGMLQuantType.java — enum from Q2_K (2.5625 bpw) through F32 (32 bpw)
  • Quantizer.java / Dequantizer.java / QuantizerFactory.java / DequantizerFactory.java / QuantizationInfo.java — quantization interfaces and dispatch
  • Standard dequantizers: Q4_0Dequantizer.java, Q4_1Dequantizer.java, Q5_0Dequantizer.java, Q5_1Dequantizer.java, Q8_0Dequantizer.java, Q8_KDequantizer.java
  • K-quant dequantizers: Q2_KDequantizer.java, Q3_KDequantizer.java, Q4_KDequantizer.java, Q5_KDequantizer.java, Q6_KDequantizer.java
  • IQ dequantizers: IQ1_MDequantizer.java, IQ1_SDequantizer.java, IQ2_SDequantizer.java, IQ2_XSDequantizer.java, IQ2_XXSDequantizer.java, IQ3_SDequantizer.java, IQ3_XXSDequantizer.java, IQ4_NLDequantizer.java, IQ4_XSDequantizer.java, TQ1_0Dequantizer.java, TQ2_0Dequantizer.java
  • Export quantizers: Q4_0Quantizer.java, Q4_1Quantizer.java, Q4_KQuantizer.java, Q5_0Quantizer.java, Q5_1Quantizer.java, Q5_KQuantizer.java, Q6_KQuantizer.java, Q8_0Quantizer.java
  • Adaptive: AdaptiveLayerQuantizer.java, AdaptiveQuantConfig.java, DynamicQuantizationAnalyzer.java, DynamicQuantConfig.java

Conversion layer:

  • convert/GGMLToSameDiffConverter.java — reads GGUF tensors, dequantizes, creates SameDiff variables
  • convert/ConversionOptions.java — quantization mode, forTraining flag, architecture override
  • export/SameDiffToGGMLConverter.java — reverse path: SameDiff variables → GGUF tensor stream
  • export/ExportOptions.java / TensorExportInfo.java — export configuration

nd4j/samediff-pipeline-core — Pipeline Framework (12 new files)

  • Pipeline.java — interface: generate(input), embed(text), classify(input)
  • PipelineLoader.java — SPI interface for format-specific loaders
  • PipelineLoaderRegistry.java — discovers loaders via ServiceLoader
  • AutoModel.javaAutoModel.fromPretrained(path) dispatches by file extension or header magic
  • ModelFormat.java — enum: GGUF, SAFETENSORS, ONNX, SDZ, TORCHSCRIPT
  • ChatTemplate.java / TokenizerConfig.java / GenerationConfig.java / ModelManifest.java / ModelIndex.java / WeightMapIndex.java / SchedulerConfig.java / PreprocessorConfig.java / SpecialTokensMap.java — inference and manifest types

nd4j/samediff-pipeline-ggml — GGML Pipeline Loader (5 new files)

  • GGMLPipelineLoader.java — implements PipelineLoader; detects architecture then delegates to ArchitectureRegistry
  • GGUFReader.java / GGUFHeader.java / GGUFMetadataType.java / GGUFType.java — lightweight reader for metadata-only extraction
  • Registered via META-INF/services/org.eclipse.deeplearning4j.pipeline.PipelineLoader

nd4j/samediff-pipeline-safetensors — SafeTensors Pipeline Loader (8 new files)

  • SafeTensorsReader.java — JSON header + memory-mapped tensor data
  • SafeTensorsHeader.java / SafeTensorsDtype.java — format types
  • SafeTensorsPipelineLoader.javaPipelineLoader for .safetensors files
  • architecture/SafeTensorsArchitecture.java / SafeTensorsArchitectureRegistry.java — architecture registry
  • architecture/SmolVLM2SafeTensorsBuilder.java — builds SameDiff graph from SmolVLM2 shards
  • architecture/Qwen3VLSafeTensorsBuilder.java — builds SameDiff graph from Qwen3-VL shards

Dependencies

  • Depends on: PR12 (nd4j-api, SameDiff), PR13 (nd4j-native backend)
  • Required by: PR19 (samediff-llm uses GGUF import for model loading), PR22 (platform tests: GGMLModelImportTest, GGUFReaderTest, TestAdaptiveLayerQuantizer, RoundTripTest, etc.)

Merge Order

This PR is in Layer 6.

Layer PRs
0 (no deps) PR01, PR02, PR20
1 (build/infra) PR03, PR04
2 (native core) PR05, PR06, PR07
3 (native feat) PR08, PR09, PR10, PR11
4 (java core) PR12, PR13, PR14, PR15
5 (java feat) PR16
6 (import/gen) PR17, PR18, PR19, PR21
7 (validation) PR22

Part of the 22-PR split of ag_new_release_updates_2 branch.
Merge layer: 6 (import/gen)
Files: 121

See pr-plans/00-master-plan.md for the full split plan and merge order.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

…plementation

The samediff-pipeline-ggml module contained its own parallel GGUFReader (337
lines, RandomAccessFile) plus companion GGUFHeader, GGUFType, and
GGUFMetadataType classes, duplicating functionality already present in
nd4j-ggml/format/ (443 lines, memory-mapped I/O).

Consolidation:
- Add open() static factories and high-level INDArray-returning methods
  (readTensor, readAllTensors, readTensors, getTensorNames, getTensorCount,
  getTensorInfo) to org.nd4j.ggml.format.GGUFReader. Dequantization delegates
  to DequantizerFactory which supports the full Q4_0..TQ2_0 type set.
- Add nd4j-ggml dependency to samediff-pipeline-ggml/pom.xml.
- Rewrite GGMLPipelineLoader to import org.nd4j.ggml.format.{GGUFReader,GGUFHeader}
  directly; all header accessor methods (getArchitecture, getModelName,
  getContextLength, getEmbeddingLength, getBlockCount) already exist on the
  canonical GGUFHeader.
- Delete the four now-redundant pipeline classes: GGUFReader, GGUFHeader,
  GGUFType, GGUFMetadataType.

Also fixes a latent bug: the pipeline GGUFType had wrong integer-type IDs
(I8=24, I16=25, I32=26, I64=27) which conflict with IQ1_M at ID 24 per the
GGUF spec. The canonical GGMLDataType uses the correct IDs (I8=25..I64=28).
…rties

samediff-pipeline-ggml (8 files, 1,054 LOC):
- Duplicate GGML/GGUF reader module with zero external imports
- Canonical GGML support lives in nd4j-ggml module
- GGMLPipelineLoader, GGUFReader, GGUFHeader, GGUFType, GGUFMetadataType
  all duplicates of nd4j-ggml equivalents
- Removed module declaration from nd4j/pom.xml

aeron.properties:
- Orphaned test resource from removed Aeron networking support
- Zero references in any test file
@agibsonccc

Copy link
Copy Markdown
Contributor Author

Architecture Overview

This PR implements GGML/GGUF model import with full quantization codec support and the SPI-based pipeline framework that enables AutoModel.fromPretrained() dispatch by file extension or magic bytes. It handles 18 model architectures and round-trip export for 8 quantization formats.

Highlights

  • Full GGML quantization codec suite — byte-exact dequantizers for all GGML formats: Q4_0 through Q8_K, k-quant super-blocks (Q2_K–Q6_K), importance-matrix quants (IQ1–IQ4), and ternary formats (TQ1_0, TQ2_0); plus AdaptiveLayerQuantizer that walks Q2_K→F32 until size budget is met while protecting embeddings, LM head, and first/last transformer blocks
  • 18 architecture handlers with multimodal support — maps GGUF tensor name patterns to SameDiff variables for LLaMA 1-4, Gemma 2-3, Mistral, Phi-3/3.5, ChatGLM, Granite, LFM2 (SSM hybrid), Nemotron, OLMo, OpenELM, SmolVLM2, Qwen3-VL, MiniCPM-V, Whisper; MultimodalGGUFLoader handles separate encoder/decoder GGUF shards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants