Skip to content

Native serialization#20702

Open
metascroy wants to merge 5 commits into
mainfrom
metascroy/native-serialization
Open

Native serialization#20702
metascroy wants to merge 5 commits into
mainfrom
metascroy/native-serialization

Conversation

@metascroy

Copy link
Copy Markdown
Contributor

Supports multi-backend "fat PTE" containers that bundle a native fallback alongside accelerator-specific specializations (e.g., MLX).

backends/native/partitioner.py — NativePartitioner delegates all core ATen ops plus a curated set of non-core ops. Handles EdgeOpOverload unwrapping for edge dialect compatibility. Accepts an optional specializations list to enable fat PTE generation.

backends/native/preprocess.py — NativeBackend.preprocess() pipeline: SpecPropPass → reinplace (functional ops → in-place variants like relu → relu_) → output spec repair → buffer mutation handling → external constants → memory planning → emit → serialize. When specializations are requested, deep-copies the program for each specialization recipe and packs results into a fat PTE.

backends/native/fat_pte.py — Binary container format (magic "NFAT") that packs multiple backend payloads with a header table. pack_fat_blob() builds the binary, build_fat_result() wraps it as a PreprocessResult.

backends/native/specializations.py — Registry for SpecializationRecipe = Callable[[ExportedProgram], bytes]. Includes a built-in MLX recipe using get_default_passes() + MLXPartitioner. Auto-imported by NativePartitioner to populate the registry at import time.

backends/native/ir/GraphTypes.h — C++ graph IR interface for the native runtime.

metascroy added 3 commits July 1, 2026 16:39
…ET), and graph IR interface

Cherry-picked from metascroy/gnn and adapted:
- backends/native/partitioner/ - NativePartitioner marks all ops as supported
- backends/native/preprocess.py - NativeBackend preprocess pipeline (ET-within-ET serialization)
- backends/native/ir/ - GraphTypes.h, Plan.h, Step.h graph interface headers

reinplace_pass: uses main's existing exir/passes/reinplace.py interface
(ops_to_inplace as frozenset, auto-derived in-place forms via schema matching)
@pytorch-bot

pytorch-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20702

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Unrelated Failure

As of commit dc9c564 with merge base 48a87f9 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@metascroy metascroy requested a review from SS-JIA July 2, 2026 19:25
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2026
@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

@metascroy metascroy requested a review from digantdesai July 2, 2026 19:25
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@claude

claude Bot commented Jul 2, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 2m 36s —— View job


Code Review: Native Serialization

  • Read changed files
  • Analyze partitioner, preprocess, fat_pte, specializations
  • Review tests
  • Post review feedback

Overall this is a clean, well-tested addition — the fat-PTE container format is simple and the test coverage (partitioner op-support matrix, blob packing, round-trip flatbuffer validation) is solid. Below are the issues worth addressing, roughly in priority order.


🔴 Passing specializations via mutable class state

partitioner.py:100 sets NativeBackend._specialization_names = self.specializations, which preprocess.py:105 later reads. This is global mutable state on the backend class used to thread data from partitioning into preprocessing.

# partitioner.py
NativeBackend._specialization_names = self.specializations

This is fragile:

  • Not concurrency-safe — two lowering jobs (or two NativePartitioner instances in one to_edge_transform_and_lower call) race on the same class attribute.
  • Ordering-dependent — assumes every partition() is immediately followed by the matching preprocess(), which isn't guaranteed by the delegation API.
  • It's exactly the "dynamic/implicit state" pattern CLAUDE.md steers away from.

The idiomatic channel already exists: DelegationSpec carries compile_specs, and preprocess(program, module_compile_spec) receives them. Encode the specialization names as a CompileSpec on the delegation spec (partitioner.py:74) and read them from module_compile_spec in preprocess, dropping _specialization_names entirely.

Fix this →


🟠 Brittle "already partitioned" detection

partitioner.py:80-82 decides the graph is already lowered by string-matching node names:

if node.op == "get_attr" and "lowered_module" in node.name:
    return ([], None)

Matching on "lowered_module" in node.name is easy to break (naming can change; a user attr could coincidentally contain the substring). Consider checking node semantics instead — e.g. whether any node is an executorch_call_delegate / get_attr resolving to a LoweredBackendModule — rather than a name heuristic.


🟠 Output-spec repair silently truncates on length mismatch

preprocess.py:147-157:

actual_names = [arg.name for arg in output_node.args[0] if hasattr(arg, "name")]
...
for spec, name in zip(program.graph_signature.output_specs, actual_names):

actual_names filters out args lacking .name (e.g. constant/None outputs), but output_specs is not filtered the same way. If they differ in length, zip silently drops trailing specs and the rebuilt new_output_specs is shorter than the original — a corrupt signature that may surface far downstream. Worth asserting len(actual_names) == len(output_specs) (or aligning by position including the non-named args) so a mismatch fails loudly.


🟡 pack_fat_blob can split a multibyte backend id

fat_pte.py:42:

bid = backend_id.encode("utf-8")[:32].ljust(32, b"\x00")

Slicing encoded bytes at [:32] can cut a UTF-8 sequence mid-character, producing an invalid id. Backend ids are ASCII today so it's latent, but a defensive check (reject/validate ids > 32 bytes, or truncate on the decoded string) would avoid a silent corruption. The truncation is also silent — a name collision after truncation ("MyLongBackendName...A" vs "...B") would be undetectable; consider raising instead of truncating.


🟡 Named-data key collisions when merging data stores

fat_pte.py:57-63 merges each result's named data into one store keyed by FQN. If the native program and a specialization both emit an external constant under the same FQN (very likely — they share weights), add_named_data is called twice with the same key. Please confirm NamedDataStore.add_named_data dedupes identical keys rather than raising or double-storing; if it doesn't dedupe, the fat PTE will either error or carry duplicate weight blobs, defeating the "share a single PTD" goal stated in the module docstring.


🟡 C++ GraphTypes.h questions

  • mem_obj_id uses only the low 32 bits of the offset (GraphTypes.h:197-198): off = alloc->memory_offset_low() and the sort key is (pool << 32) | off. If the schema also has a memory_offset_high (the _low suffix implies it does), offsets ≥ 4 GiB — or any nonzero high part — collapse to the same key and two distinct storage slots get merged into one mem_obj_id. Worth confirming the high part is always zero here, or folding it in.
  • num_instructions() vs get_op() (GraphTypes.h:395-422): num_instructions counts all instructions in chain 0, but get_op ET_CHECKs that the instruction is a KernelCall. Any non-kernel instruction (delegate/free/jump) in the chain would make a for (i in num_instructions) get_op(i) loop abort. Fine if native programs only ever emit KernelCalls, but that invariant is implicit — a comment or a kernel-only count would make it safe.
  • Only chains()->Get(0) is ever read (:224, :398, :404); single-chain assumption is reasonable but undocumented.

🟢 Minor

  • preprocess.py:215 program.graph_module.encounter_to_out_var_failure = True sets a dynamic attribute; a short comment on what consumes it would help (it reads like a typo for a flag).
  • _AnyOp verifier (preprocess.py:67) uses dialect = "TRAINING" to disable op-type checks during internal transforms — a one-line note on why would save the next reader a double-take.
  • partitioner.py imports specializations at module load (:15) purely for the registration side effect. The # noqa comment is good; just flagging that import-time registration means importing the partitioner pulls in the MLX recipe's module path resolution lazily (it's deferred inside _mlx_recipe, so this is fine — no hard MLX dependency at import).

The tests are a strong point — test_preprocess.py round-trips the native slice out of the fat blob and validates op names, and the partitioner op-support tests cover the edge/core/HOP cases well.
· branch metascroy/native-serialization

Comment thread backends/native/ir/GraphTypes.h Outdated
@@ -0,0 +1,672 @@
/*
* Copyright (c) Meta Platforms, Inc. and affiliates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend deferring this interface to another PR, or split it into its own PR, since it technically is out-of-scope for what the rest of the PR seeks to accomplish (i.e. implementing serialization) and is also not referenced elsewhere in the PR, as far as I can tell.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can drop this from the PR. It's the abstraction through which native backends would access the serialized artifact, but has no AOT dependencies.

@SS-JIA SS-JIA Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for accessing the serialized artifact, I think the C++ interface generated by flatbuffers is good enough TBH.

In my mind the path would be

flatbuffer --> deserialize with C++ flatbuffer interface --> construct graph IR --> pass graph IR to engines

so backends would never have to deal with the deserialized artifact directly. Although I'm seeing now that this approach is perhaps slightly different than what you had in mind 😅

Comment thread backends/native/ir/GraphTypes.h Outdated
* outputs: [uint];
* mutable_buffers: [uint]; // values that persist across executes
* operators: [OperatorDef]; // deduped op-name registry
* instructions: [KernelCall]; // flat list of operator calls

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, the current header presents the runtime graph/program IR as a "superset" of the ExecuTorch program schema. However, one drawback is that the program schema is very barebones (essentially representing the model as a linear list of function calls), which makes it difficult to do things like graph transforms, which we will want in order to do stuff like operator fusion (i.e. fusing dq(weight)->op chains into quantized ops at runtime), runtime memory planning, etc.

I think it would be valuable to try to aim for a more fx.Graph-like runtime IR, which also encapsulates relationships between values in the graph (i.e. which op produces/writes data into a given value, what other ops use the value). Now that we have a serializer, I'm planning to sketch this out next week; it's also the reason I'm suggesting we defer this header to another PR 😅

@metascroy metascroy Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't a linear list of function calls w/ args implicitly capture dependencies? We just need helper functions in C++ to extract this relationship info (e.g., get_users)?

On "fusing dq(weight)->op", I think we will probably start with "gguf_linear" as an allowed op, so you won't have to worry about that fusion.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does but we will have to reconstruct it. The serialization format is already sufficient, but my aim is to make the graph IR we construct while deserializing carry a bit more information, i.e. Value also tracks ops which write into it, as well as ops which use it as input.

On "fusing dq(weight)->op", I think we will probably start with "gguf_linear" as an allowed op, so you won't have to worry about that fusion.

Quant chain -> quant op fusion is just one example, but there are other applications as well. For example certain fused ops like rms_norm, some engines may have it some other may not. So it's useful to keep the decomposed form in the IR and engines can apply passes to fuse if they want to do it. Generally I think it would be a good idea to try and keep the serialized model as "un-opinionated" as possible, and push as much graph manipulation to the runtime as we can.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved GraphTypes.h into #20704 (along with some fx like functionality like looking up users).

On fused ops: TBH, I think we should not be decomposing things like rms_norm, rope, linear, etc. They should all be special ops. An engine is free to implement it by calling decomposed implementations. There aren't that many special fused ops for LLM, and we should standardize on them.

The reason is re-composing ops from the decomposed dialect was hard enough in python (and was a motivation for to_edge_transform_and_lower over to_edge); I definitely don't want to do it in C++.

@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

Comment thread backends/native/preprocess.py Outdated

# Non-core ops the native backend supports. Also preserved (not decomposed).
_SUPPORTED_NON_CORE_OPS = [
torch.ops.aten.matmul.default,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is where we add other special native ops like torch.ops.native.linear_gguf.default @SS-JIA.

The IR is core aten + growing list of special ops. All backends and specializations are required to handle all ops in _SUPPORTED_NON_CORE_OPS

# ---------------------------------------------------------------------------


def _mlx_recipe(ep: ExportedProgram) -> bytes:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try to add Vulkan and XNNPACK as specializations as well? Just to see if they work 👀 Can roll those into this PR, or a follow up PR if you'd like

@metascroy metascroy Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can add, but I don't think Vulkan / XNNPACK would meet the requirements we want for native / specializations out of the box.

I think specializations must meet all dtype (BF16) / quantization support (GGUF) / runtime features (KV cache mangement) we want to support in native, so that users are guaranteed a uniform experience.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, okay to skip them for now then 👍

@metascroy

Copy link
Copy Markdown
Contributor Author

@claude review this code

@claude

claude Bot commented Jul 3, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 1m 54s —— View job


Code Review: Native Serialization (re-review)

  • Read changed files (current state on branch)
  • Analyze partitioner, preprocess, fat_pte, specializations
  • Review reinplace pass changes + tests
  • Post review feedback

Since the last review, most of the earlier findings have been resolved. Confirming what's fixed, then a few remaining items.

✅ Resolved since last review

  • Mutable class state gone_specialization_names is replaced by a CompileSpec (native_specializations) encoded on the DelegationSpec in partitioner.py:80-86 and read back in preprocess.py:97-100. This is the idiomatic channel and fixes the concurrency/ordering fragility.
  • "Already partitioned" detection — now checks node semantics (node.target is executorch_call_delegate, partitioner.py:96) instead of substring-matching node.name.
  • Output-spec repair fails loudly — the fix moved into reinplace.py:519-521 and now assert len(output_specs) == len(out_args) before zipping, so a length mismatch raises instead of silently truncating.
  • pack_fat_blob backend-id truncation — now encode("ascii") and raises ValueError on > 32 bytes (fat_pte.py:42-46) rather than silently slicing a multibyte sequence.
  • Named-data key collisions — confirmed NamedDataStore._add_named_data_to_map (exir/_serialize/_named_data_store.py:127-135) dedupes identical key+data and raises on same-key/different-data, so merging shared weights across results is safe.
  • BACKEND_INPLACE_OPS — now a plain frozenset literal (per SS-JIA's nit).
  • encounter_to_out_var_failure / _AnyOp dialect — both now have explanatory comments (preprocess.py:195, :61-62).
  • GraphTypes.h — dropped from the PR and moved to Introduce graph interface over serialization #20704, per the review discussion.

🟠 Docstring vs. reality: specializations do not share a PTD

fat_pte.py:9-10 states "All specializations share a single PTD (named data store)." But the built-in MLX recipe returns a fully serialized PTE:

# specializations.py:43-44
et_program = lowered.to_executorch()
return et_program.buffer

That buffer bakes MLX's constants inside itself and comes back as PreprocessResult(processed_bytes=payload) with no data_store_output (preprocess.py:122). So build_fat_result only ever merges the native slice's named data — MLX weights are duplicated inside its own payload, not shared. The docstring's "single PTD" guarantee holds only for recipes that return a data_store_output, which the one shipped recipe doesn't. Either the docstring should be softened, or recipes should return (bytes, data_store_output) so their constants participate in the merge. Fix this →

🟡 Minor

  • Hardcoded native backend id (preprocess.py:117): the native entry is labeled "NativeBackend" as a string literal while specialization ids come from the registry. If someone registers a recipe named "NativeBackend", the fat blob would contain two entries with the same backend_id and the registry check at :106 wouldn't catch it. Using cls.__name__ (matching how the DelegationSpec is built in partitioner.py:87) would keep the two in sync.
  • _apply_passes ordering (preprocess.py:75-84): the isinstance(p, MemoryPlanningPass) branch precedes the ExportPass/PassBase branch, which is correct only because MemoryPlanningPass also matches the second — worth a one-line note since the branch order is load-bearing.
  • import copy / inline imports: copy (:112) and several from ... import statements are function-local. A couple (reinplace, external_constants_pass) read as deliberate lazy imports, but copy and torch.export.graph_signature could move to module scope for readability. Non-blocking.

Tests

Coverage remains strong — test_fat_pte.py now exercises the >32 byte id rejection (:68-71) and multi-specialization offset math, test_partitioner.py covers the edge/core/HOP matrix, and test_preprocess.py round-trips the native slice out of the fat blob. Nothing missing that I'd block on.

Overall this is in good shape — the one item I'd genuinely want addressed before merge is the PTD-sharing docstring mismatch, since it affects the on-disk size story the fat PTE is meant to deliver.
· branch metascroy/native-serialization

@claude

claude Bot commented Jul 3, 2026

Copy link
Copy Markdown

Claude finished @metascroy's task in 2m 7s —— View job


Code Review: Native Serialization (re-review)

I re-read the current state of the branch against my earlier review. Nearly every item from the last pass has been resolved cleanly — nice work.

✅ Resolved since last review

  • 🔴 Mutable class state → CompileSpec. _specialization_names is gone. partitioner.py:80-87 now encodes the specialization list as a CompileSpec on the DelegationSpec, and preprocess.py:97-100 reads it back from module_compile_spec. Concurrency-safe and no implicit ordering dependency. 👍
  • 🟠 Brittle "already partitioned" detection. partitioner.py:95-97 now checks node semantics (node.target is executorch_call_delegate) instead of substring-matching "lowered_module" in node.name.
  • 🟠 Output-spec repair silent truncation. The repair moved into exir/passes/reinplace.py:511-529 and now asserts len(output_specs) == len(out_args) before zipping — a mismatch fails loudly. Good.
  • 🟡 pack_fat_blob multibyte/silent truncation. fat_pte.py:42-47 now encode("ascii") and raises ValueError when len(encoded) > 32 instead of silently slicing. Covered by test_backend_id_over_32_bytes_raises.
  • 🟡 Named-data key collisions. Confirmed NamedDataStore._add_named_data_to_map (exir/_serialize/_named_data_store.py:127-135) dedupes: identical key+data is a no-op, and identical data under different keys shares one buffer. So the fat PTE genuinely shares a single PTD when native and a specialization emit the same weight. (See caveat below.)
  • 🟢 BACKEND_INPLACE_OPS is now a frozenset literal (preprocess.py:49-57), per SS-JIA's nit.
  • 🟢 encounter_to_out_var_failure now has an explanatory comment (preprocess.py:195).
  • GraphTypes.h moved out to Introduce graph interface over serialization #20704, per the review discussion — keeps this PR scoped to serialization.

🟡 One remaining edge case worth a guard

fat_pte.py:62-68 merges each result's external constants into one store keyed by FQN. Dedup only holds when the bytes are identical. If a specialization transforms a weight it externalizes under the same FQN as native — e.g. an MLX/GGUF recipe that quantizes model.….weight in place — the two payloads share the FQN but differ in bytes, and _add_named_data_to_map will raise:

ValueError: Duplicate key <fqn> with different data.

Today's built-in MLX recipe returns et_program.buffer (a self-contained PTE, constants inline) rather than externalizing into the shared store, so this doesn't fire yet. But it's a latent trap for the next recipe author who does externalize quantized weights. Consider either namespacing each specialization's external-constant keys by backend id when merging, or catching the collision and re-emitting under a prefixed key — and documenting the "specializations must share byte-identical constants, or namespace them" contract in the module docstring.

Fix this →

🟢 Minor / non-blocking

  • fat_pte.py:70-71 defaults debug_handle_map to results[0]'s (the native slice). Reasonable, but the fat blob carries N programs whose debug handles are dropped — fine as long as debugging targets the native slice; a one-line note would set expectations.
  • preprocess.py:60-68 _AnyOp verifier now has the dialect = "TRAINING" comment explaining why — good.

Overall this is in good shape and I don't see anything blocking. The remaining FQN-collision item is a latent contract issue rather than a bug in current behavior.
· branch metascroy/native-serialization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants