Native serialization#20702
Conversation
…ET), and graph IR interface Cherry-picked from metascroy/gnn and adapted: - backends/native/partitioner/ - NativePartitioner marks all ops as supported - backends/native/preprocess.py - NativeBackend preprocess pipeline (ET-within-ET serialization) - backends/native/ir/ - GraphTypes.h, Plan.h, Step.h graph interface headers reinplace_pass: uses main's existing exir/passes/reinplace.py interface (ops_to_inplace as frozenset, auto-derived in-place forms via schema matching)
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20702
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New Failures, 1 Unrelated FailureAs of commit dc9c564 with merge base 48a87f9 ( NEW FAILURES - The following jobs have failed:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude review this code |
This PR needs a
|
|
Claude finished @metascroy's task in 2m 36s —— View job Code Review: Native Serialization
Overall this is a clean, well-tested addition — the fat-PTE container format is simple and the test coverage (partitioner op-support matrix, blob packing, round-trip flatbuffer validation) is solid. Below are the issues worth addressing, roughly in priority order. 🔴 Passing specializations via mutable class state
# partitioner.py
NativeBackend._specialization_names = self.specializationsThis is fragile:
The idiomatic channel already exists: 🟠 Brittle "already partitioned" detection
if node.op == "get_attr" and "lowered_module" in node.name:
return ([], None)Matching on 🟠 Output-spec repair silently truncates on length mismatch
actual_names = [arg.name for arg in output_node.args[0] if hasattr(arg, "name")]
...
for spec, name in zip(program.graph_signature.output_specs, actual_names):
🟡
|
| @@ -0,0 +1,672 @@ | |||
| /* | |||
| * Copyright (c) Meta Platforms, Inc. and affiliates. | |||
There was a problem hiding this comment.
I would recommend deferring this interface to another PR, or split it into its own PR, since it technically is out-of-scope for what the rest of the PR seeks to accomplish (i.e. implementing serialization) and is also not referenced elsewhere in the PR, as far as I can tell.
There was a problem hiding this comment.
I can drop this from the PR. It's the abstraction through which native backends would access the serialized artifact, but has no AOT dependencies.
There was a problem hiding this comment.
Oh, for accessing the serialized artifact, I think the C++ interface generated by flatbuffers is good enough TBH.
In my mind the path would be
flatbuffer --> deserialize with C++ flatbuffer interface --> construct graph IR --> pass graph IR to enginesso backends would never have to deal with the deserialized artifact directly. Although I'm seeing now that this approach is perhaps slightly different than what you had in mind 😅
| * outputs: [uint]; | ||
| * mutable_buffers: [uint]; // values that persist across executes | ||
| * operators: [OperatorDef]; // deduped op-name registry | ||
| * instructions: [KernelCall]; // flat list of operator calls |
There was a problem hiding this comment.
As I understand it, the current header presents the runtime graph/program IR as a "superset" of the ExecuTorch program schema. However, one drawback is that the program schema is very barebones (essentially representing the model as a linear list of function calls), which makes it difficult to do things like graph transforms, which we will want in order to do stuff like operator fusion (i.e. fusing dq(weight)->op chains into quantized ops at runtime), runtime memory planning, etc.
I think it would be valuable to try to aim for a more fx.Graph-like runtime IR, which also encapsulates relationships between values in the graph (i.e. which op produces/writes data into a given value, what other ops use the value). Now that we have a serializer, I'm planning to sketch this out next week; it's also the reason I'm suggesting we defer this header to another PR 😅
There was a problem hiding this comment.
Doesn't a linear list of function calls w/ args implicitly capture dependencies? We just need helper functions in C++ to extract this relationship info (e.g., get_users)?
On "fusing dq(weight)->op", I think we will probably start with "gguf_linear" as an allowed op, so you won't have to worry about that fusion.
There was a problem hiding this comment.
Yes, it does but we will have to reconstruct it. The serialization format is already sufficient, but my aim is to make the graph IR we construct while deserializing carry a bit more information, i.e. Value also tracks ops which write into it, as well as ops which use it as input.
On "fusing dq(weight)->op", I think we will probably start with "gguf_linear" as an allowed op, so you won't have to worry about that fusion.
Quant chain -> quant op fusion is just one example, but there are other applications as well. For example certain fused ops like rms_norm, some engines may have it some other may not. So it's useful to keep the decomposed form in the IR and engines can apply passes to fuse if they want to do it. Generally I think it would be a good idea to try and keep the serialized model as "un-opinionated" as possible, and push as much graph manipulation to the runtime as we can.
There was a problem hiding this comment.
I moved GraphTypes.h into #20704 (along with some fx like functionality like looking up users).
On fused ops: TBH, I think we should not be decomposing things like rms_norm, rope, linear, etc. They should all be special ops. An engine is free to implement it by calling decomposed implementations. There aren't that many special fused ops for LLM, and we should standardize on them.
The reason is re-composing ops from the decomposed dialect was hard enough in python (and was a motivation for to_edge_transform_and_lower over to_edge); I definitely don't want to do it in C++.
|
@claude review this code |
|
|
||
| # Non-core ops the native backend supports. Also preserved (not decomposed). | ||
| _SUPPORTED_NON_CORE_OPS = [ | ||
| torch.ops.aten.matmul.default, |
There was a problem hiding this comment.
Here is where we add other special native ops like torch.ops.native.linear_gguf.default @SS-JIA.
The IR is core aten + growing list of special ops. All backends and specializations are required to handle all ops in _SUPPORTED_NON_CORE_OPS
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def _mlx_recipe(ep: ExportedProgram) -> bytes: |
There was a problem hiding this comment.
can you try to add Vulkan and XNNPACK as specializations as well? Just to see if they work 👀 Can roll those into this PR, or a follow up PR if you'd like
There was a problem hiding this comment.
We can add, but I don't think Vulkan / XNNPACK would meet the requirements we want for native / specializations out of the box.
I think specializations must meet all dtype (BF16) / quantization support (GGUF) / runtime features (KV cache mangement) we want to support in native, so that users are guaranteed a uniform experience.
There was a problem hiding this comment.
Fair point, okay to skip them for now then 👍
|
@claude review this code |
|
Claude finished @metascroy's task in 1m 54s —— View job Code Review: Native Serialization (re-review)
Since the last review, most of the earlier findings have been resolved. Confirming what's fixed, then a few remaining items. ✅ Resolved since last review
🟠 Docstring vs. reality: specializations do not share a PTD
# specializations.py:43-44
et_program = lowered.to_executorch()
return et_program.bufferThat buffer bakes MLX's constants inside itself and comes back as 🟡 Minor
TestsCoverage remains strong — Overall this is in good shape — the one item I'd genuinely want addressed before merge is the PTD-sharing docstring mismatch, since it affects the on-disk size story the fat PTE is meant to deliver. |
|
Claude finished @metascroy's task in 2m 7s —— View job Code Review: Native Serialization (re-review)I re-read the current state of the branch against my earlier review. Nearly every item from the last pass has been resolved cleanly — nice work. ✅ Resolved since last review
🟡 One remaining edge case worth a guard
Today's built-in MLX recipe returns 🟢 Minor / non-blocking
Overall this is in good shape and I don't see anything blocking. The remaining FQN-collision item is a latent contract issue rather than a bug in current behavior. |
Supports multi-backend "fat PTE" containers that bundle a native fallback alongside accelerator-specific specializations (e.g., MLX).
backends/native/partitioner.py — NativePartitioner delegates all core ATen ops plus a curated set of non-core ops. Handles EdgeOpOverload unwrapping for edge dialect compatibility. Accepts an optional specializations list to enable fat PTE generation.
backends/native/preprocess.py — NativeBackend.preprocess() pipeline: SpecPropPass → reinplace (functional ops → in-place variants like relu → relu_) → output spec repair → buffer mutation handling → external constants → memory planning → emit → serialize. When specializations are requested, deep-copies the program for each specialization recipe and packs results into a fat PTE.
backends/native/fat_pte.py — Binary container format (magic "NFAT") that packs multiple backend payloads with a header table. pack_fat_blob() builds the binary, build_fat_result() wraps it as a PreprocessResult.
backends/native/specializations.py — Registry for SpecializationRecipe = Callable[[ExportedProgram], bytes]. Includes a built-in MLX recipe using get_default_passes() + MLXPartitioner. Auto-imported by NativePartitioner to populate the registry at import time.
backends/native/ir/GraphTypes.h — C++ graph IR interface for the native runtime.