AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract by zonglinpeng · Pull Request #20703 · pytorch/executorch

zonglinpeng · 2026-07-02T22:31:24Z

Summary:
Optimizes the per-tensor dequantize_per_tensor_out SIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32.

What is optimized: the zero-point subtract is kept in the int32 domain (PDX_SUB_MX32) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-type PDX_MUL_MXF32(int32, scale) (the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, since q - zp is exact in int32 and converts exactly to float, matching float(q) - float(zp).

What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit (xb_vecMxf32) convert, so the asymmetric FP pipe still did convert + multiply (a sub -> convert -> mul dependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain to sub -> fused-mul and the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g. objv1_1x400x400x1_u8 (160000 elems) goes 0.75x -> 1.00x vs stock and dpev26_4x60x4x4_u16 goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller.

Differential Revision: D110529287

…20499) Summary: Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched). When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`. This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`. For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`. Reviewed By: mvartani-meta Differential Revision: D109500113

Summary: Optimizes the per-tensor `dequantize_per_tensor_out` SIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32. What is optimized: the zero-point subtract is kept in the int32 domain (`PDX_SUB_MX32`) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-type `PDX_MUL_MXF32(int32, scale)` (the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, since `q - zp` is exact in int32 and converts exactly to float, matching `float(q) - float(zp)`. What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit `(xb_vecMxf32)` convert, so the asymmetric FP pipe still did convert + multiply (a `sub -> convert -> mul` dependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain to `sub -> fused-mul` and the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g. `objv1_1x400x400x1_u8` (160000 elems) goes 0.75x -> 1.00x vs stock and `dpev26_4x60x4x4_u16` goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller. Differential Revision: D110529287

pytorch-bot · 2026-07-02T22:31:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20703

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f81044d with merge base 8965e51 ():

NEW FAILURE - The following job has failed:

pull / test-binary-size-linux-gcc / linux-job (gh)
RuntimeError: Command docker exec -t 6dda945d2a0ec659ea0ec427c1d0984b98688f066190cde5556e1a2c8c41e633 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-07-02T22:31:33Z

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110529287.

github-actions · 2026-07-02T22:33:25Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

zonglinpeng added 2 commits July 2, 2026 15:31

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2026

meta-codesync Bot added the meta-exported label Jul 2, 2026

meta-codesync Bot temporarily deployed to cadence July 2, 2026 22:31 Inactive

meta-codesync Bot temporarily deployed to cadence July 2, 2026 22:59 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703

AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703
zonglinpeng wants to merge 2 commits into
pytorch:mainfrom
zonglinpeng:export-D110529287

zonglinpeng commented Jul 2, 2026

Uh oh!

pytorch-bot Bot commented Jul 2, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zonglinpeng commented Jul 2, 2026

Uh oh!

pytorch-bot Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20703

❌ 1 New Failure

Uh oh!

meta-codesync Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jul 2, 2026 •

edited

Loading

This PR needs a `release notes:` label