Skip to content

AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703

Open
zonglinpeng wants to merge 2 commits into
pytorch:mainfrom
zonglinpeng:export-D110529287
Open

AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703
zonglinpeng wants to merge 2 commits into
pytorch:mainfrom
zonglinpeng:export-D110529287

Conversation

@zonglinpeng

Copy link
Copy Markdown
Contributor

Summary:
Optimizes the per-tensor dequantize_per_tensor_out SIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32.

What is optimized: the zero-point subtract is kept in the int32 domain (PDX_SUB_MX32) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-type PDX_MUL_MXF32(int32, scale) (the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, since q - zp is exact in int32 and converts exactly to float, matching float(q) - float(zp).

What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit (xb_vecMxf32) convert, so the asymmetric FP pipe still did convert + multiply (a sub -> convert -> mul dependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain to sub -> fused-mul and the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g. objv1_1x400x400x1_u8 (160000 elems) goes 0.75x -> 1.00x vs stock and dpev26_4x60x4x4_u16 goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller.

Differential Revision: D110529287

…20499)

Summary:

Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched).

When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`.

This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`.

For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`.

Reviewed By: mvartani-meta

Differential Revision: D109500113
Summary:
Optimizes the per-tensor `dequantize_per_tensor_out` SIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32.

What is optimized: the zero-point subtract is kept in the int32 domain (`PDX_SUB_MX32`) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-type `PDX_MUL_MXF32(int32, scale)` (the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, since `q - zp` is exact in int32 and converts exactly to float, matching `float(q) - float(zp)`.

What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit `(xb_vecMxf32)` convert, so the asymmetric FP pipe still did convert + multiply (a `sub -> convert -> mul` dependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain to `sub -> fused-mul` and the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g. `objv1_1x400x400x1_u8` (160000 elems) goes 0.75x -> 1.00x vs stock and `dpev26_4x60x4x4_u16` goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller.

Differential Revision: D110529287
@pytorch-bot

pytorch-bot Bot commented Jul 2, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20703

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit f81044d with merge base 8965e51 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2026
@meta-codesync

meta-codesync Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110529287.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant