AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703
AI: [G3] dequantize_per_tensor asym SIMD: integer zero-point subtract#20703zonglinpeng wants to merge 2 commits into
Conversation
…20499) Summary: Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched). When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`. This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`. For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`. Reviewed By: mvartani-meta Differential Revision: D109500113
Summary: Optimizes the per-tensor `dequantize_per_tensor_out` SIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32. What is optimized: the zero-point subtract is kept in the int32 domain (`PDX_SUB_MX32`) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-type `PDX_MUL_MXF32(int32, scale)` (the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, since `q - zp` is exact in int32 and converts exactly to float, matching `float(q) - float(zp)`. What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit `(xb_vecMxf32)` convert, so the asymmetric FP pipe still did convert + multiply (a `sub -> convert -> mul` dependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain to `sub -> fused-mul` and the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g. `objv1_1x400x400x1_u8` (160000 elems) goes 0.75x -> 1.00x vs stock and `dpev26_4x60x4x4_u16` goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller. Differential Revision: D110529287
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20703
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit f81044d with merge base 8965e51 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@zonglinpeng has exported this pull request. If you are a Meta employee, you can view the originating Diff in D110529287. |
This PR needs a
|
Summary:
Optimizes the per-tensor
dequantize_per_tensor_outSIMD fast path on Fusion-G3 for int8 / uint8 / int16 / uint16 -> float32.What is optimized: the zero-point subtract is kept in the int32 domain (
PDX_SUB_MX32) and the int->float conversion is folded into the multiply by feeding the widened int32 codes straight into the mixed-typePDX_MUL_MXF32(int32, scale)(the idiom the vendor NNLib dequantize kernel uses). Per element the symmetric path is then a single fused multiply, and the asymmetric path is one integer subtract plus one fused multiply, with no separate int->float convert instruction on either path. Bit-identical to the previous code for 8/16-bit inputs, sinceq - zpis exact in int32 and converts exactly to float, matchingfloat(q) - float(zp).What is improved from 00b95cf2b4: that revision moved the subtract into the integer domain but kept an explicit
(xb_vecMxf32)convert, so the asymmetric FP pipe still did convert + multiply (asub -> convert -> muldependency chain) and the symmetric path was left untouched (also convert + multiply). This diff removes that convert on both paths via the fused mixed-type multiply, shortening the asymmetric chain tosub -> fused-muland the symmetric chain to a single fused multiply. That clears the residual large-tensor asymmetric regression 00b95cf2b4 still had -- e.g.objv1_1x400x400x1_u8(160000 elems) goes 0.75x -> 1.00x vs stock anddpev26_4x60x4x4_u16goes 0.89x -> 1.17x -- while keeping every symmetric case at parity or better, and the object file is ~144 B smaller.Differential Revision: D110529287