Skip to content

Memoize file contents by path in CachedParser to skip redundant reads#5928

Open
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:cachedparser-memoize-file-contents
Open

Memoize file contents by path in CachedParser to skip redundant reads#5928
SanderMuller wants to merge 1 commit into
phpstan:2.2.xfrom
SanderMuller:cachedparser-memoize-file-contents

Conversation

@SanderMuller

Copy link
Copy Markdown
Contributor

CachedParser::parseFile() reads the whole file via FileReader::read() on every call, before the content-keyed node cache. The same file is parsed many times (a trait file once per class that uses it), so the read repeats even when nothing changed. On Tempest that is 73,498 parseFile calls for 2,327 distinct files, 96.8% of them redundant; hot trait files are re-read thousands of times and the 256-entry content cache thrashes on them.

This memoizes the contents by path, keyed by mtime, and skips the re-read when the file is unchanged. clearstatcache() before the mtime check keeps it correct in long-running processes (PHPStan Pro, the fixer worker), where a file can change between calls: an edited file gets a new mtime, so it is read and parsed again. A test covers both paths, that an unchanged file is not re-read and a file with a newer mtime is.

Effect (CPU = user+sys, /usr/bin/time, result-cold), measured on Tempest and on phpstan-src's own source:

  • single process: sys −45%, CPU ~−5%
  • parallel (8 workers): sys −16%, CPU ~−5%, wall unchanged

This is a CPU / shared-runner saving on result-cold runs (the path taken after composer update or a config change in CI), not a latency win: parallel analysis is CPU-bound, so wall time is unchanged. The saving is a roughly constant amount of syscall time, so it is a larger fraction on lighter-per-file projects. Output is byte-identical before and after on both codebases.

Memory is about +4 MB on Tempest for the contents held by path; the node ASTs are not duplicated, since the existing content cache still holds those.

CachedParser::parseFile() read the whole file via FileReader::read() on every
call, before the content-keyed node cache. The same file is parsed many times
(a trait file once per class that uses it - on Tempest, 73 498 parseFile calls
for 2 327 distinct files, 96.8% redundant reads; the 256-entry content cache
thrashes on hot traits), so the read is repeated even when nothing changed.

Memoize the contents by path, keyed by mtime, and skip the re-read when the
file is unchanged. clearstatcache() before the mtime check keeps this correct
in long-running processes (PHPStan Pro, fixer worker) where a file may be
edited between calls, so an edited file is always re-read and re-parsed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@staabm

staabm commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

How to reproduce/measure the performance improvement?

@SanderMuller

Copy link
Copy Markdown
Contributor Author

Here's the A/B I used to measure it. Swap only CachedParser.php between this branch and its parent (bad7874ec), keep the same vendor, run cold (delete the tmp/cache dir before each run) and single-process, and compare sys and total CPU (user+sys) rather than wall. Wall is overlapped by the parallel workers and is too noisy on a shared machine to read a few-percent change from.

# in the phpstan checkout, swap just the one file between the two versions:
git show <ver>:src/Parser/CachedParser.php > src/Parser/CachedParser.php   # <ver> = this branch, then bad7874ec
# from the target project, cold + single-process, 3 reps:
rm -rf <tmpDir> && /usr/bin/time php <phpstan>/bin/phpstan analyse -l 8 --no-progress <paths>

The mechanism: parseFile() is called once per class that uses a given trait (and for other shared includes), so the same file's contents get read from disk many times in a single run. The memo keys the contents by path + mtime and reads each file once.

So the win tracks how much cross-file re-read redundancy a codebase has, which makes it corpus-dependent. Single-process, cold, 3 reps, median:

  • Tempest (multi-package framework, many files pulling the same shared base classes/traits): sys 2.62s → 1.79s (-32%), total CPU 34.9s → 34.0s (-2.6%), user flat. The saving is in read syscalls, as expected; user time doesn't move because parsing itself is unchanged.
  • rector-src (~1150 files, parses each roughly once): no measurable change (sys ~1.60s either way).

So it's a real win on high-fan-in codebases and neutral on low-redundancy ones. The content memo costs about 4MB regardless of whether the project benefits. If that always-paid cost is the concern, a bounded variant (memoize only files that get read more than once) would make the memory track the benefit; happy to do that if you'd prefer it.

The red CI check is the flaky RegressionBench wall-time assertion (it fails the same way on unrelated PRs that can't affect analysis time), not a real regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants