Skip to content

feat(lib): capture client-attested build provenance#454

Open
max-parke-scale wants to merge 3 commits into
nextfrom
maxparke/agx1-418-build-provenance-capture
Open

feat(lib): capture client-attested build provenance#454
max-parke-scale wants to merge 3 commits into
nextfrom
maxparke/agx1-418-build-provenance-capture

Conversation

@max-parke-scale

@max-parke-scale max-parke-scale commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Adds agentex.lib.utils.build_provenance — the shared capture util for client-attested build provenance: git coordinates (repo/commit/ref/subpath), a deterministic working_tree_hash over the build inputs (not the tarball), a dirty flag (Go vcs.modified / Nix dirtyRev shape), and normalize_remote. Capture is best-effort and never raises into a build. Also makes the build archive’s member order deterministic via a sorted enumeration shared with the hash.

First of three surfaces for AGX1-418 (Phase 1, client-attested). Provenance is delivered via the build-record sink — source_* columns on POST /v5/builds (Surface C, scaleapi) consumed by the sgpctl + CI uploaders (Surface B, scaleapi/sgp). This PR lands the util + archive determinism where agentex.lib lives; the uploaders/columns follow.

Scope notes

  • No build-info.json / runtime sink. An earlier revision wrote build-info.json into the build context for the register_agent()registration_metadata path. Greptile (T-Rex) correctly flagged it as dead-on-arrival (written to the archive root, which the templates’ Dockerfiles don’t COPY and locate_build_info_path() doesn’t read). It’s also redundant: AgentexCloudDeploy.build_id is an FK to AgentexCloudBuild, so a deployment’s source provenance derives from the build record over that join — the same Build→Deploy edge lineage already traverses. Dropped; can be revived (correctly placed) if a real consumer for deployment-history provenance ever appears.

Identity model

working_tree_hash is always computed (content identity); commit/ref/repo anchor it to source when in a git work tree; dirty records uncommitted changes (None outside git).

Tests

20 provenance unit tests (clean/dirty/untracked/detached-HEAD/no-remote/non-git/monorepo-subpath, hash determinism + one-byte/added/exec-bit/symlink sensitivity, and a never-raises-on-hash-failure guard). ruff/pyright clean; full lib suite green.

🧑‍💻🤖 — posted via Claude Code

Greptile Summary

This PR adds client-side build provenance for agent build contexts. The main changes are:

  • New provenance utilities for git coordinates and deterministic context hashing.
  • Deterministic archive member ordering shared with the hash enumeration.
  • Tests for provenance capture, hashing, remotes, git states, and cloud-build packaging behavior.
  • Lockfile version updates for the editable packages.

Confidence Score: 4/5

The main provenance utility work is well-scoped, but the cloud build packaging path currently omits the generated build metadata needed for the feature to function.

The review focused on the changed provenance and packaging flow, and runtime evidence confirmed that the returned build context archive can be produced without build-info.json or exposed provenance metadata.

src/agentex/lib/cli/handlers/agent_handlers.py

T-Rex T-Rex Logs

What T-Rex did

  • The team reproduced the provenance verification by running a Python repro through prepare_cloud_build_context with a minimal manifest, a Dockerfile, and an app file to generate a CloudBuildContext and inspect the archive contents.
  • The provenance identity rule was examined across repository commits, confirming that after state enriches BuildProvenance with source_fields and build_info fields in the post-change scenario.
  • The tarball ordering was checked and found to be unsorted before, then normalized to a sorted order after the change, verified by equality of iter_context_files_order and tar_member_order.

View all artifacts

T-Rex Ran code and verified through T-Rex

Comments Outside Diff (1)

  1. General comment

    P1 Clean committed git provenance still emits working_tree_hash

    • Bug
      • The stated identity rule requires clean committed repositories to key identity on the clean commit and omit/null working_tree_hash. Runtime evidence from the head checkout shows clean_git_committed_tree returns commit 9d5b3e883d0470449c554424c322277f5b0ddaf4 with dirty=false, but also returns working_tree_hash 06ae68b1d1ef47dbe60829f2fd3bff3367e51853bea8781c5fb647f222cf85a0 in BuildProvenance, source_fields, and build_info.
    • Cause
      • capture_build_provenance computes tree_hash before checking git state and always passes working_tree_hash=tree_hash into BuildProvenance for git captures. The changed return path is anchored at src/agentex/lib/utils/build_provenance.py:224-230, specifically working_tree_hash=tree_hash on line 229; the docstring/comments also say it is always computed.
    • Fix
      • Only compute/assign working_tree_hash when there is no clean commit identity: non-git, unborn/no HEAD, or dirty work tree. For clean committed git captures, return working_tree_hash=None so source_fields/build_info omit it while keeping commit/ref/repo and dirty=false.

    T-Rex Ran code and verified through T-Rex

Reviews (3): Last reviewed commit: "refactor(lib): drop the build-info.json ..." | Re-trigger Greptile

Add agentex.lib.utils.build_provenance — the single producer of source
identity for agent builds (git coordinates + a deterministic content hash
of the build context). prepare_cloud_build_context now writes
build-info.json into the staged context (populates runtime
registration_metadata with no server change) and exposes provenance on
CloudBuildContext so the upload can send source_* fields. Archive member
order is now deterministic via a sorted enumeration shared with the hash.

The hash is computed only when there is no clean commit to identify the
build (dirty tree or non-git context). First of three surfaces for
AGX1-418 (Phase 1, client-attested); the SGP build-record columns and the
sgpctl/Gitea uploaders follow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@max-parke-scale max-parke-scale changed the base branch from main to next June 30, 2026 01:04
Comment thread src/agentex/lib/utils/build_provenance.py Outdated
Comment thread src/agentex/lib/utils/build_provenance.py Outdated
Address Greptile review on the build-provenance capture util:

- Always compute working_tree_hash (drop the "skip on clean commit"
  path). A `git status` clean tree can still contain .gitignore'd-but-not-
  .dockerignore'd files the commit can't reproduce; an always-present
  content hash identifies the exact shipped bytes and closes that gap.
- Guard the hash (_safe_working_tree_hash) so a permission error or
  filesystem race degrades to None instead of aborting the build — the
  module contract is that capture never raises into a build.
- Record dirtiness as a first-class `dirty` flag (surfaced as `source_dirty`
  / `dirty`) rather than overloading hash-presence, matching Go's
  vcs.modified and Nix's dirtyRev. None outside a git work tree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@max-parke-scale

Copy link
Copy Markdown
Contributor Author

Addressed both Greptile findings in cf9994d:

  1. Ignored files lose hashing — fixed by removing the "skip hash on clean commit" path entirely: working_tree_hash is now always computed over the staged context, so a .gitignored-but-not-.dockerignored file is captured by the content hash regardless of git status. (Identity = the always-present hash; commit anchors it to source. Dedupe is unaffected since the hash is deterministic.)
  2. Hash failures abort builds — wrapped the computation in _safe_working_tree_hash, which degrades to None and logs on any error, honoring the “capture never raises into a build” contract.

Also, per design discussion: dirtiness is now a first-class dirty flag (surfaced as source_dirty / dirty) rather than implied by hash-presence — matching Go’s vcs.modified and Nix’s dirtyRev; None outside a git work tree.

🧑‍💻🤖 — posted via Claude Code

context_root=build_context_root,
content_root=staged_root,
)
(staged_root / _BUILD_INFO_FILENAME).write_text(json.dumps(provenance.build_info(), indent=2, sort_keys=True))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Build info not copied

build-info.json is added to the archive root, but the generated Dockerfiles only copy the project subdirectory contents into the image, such as COPY {{ project_path_from_build_root }}/project /app/{{ project_path_from_build_root }}/project. FastACP.locate_build_info_path() then looks next to the importing project/acp.py. For the default, temporal, and sync templates, the file can be present in the tarball but absent from /app/<agent>/project/build-info.json, so runtime registration sends no provenance metadata. Please stage the file under the path that is copied and read at runtime, or update the Dockerfiles/runtime lookup to handle the root-level file.

Artifacts

Repro: focused build context mismatch script

  • Contains supporting evidence from the run (text/x-python; charset=utf-8).

Repro: script output showing root-only build-info.json and missing project build-info path

  • Keeps the command output available without making the summary code-heavy.

View artifacts

T-Rex Ran code and verified through T-Rex

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/agentex/lib/cli/handlers/agent_handlers.py
Line: 279

Comment:
**Build info not copied**

`build-info.json` is added to the archive root, but the generated Dockerfiles only copy the project subdirectory contents into the image, such as `COPY {{ project_path_from_build_root }}/project /app/{{ project_path_from_build_root }}/project`. `FastACP.locate_build_info_path()` then looks next to the importing `project/acp.py`. For the default, temporal, and sync templates, the file can be present in the tarball but absent from `/app/<agent>/project/build-info.json`, so runtime registration sends no provenance metadata. Please stage the file under the path that is copied and read at runtime, or update the Dockerfiles/runtime lookup to handle the root-level file.

How can I resolve this? If you propose a fix, please make it concise.

Fix in Cursor Fix in Claude Code Fix in Codex

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 923a110 by removing the build-info.json write entirely. It was dead-on-arrival as you found (archive root, not COPYd into the image / not read by locate_build_info_path()), and it's redundant anyway: AgentexCloudDeploy.build_idAgentexCloudBuild lets a deployment's source provenance derive from the build record (the source_* columns this work adds) over the FK — no need to denormalize onto registration_metadata. The runtime sink can be revived (correctly placed) if a real consumer for deployment-history provenance shows up.

🧑‍💻🤖 — posted via Claude Code

Greptile (T-Rex repro) showed build-info.json was written to the archive
root, which the templates' Dockerfiles don't COPY and the runtime
locate_build_info_path() doesn't read — so it never reached the image and
the registration_metadata sink stayed empty.

Beyond the placement bug, the sink is redundant: AgentexCloudDeploy.build_id
is an FK to AgentexCloudBuild, so a deployment's source provenance derives
from the build record (the source_* columns this work adds, Surface C) over
that join — the same Build->Deploy edge lineage already traverses. No need
to denormalize provenance onto registration_metadata/DeploymentHistory
(which has had no producer since its read path landed 2025-09, so its git
fields have never been populated).

#454 now ships only the shared capture util (agentex.lib.build_provenance)
plus a deterministic build-archive ordering. Provenance is delivered via the
build-record sink; the runtime sink can be revived (correctly placed) if a
real consumer for deployment-history provenance ever appears.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Updatedpypi/​agentex-sdk@​0.13.0 ⏵ 0.16.294 +1100100100 +50100
Updatedpypi/​agentex-client@​0.13.0 ⏵ 0.16.299 +1100100100100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant