acc: simulate dashboard eventual consistency and harden dashboard tests#5724
acc: simulate dashboard eventual consistency and harden dashboard tests#5724denik wants to merge 13 commits into
Conversation
Approval status: pending
|
Integration test reportCommit: 9745574
23 interesting tests: 13 SKIP, 10 RECOVERED
Top 5 slowest tests (at least 2 minutes):
|
Add a testserver eventual-consistency mode for dashboards: under the EC token (opt-in via TESTS_STALE_ONCE=1, direct engine only) the first GET of a dashboard after it is created returns 404, reproducing deterministically the cloud propagation window these tests run against (Cloud = true). Enable it for the dashboards directory and wrap the first lakeview get after each create with retry.py, so reads right after deploy are retried rather than assumed to succeed. This is a follow-up to #5694 and de-flakes the same class of cloud reads. Test-only: no engine changes. Co-authored-by: Isaac
Several dashboard tests read the same dashboard twice right after deploy (once to capture the etag for a replacement, once to display it). Capture the response once and derive both from it, removing the redundant API call. Also clarify the DashboardTrash comment: like DashboardUpdate, it applies immediately; only DashboardCreate stages the eventual-consistency 404. Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
It's not about retrying: the terraform provider reads the dashboard back after create and fails on the 404. Co-authored-by: Isaac
retry.py is a shebang script; under MSYS_NO_PATHCONV=1 git-bash passes its own path to the native python unconverted, so python can't find it. The dir-wide MSYS_NO_PATHCONV in dashboards/test.toml was redundant (only detect-change needs it, and it sets its own), so drop it; detect-change keeps it and runs retry.py via `env -u MSYS_NO_PATHCONV`. Co-authored-by: Isaac
Move the eventual-consistency read logic off the call sites and into an EventualMap[K,V] that owns the map and the EC flag. Handlers now do s.Dashboards.Read / ReadStrong / Write / Put instead of fetching the EventualValue and branching on s.EventualConsistency. Co-authored-by: Isaac
On Windows git-bash passes retry.py's own path to the native Python unconverted under MSYS_NO_PATHCONV, so Python can't find the script (same problem the envsubst() wrapper already guards against). Add a retry() wrapper mirroring envsubst() and route all dashboard retry.py calls through it. Restore the dir-wide MSYS_NO_PATHCONV that an earlier commit removed. Co-authored-by: Isaac
Document that immediate-visibility on update is a deliberate simplification: staging a stale value would produce a successful-but-stale 200 that breaks plan-time reads the engine doesn't yet tolerate. Only create stages a 404. Co-authored-by: Isaac
Use Write (not Put) on update so the first read after an update returns the stale pre-update value, like a real backend. Tests that read right after an update now wait for the new value with a content-aware retry (retry --until / --until-not). detect-change opts out: its out-of-band-modification check reads inside bundle plan, an engine-level read no test retry can fix. Co-authored-by: Isaac
f55a64e to
ed61463
Compare
Capturing $DASHBOARD in the function but reading it at the top level was confusing. Register the DASHBOARD_ID and ETAG replacements where the value is read, inside deploy_dashboard. Co-authored-by: Isaac
| if isTruePtr(config.IsServicePrincipal) { | ||
| token = testserver.ServicePrincipalTokenPrefix + tokenSuffix | ||
| testUser = testserver.TestUserSP | ||
| } else if staleOnceEnabled(testEnv) { |
There was a problem hiding this comment.
Having an environment variable for this in the test server and special token seems unnecessarily complex. Why not add an option to the acc test framework and just propogate that when creating the test server?
This test server is shared between multiple tests. Are we sure that the current test is isolated to the scope of the test that set the env var?
There was a problem hiding this comment.
Special token is needed so that shared testserver can switch the behaviour for a given client.
Regarding env var vs config field - both would work but env var can be composed with EnvMatrix and I think it would be useful to run some tests under both modes to assert that tests are resilient towards eventual consistency but not overfitting to it.
Follow-up to #5694.
We've seen the dashboard API be eventually consistent: a GET right after a create
can 404 before the write propagates. The dashboard acceptance tests run on cloud
(
Cloud = true) and read a dashboard immediately after deploy, so they are exposedto this.
(opt-in via
TESTS_STALE_ONCE=1, direct engine only) the first GET of a dashboardafter it is created returns 404. This reproduces the cloud window deterministically.
lakeview getafter eachcreate with
retry.py, so reads right after deploy are retried rather than assumedto succeed.
Test-only: no engine changes. Making the direct engine itself resilient to the same
404 (plan/refresh reads) is a separate follow-up.
This pull request and its description were written by Isaac.