Technical Advisory 123371

On this page Carat arrow pointing down

Publication date: June 17, 2024

Description

In all versions of CockroachDB v22.2, v23.1.0 to v23.1.21, v23.2.0 to v23.2.5, and testing versions of v24.1 through v24.1.0-rc.1, changefeeds could drop events during the initial scan in some cases, causing changefeed consumers to receive incomplete data. This bug was caused by a code change to reduce the number of duplicates sent during an initial scan if the changefeed needed to restart by using the checkpoint to determine which spans could be skipped when the job resumed. This change led to some non-determinism in another part of the codebase that would sometimes incorrectly forward the progress of every span a node was tracking to the lowest checkpoint timestamp when some spans may not have been scanned yet. This bug is now fixed by PR #123625.

Symptoms of the bug:

  • Initial scan completes but the emitted row count is less than the row count of the targeted tables.
  • The cr.node.changefeed.backfill_pending_ranges metric for some of the nodes in the cluster drop suddenly to zero without a corresponding increase in the metric on other nodes.

Requirements for the bug to occur:

  • A changefeed that is running with initial_scan = 'only' or initial_scan = 'yes'.
  • The changefeed restarts (e.g. because of a transient error or one of the nodes participating in the plan restarted) before the initial scan completes and after a checkpoint has been written.
  • The final span in the non-deterministic order that a changefeed aggregator processes the spans in is fully enclosed by the most recent checkpoint.

Factors that increase the likelihood the bug occurring:

  • Cluster is on v23.2 or later and has the changefeed.shutdown_checkpoint.enabled cluster setting set to true.
  • The changefeed.frontier_checkpoint_frequency or the changefeed.frontier_highwater_lag_checkpoint_threshold cluster setting is low such that the initial scan takes many multiples of this frequency to complete.
  • The changefeed targets multiple tables, which have significant differences in row counts.
  • The targeted tables are large (have many ranges).
  • The initial scan takes a long time to complete (an hour or longer).

Statement

This is resolved in CockroachDB by PR #123625 which prevents incorrect forwarding of progress for spans that have not been scanned yet by the initial scan. The fix has been applied to maintenance releases of CockroachDB v23.1.22, v23.2.6, and v24.1.0-rc.2. This public issue is tracked by #123371.

Mitigation

Users of CockroachDB v22.2, v23.1.0 to v23.1.21, and v23.2.0 to v23.2.5 are encouraged to upgrade to v23.1.22, v23.2.6, v24.1.0, or a later version.

Impact

Changefeeds could drop events during the initial scan in some cases, causing changefeed consumers to receive incomplete data. Versions affected include CockroachDB v22.2, v23.1.0 to v23.1.21, v23.2.0 to v23.2.5, and testing versions of v24.1 through v24.1.0-rc.1.

Questions about any technical alert can be directed to our support team.


Yes No
On this page

Yes No