Root Cause Analysis (RCA) on disappeared SLE BCI 15 SP5 based containers on registry.suse.com

Marcus Meissner meissner at suse.de
Mon Jul 1 10:03:03 UTC 2024


Hi,

Root Cause Analysis (RCA) on disappeared SLE BCI 15 SP5 based containers on registry.suse.com

(original at https://github.com/SUSE/bci/discussions/25 )

Between Saturday June 22nd 04:15 UTC and Tuesday June 25th 09:30 UTC
several containers from SLE BCI disappeared or were replaced by previous
versions on registry.suse.com.

On Saturday morning, a release event of an updated SLE BCI container led
to all SLE BCI containers being deleted from registry.suse.com. This was
recognized and reported on Saturday by internal and external users. SUSE
investigated the reports and identified that the container build and
publishing system failed to complete an important aggregation step.
Rather than collecting the containers to publish to the registry,
it incorrectly finished with no aggregations due to an erroneous code
change in the build system. This led to the temporary loss of containers
with metadata in the public registry.suse.com instance.  In some cases,
prior versions of the containers appeared back under the ":latest" or
other respective floating tags.The containers reverted to the state
of the registry as of ~ July 2023, which led to a number of failure
scenarios in CI systems and for certain patches that may have been
applied since July 2023 to no longer be effective. Also, some users
may have experienced difficulties with launching SLE BCI containers.

A restoration of the missing containers in registry.suse.com started on
Monday, June 24th around 08:30 UTC. This process was completed by Tuesday,
June 25 at 09:32 UTC.

No data was modified or tampered with from the outside. No intrusion
occurred, and no integrity was violated within our systems.


Technical details


On Friday, June 20 2024, SUSE Build Operations Team deployed a code
change to aggregate helm charts to the Internal Build Service. Due to a
logic error not caught by tests, this code change led to the deletion
rather than the aggregation of containers.

On Saturday morning, the usual automated updating pipeline of SLE BCI
ran, leading to a new aggregation with a now empty result. This resulted
in the temporary deletion of all SLE 15 SP5 based containers on the
registry.

Within hours, users and customers began alerting SUSE of the issue. The
issue was analyzed by SUSE and escalated internally to the respective
teams through Saturday and Sunday. On Sunday, part of the incident ( the
issue that triggered http 500 errors on some registry pages) was
resolved. On Monday around 08:13 UTC the remaining issue was resolved by
deploying a fix and re-running the aggregation step. Over the course of
about 24 hours, all content was restored on the SUSE registry. After the
completion of the restoration, these issues were resolved.


Learnings


SUSE has taken the following learnings from this incident which already
have been or will be implemented shortly.

- Implemented automated test cases to the Open Build Service codebase
  to ensure aggregation behaves correctly.
- Adding an extra safeguard that prevents an automated registry
  push in the event critical tags have been removed.
- Add a build service configuration setting to not publish
  container architectures individually, which will prevent
  inconsistencies and mismatched architectures that were
  experienced in the described incident.
- Adding an extra safeguard for the monitoring of critical
  tags, downgrades or failure to update
- Improve response times in the event of incidents.


More information about the sle-updates mailing list