[sle-beta] Transactional Updates in Leap 15

Thu Apr 5 06:37:08 MDT 2018

On Thu, 2018-04-05 at 13:25 +0100, Joe Doupnik wrote:
> 
> ---------
>      The referenced doc is interesting to read and think about. Alas, 
> patching nirvana is still on back-order.
>      My thinking (yours will likely vary) is as follows. The 
> snapshot-like temp area can be as large as the main o/s file system 
> (root) because we put more into that area than just the o/s and we try 
> to avoid over partitioning etc. Of particular concern is the arrow of 
> time, which means changes occur on the running system after snapping and 
> where also patches will eventually appear. Thus the snapshot becomes out 
> of date after the first such change to the running system. Think about 
> databases, linked systems, memory cached data and the like. Thus a 
> snapshot may not be a safe item to restore in many cases.

And that problem is addressed by the fact that the root filesystem is read-only in the system role in
question.
"The arrow of time" can't change anything, in the root filesystem.

Note, by "root filesystem", we mean the contents of the "root subvolume" on the btrfs partition.
Any other subvolume is not snapshotted, not considered part of the transactional update, and not read-only.
This includes (but is not limited to) gems such as /var, /root and /home, where we expect users and services
to continue their merry tasks of writing nonsense and worthwhile data to those locations.

This division between "the OS root filesystem" and "everything else" is working quite well, hence the
promotion of this feature from it's previous incarnation only in CaaSP & Kubic, and now suitable for a broader
audience in Leap and Tumbleweed.

>      What makes more sense to me is patch a quiescent system. That would 
> mean accumulate the new change sets and then bring up the system in 
> memory based rescue mode where the regular file systems is/are otherwise 
> not enabled. The scheme then tries applying patches one by one (with a 
> log to revert), and if a failure occurs then consider undoing them all, 
> with optional variations about accepting some regardless and so forth. 
> This eliminates concerns about open files, memory caches, partial 
> transactions, interaction amongst machines, huge extra disk space, not 
> being restricted to BTRFS, and likely a few more nuances. It also avoids 
> yet another installation-time-only option and thus can be used well 
> after a machine has been built in an ordinary manner.

This approach would be incredibly long winded. Would users really be willing to have their systems offline for
so long while they patch so slowly and serially? One of the benefits of our approach is we can patch in a
threaded manner - every package applies its changeset as fast and as in parallel as the system allows, but
none of those changes are written to the running snapshot so impacts are avoided until the next reboot.

Doesn't this strike the best balance of using the systems hardware efficiently, minimising downtime, while
still ensuring updates happen atomicly?

-- 
Richard Brown
Linux Distribution Engineer - Future Technology Team
Chairman - openSUSE

Phone +4991174053-361
SUSE Linux GmbH,  Maxfeldstr. 5,  D-90409 Nuernberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton, 
HRB 21284 (AG Nürnberg)