Does snapshot happen atomically at a SINGLE point in time?

Suppose I memory map a 100G file and I have a single process which is constantly changing the values everywhere (e.g., a banking app with 1B accounts doing 1M DB/CR per second randomly over 100G of RAM).

If I do a snapshot on the volume that the file is in, is it done atomically so that all pages of the file are frozen in my app at a specific point in time? or does it take the snap incrementally over pages in my file so that some pages of my file might be snapshotted at a slightly different time (within a msec)?

I’m guessing that all files in a volume are atomically snapshotted at a single moment in time and that I have to wait for snapshot to return (a few msec) until I can be sure it is done and i can continue. So I should stop processing, snap and wait for snap to return, then start processing again so I know exactly what my state was when the snap is taken (since I don’t want to take in the middle of a DB/CR txn since that would be an inconsistent state… i always want to snap after a known transaction completed).

Is there a way to avoid the 3 msec wait time for the snap to complete or that’s the best you can do? I’m guessing i have to wait and i should just call it synchronously from within my app at the time everything is consistent.

And presumably doing fsync or sync before a snap makes no difference whatsoever as far as what snap is produced.

As far as I know, a snapshot is done atomically.

This is mostly handled by CoW (copy on write). When a disk block has to be updated, the new data is written to a new disk block and the old disk block remains intact. So creating a snapshot is mostly just generating an index for blocks assigned to files.

However, I don’t personally use “btrfs”, and I don’t claim any expertise on the details.

From the btrfs point of view snapshot is atomic. But it does not automatically makes it consistent from the point of view of applications tusing this filesystem. So either your application needs to implement extended logic to recover from incomplete data (intent journal, roll back etc) or you need to do exactly as you describe - pause processing while application is in consistent state, capture this state in snapshot, continue processing. That is how it is implemented by virtually every backup application out there.

Note snapshots do not image all things on the system essential the target is the system files. Unless you modify the rules or put user data in system areas your working files are not imaged.

Snaps are auto taken before and after an update also periodically once per week I think

Technically,
Although I think all understand what is meant and intended in the original post,

Snapshots, and in particular as created by BTRFS cannot be considered atomic.
If you look up the definition, “atomic” normally describes a specific transaction flow where if data integrity is threatened(ie the snapshot is attempted in the middle of a long running, incomplete transaction), then the transaction is rolled back.

That doesn’t happen when you create a snapshot, instead the state is frozen and changes are managed separately… Then after the snapshot has been created the changes are appended/merged to the snapshot to update to current.

So, your question should have been more along the line of how and whether a snapshot can guarantee its own integrity.

The description by @nrickert can be considered mostly correct, my only criticism is that it’s my understanding common file changes without snapshotting can be found to happen that way as well in various fs(what actually happens varies from one fs to another), the description by @arvidjaar is accurate.

TSU

More accurately, the current default snapper policy creates a snapshot on every bootup, shutdown and whenever zypper executes a package operation (eg install, update) and retains/purges snapshots on its own schedule.

For more info about BTRFS,
I’ve compiled what I consider authoritative references at the following link.
https://en.opensuse.org/User:Tsu2#BTRFS

TSU

Great this is what I was looking for. Basically the snap is the state of all files (both mmapped and on disk) in that volume at that EXACT moment in time when it twiddles some magic counter or variable that anyone writing to a page will subsequently reference before modifying a page.

So at that magical time, I would image that any modifications “in process” at the time of the snap will finish up so that the data being written becomes included in the snapshot (since otherwise you’d have to check for a new snap for every byte you wrote which would be insane).

So I’d guess the snap is technically not atomic (by which I define for my question as “at a single point in time”) because it doesn’t stop i/o in progress.

But maybe I am wrong…maybe snap waits for everyone modifying any file to finish before it says “OK, from now on, it’s COW”.

I’m guessing halting the system to wait for all writing to cease would be a bit “over the top” so my guess is that snapping is pretty close to an atomic snap WITH the exception of any writes that were in process at the time which would be considered to be part of the snapshot even though they technically did not finish at the moment the snapshot took “effect”.

Practically speaking, i’ll halt, snap, and resume so this doesn’t matter to me, it’s more of intellectual curiosity. It’s a pretty cool feature.

You are assuming there is continuous flow of IO. It is not how it works. btrfs performs IO in series of transactions. Either transaction is completed and all modifications applied or not. Transaction is minimal unit of filesystem changes that can be applied.

maybe snap waits for everyone modifying any file to finish before it says “OK, from now on, it’s COW”.

You are mistaken again. Snapshots do not work on file level, they work on block level. When snapshot is created, new metadata records shared state of blocks (extents) in source subvolume and transaction is initiated that writes this metadata on disk. So snapshot is atomic with respect to on-disk filesystem state - either transaction is committed and snapshot is present or transaction is aborted for whatever reason and snapshot is not present. It is impossible (sans bugs) to have partial snapshot of some subvolume content or snapshot that contains data from later transactions.

That said, there is no serialization with other processes doing IO which means of process A writes to file and process B creates snapshot at the same time it is undefined whether changes made by process A will be part of snapshot.

"An “atomic COW snapshot”—easily the most hilarious-sounding feature ever to grace a filesystem—is an image of the entire filesystem in exactly the condition it was in at a given instant in time, no matter what else was transpiring at the time

So if you take a snapshot of a filesystem at 8:13 and 32 seconds pm on December 19, 2013, that snapshot will contain every single byte of that filesystem at exactly 8:13 and 32 seconds pm on December 19, 2013—period, no ifs, ands, or buts. This helps keep high-activity structures like databases consistent. As long as the database uses journaling (and if it doesn’t, upgrade!), its journal will be consistent in the snapshot. Any partially completed transactions can be cleanly rolled back instead of leaving the database in an inconsistent state."

Bitrot and atomic COWs: Inside “next-gen” filesystems

Unfortunately,
You actually describe the most common scenario that is actually broken by snapshotting…
When you’re talking about a database and database transactions, you can have long running transactions (sequential flow, very large data changes), and since a snapshot is completely unaware of these kinds of operations cannot guarantee data integrity.

This is why snapshots should expressly never be enabled where there is a database unless some day the BTRFS snapshot is made “application aware.”
So, for example in the MSWindows world,there is a snapshot technology called “Volume Shadow Copy” – Where plugins are written to make it aware of specific database and mail applications, so that the application (and activity) is issued a suspend command and when the suspend happens only then the snapshot is created.

BTRFS snapshots have no such “application awareness” so will create snapshots immediately regardless of application state and if data is in flight it’s anyone’s guess what will be in your snapshot.

So yes… although there is no such thing as an atomic snapshot (go ahead and look that up, you won’t find anything except a fairly specialized situation that has nothing to do with what we’re talking about here), there definitely is such a thing as an “atomic transaction.”

BTW -
Disregard anything anyone has to say about bitrot… It’s FUD.
I’ve spoken to disk manufacturers directly who do extensive testing on their own products, and they say that as long as you care for your disk there is no such thing as bitrot.

TSU

That is exactly answer to the question I was asking and I’ll leave it at that! Thank you!

https://forums.opensuse.org/images/misc/quote_icon.png Originally Posted by arvidjaar https://forums.opensuse.org/images/buttons/viewpost-right.png](https://forums.opensuse.org/showthread.php?p=2907772#post2907772)
That said, there is no serialization with other processes doing IO which means of process A writes to file and process B creates snapshot at the same time it is undefined whether changes made by process A will be part of snapshot.

That is exactly answer to the question I was asking and I’ll leave it at that! Thank you!


Actually, the problem is not if a process has written or not at the time the snapshot is taken…

An example of the actual problem is if you were to modify the data in multiple linked tables in a database, the database would have to make changes in each and every linked table for the data’s integrity to be preserved.
Take for instance if you deposit money in your bank account.
The process of depositing money might require multiple steps, removing money from one account and placing in another account, and each step might further need to go through additional steps like authentication and authorization, ensuring that the proposed transaction won’t violate banking rules, and more.
You want the entire transaction to succeed ensuring the money is verified in your account, if the transaction was interrupted you don’t want to walk away believing your money was deposited if the transaction didn’t complete because if you know the transaction failed, you’d want to cancel the failed transaction and try again. From the bank’s perspective it wouldn’t want money withdrawn from an account and then never deposited in its destination, resulting in the electronic funds in some state of unaccountable limbo.

The above is a rough example of what an atomic transaction is supposed to accomplish, verifying the completion of the transaction which ensures the integrity of the data… so you can’t have half-transactions that corrupt the data.

Note that because part of the data is written while some not, at the “write data” level the write functions have each completed successfully, it’s the more complex idea that multiple writes comprise a greater collection of data that must itself have transactional integrity.
In the above money deposit example, you wouldn’t want to have the transaction fully completed until every step has been completed and both the deposited and withdrawn funds verified in their respective accounts.

A BTRFS snapshot unaware of an example like what was described is like cutting the power as soon as funds had been withdrawn but not yet deposited or vice versa. If any part of the transaction was uncompleted, then you’d have “successful writes” ie parts of the transaction completed successfully but other parts not and whatever cut the power wouldn’t have any way to be aware what was completed or not. In the same way, a BTRFS snapshot won’t have any idea what might be happening in a database application at the moment the snapshot is taken… All the snapshot knows is that writes have completed but not know if more writes are needed to preserve data integrity.

HTH,
TSU

Fortunately I am only quoting a bold claim from the above linked article. I don’t buy that claim and therefore posted it for comment. Btrfs snapshots are not atomic in a strict sense, but they represent exactly the condition it was in at a given instant in time. I perfectly agree regarding application awareness.

Bitrot indeed occured with some HDDs, e.g. 100 sectors flagged as “198 Offline_Uncorrectable” by smartctl on a ST380020. No more failed sectors since a decade!