**Tomas Vondra** @tomasv@fosstodon.org · May 25

**Tomas Vondra** @tomasv@fosstodon.org · May 25

Tomas Vondra @tomasv@fosstodon.org

What's a good way to simulate crashes / unexpected reboots on Linux?

The goal is to do better testing of crash recovery, i.e. recovery after the system unexpectedly "dies" without a chance to flush stuff to disk. It needs to be as close to losing power as possible. But scriptable/cheap to do in a loop (so preferably not a full reboot of physical machine).

I have a PoC using LVM snashots for this - take snapshot, try recover from that, do some consistency checks, etc.

Is there a better way?

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

May 25

AndresFreundTec @AndresFreundTec@mastodon.social

@tomasv Do you want to include drive-level effects? I.e. what happens with the write-back caches on power loss?

That'd be rather hard to test without actually simulating power loss somewhere.

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

May 25

AndresFreundTec @AndresFreundTec@mastodon.social

@tomasv You could probably just detach the PCIe device in sysfs and reattach it. That won't model powercaps etc though.

**Tomas Vondra** @tomasv@fosstodon.org · May 25

**Tomas Vondra** @tomasv@fosstodon.org · May 25

May 25

Tomas Vondra @tomasv@fosstodon.org

@AndresFreundTec No, not really. My assumption is that if the drive can lose data, any sort of data corruption is possible, and we can't guarantee anything. So I'm assuming the disks behave nicely.

Or did you mean disks with powerloss protection, but losing unconfirmed writes? That might be nice, but I'll leave that for the future.

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

**AndresFreundTec** @AndresFreundTec@mastodon.social · May 25

May 25

AndresFreundTec @AndresFreundTec@mastodon.social

@tomasv I, rather strongly, suspect that if you don't take write cache effects into account you will hide a lot of FS or PG level corruption. If either FS or PG somehow misses issuing a write cache flush a "software level" reset won't lead to the write cache being lost.

**Tomas Vondra** @tomasv@fosstodon.org · May 25

**Tomas Vondra** @tomasv@fosstodon.org · May 25

May 25

Tomas Vondra @tomasv@fosstodon.org

@AndresFreundTec Maybe. It's possible (likely) just killing the VM using "virt destroy" does not discard writes that made it to the "volatile" write cache on device. But in my current setup the VM is backed by a file, not by actual physical device (well, not directly - the file ultimately ends up on a disk, of course).

I wonder if there's a way to discard the write cache for the image file ...

**Chris Ellis** @intrbiz@bergamot.social · May 25

**Chris Ellis** @intrbiz@bergamot.social · May 25

May 25

Chris Ellis @intrbiz@bergamot.social

@tomasv if you used NBD you could sigkill the server or maybe modify it to simulate different types of the disk doing odd things.

**Tomas Vondra** @tomasv@fosstodon.org · May 25

**Tomas Vondra** @tomasv@fosstodon.org · May 25

May 25

Tomas Vondra @tomasv@fosstodon.org

@intrbiz Using NBD is an interesting idea. I'm not sure just killing the NBD server would give much - it has the same issue as just killing the VM. But maybe it'd be possible to write a "proxy" that queues the writes and only sends them to the server on NBD_CMD_FLUSH.

**Chris Ellis** @intrbiz@bergamot.social · 2025-05-25T17:52:31Z

Chris Ellis @intrbiz@bergamot.social

@tomasv yeh, that was more my thought, it's a very simple protocol and gives lots of scope for injecting any specifics you wanted.

May 25, 2025 at 5:52 PM · · Mastodon for Android · · ·

Trending now

Resources

Developers

What is Mastodon?

bergamot.social

More…