What's a good way to simulate crashes / unexpected reboots on Linux?
The goal is to do better testing of crash recovery, i.e. recovery after the system unexpectedly "dies" without a chance to flush stuff to disk. It needs to be as close to losing power as possible. But scriptable/cheap to do in a loop (so preferably not a full reboot of physical machine).
I have a PoC using LVM snashots for this - take snapshot, try recover from that, do some consistency checks, etc.
Is there a better way?
@tomasv Do you want to include drive-level effects? I.e. what happens with the write-back caches on power loss?
That'd be rather hard to test without actually simulating power loss somewhere.
@tomasv You could probably just detach the PCIe device in sysfs and reattach it. That won't model powercaps etc though.
@AndresFreundTec No, not really. My assumption is that if the drive can lose data, any sort of data corruption is possible, and we can't guarantee anything. So I'm assuming the disks behave nicely.
Or did you mean disks with powerloss protection, but losing unconfirmed writes? That might be nice, but I'll leave that for the future.
@tomasv I, rather strongly, suspect that if you don't take write cache effects into account you will hide a lot of FS or PG level corruption. If either FS or PG somehow misses issuing a write cache flush a "software level" reset won't lead to the write cache being lost.
@AndresFreundTec Maybe. It's possible (likely) just killing the VM using "virt destroy" does not discard writes that made it to the "volatile" write cache on device. But in my current setup the VM is backed by a file, not by actual physical device (well, not directly - the file ultimately ends up on a disk, of course).
I wonder if there's a way to discard the write cache for the image file ...
@tomasv if you used NBD you could sigkill the server or maybe modify it to simulate different types of the disk doing odd things.