What's a good way to simulate crashes / unexpected reboots on Linux?
The goal is to do better testing of crash recovery, i.e. recovery after the system unexpectedly "dies" without a chance to flush stuff to disk. It needs to be as close to losing power as possible. But scriptable/cheap to do in a loop (so preferably not a full reboot of physical machine).
I have a PoC using LVM snashots for this - take snapshot, try recover from that, do some consistency checks, etc.
Is there a better way?
@tomasv Do you want to include drive-level effects? I.e. what happens with the write-back caches on power loss?
That'd be rather hard to test without actually simulating power loss somewhere.
@tomasv You could probably just detach the PCIe device in sysfs and reattach it. That won't model powercaps etc though.
@AndresFreundTec No, not really. My assumption is that if the drive can lose data, any sort of data corruption is possible, and we can't guarantee anything. So I'm assuming the disks behave nicely.
Or did you mean disks with powerloss protection, but losing unconfirmed writes? That might be nice, but I'll leave that for the future.
@tomasv I, rather strongly, suspect that if you don't take write cache effects into account you will hide a lot of FS or PG level corruption. If either FS or PG somehow misses issuing a write cache flush a "software level" reset won't lead to the write cache being lost.
@tomasv yeh, that was more my thought, it's a very simple protocol and gives lots of scope for injecting any specifics you wanted.
@intrbiz Using NBD is an interesting idea. I'm not sure just killing the NBD server would give much - it has the same issue as just killing the VM. But maybe it'd be possible to write a "proxy" that queues the writes and only sends them to the server on NBD_CMD_FLUSH.