Troubleshooting NixOS and ZFS
Incident on 21/10/2021. NixOS ZFS partition on laptop reported 0 available bytes on all pools.
Teardown:⌗
-
Tried to setup Android Studio with emulator capabilities on my NixOS machine
- Installed a bunch of packages.
- Nix the package manager sucks up a lot of storage if not properly scrubbed.
- Installed and built so much on my machine that suddenly nothing worked anymore.
-
du
reported 100% full on everything.- Started getting so weird that my command line prompt started printing random stuff since it couldn’t create a
tmp
file. - I could not collect garbage with
nix-collect-garbage
since it couldn’t create a lock file.
- Started getting so weird that my command line prompt started printing random stuff since it couldn’t create a
-
Rebooted the machine, wrong call.
- Most services enabled at boot reported failure to start.
- Got to login screen.
- Non-existent users very denied entry.
- Existing users (
root
and mycookie
) passed the login, but due to no available space were thrown back into a login.
-
Troubleshooting with built-in tools.
- Laptop firmware settings unable to help.
- Lenovo has SMART tools available.
- Did not report hardware error on disk.
- All SMART attributes seemed in order.
-
Nikola flashed a USB with an ISO of NixOS.
- Able to access a command line with my system.
smartctl -a/x
confirmed no hardware errors present.- Also tried to use Finnix, did not assist me due to too old built-in version of ZFS (v23 vs v28 on Nix).
- Thought NixOS wouldn’t have tools for ZFS, but both
zpool
andzfs
are available.
- Thought NixOS wouldn’t have tools for ZFS, but both
-
Tried to get access to my ZFS pool (
rpool
).-
Imported all ZFS pools available.
-
Loaded encryption keys for pool.
-
Mounted dataset through
mount
as datasets were of typelegacy
.zpool import -a zfs list -a zfs load-key -a mount -t zfs rpool/home /mnt/rec
-
-
Attempt to backup
/home
to external drive.-
70GB of data.
-
External drive was connect with USB, so extremely slow.
-
Learnt that
cp
does not have an internal mechanism for progress report. -
You can, however, get the
cp
process ID and then watch the opened files and object reference from/proc/<id>
.pgrep -x cp # -x gets exact and only the ID cat /proc/<id>/fdinfo # gets object reference ls -l /proc/<id>/ # lists open files
-
-
Due to enormous write size (and impatience) the external (NTFS) drive’s MFS got corrupted due to lazy
umount
.- Tried to salvage with
ntfsfix
on NixOS command line, no success. - Final option is to use
chkdsk
on a Windows machine, but that can take around 8 hours per TB.
- Tried to salvage with
-
-
Attempt to garbage collect
rpool/root/nixos
manually.-
nix-env
allows for defining system profile. -
Idea was to delete old NixOS generations to free up space.
nix-env -p /mnt/rec/nix/var/nix/profiles/system
-
Failed again due to no disk space.
-
-
Started reading up on ZFS snapshots.
-
ZFS snapshots do not take storage space before something changes on disk.
-
My machine has automatic ZFS snapshots by a monthly, weekly, daily, hourly, and “frequent” cadence.
-
THEORY:
- Due to the snapshots running in the background all of the referenced storage from
zfs list
reported the full size. - Thus deleting files didn’t report more available space.
- Due to the snapshots running in the background all of the referenced storage from
-
Deleted some of the newer snapshots to free 14.1 GB.
-
Deleting old generations now worked.
nix-env -p /mnt/rec/nix/var/nix/profiles/system --delete-generations 30d
-
-
Finally able to login into system and GC.
- Followed NixOS doumentation to GC properly.
-
Next steps are:
- to implement the documentation steps for automated GC,
- setup proper backups to my server for critical files,
- invest in a proper harddrive that snapshots and larger files can be exported to.