Incident on 21/10/2021. NixOS ZFS partition on laptop reported 0 available bytes on all pools.

Teardown:

  • Tried to setup Android Studio with emulator capabilities on my NixOS machine

    • Installed a bunch of packages.
    • Nix the package manager sucks up a lot of storage if not properly scrubbed.
    • Installed and built so much on my machine that suddenly nothing worked anymore.
  • du reported 100% full on everything.

    • Started getting so weird that my command line prompt started printing random stuff since it couldn’t create a tmp file.
    • I could not collect garbage with nix-collect-garbage since it couldn’t create a lock file.
  • Rebooted the machine, wrong call.

    • Most services enabled at boot reported failure to start.
    • Got to login screen.
      • Non-existent users very denied entry.
      • Existing users (root and my cookie) passed the login, but due to no available space were thrown back into a login.
  • Troubleshooting with built-in tools.

    • Laptop firmware settings unable to help.
    • Lenovo has SMART tools available.
  • Nikola flashed a USB with an ISO of NixOS.

    • Able to access a command line with my system.
    • smartctl -a/x confirmed no hardware errors present.
    • Also tried to use Finnix, did not assist me due to too old built-in version of ZFS (v23 vs v28 on Nix).
      • Thought NixOS wouldn’t have tools for ZFS, but both zpool and zfs are available.
  • Tried to get access to my ZFS pool (rpool).

    • Imported all ZFS pools available.

    • Loaded encryption keys for pool.

    • Mounted dataset through mount as datasets were of type legacy.

      zpool import -a
      zfs list -a
      zfs load-key -a
      mount -t zfs rpool/home /mnt/rec
      
  • Attempt to backup /home to external drive.

    • 70GB of data.

    • External drive was connect with USB, so extremely slow.

      • Learnt that cp does not have an internal mechanism for progress report.

      • You can, however, get the cp process ID and then watch the opened files and object reference from /proc/<id>.

        pgrep -x cp            # -x gets exact and only the ID
        cat /proc/<id>/fdinfo  # gets object reference
        ls -l /proc/<id>/      # lists open files
        
    • Due to enormous write size (and impatience) the external (NTFS) drive’s MFS got corrupted due to lazy umount.

      • Tried to salvage with ntfsfix on NixOS command line, no success.
      • Final option is to use chkdsk on a Windows machine, but that can take around 8 hours per TB.
  • Attempt to garbage collect rpool/root/nixos manually.

    • nix-env allows for defining system profile.

    • Idea was to delete old NixOS generations to free up space.

      nix-env -p /mnt/rec/nix/var/nix/profiles/system
      
    • Failed again due to no disk space.

  • Started reading up on ZFS snapshots.

  • Finally able to login into system and GC.

  • Next steps are:

    • to implement the documentation steps for automated GC,
    • setup proper backups to my server for critical files,
    • invest in a proper harddrive that snapshots and larger files can be exported to.