NFS, CSI, and Nomad

Aug 30, 2022 · 711 words · 4 minute read

After far too many days of failure, I got an NFS share working with the CSI Plugin running on Nomad (version 1.3.3)!

I have a three-node pi cluster running Nomad. I created the NFS share on the pi that has an external SSD plugged into it, and was able to create new files and directories on it from all three pi’s, after mounting the NFS share. I had some odd permissioning things, where I had to use sudo on the node sharing the drive in order to make a directory, but then everyone could use it.

I deployed the Kubernetes NFS CSI Driver, but couldn’t get it to work. I was able to deploy the controller and node jobs, similar to how I’ve seen other people try to deploy it, but when I tried to run

$ nomad volume create foo.hcl
Error creating volume: Unexpected response code: 500 (rpc error: 1 error occurred:
        * controller create volume: rpc error: controller create volume: CSI.ControllerCreateVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to mount nfs server: rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 169.254.116.188:/scratch/gitea /tmp/test
Output: mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.

Weird. Not sure about the rpc.statd stuff. And I don’t want to keep locks local…

I found another NFS CSI Driver, which takes a slightly different configuration, but whatever, I wanted it to work. I got to the point where I could deploy the controller and node jobs, again, but was receiving errors still. Fortunately, the errors were different, but still not entirely helpful.

$ nomad volume create foo.hcl
Error creating volume: Unexpected response code: 500 (rpc error: 1 error occurred:
        * controller create volume: rpc error: controller create volume: CSI.ControllerCreateVolume: rpc error: code = Unknown desc = Exception calling application: [Errno 13] Permission denied: '/tmp/tmpkqgru5r8/gitea'

Ultimately, the errors being thrown weren’t entirely relevant (yes, the rocketduck error was significantly more helpful, but my frustrated monkey-brain was just seeing red text). I ultimately remembered having to be sudo on the node sharing the NFS drive to be able to create a directory, and it clicked - something’s wrong with my NFS configuration!

I’m not going to proclaim I understand what of my change fixed the problem or what it’s doing under the hood. I still need to learn that. But this is what I did.

Debugging NFS 🔗

The original, problematic, /etc/exports:

# /etc/exports
/mnt/usb/scratch 169.254.0.0/16(rw,sync)

First, I thought it could be the CIDR range I provided. My cluster is currently on

WiFi
Plugged into a switch
Tailscale

The CIDR range is for the network switch, but in Nomad, I have the share addresses as the IP’s provided by the Tailscale network.

Was it the CIDR range?

# /etc/exports
/mnt/usb/scratch *(rw,sync)

Alas, I tried to create a new volume in Nomad with no avail. It’s not the network.

Oh, yeah, what about that sudo problem?

blindly copy-pastes stuff from the internet…

# /etc/exports
/mnt/usb/scratch *(rw,sync,no_subtree_check,no_root_squash)

Hey, now I can create files and directories on the share without sudo! And I can do so from other nodes, and not just the one connected to the drive! Ayo!

Oh yeah, it’s not the network…

# /etc/exports
/mnt/usb/scratch 169.254.0.0/16(rw,sync,no_subtree_check,no_root_squash)

And…it still works!

And…I can create a volume in Nomad! Hurah!

Which driver 🔗

Well, honestly, I’m still getting an error with the kubernetes nfs csi driver:

Error creating volume: Unexpected response code: 500 (rpc error: 1 error occurred:
        * controller create volume: rpc error: controller create volume: CSI.ControllerCreateVolume: controller plugin returned an internal error, check the plugin allocation logs for more information: rpc error: code = Internal desc = failed to mount nfs server: rpc error: code = Internal desc = mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 169.254.116.188:/scratch/gitea /tmp/gitea
Output: mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.

But I do have a somewhat urgent reason to just be happy with another nfs csi driver and move on with my life.