Why every now and then I had to restart PureLB's lbnodeagent to get the LoadBalancer responding again, and how I fixed it for good.

For a while I had this silly ritual on my home cluster (the arenero): every now and then, access to my services through the LoadBalancer IP would just die. No clear errors, no logs screaming at me. The 192.168.90.210 (the VIP of my HAProxy ingress) simply stopped answering. And like a robot, I would run this and everything came back to life:

1
kubectl rollout restart daemonset/lbnodeagent -n purelb

It always worked. Which is exactly the most infuriating part, because fixing the symptom without understanding the cause is a band-aid that keeps falling off. So one day I sat down to figure out what the hell was going on.

First clue: Cloudflare was still alive

The detail that changed everything: when things “went down”, anything coming in through Cloudflare Tunnel kept working perfectly. And that makes a lot of sense once you think about it: cloudflared enters the cluster through the pod network (ClusterIP), it doesn’t need anyone answering for the LoadBalancer IP. The only thing dying was traffic hitting the .210 directly over the LAN.

Translation: this is not an internal routing problem, nor PureLB failing to allocate the IP. It’s an L2 announcement / ARP problem. Someone had to be answering ARP for the .210 on the network… and they weren’t.

Confirming the mess

PureLB in local mode works like this: since the VIP lives in the same subnet as the nodes (192.168.90.0/24), it elects one node via gossip (memberlist) and adds the IP as a secondary address on its interface, where it answers ARP. The Service annotation tells you who’s in charge:

1
2
3
$ kubectl get svc haproxy-kubernetes-ingress -n ingress-controller \
-o jsonpath='{.metadata.annotations.purelb\.io/announcing-IPv4}'
worker1,eno1

OK, so PureLB thinks worker1 is announcing the .210 on eno1. Let’s go to worker1 and look at the interface:

1
2
3
$ ip -4 addr show eno1
2: eno1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
inet 192.168.90.21/24 brd 192.168.90.255 scope global eno1

There’s the node’s .21… but the .210 is not there. PureLB says it’s announcing it, yet the IP is on no interface at all. In fact I checked all 4 workers and it was nowhere. Not the .210, not the rest of the pool. A ping from my laptop confirmed it:

1
2
$ ping -c2 192.168.90.210
2 packets transmitted, 0 packets received, 100.0% packet loss

Totally down. PureLB’s logical state and the actual network state were divorced. And that’s why the rollout restart fixed it: starting from scratch, the lbnodeagent walks the Services and runs ip addr add again. But… who had been yanking the IP off the interface?

The culprit: systemd-networkd and unattended-upgrades

The key was in how eno1 is configured. It’s static and managed by systemd-networkd (netplan, no DHCP):

1
2
3
4
5
6
network:
ethernets:
eno1:
dhcp4: false
addresses:
- 192.168.90.21/24

PureLB adds the .210 as a foreign address (an IP networkd doesn’t know about, because it’s not in its config). And here’s the kicker: every time systemd-networkd restarts, it re-applies the netplan config —which only declares the .21— and purges the .210. PureLB doesn’t watch the interface, so it neither notices nor puts it back.

And why does networkd restart “every now and then”? unattended-upgrades. I checked the journal and the timestamps matched to the minute: networkd reconfigured eno1 exactly when unattended-upgrades upgraded systemd, netplan.io or udev (daily, around 06:00 UTC). Upgrade those packages → networkd restarts → goodbye VIP.

It’s the same classic pattern as this Netplan + keepalived bug: you add a virtual IP by hand, restart networkd, and it eats it.

The acid test

Before touching 4 nodes blindly, I reproduced it in isolation on worker1 with a throwaway IP, so I wouldn’t depend on PureLB’s timing:

1
2
3
4
# Add a foreign IP, just like PureLB does
sudo ip addr add 192.168.90.250/24 dev eno1
sudo systemctl restart systemd-networkd
ip -4 addr show eno1 | grep 192.168.90.250 # <-- empty: PURGED

Confirmed: the networkd restart eats it. Now with the fix in place:

1
2
3
4
5
6
7
8
sudo mkdir -p /etc/systemd/network/10-netplan-eno1.network.d
printf '[Network]\nKeepConfiguration=yes\n' | \
sudo tee /etc/systemd/network/10-netplan-eno1.network.d/keep-purelb.conf
sudo networkctl reload

sudo ip addr add 192.168.90.250/24 dev eno1
sudo systemctl restart systemd-networkd
ip -4 addr show eno1 | grep 192.168.90.250 # <-- SURVIVES :)

KeepConfiguration=yes makes networkd keep the addresses that aren’t its own when it restarts. On systemd 255 (Ubuntu 24.04) it works like a charm, though heads up: the docs are ambiguous and I myself doubted it would cover foreign addresses. That’s why I tested it instead of trusting it. Always test.

The permanent fix

The drop-in goes on every worker running the lbnodeagent (4 in my case; the master is tainted and never announces, so it doesn’t need it):

1
2
3
4
sudo mkdir -p /etc/systemd/network/10-netplan-eno1.network.d
printf '[Network]\nKeepConfiguration=yes\n' | \
sudo tee /etc/systemd/network/10-netplan-eno1.network.d/keep-purelb.conf
sudo networkctl reload

One last kubectl rollout restart daemonset/lbnodeagent -n purelb so PureLB mounts the VIP again (this time protected) and back to life:

1
2
$ ping -c3 192.168.90.210
3 packets transmitted, 3 packets received, 0.0% packet loss

Lesson learned: if your PureLB VIP (or MetalLB in L2 mode, or keepalived) disappears “on its own” every few days on a box running systemd-networkd, don’t go hunting for ghosts in the load balancer. Look at your automatic updates restarting the network underneath you.

Cheers!

Comments

⬆︎TOP