free-pmx/dhcp-cluster.free-pmx.rst

DHCP setup of a cluster

TL;DR Keep control of the entire cluster pool of IPs from your networking plane. Avoid potential IP conflicts and streamline automated deployments with DHCP managed, albeit statically reserved assignments.

ORIGINAL POST DHCP setup of a cluster

PVE static network configuration ^ is not actually a real prerequisite, not even for clusters. The intended use case for this guide is to cover a rather stable environment, but allow for centralised management.

CAUTION While it actually is possible to change IPs or hostnames without a reboot (more on that below), you WILL suffer from the same issues as with static network configuration in terms of managing the transition.

Prerequisites

IMPORTANT This guide assumes that the nodes satisfy all of the below requirements, latest before you start adding them to the cluster and at all times after.

have reserved their IP address at DHCP server; and
obtain reasonable lease time for the IPs; and
get nameserver handed out via DHCP Option 6;
can reliably resolve their hostname via DNS lookup;

TIP There is also a much simpler guide for single node DHCP setups which does not pose any special requirements.

Example dnsmasq

Taking dnsmasq ^ for an example, you will need at least the equivalent of the following (excerpt):

dhcp-range=set:DEMO_NET,10.10.10.100,10.10.10.199,255.255.255.0,1d
domain=demo.internal,10.10.10.0/24,local

dhcp-option=tag:DEMO_NET,option:domain-name,demo.internal
dhcp-option=tag:DEMO_NET,option:router,10.10.10.1
dhcp-option=tag:DEMO_NET,option:dns-server,10.10.10.11

dhcp-host=aa:bb:cc:dd:ee:ff,set:DEMO_NET,10.10.10.101
host-record=pve1.demo.internal,10.10.10.101

There are appliance-like solutions, e.g. VyOS ^ that allow for this in an error-proof way.

Verification

Some tools that will help with troubleshooting during the deployment:

ip -c a should reflect dynamically assigned IP address (excerpt):

2: enp1s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether aa:bb:cc:dd:ee:ff brd ff:ff:ff:ff:ff:ff
    inet 10.10.10.101/24 brd 10.10.10.255 scope global dynamic enp1s0

hostnamectl checks the hostname, if static is unset or set to localhost, the transient one is decisive (excerpt):

Static hostname: (unset)
Transient hostname: pve1

dig nodename confirms correct DNS name lookup (excerpt):

;; ANSWER SECTION:
pve1.            50    IN    A    10.10.10.101

hostname -I can essentially verify all is well the same way the official docs actually suggest.

Install

You may use any of the two manual installation methods. Unattended install is out of scope here.

ISO Installer

The ISO installer ^ leaves you with static configuration.

Change this by editing /etc/network/interfaces - your vmbr0 will look like this (excerpt):

iface vmbr0 inet dhcp
        bridge-ports enp1s0
        bridge-stp off
        bridge-fd 0

Remove the FQDN hostname entry from /etc/hosts and remove the /etc/hostname file. Reboot.

See below for more details.

Install on top of Debian

There is official Debian installation walkthrough, ^ simply skip the initial (static) part, i.e. install plain (i.e. with DHCP) Debian. You can fill in any hostname, (even localhost) and any domain (or no domain at all) to the installer.

After the installation, upon the first boot, remove the static hostname file:

rm /etc/hostname

The static hostname will be unset and the transient one will start showing in hostnamectl output.

NOTE If your initially chosen hostname was localhost, you could get away with keeping this file populated, actually.

It is also necessary to remove the 127.0.1.1 hostname entry from /etc/hosts.

Your /etc/hosts will be plain like this:

127.0.0.1       localhost
# NOTE: Non-loopback lookup managed via DNS

# The following lines are desirable for IPv6 capable hosts
::1     localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

This is also where you should actually start the official guide - “Install Proxmox VE”. ^

Clustering

TIP This guide may ALSO be used to setup a SINGLE NODE. Simply do NOT follow the instructions beyond this point.

Setup

This part logically follows manual installs.

Unfortunately, PVE tooling populates the cluster configuration (corosync.conf) ^ with resolved IP addresses upon the inception.

Creating a cluster from scratch:

pvecm create demo-cluster

Corosync Cluster Engine Authentication key generator.
Gathering 2048 bits for key from /dev/urandom.
Writing corosync key to /etc/corosync/authkey.
Writing corosync config to /etc/pve/corosync.conf
Restart corosync and cluster filesystem

While all is well, the hostname got resolved and put into cluster configuration as an IP address:

cat /etc/pve/corosync.conf

logging {
  debug: off
  to_syslog: yes
}

nodelist {
  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: 10.10.10.101
  }
}

quorum {
  provider: corosync_votequorum
}

totem {
  cluster_name: demo-cluster
  config_version: 1
  interface {
    linknumber: 0
  }
  ip_version: ipv4-6
  link_mode: passive
  secauth: on
  version: 2
}

This will of course work just fine, but It defeats the purpose. You may choose to do the following now (one by one as nodes are added), or may defer the repetitive work till you gather all nodes for your cluster. The below demonstrates the former.

All there is to do is to replace the ringX_addr with the hostname. The official docs ^ are rather opinionated how such edits should be performed.

CAUTION Be sure to include the domain as well in case your nodes do not share one. Do NOT change the name entry for the node.

At any point, you may check journalctl -u pve-cluster to see that all went well:

[dcdb] notice: wrote new corosync config '/etc/corosync/corosync.conf' (version = 2)
[status] notice: update cluster info (cluster name  demo-cluster, version = 2)

Now, when you are going to add a second node to the cluster (in CLI, this is done counter-intuitively from to-be-added node referencing a node already in the cluster):

pvecm add pve1.demo.internal

Please enter superuser (root) password for 'pve1.demo.internal': **********

Establishing API connection with host 'pve1.demo.internal'
The authenticity of host 'pve1.demo.internal' can't be established.
X509 SHA256 key fingerprint is 52:13:D6:A1:F5:7B:46:F5:2E:A9:F5:62:A4:19:D8:07:71:96:D1:30:F2:2E:B7:6B:0A:24:1D:12:0A:75:AB:7E.
Are you sure you want to continue connecting (yes/no)? yes
Login succeeded.
check cluster join API version
No cluster network links passed explicitly, fallback to local node IP '10.10.10.102'
Request addition of this node
cluster: warning: ring0_addr 'pve1.demo.internal' for node 'pve1' resolves to '10.10.10.101' - consider replacing it with the currently resolved IP address for stability
Join request OK, finishing setup locally
stopping pve-cluster service
backup old database to '/var/lib/pve-cluster/backup/config-1726922870.sql.gz'
waiting for quorum...OK
(re)generate node files
generate new node certificate
merge authorized SSH keys
generated new node certificate, restart pveproxy and pvedaemon services
successfully added node 'pve2' to cluster.

It hints you about using the resolved IP as static entry (fallback to local node IP '10.10.10.102') for this action (despite hostname was provided) and indeed you would have to change this second incarnation of corosync.conf again.

So your nodelist (after the second change) should look like this:

nodelist {

  node {
    name: pve1
    nodeid: 1
    quorum_votes: 1
    ring0_addr: pve1.demo.internal
  }

  node {
    name: pve2
    nodeid: 2
    quorum_votes: 1
    ring0_addr: pve2.demo.internal
  }

}

NOTE If you wonder about the warnings on “stability” and how corosync actually supports resolving names, you may wish to consult ^ (excerpt):

ADDRESS RESOLUTION

corosync resolves ringX_addr names/IP addresses using the getaddrinfo(3) call with respect of totem.ip_version setting.

getaddrinfo() function uses a sophisticated algorithm to sort node addresses into a preferred order and corosync always chooses the first address in that list of the required family. As such it is essential that your DNS or /etc/hosts files are correctly configured so that all addresses for ringX appear on the same network (or are reachable with minimal hops) and over the same IP protocol.

CAUTION At this point, it is suitable to point out the importance of ip_version parameter (defaults to ipv6-4 when unspecified, but PVE actually populates it to ipv4-6), ^ but also the configuration of hosts in nsswitch.conf. ^

You may want to check if everything is well with your cluster at this point, either with pvecm status ^ or generic corosync-cfgtool. Note you will still see IP addresses and IDs in this output, as they got resolved.

Corosync

Particularly useful to check at any time is netstat (you may need to install net-tools):

netstat -pan | egrep '5405.*corosync'

This is especially true if you are wondering why your node is missing from a cluster. Why could this happen? If you e.g. have improperly configured DHCP and your node suddenly gets a new IP leased, corosync will NOT automatically take this into account:

DHCPREQUEST for 10.10.10.103 on vmbr0 to 10.10.10.11 port 67
DHCPNAK from 10.10.10.11
DHCPDISCOVER on vmbr0 to 255.255.255.255 port 67 interval 4
DHCPOFFER of 10.10.10.113 from 10.10.10.11
DHCPREQUEST for 10.10.10.113 on vmbr0 to 255.255.255.255 port 67
DHCPACK of 10.10.10.113 from 10.10.10.11
bound to 10.10.10.113 -- renewal in 57 seconds.
  [KNET  ] link: host: 2 link: 0 is down
  [KNET  ] link: host: 1 link: 0 is down
  [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
  [KNET  ] host: host: 2 has no active links
  [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
  [KNET  ] host: host: 1 has no active links
  [TOTEM ] Token has not been received in 2737 ms
  [TOTEM ] A processor failed, forming new configuration: token timed out (3650ms), waiting 4380ms for consensus.
  [QUORUM] Sync members[1]: 3
  [QUORUM] Sync left[2]: 1 2
  [TOTEM ] A new membership (3.9b) was formed. Members left: 1 2
  [TOTEM ] Failed to receive the leave message. failed: 1 2
  [QUORUM] This node is within the non-primary component and will NOT provide any services.
  [QUORUM] Members[1]: 3
  [MAIN  ] Completed service synchronization, ready to provide service.
[status] notice: node lost quorum
[dcdb] notice: members: 3/1080
[status] notice: members: 3/1080
[dcdb] crit: received write while not quorate - trigger resync
[dcdb] crit: leaving CPG group

This is because corosync has still link bound to the old IP, what is worse however, even if you restart the corosync service on the affected node, it will NOT be sufficient, the remaining cluster nodes will be rejecting its traffic with:

[KNET  ] rx: Packet rejected from 10.10.10.113:5405

It is necessary to restart corosync on ALL nodes to get them back into (eventually) the primary component of the cluster. Finally, you ALSO need to restart pve-cluster service on the affected node (only).

TIP If you see wrong IP address even after restart, and you have all correct configuration in the corosync.conf, you need to troubleshoot starting with journalctl -t dhclient (and checking the DHCP server configuration if necessary), but eventually may even need to check nsswitch.conf ^ and gai.conf. ^