Install
To install or reinstall the new FIONA node:
-
If node is in the cluster already, first make sure the node is not a part of ceph or is a gateway (metallb, ingress, etc). Ceph nodes can only be taken out one at a time, allowing time to recover after being brought back.
-
Find the network settings: IP, subnet, gateway, DNS (if not google/cloudflare)
-
Note the current disks setup - whether the node has similar OS drives for md RAID
-
Drain the node
-
Login to the node's IPMI screen
-
Attach the ubuntu 20.04 image via the virtual media
-
Reboot the node
-
Trigger the boot menu (usually F10), choose to boot from virtual media
-
Start the install with media check off
-
Agree to everything it asks
-
Set up the network:
-
DNS can be 1.1.1.1,8.8.8.8
-
Disable unused networks
-
Can use the subnet calculator to figure out the subnet
-
-
For disk: if node has OS drive mirror, use custom layout:
-
Delete all existing md arrays
-
Click the drives you're going to use, choose reformat
-
Add unformatted GPT partitions to the drives
-
Create md array with those partitions
-
For 2nd disk choose "Add as another boot device"
-
Create
ext4
GPT partition on created MD array -
Proceed with installation
-
-
For username choose
nautilus
-
Choose to install SSH server, optionally import key from github
-
Don't install any additional packages
-
In the end disconnect media, reboot
-
After the node boots, make the
nautilus
user sudoer with NOPASSWD:-
sudo visudo
,%sudo ALL=(ALL:ALL) NOPASSWD:ALL
-
Add
mtu: 9000
to/etc/netplan/00-installer-config.yaml
, execnetplan apply
. Themtu
is under the ethernets device.
-
-
Make changes in ansible inventory file:
-
Remove
runtime: docker
(containerd is default) -
Add:
ansible_user: nautilus ansible_become: true
-
Add up
lv_devices
andcontainers_lv_size
if needed
-
-
If the node was previously installed with centos, delete the k8s VG:
vgremove k8s
, yes to all -
Generate a
join_token
by logging into the controller and running:$ kubeadm token create
-
Run the ansible playbook according to docs:
ansible-playbook setup.yml -l <node> -e join_token=...
Labels done by cluster admins:
Check that proper labels were added by ansible:
Add the following labels:
nautilus.io/network: "10000"
- network speed (10000/40000/100000) (needed for perfsonar maddash)
netbox.io/site: UNL
- SLUG for netbox site (should exist)
topology.kubernetes.io/region: us-central
- region (us-west, us-east, etc)
topology.kubernetes.io/zone: unl
- zone
If new zone was added, add it to clockwork topology in clockwork/ttcs-mesh-config
config map, and restart the clockwork controller pod. Make sure the new zone appeared in clockwork. If it did not, the node pod might need to be restarted.
Both netbox/netbox-agent-*
and clockwork/ttcs-agent-mod-*
pods are using the node topology labels on start and are NOT watching the changes. If labels were changed or added, these pods need restart to work properly.