Move diskless cluster to iSCSI diskless cluster. This is how we did it.

The goal: iSCSI Boot via iPXE

The goal was to move our diskless cluster from “nfsroot + aufs” to iSCSI disks for each node and boot from iSCSI via iPXE. The bootstrapping of the node will be done with FAI and the final configuration with salt upon first boot. Once the initial configuration with FAI is stable, changes the nodes configuration will only be done via salt.

Documentation about the whole process is generally available although a bit thinly spread and does not combine all of the tools used here. It was quite a bumpy road to finally get a booting system with numerous small roadblocks.

But in the end: It works!

Making FAI iSCSI “aware”

The first thing needed is a working PXE boot environment. We use dnsmasq, the setup can be found here.

On boot, the NIC will send a DHCP request. The DHCP Server will answer with the IP address, gateway, etc. and a boot image.

FAI needs to be setup too, obviously. Make sure to include the package open-iscsi in /etc/fai/NFSROOT or chroot to your nfsroot and install it there. It is not necessary to make it available in the initramdisk (it will be crucial for the nodes, though). FAI will boot into its nfsroot and perform then the installation on the target. To make disks via iSCSI available there are some modifications to the startup necessary. By itself FAI will ignore the iSCSI disks even when you manage to connect them. Even if you make the disk somehow known to FAI, the iSCSI disks will always be added later in the device tree. If there are other, physical, disks in the system installed, the iSCSI disk will always be the last disk. I did not find a way to tell FAI (setup-storage) to use the last disk, but I found a way to use an iSCSI disk.

On start up FAI will execute the files in the folder class in the FAI config space. I modified the 20-hwdetect.sh to include the iSCSI disk in the disklist variable. Then I used the feature of globbing device names in the disk_config partition setup. The file 20-hwdetect.sh needed to be renamed to 55-hwdetect.sh , otherwise there was no access to the defined classes of the host which are defined later in 50-host-classes. Here is the 55-hwdetect.sh :

    #! /bin/bash
    # (c) Thomas Lange, 2002-2013, lange@informatik.uni-koeln.de
    # NOTE: Files named *.sh will be evaluated, but their output ignored.
    [ $do_init_tasks -eq 1 ] || return 0 # Do only execute when doing install
    echo 0 > /proc/sys/kernel/printk
    kernelmodules=iscsi_tcp
    # here, you can load modules depending on the kernel version
    case $(uname -r) in
        2.6*) kernelmodules="$kernelmodules mptspi dm-mod md-mod aes dm-crypt" 
;;
          3*) kernelmodules="$kernelmodules mptspi dm-mod md-mod aes dm-crypt" 
;;
          4*) kernelmodules="$kernelmodules mptspi dm-mod md-mod aes dm-crypt" 
;;
    esac
    for mod in $kernelmodules; do
        [ "$verbose" ] && echo Loading kernel module $mod
        modprobe -a $mod 1>/dev/null 2>&1
    done
    
    ip ad show up | egrep -iv 'loopback|127.0.0.1|::1/128|_lft'
    
    echo $printk > /proc/sys/kernel/printk
    
    # here comes the iSCSI part
    # this will start iSCSI and makes the disk visible to the system
    if ifclass ISCSI; then
        echo "ISCSI class defined: logging in to target"
        echo "Initiatorname: iqn.2010-04.org.ipxe:$HOSTNAME "
        iscsistart -i iqn.2010-04.org.ipxe:$HOSTNAME \
               -t iqn.2003-01.cluster.nas1:root \
               -g 1 \
               -a $SERVER \
               -p 3260
        sleep 5
        lsscsi | grep LIO-ORG
        # match the last 4 letters of the LIO-ORG device line, i.e. sdb<SPACE>
        # this selects the iSCSI disk, my be better to use /dev/disk/by-path/* ?
        disklist=$(lsscsi | grep LIO-ORG | grep -o "....$")
        #disklist=${iscsi_disk: -4}
        echo "ISCSI disk found: $disklist"
    fi
    
    
    odisklist=$disklist
    set_disk_info  # recalculate list of available disks
    if [ "$disklist" != "$odisklist" ]; then
        tmp_disklist="$disklist $odisklist"
        echo New disklist: $tmp_disklist
        echo disklist=\"$tmp_disklist\" >> $LOGDIR/additional.var
    fi
    
    save_dmesg     # save new boot messages (from loading modules)

For this to work there needs to be an iSCSI target setup with the proper ACLs for the current initiator. Here is a small script which creates a small disk image in the fileio backstore, and makes it the first LUN for the initiator:

    #!/bin/bash
    set -e
    TARGET_HOSTNAME=$1
    ISCSI_TARGET="iqn.2003-01.cluster.nas1:root"
    ISCSI_INITIATOR="iqn.2010-04.org.ipxe:$TARGET_HOSTNAME"
    
    # Create file image
    targetcli backstores/fileio create $TARGET_HOSTNAME /srv/iscsi-images/$TARGET_HOSTNAME.img 24G true
    
    # Create ACL for initiator
    targetcli iscsi/$ISCSI_TARGET/tpg1/acls create $ISCSI_INITIATOR add_mapped_luns=false
    
    # create lun on ACL
    targetcli iscsi/$ISCSI_TARGET/tpg1/acls/$ISCSI_INITIATOR  create mapped_lun=0 tpg_lun_or_backstore=/backstores/fileio/$TARGET_HOSTNAME
    targetcli / saveconfig

And finally the disk config for FAI’s setup-storage (/srv/fai/config/disk_config/ISCSI). The glob expression will match the first iSCSI disk (LUN 0) and create a root and swap partition.

disk_config /dev/disk/by-path/*iscsi*-lun-0 fstabkey:uuid align-at:2M  disklabel:gpt-bios
primary /       4G-	xfs       defaults,relatime
primary swap    8G   swap    sw

Client Configuration

For the client configuration I followed mostly this configuration in the part iSCSI boot configuration

In the FAI config space:

  • make sure open-iscsi is marked for installation in the package_config
  • create empty files/etc/iscsi/iscsi.initramfs/ISCSI
  • add DEVICE=eth0 in ./files/etc/initramfs-tools/initramfs.conf/ISCSI (otherwise there were problem wuth bboting on multi NIC hosts)
  • GRUB_CMDLINE_LINUX=”ip=dhcp iscsi_auto=true” in files/etc/default/grub/ISCSI

Now comes a hook which will create the initiator login information (hooks/chboot.ISCSI ):

    #!/bin/bash
    # this script is executable
    
    # we skip the normal chboot task
    skiptask chboot

    # this will create the ipxe boot script
    ssh -l $LOGUSER $SERVER /usr/local/bin/enable-iscsi $HOSTNAME $SERVER

    # configure initiator name
    echo "InitiatorName=iqn.2010-04.org.ipxe:$HOSTNAME" > $target/etc/iscsi/initiatorname.iscsi
    
    # iscsi configuration
    cat << EOF > $target/etc/iscsi/iscsi.initramfs
    ISCSI_INITIATOR="iqn.2010-04.org.ipxe:$HOSTNAME"
    ISCSI_TARGET_NAME=iqn.2003-01.cluster.nas1:root
    ISCSI_TARGET_IP=$SERVER
    ISCSI_TARGET_PORT=3260
    ISCSI_TARGET_GROUP=1
    EOF

    chroot $target update-initramfs -u

    exit 0

In this script we run a command on the server which creates a ipxe script (/usr/local/bin/enable-iscsi):

    #!/bin/bash
    HOST=$1
    SERVER=$2
    FAI_TFTP_CONFIG=/srv/tftp/fai/pxelinux.cfg
    ISCSI_TARGET="iqn.2003-01.cluster.nas1:root"
    HEX_HOST=$(sipcalc -d $HOST |grep "Host address (hex)" | awk  '{print $5}')

    cat << EOF > $FAI_TFTP_CONFIG/$HEX_HOST
    default iscsi
    label iscsi
    kernel ipxe.lkrn
    append dhcp && sanboot iscsi:$SERVER::::$ISCSI_TARGET
    EOF

Now, upon reboot, the host should boot via iSCSI. On the first boot the host will start salt-call to configure and install the rest of the system (slurm, ganglia, ssh keys etc.).