虚無ありき

うるせーーーしらねーーー

Ubuntu 18.04: Slurm の導入

tl;dr

  • ジョブ管理ツール Slurm を Ubuntu 18.04 に導入する
  • PC クラスタ環境ではなく、1 ホスト 1 ノード環境に入れる

Environment

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.2 LTS
Release:    18.04
Codename:   bionic

Introduction

  • slurmctl
    • 管理デーモン
    • マスターノードにインストール
  • slurmd
    • 計算デーモン
    • 各計算ノードにインストール
  • 今回は、1 ホスト 1 ノード環境であるため両方入れる
  • MUNGE
    • 認証システム
    • 入れないと、後々の起動時にコケていた
  • apt で slurm を入れることも出来るが、今回は最新 version をソースから入れていく

Installation

[Slurm - Quick Start Admin] https://slurm.schedmd.com/quickstart_admin.html を参考にする

MUNGE の install

Install MUNGE for authentication. Make sure that all nodes in your cluster have the same munge.key. Make sure the MUNGE daemon, munged is started before you start the Slurm daemons.

めんどくさいから、apt で入れる

$ apt search munge
Sorting... Done
Full Text Search... Done
invada-studio-plugins-lv2/bionic 1.2.0+repack0-8 amd64
  Invada Studio Plugins - a set of LV2 audio plugins

libdata-munge-perl/bionic 0.097-1 all
  collection of various utility functions

libmoosex-mungehas-perl/bionic 0.007-3 all
  munge your "has" (works with Moo, Moose and Mouse)

libmunge-dev/bionic,now 0.5.13-1 amd64 [installed]
  authentication service for credential -- development package

libmunge-maven-plugin-java/bionic 1.0-2 all
  Maven plugin to pre-process Java code

libmunge2/bionic,now 0.5.13-1 amd64 [installed,automatic]
  authentication service for credential -- library package

libpod-elemental-perlmunger-perl/bionic 0.200006-1 all
  Perl module that rewrites Perl documentation

libstring-flogger-perl/bionic 1.101245-2 all
  module to munge strings for loggers

libterm-ttyrec-plus-perl/bionic 0.09-1 all
  module for reading a ttyrec

munge/bionic,now 0.5.13-1 amd64 [installed]
  authentication service to create and validate credentials

$ apt install -y munge libmunge-dev

Daemon が起動しているかと、MUNGE の key を確認しておく

$ systemctl list-unit-files --type=service | grep munge
munge.service                          enabled
$ systemctl start munge
$ systemctl status munge
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-06-16 12:08:51 UTC; 2h 59min ago
     Docs: man:munged(8)
 Main PID: 9372 (munged)
    Tasks: 4 (limit: 4915)
   CGroup: /system.slice/munge.service
           └─9372 /usr/sbin/munged

$ find /etc | grep munge.key
/etc/munge/munge.key

Slurm の Install

Slurm - Download を参考にする

最新のソースは GitHub - slurm から取得する

$ cd /tmp
$ curl -fsSL https://github.com/SchedMD/slurm/archive/slurm-19-05-0-1.tar.gz | tar zx
$ cd slurm-slurm-19-05-0-1
$ ./configure
$ make -j 10
$ make install

Daemon と設定ファイルの確認する

$ systemctl list-unit-files --type=service | grep slurm
$ find /usr/local -type d | grep slurm
/usr/local/share/doc/slurm-19.05.0
/usr/local/share/doc/slurm-19.05.0/html
/usr/local/lib/slurm
/usr/local/lib/slurm/src
/usr/local/lib/slurm/src/sattach
/usr/local/lib/slurm/src/srun
/usr/local/include/slurm
$ find /etc -type d | grep slurm
$ find /var -type d | grep slurm

基本的に、service や 設定ファイル用の dir が存在しないため、作成しなければならない

$ cp ./etc/slurmd.service /etc/systemd/system/
$ cp ./etc/slurmctld.service /etc/systemd/system/

Type ldconfig -n <library_location> so that the Slurm libraries can be found by applications that intend to use Slurm APIs directly.

$ ldconfig -n /usr/local/lib/slurm

コンフィグファイルを作成する

下記の html を使って作成することが出来るが、最新 version を使う場合は Slurm - configurator を使うほうが楽

$ ls /usr/local/share/doc/slurm-19.05.0/html | grep configurator
configurator.easy.html
configurator.html
$ slurmd -V
slurm 19.05.0

作成したものを /usr/local/etc/slurm.conf に配置する

$ cat << EOS > /usr/local/etc/slurm.conf
SlurmctldHost=<my_host_name>
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=root
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=info
SlurmdDebug=info
NodeName=<my_node_name> NodeAddr=<my_node_ip_address> RealMemory=96333 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN
PartitionName=<my_pertition_name> Nodes=<my_node_name> Default=YES MaxTime=INFINITE State=UP
EOS
  • 基本的に default の設定を使う
  • 変更部分は、
    • SlurmUser: root
      • slurm などを使う場合は、user と group を作成して、適切に permission を設定する
    • SelectType: cons_res
    • SelectTypeParameters: CR_CPU
    • RealMemory, Sockets, CoresPerSocket, ThreadsPerCore は下記のコマンドで調べる
$ grep physical.id /proc/cpuinfo | sort -u | wc -l
$ grep cpu.cores /proc/cpuinfo | sort -u
$ grep processor /proc/cpuinfo | wc -l
$ free -m

The parent directories for Slurm's log files, process ID files, state save directories, etc. are not created by Slurm. They must be created and made writable by SlurmUser as needed prior to starting Slurm daemons.

自動生成されないファイル群を作成しておく

mkdir -p /var/spool/slurmd
touch /var/spool/node_state
touch /var/spool/job_state
touch /var/spool/resv_state
touch /var/spool/trigger_state
touch /var/run/slurmctld.pid
touch /var/run/slurmd.pid

起動する

$ systemctl enable slurmctld
$ systemctl start slurmctld
$ systemctl status slurmctld
● slurmctld.service - Slurm controller daemon
   Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-06-16 15:34:48 UTC; 5s ago
  Process: 9512 ExecStart=/usr/local/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 9520 (slurmctld)
    Tasks: 7
   CGroup: /system.slice/slurmctld.service
           └─9520 /usr/local/sbin/slurmctld

$ systemctl enable slurmd
$ systemctl start slurmd
$ systemctl status slurmd
● slurmd.service - Slurm node daemon
   Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-06-16 15:34:26 UTC; 1min 1s ago
  Process: 9455 ExecStart=/usr/local/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
 Main PID: 9479 (slurmd)
    Tasks: 1
   CGroup: /system.slice/slurmd.service
           └─9479 /usr/local/sbin/slurmd

Test

$ sinfo
$ srun -l sleep 60 &
$ srun -l sleep 60 &
$ squeue