Ubuntu 18.04: Slurm の導入
tl;dr
Environment
$ lsb_release -a No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 18.04.2 LTS Release: 18.04 Codename: bionic
Introduction
slurmctl
- 管理デーモン
- マスターノードにインストール
slurmd
- 計算デーモン
- 各計算ノードにインストール
- 今回は、1 ホスト 1 ノード環境であるため両方入れる
MUNGE
- 認証システム
- 入れないと、後々の起動時にコケていた
- apt で
slurm
を入れることも出来るが、今回は最新 version をソースから入れていく
Installation
[Slurm - Quick Start Admin] https://slurm.schedmd.com/quickstart_admin.html を参考にする
MUNGE の install
Install MUNGE for authentication. Make sure that all nodes in your cluster have the same munge.key. Make sure the MUNGE daemon, munged is started before you start the Slurm daemons.
めんどくさいから、apt で入れる
$ apt search munge Sorting... Done Full Text Search... Done invada-studio-plugins-lv2/bionic 1.2.0+repack0-8 amd64 Invada Studio Plugins - a set of LV2 audio plugins libdata-munge-perl/bionic 0.097-1 all collection of various utility functions libmoosex-mungehas-perl/bionic 0.007-3 all munge your "has" (works with Moo, Moose and Mouse) libmunge-dev/bionic,now 0.5.13-1 amd64 [installed] authentication service for credential -- development package libmunge-maven-plugin-java/bionic 1.0-2 all Maven plugin to pre-process Java code libmunge2/bionic,now 0.5.13-1 amd64 [installed,automatic] authentication service for credential -- library package libpod-elemental-perlmunger-perl/bionic 0.200006-1 all Perl module that rewrites Perl documentation libstring-flogger-perl/bionic 1.101245-2 all module to munge strings for loggers libterm-ttyrec-plus-perl/bionic 0.09-1 all module for reading a ttyrec munge/bionic,now 0.5.13-1 amd64 [installed] authentication service to create and validate credentials $ apt install -y munge libmunge-dev
Daemon が起動しているかと、MUNGE の key を確認しておく
$ systemctl list-unit-files --type=service | grep munge munge.service enabled $ systemctl start munge $ systemctl status munge ● munge.service - MUNGE authentication service Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2019-06-16 12:08:51 UTC; 2h 59min ago Docs: man:munged(8) Main PID: 9372 (munged) Tasks: 4 (limit: 4915) CGroup: /system.slice/munge.service └─9372 /usr/sbin/munged $ find /etc | grep munge.key /etc/munge/munge.key
Slurm の Install
Slurm - Download を参考にする
最新のソースは GitHub - slurm から取得する
$ cd /tmp $ curl -fsSL https://github.com/SchedMD/slurm/archive/slurm-19-05-0-1.tar.gz | tar zx $ cd slurm-slurm-19-05-0-1 $ ./configure $ make -j 10 $ make install
Daemon と設定ファイルの確認する
$ systemctl list-unit-files --type=service | grep slurm $ find /usr/local -type d | grep slurm /usr/local/share/doc/slurm-19.05.0 /usr/local/share/doc/slurm-19.05.0/html /usr/local/lib/slurm /usr/local/lib/slurm/src /usr/local/lib/slurm/src/sattach /usr/local/lib/slurm/src/srun /usr/local/include/slurm $ find /etc -type d | grep slurm $ find /var -type d | grep slurm
基本的に、service や 設定ファイル用の dir が存在しないため、作成しなければならない
$ cp ./etc/slurmd.service /etc/systemd/system/ $ cp ./etc/slurmctld.service /etc/systemd/system/
Type ldconfig -n <library_location> so that the Slurm libraries can be found by applications that intend to use Slurm APIs directly.
$ ldconfig -n /usr/local/lib/slurm
コンフィグファイルを作成する
下記の html を使って作成することが出来るが、最新 version を使う場合は Slurm - configurator を使うほうが楽
$ ls /usr/local/share/doc/slurm-19.05.0/html | grep configurator configurator.easy.html configurator.html $ slurmd -V slurm 19.05.0
作成したものを /usr/local/etc/slurm.conf
に配置する
$ cat << EOS > /usr/local/etc/slurm.conf SlurmctldHost=<my_host_name> MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=root StateSaveLocation=/var/spool SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SelectType=select/cons_res SelectTypeParameters=CR_CPU AccountingStorageType=accounting_storage/none AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=info SlurmdDebug=info NodeName=<my_node_name> NodeAddr=<my_node_ip_address> RealMemory=96333 Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN PartitionName=<my_pertition_name> Nodes=<my_node_name> Default=YES MaxTime=INFINITE State=UP EOS
- 基本的に default の設定を使う
- 変更部分は、
SlurmUser
: root- slurm などを使う場合は、user と group を作成して、適切に permission を設定する
SelectType
: cons_resSelectTypeParameters
: CR_CPURealMemory
,Sockets
,CoresPerSocket
,ThreadsPerCore
は下記のコマンドで調べる
$ grep physical.id /proc/cpuinfo | sort -u | wc -l $ grep cpu.cores /proc/cpuinfo | sort -u $ grep processor /proc/cpuinfo | wc -l $ free -m
The parent directories for Slurm's log files, process ID files, state save directories, etc. are not created by Slurm. They must be created and made writable by SlurmUser as needed prior to starting Slurm daemons.
自動生成されないファイル群を作成しておく
mkdir -p /var/spool/slurmd touch /var/spool/node_state touch /var/spool/job_state touch /var/spool/resv_state touch /var/spool/trigger_state touch /var/run/slurmctld.pid touch /var/run/slurmd.pid
起動する
$ systemctl enable slurmctld $ systemctl start slurmctld $ systemctl status slurmctld ● slurmctld.service - Slurm controller daemon Loaded: loaded (/etc/systemd/system/slurmctld.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2019-06-16 15:34:48 UTC; 5s ago Process: 9512 ExecStart=/usr/local/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 9520 (slurmctld) Tasks: 7 CGroup: /system.slice/slurmctld.service └─9520 /usr/local/sbin/slurmctld $ systemctl enable slurmd $ systemctl start slurmd $ systemctl status slurmd ● slurmd.service - Slurm node daemon Loaded: loaded (/etc/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2019-06-16 15:34:26 UTC; 1min 1s ago Process: 9455 ExecStart=/usr/local/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS) Main PID: 9479 (slurmd) Tasks: 1 CGroup: /system.slice/slurmd.service └─9479 /usr/local/sbin/slurmd
Test
$ sinfo $ srun -l sleep 60 & $ srun -l sleep 60 & $ squeue