python - Running TensorFlow on a Slurm Cluster? -


i access computing cluster, 1 node 2 12-core cpus, running slurm workload manager.

i run tensorflow on system unfortunately not able find information how or if possible. new far understand it, have run tensorflow creating slurm job , can not directly execute python/tensorflow via ssh.

has idea, tutorial or kind of source on topic?

it's relatively simple.

under simplifying assumptions request 1 process per host, slurm provide information need in environment variables, slurm_procid, slurm_nprocs , slurm_nodelist.

for example, can initialize task index, number of tasks , nodelist follows:

from hostlist import expand_hostlist task_index  = int( os.environ['slurm_procid'] ) n_tasks     = int( os.environ['slurm_nprocs'] ) tf_hostlist = [ ("%s:22222" % host) host in                 expand_hostlist( os.environ['slurm_nodelist']) ]   

note slurm gives host list in compressed format (e.g., "myhost[11-99]"), need expand. module hostlist kent engström, available here https://pypi.python.org/pypi/python-hostlist

at point, can go right ahead , create tensorflow cluster specification , server information have available, e.g.:

cluster = tf.train.clusterspec( {"your_taskname" : tf_hostlist } ) server  = tf.train.server( cluster.as_cluster_def(),                            job_name   = "your_taskname",                            task_index = task_index ) 

and you're set! can perform tensorflow node placement on specific host of allocation usual syntax:

for idx in range(n_tasks):    tf.device("/job:your_taskname/task:%d" % idx ):        ... 

a flaw code reported above jobs instruct tensorflow install servers listening @ fixed port 22222. if multiple such jobs happen scheduled same node, second 1 fail listen 22222.

a better solution let slurm reserve ports each job. need bring slurm administrator on board , ask him configure slurm allows ask ports --resv-ports option. in practice, requires asking them add line following in slurm.conf:

mpiparams=ports=15000-19999 

before bug slurm admin, check options configured, e.g., with:

scontrol show config | grep mpiparams 

if site uses old version of openmpi, there's chance option in place.

then, amend first snippet of code follows:

from hostlist import expand_hostlist task_index  = int( os.environ['slurm_procid'] ) n_tasks     = int( os.environ['slurm_nprocs'] ) port        = int( os.environ['slurm_step_resv_ports'].split('-')[0] ) tf_hostlist = [ ("%s:%s" % (host,port)) host in                 expand_hostlist( os.environ['slurm_nodelist']) ]   

good luck!


Comments

Popular posts from this blog

c++ - llvm function pass ReplaceInstWithInst malloc -

Cross-Compiling Linux Kernel for Raspberry Pi - ${CCPREFIX}gcc -v does not work -

java.lang.NoClassDefFoundError When Creating New Android Project -