python - Running TensorFlow on a Slurm Cluster? -
i access computing cluster, 1 node 2 12-core cpus, running slurm workload manager.
i run tensorflow on system unfortunately not able find information how or if possible. new far understand it, have run tensorflow creating slurm job , can not directly execute python/tensorflow via ssh.
has idea, tutorial or kind of source on topic?
it's relatively simple.
under simplifying assumptions request 1 process per host, slurm provide information need in environment variables, slurm_procid, slurm_nprocs , slurm_nodelist.
for example, can initialize task index, number of tasks , nodelist follows:
from hostlist import expand_hostlist task_index = int( os.environ['slurm_procid'] ) n_tasks = int( os.environ['slurm_nprocs'] ) tf_hostlist = [ ("%s:22222" % host) host in expand_hostlist( os.environ['slurm_nodelist']) ]
note slurm gives host list in compressed format (e.g., "myhost[11-99]"), need expand. module hostlist kent engström, available here https://pypi.python.org/pypi/python-hostlist
at point, can go right ahead , create tensorflow cluster specification , server information have available, e.g.:
cluster = tf.train.clusterspec( {"your_taskname" : tf_hostlist } ) server = tf.train.server( cluster.as_cluster_def(), job_name = "your_taskname", task_index = task_index )
and you're set! can perform tensorflow node placement on specific host of allocation usual syntax:
for idx in range(n_tasks): tf.device("/job:your_taskname/task:%d" % idx ): ...
a flaw code reported above jobs instruct tensorflow install servers listening @ fixed port 22222. if multiple such jobs happen scheduled same node, second 1 fail listen 22222.
a better solution let slurm reserve ports each job. need bring slurm administrator on board , ask him configure slurm allows ask ports --resv-ports option. in practice, requires asking them add line following in slurm.conf:
mpiparams=ports=15000-19999
before bug slurm admin, check options configured, e.g., with:
scontrol show config | grep mpiparams
if site uses old version of openmpi, there's chance option in place.
then, amend first snippet of code follows:
from hostlist import expand_hostlist task_index = int( os.environ['slurm_procid'] ) n_tasks = int( os.environ['slurm_nprocs'] ) port = int( os.environ['slurm_step_resv_ports'].split('-')[0] ) tf_hostlist = [ ("%s:%s" % (host,port)) host in expand_hostlist( os.environ['slurm_nodelist']) ]
good luck!
Comments
Post a Comment