Questions tagged [slurm]
SLURM is a workload manager for Linux clusters
100 questions
0 votes
0 answers
18 views
sacctmgr hanging with slurmdbd reporting error: read(13): No error
I've been trying to configure slurmdbd in a small test cluster. So far it's able to start successfully. However, when I try to do anything with sacctmgr, like sacctmgr list cluster, the command hangs, ...
2 votes
1 answer
67 views
Slurm allocates job requesting entire GPU to same GPU as jobs requesting shards already running on that GPU
I'm working on a "cluster" that currently has only one computenode, with 8x H100 GPUs. Slurm is configured such that each GPU is available either as a whole GPU, or as 20 shards. The (from ...
0 votes
1 answer
81 views
InfluxQL query returns a "partial" answer over http. Can I curl the whole thing?
What’s the right way to pull a complete answer to an InfluxQL query over http? I’m using the acct_gather plugin for a slurm cluster. It sends resource usage data to an influxdb v1 database. So if I ...
0 votes
1 answer
456 views
SLURM error and output files with custom variable name
I'd like my log files to be named after a variable. Since this isn't possible: #SBATCH --output some_software.${var}.out #SBATCH --error some_software.${var}.err I came across this work around but ...
1 vote
1 answer
364 views
Get the output filename(s) for a running slurm job
If I run a Slurm job with sbatch, I can specify output and error filenames with a custom format. But how can I look up these filenames, given the job ID (e.g. 123456) of a running job? For example: ...
0 votes
0 answers
125 views
How to change Slurm squeue output column label?
I'd like to make a shell alias for customers to use like squeue --format="%.9i %10j %8u %8T %.12M %6D %20R %16P %.4C %m" The right-most %m format tag shows how much memory a job requested ...
1 vote
0 answers
96 views
Slurm jobs ignore GPU skipping in gres.conf
When I specify in gres.conf to omit the first GPU, Processes in Slurm still use the first one. If I allow Slurm to manage both, the second concurrent process properly goes onto the second GPU. Why? ...
0 votes
1 answer
483 views
Is sbatch-inside-sbatch a bad idea?
On a slurm cluster, is there ever a time when it’s appropriate to use sbatch inside an sbatch script? Or is it always a bad pattern? I’ve seen this in use, and it looks iffy: #SBATCH -J ...
0 votes
0 answers
233 views
single node Slurm machine, munge authentication problem
I'm in the process of setting up a singe-node Slurm workstation machine and I believe I followed the process closely and everything is working just fine. See below: sudo systemctl restart slurmdbd &...
0 votes
0 answers
123 views
persistent - slurmdbd: error: mysql_query failed: 1193 Unknown system variable 'wsrep_on'
Hi I was working on installing Slurm and got most things sorted out but upon launching sudo journalctl -fu slurmdbd I get the following: Jan 25 12:49:49 ... systemd[1]: Stopped slurmdbd.service - ...
0 votes
1 answer
104 views
Where is documentation for `/boot/config-<kernel_version>`?
I am working on understanding of how cgroups memory resource controller is enabled on Ubuntu 20.04. I have several Ubuntu machines that make up a Slurm 23.02.7 cluster. In cgroup.conf, SchedMD ...
0 votes
0 answers
431 views
Slurm IO error, could not open stdoutfile
I am new to Slurm. I have set it up in the cluster and on some nodes of a partition, the job runs perfectly fine but some other nodes of the same partition, the jobs do not run. They get cancelled the ...
0 votes
0 answers
90 views
SLURM job script - why is the tmp local directory deleted before archiving can occur? How to prevent this?
I wrote a SLURM job script to run a computational chemistry calculation using the CREST program (part of the xtb software package). In the script, I create a temporary directory on the local storage ...
0 votes
1 answer
86 views
slurm - minimizing effect of offline CPUs
I am doing experiments seeing how slurm behaves when it finds offline CPUs. In my experiments, slurm provides configurations that make available too few CPUs. Here's a few examples from an 8-cpu node ...
0 votes
1 answer
588 views
slurm - is it possible to query slurmctld/slurmd to know if they are using the right slurm.conf version?
I am facing a problem where slurmctld and slurmd are not in sync in terms of using the same slurm.conf file so we have this: error: Node node1 appears to have a different slurm.conf than the slurmctld....