Questions tagged [slurm]

Question 1

I've been trying to configure slurmdbd in a small test cluster. So far it's able to start successfully. However, when I try to do anything with sacctmgr, like sacctmgr list cluster, the command hangs, ...

Question 2

I'm working on a "cluster" that currently has only one computenode, with 8x H100 GPUs. Slurm is configured such that each GPU is available either as a whole GPU, or as 20 shards. The (from ...

Question 3

What’s the right way to pull a complete answer to an InfluxQL query over http? I’m using the acct_gather plugin for a slurm cluster. It sends resource usage data to an influxdb v1 database. So if I ...

Question 4

I'd like my log files to be named after a variable. Since this isn't possible: #SBATCH --output some_software.${var}.out #SBATCH --error some_software.${var}.err I came across this work around but ...

Question 5

If I run a Slurm job with sbatch, I can specify output and error filenames with a custom format. But how can I look up these filenames, given the job ID (e.g. 123456) of a running job? For example: ...

Question 6

I'd like to make a shell alias for customers to use like squeue --format="%.9i %10j %8u %8T %.12M %6D %20R %16P %.4C %m" The right-most %m format tag shows how much memory a job requested ...

Question 7

When I specify in gres.conf to omit the first GPU, Processes in Slurm still use the first one. If I allow Slurm to manage both, the second concurrent process properly goes onto the second GPU. Why? ...

Question 8

On a slurm cluster, is there ever a time when it’s appropriate to use sbatch inside an sbatch script? Or is it always a bad pattern? I’ve seen this in use, and it looks iffy: #SBATCH -J ...

Question 9

I'm in the process of setting up a singe-node Slurm workstation machine and I believe I followed the process closely and everything is working just fine. See below: sudo systemctl restart slurmdbd &...

Question 10

Hi I was working on installing Slurm and got most things sorted out but upon launching sudo journalctl -fu slurmdbd I get the following: Jan 25 12:49:49 ... systemd[1]: Stopped slurmdbd.service - ...

Question 11

I am working on understanding of how cgroups memory resource controller is enabled on Ubuntu 20.04. I have several Ubuntu machines that make up a Slurm 23.02.7 cluster. In cgroup.conf, SchedMD ...

Question 12

I am new to Slurm. I have set it up in the cluster and on some nodes of a partition, the job runs perfectly fine but some other nodes of the same partition, the jobs do not run. They get cancelled the ...

Question 13

I wrote a SLURM job script to run a computational chemistry calculation using the CREST program (part of the xtb software package). In the script, I create a temporary directory on the local storage ...

Question 14

I am doing experiments seeing how slurm behaves when it finds offline CPUs. In my experiments, slurm provides configurations that make available too few CPUs. Here's a few examples from an 8-cpu node ...

Question 15

I am facing a problem where slurmctld and slurmd are not in sync in terms of using the same slurm.conf file so we have this: error: Node node1 appears to have a different slurm.conf than the slurmctld....

Stack Exchange Network

Questions tagged [slurm]

sacctmgr hanging with slurmdbd reporting error: read(13): No error

Slurm allocates job requesting entire GPU to same GPU as jobs requesting shards already running on that GPU

InfluxQL query returns a "partial" answer over http. Can I curl the whole thing?

SLURM error and output files with custom variable name

Get the output filename(s) for a running slurm job

How to change Slurm squeue output column label?

Slurm jobs ignore GPU skipping in gres.conf

Is sbatch-inside-sbatch a bad idea?

single node Slurm machine, munge authentication problem

persistent - slurmdbd: error: mysql_query failed: 1193 Unknown system variable 'wsrep_on'

Where is documentation for `/boot/config-<kernel_version>`?

Slurm IO error, could not open stdoutfile

SLURM job script - why is the tmp local directory deleted before archiving can occur? How to prevent this?

slurm - minimizing effect of offline CPUs

slurm - is it possible to query slurmctld/slurmd to know if they are using the right slurm.conf version?

Hot Network Questions