2

How can I get reasonable parallelisation on multi-core nodes without saturating resources? As in many other similar questions, the question is really how to learn to tweak GNU Parallel to get reasonable performance.

In the following example, I can't get to run processes in parallel without saturating resources or everything seems to run in one CPU after using some -j -N options.

From inside a Bash script running in a multi-core machine, the following loop is passed to GNU Parallel

for BAND in $(seq 1 "$BANDS") ;do echo "gdalmerge_and_clean $VARIABLE $YEAR $BAND $OUTPUT_PIXEL_SIZE_X $OUTPUT_PIXEL_SIZE_Y" done |parallel 

This saturates, however, the machine and slows down processing.

In man parallel I read

--jobs -N
-j -N
--max-procs -N
-P -N

Subtract N from the number of CPU threads.

Run this many jobs in parallel. If the evaluated number is less than 1 then 1 will be used.

See also: --number-of-threads --number-of-cores --number-of-sockets

and I've tried to use

|parallel -j -3 

but this, for some reason, uses only one CPU out of the 40. Checking with [h]top, only one CPU is reported high-use, the rest down to 0. Should -j -3 not use 'Number of CPUs' - 3 which would be 37 CPUs for example?

and I extended the previous call then

-j -3 --use-cores-instead-of-threads 

blindly doing so, I guess. I've read https://unix.stackexchange.com/a/114678/13011, and I know from the admins of the cluster I used to run such parallel jobs, that hyperthreading is disabled. This is still running in one CPU.

I am now trying to use the following:

for BAND in $(seq 1 "$BANDS") ;do echo "gdalmerge_and_clean $VARIABLE $YEAR $BAND $OUTPUT_PIXEL_SIZE_X $OUTPUT_PIXEL_SIZE_Y" done |parallel -j 95% 

or with |parallel -j 95% --use-cores-instead-of-threads.

Note

For the record, this is part of a batch job, scheduled via HTCondor and each job running on a separate node with some 40 physical CPUs available.

Above, I kept only the essential -- the complete for loop piped to parallel is:

for BAND in $(seq 1 "$BANDS") ;do # Do not extract, unscale and merge if the scaled map exists already! SCALED_MAP="era5_and_land_${VARIABLE}_${YEAR}_band_${BAND}_merged_scaled.nc" MERGED_MAP="era5_and_land_${VARIABLE}_${YEAR}_band_${BAND}_merged.nc" if [ ! -f "${SCALED_MAP+set}" ] ;then echo "log $LOG_FILE Action=Merge, Output=$MERGED_MAP, Pixel >size=$OUTPUT_PIXEL_SIZE_X $OUTPUT_PIXEL_SIZE_Y, Timestamp=$(timestamp)" echo "gdalmerge_and_clean $VARIABLE $YEAR $BAND $OUTPUT_PIXEL_SIZE_X >$OUTPUT_PIXEL_SIZE_Y" else echo "warning "Scaled map "$SCALED_MAP" exists already! Skipping merging.-"" fi done |parallel -j 95% log "$LOG_FILE" "Action=Merge, End=$(timestamp)" 
where `log` and `warning` are a custom functions 
2
  • Does parallel correctly detect your CPU cores ? What does parallel --number-of-cores print? Also, does it work as expected if you explicitly specify the no. of cores to be used, e.g. -j 17 ? Commented Nov 29, 2022 at 14:25
  • Yes, parallel correctly detects all CPU cores. In one job running in some node, for example, it reports 40. Not all nodes are the same, however. Anyhow, jobs will only be assigned to machines with at least 24 CPUs. The idea behind -j 95% was to avoid hardcoding the number of CPUs. Nonetheless, in my latest attempts I just put -j 20. Commented Dec 1, 2022 at 14:56

1 Answer 1

1

To debug this I will suggest you first run this with something simpler than gdalmerge_and_clean.

Try:

seq 100 | parallel 'seq {} 100000000 | gzip | wc -c' 

Does this correctly run one job per CPU thread?

seq 100 | parallel -j 95% 'seq {} 100000000 | gzip | wc -c' 

Does this correctly run 19 jobs for every 20 CPU threads?

My guess is that gdalmerge_and_clean is actually run in the correct number of instances, but that it depends on I/O and is waiting for this. So your disk or network is pushed to the limit while the CPU is sitting idle and waiting.

You can verify the correct number of copies is started by using ps aux | grep gdalmerge_and_clean.

You can see if your disks are busy with iostats -dkx 1.

4
  • I think you are right. At some point, I observed via [h]top multiple CPUs kicking-in! I guess when resources weere freed. Background: jobs run in JRC's BDAP platform, scheduled via HTCondor. Physical machines that compose the cluster offer shared resources to docker containers in which, ideally, one process runs on one CPU. My workflow isn't exactly a fit for an HTC. It would better run in an HPC. Or design a workflow using DAGMan. GNU Parallel rocks, however! Commented Dec 1, 2022 at 14:47
  • What can be then done from the "user"'s side for network or/and disk I/O bottlenecks? Can GNU Parallel be tweaked in this regard? Or is it irrelevant and outside its territory? Commented Dec 1, 2022 at 14:51
  • Also: grep correctly shows some 20 jobs now (after parallel -j 20. Commented Dec 1, 2022 at 14:56
  • 1
    @NikosAlexandris You will need to find the optimal number to run in parallel. You do that by increasing the number of jobs in parallel until the average runtime increases. Commented Dec 1, 2022 at 23:45

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.