How to scatter a global 3D field from a root to all MPI processes (from the plugin-side)

marcowurth · March 3, 2026, 5:00pm

Hi, cool to see such a modern new forum, this will be great for open-source development!

Coincidentally I just came across a task I couldn’t find a good resolution for yet and maybe it will be of use to others if we discuss this here…

So I’m developing a larger ComIn plugin which in a callback function called once at the beginning is reading in global icon fields from a netcdf file and this data is written into the host icon fields, it is basically an externally written initialization of icon fields. The problem arises because with MPI-only parallelization (which we use at KIT, although I never really understood why not run hybrid MPI-OpenMP) if all MPI processes of a node are reading in the global 3D fields from the file, the node runs out of memory. Then okay, I switched to reading in the global 3D file data from a single root MPI process only but then broadcasting this global field to all processes also needs to much memory.

What would be the easiest way to manually scatter the global field from one to all processes, so that every processes receives their p_patch data only (therefore keeping memory usage low)? And should ComIn have a helper function for this kind of task?

marcowurth · March 3, 2026, 5:59pm

A rather complicated way I can think of would be to have all receiving MPI processes each prepare a list of all global indices of their respective p_patch cells and gather this lists from the root process. Then have the root subset the global field and send the cut out p_patch data manually to all the other processes with MPI_Send/MPI_IRecv. Or split the global field into a new padded array (because cell numbers vary) and then send it with MPI_Scatter at once.

In the global descriptive data that ComIn provides, is there something like a reversed glb_index() that returns the individual process ranks for given global indices which the root process could use to recreate the domain decomposition?

Am I thinking to complicated in all this while trying to do external file I/O?

k202077 · March 4, 2026, 7:56am

I just recently removed some of the last traces of root-based input…please do not add it again.

Root-based IO has:

slow read performance
slow redistribution performance
bad memory scaling (will always crash for high enough grid resolutions)

It is better to use a distributed read approach:

the global array is split into M contiguous parts
- M is a subset of N (total number of processes)
- M is selected such that there is only one or very few processes per node
each of the M processes reads in its chunk of the global data
data redistribution from IO to ICON decomposition is handled by dedicated communication interface (`mo_communication` or `yaxt`)

Check out `src/io/shared/distributed` on details on how this is implemented in ICON.

k202077 · March 4, 2026, 8:08am

In case the original data is not on the ICON grid and in a much coarse resolution than (e.g. ERA5 data) the ICON grid, it makes more sense to read in the original data and map it on-the-fly to the ICON grid.

The reading of the coarse data can be done by dedicated read processes and the mapping to the ICON grid via YAC.

Check out `etc/`, which contains a couple of these read programs.

Nils3er · March 4, 2026, 11:16am

Hi Marco,

Thanks for bringing this up here.

This problem arises basically for all input data that ICON needs, and as Moritz already pointed out, there are also some utilities that solve the problem, though not using ComIn. In ComIn’s plugins/ directory you can find a yaxt_c (also yaxt_fortran if you prefer) plugin that uses yaxt to redistribute data across the compute nodes. You could easily alter that to read in data on a subset of processes and then scatter it globally.

Also, as Moritz already wrote: If the data is actually living on a different grid than ICON, you could use YAC to interpolate (and distribute) it. Examples using YAC can be found in plugins/yac_input (already almost your use case) or in plugins/python_adapter/examples/yac_example.py.

Remark: The examples usually use an external (non-ICON) process to read the data and then communicate the data to the ICON processes. Alternatively, you could also read in the data on ICON processes and redistribute from there. This makes the plugin a bit more complex but doesn’t need the MMD setup in the runscript. Which is better depends on your use case, I would say.

Hope that helps!
Nils

marcowurth · March 5, 2026, 2:27pm

Thanks for the convincing reply! Having something like 1 process per node to do I/O and then communicate this via yaxt to all core ICON processes seems like a very good idea indeed. The data is on the icon grid, so no interpolation needed in my case (only vertical but that’s already handled).

So using my global plugin communicator how I can check if my ICON process is the single “chosen“ one of my node? Or how can I subset my N processes, so that every one of the M lies on a different node?

I will have a look at what’s done in src/io/shared/mo_read_netcdf_distributed.f90 but well after all from the plugin I can’t use any ICON routines or functionalities that are not explicitely provided.

marcowurth · March 5, 2026, 2:44pm

Hi Nils, I was not aware at all of what yaxt actually is or does and until now I tried to keep it simple and stay away from adding more external code because building everything just keeps getting more complicated then. But after looking into the yaxt documentation and the yaxt_fortran example a bit (1000 thanks for providing that example!!!) it seems like exactly what I need to communicate the global data properly to the domain decomposition. As I said in the other reply, I’m a little clueless how to spawn these external I/O processes (what is a MMD setup?) or how to subset the ICON processes. In my case I don’t need asynchronous compute & I/O and neither parallel I/O if that’s what you are refering to…

k202077 · March 5, 2026, 2:58pm

In YAC this is handled by the routine check_io_max_num_ranks. You can get some inspiration from there. If built with YAC, you can also use the C routine yac_get_io_ranks directly (also available in the Python interface). The IO ranks can be configured via a couple of environment variables (see YAC: Configuration of parallel IO in YAC).

Nils3er · March 5, 2026, 3:57pm

how can I subset my N processes, so that every one of the M lies on a different node?

I think the easiest would be to specify a number like ranks_per_node (like 128) and then use MPI_Comm_split with color=rank/ranks_per_node and then use the root rank (rank==0) as I/O process for the new communicator.

On that communicator you can gather the global indices, read the data for these indices from the file and than scatter it back. This wouldn’t even require yaxt.

MPMD (sorry “MMD” is a typo) stands for Multiple Process Multiple Data. In the context of MPI it means that there a different executables run by the processes. This is then a bit more complex to setup in the runscript. With mpirun you can simple do something like

mpirun -n 42 ./icon : -n 3 python script.py

to run different executables. With slurm it is a bit different.

Regarding installing yaxt: You probably have to set the --prefix flag with configure. This sets the path where to install yaxt.

k202077 · March 5, 2026, 7:51pm

Sorry, I posted the wrong routine before:
check_io_max_num_ranks_per_node ( YAC: src/core/io_utils.c File Reference ).
It internally splits the given communicator into multiple communicators (one per node).

marcowurth · March 6, 2026, 10:39am

Nice, the MPI_Comm_split_type(comm, MPI_COMM_TYPE_SHARED, ...) is the secret sauce I was looking for. The MPI library basically fills the MPI_COMM_TYPE_SHARED with underlying topology information, that’s some cool stuff!

I was a little afraid to use a MPI_Comm_split with color values from a truncated integer division as you mentioned Nils because I know some processes that I start ICON with get reserved to be doing I/O and some do prefetching and I don’t know which ranks do that. MPI_Comm_split_type seems more robust to that!

The MPMD MPI setup seems like black magic to me already, I think that would be overkill for my case, cool to see what’s possible though.

Unfortunately as I feared a bit, trying to build ComIn with YAXT is slowly becoming a nightmare as I keep getting linking problems, no matter which intel compiler version I try, I did set -DCMAKE_PREFIX_PATH=path_to_yaxt_build and the LD_LIBRARY_PATH to include it. I guess I’ll create an ComIn issue for that and report there.

Nils3er · March 6, 2026, 10:47am

yaxt doesn’t provide a CMake config file. So it relies on the FindYAXT.cmake file in ComIn. Setting YAXT_ROOT (environment variable) to the install prefix of yaxt should work.

marcowurth · March 6, 2026, 12:54pm

Tried that also but doesn’t work, YAXT was already found apparently by cmake.

https://gitlab.dkrz.de/icon-comin/comin/-/issues/316