Self nesting#


Warning

This site is Work in Progress.

ToDo:

  • add description of particle transfer
  • add ocean-coupling related descriptions if necessary

General Features#

See the guide section for general features and usage of the self-nesting.

Nesting modes#

Nesting data transfer between parent and child models can be made in two alternative modes:

  • One-way means that children get solution data from their parents for setting boundary conditions on their nested boundaries but parents do not use any solution data from their children

  • Two-way means that also parents use solution data received from their children to modify the parent solution in the area occupied by a child domain using so-called post-insertion procedure (Clark and Hall, 1991)

In case of two-way coupling and cascading arrangement of more than one nested domain, the order of data-transfer operations (from parent to child and from child to parent) can be made in three alternative orders:

  • cascade: first from parent to child consequently all the way down to the lowest level child and then from child to parent consequently all the way up to the root level
  • overlap: simultaneous parent to child operations and child to parent operations
  • mixed (default): simultaneous operations from parent to child and consequent operations from child to parent

Cascade is following the formally correct order of operations and it is the slowest data-transfer mode because it involves more waiting time than the other modes. Overlap is the fastest one, but it ignores the information transferred on previous stages of the cascade. Mixed is the default and the recommended data-transfer mode as it is faster than cascade and does not ignore the data anterpolated on the previous stages of the cascade. In practice, this is only important in the child to parent transfer (anterpolation) in which 3-D data is substituted to the parent domain while in the parent to child transfer this is not critical because the data is only substituted on the child boundaries which are located in different places on different-level nested domains within the cascade.

Child boundary conditions on the nested boundaries are set by interpolation which is described and discussed by Hellsten et al. (2021). Also the child-data post-insertion to parent (called anterpolation), and the phases of the time-step algorithm on which these actions take place, as well as some other features are reported in detail in Hellsten et al. (2021). It should be noted that the interpolation and anterpolation operations are performed for the intermediate velocity field, i.e. before the pressure-correction step. The reason for this is that otherwise, a second pressure-correction step would have been needed for the anterpolated fields and this would be computationally very costly.

Overall code arrangement#

The nesting data-transfer pattern is complicated since both child and parent models are internally parallellized and their subdomain divisions are generally different and non-matching. Therefore, a sending process typically has to send data to more than one receiving processes, a receiving process has to receive data from more than one sending processes, and subdomains may be only partially involved in the transfer. This is illustrated in Figure 1 by a simple example. This complicated data-transfer pattern is implemented in the Palm Model Coupler (PMC). It is largely based on a hierarchy of defined data types and creative use of pointers.

parent_child_datatransfer_pattern
Figure 1: A simple example of the data transfer patterns between parent and child on horizontal plane. Blue color represents parent and red child. The designations like P0P and P0C stand for process zero of parent and process zero of child, respectively, and so on. The list on the left side shows which child processes each parent process sends data, and the list on the right side indicates which parent processes each child process sends data. Gridlines are drawn as thin dotted lines and the thin solid lines are subdomain borders. The red dashed rectangle shows the full horizontal extent of the child's parent-grid arrays.

The data transfer is based on the parent grid. A parent only sends and receives data while interpolation from parent to child grid and anterpolation from child to parent grid are always made by child. This significantly reduces the amount of data to be transferred. Children have so-called child's parent-grid arrays which are defined in the parent grid and span over the child-occupied subvolume of the parent domain with two layers of ghost nodes beyond the outer boundaries but not beyond the internal boundaries.

PMC consists of the PMC core including five modules, the PMC-interface module, and the PMC-particle interface module. The PMC core is connected to the rest of the code through the PMC interface only. The PMC core modules do not access any other modules than other PMC modules with the exception of module kinds, which includes the data type kind-definitions. The PMC module source-code files are:

pmc_interface_mod.f90
pmc_particle_interface.f90 
    pmc_general_mod.f90
    pmc_handle_communicator_mod.f90 
    pmc_parent_mod.f90
    pmc_child_mod.f90 
    pmc_mpi_wrapper_mod.f90 

The PMC core has not been originally developed for PALM but for global to regional climate- and weather forecast model coupling, and it has been originally written in the C-language, and only later, during the PALM-nesting development it was translated to Fortran. Therefore it looks very different compared with other PALM source code. Therefore the PMC source code is described here in more detail than other parts of the PALM code. The interface modules pmc_interface and pmc_particle_interface are developed specifically for PALM to provide interface to the PMC core.

The pmc_interface is a relatively large module containing subroutines for the upper-level nesting initiating and preparation operations, all child-solution initialization operations, interpolation and anterpolation operations as well as calling-routines to the actual data transfer routines in the PMC core. All the subroutine names in the PMC interface begin with pmci_. Similarly all subroutines in pmc_general start with pmc_g_, in pmc_parent with pmc_p_ and in pmc_child with pmc_c_.

Nesting initiation and initial preparations#

Initiation of a nested run. In the beginning of the PALM main program subroutine pmci_init in the pmc_interface is called. This subroutine calls pmc_init_model which resides in the pmc_handle_communicator module, one of the five PMC core modules. By calling pmc_parin this subroutine first detects if the &nesting_parameters namelist is given in the root namelist file (suffix _p3d) or not. If it is not found, pmc_init_model returns such that the run will continue as a non-nested run. If it is found, pmc_parin reads the nesting_parameters namelist and then pmc_init_model determines the processing-element space for each model and creates the model-specific communicators for each models' process groups according to which model a process belongs. These process groups are said to have different "colors". This is done by splitting the global world communicator MPI_COMM_WORLD by calling MPI_COMM_SPLIT according to couple id m_my_cpl_id as the color and the global rank (the rank in MPI_COMM_WORLD) m_my_cpl_rank of the current process.

Next, pmc_init_model creates inter-communicators for each parent-child pair by combining the local communicators of different colors by calling MPI_INTERCOMM_CREATE. This is done by using a copy of MPI_COMM_WORLD as a peer communicator meaning that MPI_INTERCOMM_CREATE is given the local communicator, the peer communicator and the global rank of the first process on the remote process group. This way it is able to identify the remote group. pmc_init_model counts the number of children of the current model and calls MPI_INTERCOMM_CREATE to create the inter-communicators between the current model and all of its children m_to_child_comm(i), and between the current model and its parent m_to_parent_comm.

The next call from the main program to PMC interface is to pmci_modelconfiguration. It makes the initial preparations for all models by calling other PMC-interface subroutines which e.g. set up the coordinates of each model domain (pmci_setup_coordinates), determines the number of variables to be coupled (pmci_num_arrays), sets up children (pmci_setup_child) and parents (pmci_setup_parent). The last two are large subroutines making calls to several lower-level subroutines.

Child setup. The subroutine pmci_setup_child performs the following operations. First it calls pmc_c_childinit which is in module pmc_child. It collects the information on the local model-communicator m_model_comm and on the inter-communicator m_to_parent_comm created by pmc_init_model into a childdef-type variable me. The childdef-type is defined in the module pmc_general. This information includes the local model communicator, the inter-communicator and the rank of the current process and size of the model communicator and the remote size of the inter-communicator. Then it creates an intra-communicator from the inter-communicator by calling MPI_INTERCOMM_MERGE. The reason why the inter-communicators have to be merged to intra-communicators is that the Remote Memory Access windows (RMA) used in the one-sided MPI communication cannot employ inter-communicators, they must be based on intra-communicators. The intra-communicator me%intra_comm includes first all the parent processes active for me (the current child), and then the current child processes. Then it allocates a pointer array pes for storing the child processes. This array is an element of the childdef-type variable me and is itself of type pedef. The array pes in turn includes the type arraydef array array_list which is also allocated here for each parent process connected with the current child. To understand the data structures, it is important to get familiar with the data types arraydef, pedef, and childdef defined in module pmc_general. Their definitions are listed below.


    TYPE ::  ij_index  !< pair of indices in horizontal plane
       INTEGER(iwp) ::  i  !<
       INTEGER(iwp) ::  j  !<
    END TYPE

    TYPE ::  arraydef
       CHARACTER(LEN=da_namelen) ::  name  !< name of array

       INTEGER(iwp) ::  coupleindex  !< ID of array
       INTEGER(iwp) ::  dimkey       !< key for NR dimensions and array type (2 = 2d-real, 3 = 3d-real, or 22  = 2d-integer*8))
       INTEGER(iwp) ::  nrdims       !< number of dimensions
       INTEGER(iwp) ::  recvsize     !< size in receive buffer
       INTEGER(iwp) ::  sendsize     !< size in send buffer
       INTEGER(idp) ::  recvindex    !< index in receive buffer
       INTEGER(idp) ::  sendindex    !< index in send buffer
       INTEGER(iwp) ::  ks           !< start index in z direction (3d arrays on parent only)
       INTEGER(iwp) ::  ke           !< end   index in z direction (3d arrays on parent only)

       INTEGER(iwp), DIMENSION(4) ::  a_dim  !< size of dimensions

       TYPE(C_PTR) ::  data     !< pointer of data in parent space
                                !< c pointers are used because they can point to different data types
                                !< e.g. REAL(2d), REAL(3d), INTEGER, etc.
                                !< otherwise, separate arraydef types would be required for each of them
       TYPE(C_PTR) ::  sendbuf  !< data pointer in send buffer
       TYPE(C_PTR) ::  recvbuf  !< data pointer in receive buffer

       TYPE(arraydef), POINTER ::  next  !<

       TYPE(C_PTR), DIMENSION(2) ::  po_data  !< base pointers, pmc_p_set_active_data_array
                                              !< sets active pointer to respective time level (e.g. u_1 or u_2)
    END TYPE arraydef

    TYPE ::  pedef
       INTEGER(iwp) ::  nr_arrays = 0  !< number of arrays which will be transfered
       INTEGER(iwp) ::  nrele          !< number of elements along x and y, same for all arrays

       TYPE(arraydef), POINTER, DIMENSION(:) ::  array_list  !< list of data arrays to be transfered

       TYPE(ij_index), POINTER, DIMENSION(:) ::  locind  !< i,j index local array for remote PE
    END TYPE pedef

    TYPE ::  childdef
       INTEGER(iwp) ::  inter_comm        !< inter communicator model and child
       INTEGER(iwp) ::  inter_npes        !< number of PEs child model
       INTEGER(iwp) ::  intra_comm        !< intra communicator model and child
       INTEGER(iwp) ::  intra_rank        !< rank within intra_comm
       INTEGER(iwp) ::  model_comm        !< communicator of this model
       INTEGER(iwp) ::  model_npes        !< number of PEs this model
       INTEGER(iwp) ::  model_rank        !< rank of this model
       INTEGER(idp) ::  totalbuffersize   !< size of RMA window
       INTEGER(iwp) ::  win_parent_child  !< MPI RMA window object for preparing data on parent AND child side

       TYPE(pedef), DIMENSION(:), POINTER ::  pes  !< list of all child PEs on parent or list of all parents on child
    END TYPE childdef

After returning from pmc_c_childinit, pmci_setup_child sets names for the child's parent grid arrays by calling pmc_c_set_dataarray_name in module pmc_child. It sets the name information of each variable to be coupled in type da_namedef structure myname. In the end of pmc_c_set_dataarray_name subroutine pmc_g_setname of module pmc_general is called. It sets myname%couple_index and myname%nameonchild to the childdef-type data structure me.

Next, pmci_setup_child exchanges child- and parent-grid information with its parent using the transfer subroutines pmc_send_to_parent and pmc_recv_from_parent defined in the module pmc_mpi_wrapper_mod. After this pmci_map_child_grid_to_parent_grid is called to determine the local index bounds ipl, ipr, jps, and jpn of the data-exchange area in the parent-grid horizontal index space. This is the area shown by red dashed line in Figure 1. Next, the child receives the so-called index list from its parent by calling pmc_c_get_2d_index_list. The index list is described below in the context of parent setup. The next task is to create the child's parent grid arrays by calling pmci_create_childs_parent_grid_arrays sequently for each variable to be coupled. The list of variables to be coupled is a linked-list type construction controlled by function pmc_c_getnextarray and based on the name information already set by pmc_c_set_dataarray_name. Subroutine pmci_create_childs_parent_grid_arrays allocates the requested child's parent-grid array, for example uc for u and associates a temporary pointer p_3d (or p_2d if the variable in question is a 2-D array or i_2d if it is an integer 2-D array). Then it passes this temporary pointer to the subroutine pmc_c_set_dataarray which sets this array pointer and the array dimensions inside the structure me for every process on the parent side i = 1, me%inter_npes. The core of this subroutine is given here as an example on how the data structures are typically operated in the PMC.

    dims    = 1
    nrdims  = 3
    dims(1) = SIZE( array, 1 )
    dims(2) = SIZE( array, 2 )
    dims(3) = SIZE( array, 3 )

    array_adr = C_LOC( array )
!
!-- Fill the array_list structure for every parent PE that is communicating with this child.
    DO  i = 1, me%inter_npes
       ape => me%pes(i)
       ar  => ape%array_list(next_array_in_list)
       ar%nrdims = nrdims
       ar%dimkey = nrdims
       ar%a_dim  = dims
       ar%data   = array_adr
    ENDDO

Note that the type of ar%data is C-pointer. This is because the one-sided MPI-communication operates on raw pointers (bare starting memory addresses) and cannot thus handle Fortran-pointers. Therefore the PMC code includes quite a lot of pointer conversions and C-pointer returning memory allocations using MPI_ALLOC_MEM actually embedded in the wrapper routine pmc_alloc_mem in the module pmc_mpi_wrapper_mod. For this reason there is the USE-statement

    USE, INTRINSIC :: ISO_C_BINDING

in the declaration section of most of the PMC-modules. A C-pointer to an array is simply the starting memory address of the array. This is determined here by the function C_LOC(). The childdef-type datastructure me includes the pedef-type structure pes which contains elements for every parent process communicating with the present child. It in turn contains the arraydef type array_list, here shortnamed as ar, which contains all the arrays to be coupled. This is the structure from which the final transfer buffer will be formed in the actual child to parent data-transfer will be extracted in pmc_c_putbuffer. This will be explained in section Data transfer. Because ar%data is just a starting address, also the dimensions have to be stored in the data structure.

The next phase is to prepare for the data receiving from parent that will be done in pmc_c_getbuffer and data passing to parent that will be done in pmc_c_putbuffer, see section Data transfer for more details. These preparations are made by pmc_c_setind_and_allocmem. It is important to understand that the size of the RMA window, which is created by the parent, covers all the processes on the receiving side (i = 1, me%inter_npes) and all the coupled variables (j = 1, me%pes(i)%nr_arrays). However, the child gets the data one by one in pmc_c_getbuffer. Therefore start indices to the window are needed. These start indices are received from the parent side by MPI_ALLTOALL in a two-dimensional array myindex_r. The elements of myindex_r are then stored in me%pes(i)%array_list(j)%recvindex in process- (i) and array- (j) loops. In the same nested loops the receive buffer size is determined. The receive buffer is not covering the whole RMA window, instead it is only of the size of single child's parent grid array, i.e. ar%a_dim(1)*ar%a_dim(2)*ar%a_dim(3). Maximum size of all i- and j-values is used for all of them. Next base_array_pcis allocated by calling pmc_alloc_mem, which is a wrapper routine in module_pmc_mpi_wrapper_mod module which calls MPI_ALLOC_MEM and assigns the resulting C-pointer to a Fortran pointer by C_F_POINTER. Next the C-pointer base_ptrto the receive buffer is set for all i and j although it is the same for all of them because only one buffer is allocated and it is reused for all transfer iterations i and j in pmc_c_getbuffer. This completes the preparations for data receiving from the parent on the child side.

Next the preparations for child to parent data transfer are carried out in the latter part of pmc_c_setind_and_allocmem. First the send-buffer size, sendsize (size of one transfer unit in the i-j-loop) and the start indices myindex_s are determined in loops as follows.


    bufsize = 8
    myindex_s = 0
    myindex_r = 0

    DO  i = 1, me%inter_npes
       ape => me%pes(i)
       DO  j = 1, ape%nr_arrays
          ar => ape%array_list(j)
          IF ( ar%nrdims == 2 )  THEN
             arlen = ape%nrele
          ELSEIF( ar%nrdims == 3 )  THEN
             arlen = ape%nrele * ar%a_dim(1)
          ENDIF
          IF ( ape%nrele > 0 )  THEN
             ar%sendindex = myindex
          ELSE
             ar%sendindex = noindex
          ENDIF
          myindex_s(j,i-1) = ar%sendindex
          IF ( ape%nrele > 0 )  THEN
             ar%sendsize = arlen
             myindex     = myindex + arlen
             bufsize     = bufsize + arlen
          ENDIF
       ENDDO
    ENDDO

The initial value of bufsize must not be zero, because RMA window cannot be of zero size. Therefore an arbitrary small size of 8 is initially given. The size of a horizontal layer of an individual array to be transferred is given by nrele and its vertical dimension is ar%a_dim(1). Note that nrele may be smaller than the horizontal size of a child's parent grid array on the sending process (which is (jpn-jps+1)*(ipr-ipl+1)), because the receiving parent process subdomain may overlap with the sending child subdomain only partially, see Figure 1. To indicate a passive parent process (no overlap at all and nrele=0) noindex is set to -1. The index i-1in myindex_s(j,i-1) is because MPI process id-numbering always starts from zero, but the process-loop i here starts from 1. After this, the start indices are sent to parent by MPI_ALLTOALL. Next, base_array_cp is allocated using pmc_alloc_mem and the RMA window is created on the intra-communicator me%intra_comm by calling MPI_WIN_CREATE. Finally, the buffer pointers are set as follows


    DO  i = 1, me%inter_npes
       ape => me%pes(i)
       DO  j = 1, ape%nr_arrays
          ar => ape%array_list(j)
          IF ( ape%nrele > 0 )  THEN
             ar%sendbuf = C_LOC( base_array_cp(ar%sendindex) )
             ...
          ENDIF
       ENDDO
    ENDDO

C_LOC returns C-pointer to the ar%sendindex-indicated entry in base_array_cp. Here the three dots just indicate seven code lines containing a buffer-size check and error-messaging. These code lines are excluded from here for clarity and conciseness. This completes the child-side preparations for child to parent transfer and the subroutine pmc_c_setind_and_allocmem.

Next pmci_setup_child call its internal subroutine pmci_define_index_mapping which defines the mappings from parent grid to child grid. For example iflo(ip) and ifuo(ip)indicate the lower and upper child grid points in the x-direction that correspond to the parent grid point ip. Such mapping arrays are defined for all three directions and separately for scalar nodes and staggered velocity-component nodes in each direction. These mappings are needed in the interpolation and anterpolation. It also determines and saves the numbers of child-grid nodes used for anterpolation for each parent-grid node.

In the end of pmci_setup_child there are still few operations left. First, the consistency of the nested-grid setup is checked by pmci_check_grid_matching. Then the child domain boundary-face areas are computed by pmci_compute_face_areas to be used in the mass-conservation control in pmci_ensure_nest_mass_conservation. Lastly, the anterpolation starting height as a function of x and y is determined by pmci_compute_kpb_anterp. This is needed for canopy-restricted anterpolation.

Parent setup. The subroutine pmci_setup_parent performs largely corresponding operations as pmci_child_setup, but there are differences, too. One difference is that a parent may have several children while a child can only have one parent. Therefore pmci_setup_parent handles the parent-child pairs (called couplers) in childid-loops over the number of children of the present parent. Another difference is that on the parent side in pmci_set_array_pointer the temporary pointer p_3d is set to the whole array, for instance u, instead of a coupling array like for instance uc on the child side. Such coupling arrays do not exist on the parent side. Moreover, because the time level swapping of the Runge-Kutta time-integration scheme, also another temporary pointer p_3d_sec has to be set for the second time level variable, for instance u_2. Also, on the parent side dims has four elements instead of three like on the child side. The first three are the dimensions of the actual prognostic array, for instance u, and the fourth is the vertical dimension of the corresponding child's parent-grid array, for example uc. Perhaps the largest difference to pmci_setup_child is that parent creates the index list, prepares the complete parent-child mapping and finally transfers the mapping information to child.

The index list is an integer array which maps the child index- and subdomain space on the parent index- and subdomain space. Its dimension is (6,index_list_size) where the six elements of the first dimension are i- and j-index on the parent grid, i- and j-index on the child's parent grid arrays, child process id and parent process id. The second dimension index_list_size is the number of grid points on the parent-grid x-y-plane subspace covered by the child domain including the ghost-node layers, see Figure 1. Note that the index list contains only horizontal index-information. This is a consequence of the two-dimensional domain decomposition. The only vertical index information needed is the vertical dimension of the transfer domains and this is provided by dims(4). Index list is created by parent process 0 by subroutine pmci_create_index_list. After creating the index list it calls module pmc_parent subroutine pmc_p_split_and_broadcast_index_list which splits the index list into local index lists according to the parent process id of the index-list elements and sends each local index list to the parent process it belongs to. Then it calls pmc_p_prepare_index_list_for_child, which arranges the data according to the child process id it will be sent to, defines the start indices and number of elements to be sent to each child process and the i and j indices of the child's parent grid array. Finally it sends these information to child processes and pmc_child module subroutine pmc_c_get_2d_index_list receives this information as mentioned above in the description of pmc_setup_child, and substitutes the number of elements in me%pes(i)%nrele and the i and j grid indices in me%pes(i)%locind(j)%i and me%pes(i)%locind(j)%j, respectively. Here the loop index i runs over all the remote processes of the intercommunicator me%inter_comm (usually many of them are not involved with the current transfer and in those loop-iterations nothing is done). The loop index j runs over the array elements on each child process (note that these i and j are not the grid-indices). These actions complete the mapping between the child index- and subdomain space on the parent index- and subdomain space.

After setting up the parent-child mapping, the variables to be coupled are set up by calling pmci_set_array_pointer to associate the variables to be coupled to the temporary pointer as already mentioned above. After that step, pmc_p_setind_and_allocmem is called for each child of the current parent. It makes the preparations for the actual data transfer which is made through the RMA-window that is created here. These parent-side preparations are made largely in a similar fashion as the corresponding operations on the child side in pmc_c_setind_and_allocmem already described above. Like in pmc_c_setind_and_allocmem, here are two base arrays: base_array_pc for parent to child transfer and base_array_cp for child to parent transfer. pmc_p_setind_and_allocmem first determines the start indices myindex_s and the necessary size of base_array_pc. It stores the start indices in children(childid)%pes(i)%array_list(j)%sendindex according to the process and variable. Then it sends the myindex_s to the child side using MPI_ALLTOALL over the intra communicator children(childid)%intra_comm which is the same communicator as me%intra_comm on the child side. The corresponding receive has already been discussed above in the description of pmci_setup_child. After this, the base array is allocated and the RMA window is created on children(childid)%intra_comm by calling MPI_WIN_CREATE. Finally it sets the buffer pointer to the base array in children(childid)%pes(i)%array_list(j)%sendbuffer which is of type C_PTR. Note that base_array_pc and base_array_cp themselves are Fortran pointers. Below is shown the part of pmc_p_setind_and_allocmem doing this. The three dots just indicate five code lines containing a buffer-size check and error-messaging. These code lines are excluded from here for clarity and conciseness.

    DO  i = 1, children(childid)%inter_npes
       ape => children(childid)%pes(i)
       DO  j = 1, ape%nr_arrays
          ar => ape%array_list(j)
          ar%sendbuf = C_LOC( base_array_pc(ar%sendindex) )
          ...
       ENDDO
    ENDDO

After this point the corresponding operations are made for child to parent transfer in a similar fashion as the parent to child transfer preparations on the child side in pmc_c_setind_and_allocmem.

Child initialization#

Nested-domain solutions can be initialized by interpolating from an existing root solution. The first-level child initial conditions can be interpolated from the root solution (which is either from a precursor run or just the initial root setting). Then the second-level child initial conditions can be interpolated from their parents which are first level children, and so on until all children have got their initial conditions. There are two alternative methods for this initial-condition interpolation

In the pmc_interface_mod, there are subroutines: pmci_child_initialize and pmci_parent_initialize called from the main program before the beginning of the time integration. pmci_child_initialize is called first to make sure that models being both child and parent have first obtained their own initial condition before transferring data further to their own children.

In case of three-dimensional initialization, pmci_parent_initialize simply fills the buffers of the current parent by calling pmc_p_fillbuffer for its children to receive the data by calling pmc_c_getbuffer in pmci_child_initialize. The subroutines pmc_p_fillbuffer and pmc_p_fillbuffer are described in section Data transfer. After this, pmci_child_initialize calls pmci_interp_all and exchange_horiz for all variables to be coupled. The subroutine pmci_interp_all employs zeroth-order interpolation which simply means that the parent-grid value in the child's parent-grid array, for instance uc, is copied to all child-grid points inside the current parent-grid cell.

In case of homogeneous initialization, pmci_parent_initialize first calls pmci_send_domain_averaged_profiles in a loop for all children of the current parent. The subroutine pmci_send_domain_averaged_profiles calls pmci_compute_average to compute horizontally averaged z-profile of the parent solution and then sends it to child by calling pmc_send_to_child. The horizontal averaging spans only the horizontal area occupied by the current child. This process is repeated for all variables to be coupled except w which is simply set to zero. In pmci_child_initialize, the vertically zeroth-order interpolated to the child grid.

Data transfer#

Upper-level data transfer subroutines. The upper-level data transfer subroutines: pmci_datatrans, pmci_child_datatrans, and pmci_parent_datatrans are found in the pmc interface module while the lower-level data transfer subroutines are found in pmc_parent_mod and pmc_child_mod. The hierarchy of these subroutines, including also the upper-level interpolation and anterpolation subroutines, is shown below.

pmci_datatrans
    pmci_parent_datatrans
        pmc_p_fillbuffer
        pmc_p_getbuffer
    pmci_child_datatrans
        pmc_c_getbuffer
        pmci_interpolation
        pmci_anterpolation
        pmc_c_putbuffer

The subroutine pmci_datatrans is called from the subroutine time_integration. It branches according to the control parameters nesting_mode and nesting_datatransfer_mode and calls pmci_parent_datatrans and pmci_child_datatrans in a corresponding order according to these control parameters. Subroutine pmci_parent_datatrans calls the actual data transfer routine pmc_p_fillbuffer or pmc_p_getbuffer depending on the transfer direction (parent_to_child or child_to parent). After calling pmc_p_getbuffer, it ensures as a safety measure that the velocity components are zero inside buildings and under the terrain surface after the anterpolation. In parent_to_child-direction subroutine pmci_child_datatrans first calls pmc_c_getbuffer and then pmci_interpolation. In child_to_parent-direction it first calls pmci_anterpolation and then pmc_c_putbuffer.

Parent send. The subroutine pmc_p_fillbuffer in module child_mod takes care of sending data to its children. It is called separately for each child of the current parent and childid comes in as a parameter. Then loops over the processes (ip) and arrays (j) are run, and within them the transfer data buffer size is first determined and stored in buf_shape. In case of normal three-dmensional data of type REAL(wp) it is children(childid)%pes(ip)%nrele * children(childid)%pes(ip)%array_list(j)%a_dim(4). The latter factor is the vertical dimension of the transfer domain, not the whole vertical size of the parent domain. Next, the sendbuffer children(childid)%pes(ip)%array_list(j)%sendbuf which is a C-pointer is associated to the Fortran-pointer array buf by calling C_F_POINTER. Similarly children(childid)%pes(ip)%array_list(j)%data is associated with data_3d. Then the sendbuffer is filled through buf. See the truncated source code insert below.


    DO  ip = 1, children(childid)%inter_npes
        ape => children(childid)%pes(ip)
        DO  j = 1, ape%nr_arrays
            ar => ape%array_list(j)
            myindex = 1
            ...

            buf_shape(1) = ape%nrele*ar%a_dim(4)
            CALL C_F_POINTER( ar%sendbuf, buf, buf_shape )
            CALL C_F_POINTER( ar%data, data_3d, ar%a_dim(1:3) )
            ...
!
!--         Copy from PALM 3d-array (e.g. u,v,...) to sendbuf in RMA window.
            DO  ij = 1, ape%nrele
                buf(myindex:myindex+ar%a_dim(4)-1) =                                            &
                              data_3d(ar%ks:ar%ke+2,ape%locind(ij)%j,ape%locind(ij)%i)
                myindex = myindex + ar%a_dim(4)
            ENDDO
            ...
        ENDDO
    ENDDO

Note that the ij-loop runs over the horizontal dimensions of the transfer domain (nrele) substituting the whole vertical stride of length ar%a_dim(1) at each loop-iteration to buf. For those ip-loop iterations for which nrele = 0, nothing is substituted to buf and myindex is not incremented. After completing these loops shown above, the sendbuffer is completely filled and there is a call to MPI_BARRIER to make sure that the child side will not proceed to receive the data until the buffer is completely filled. There is a corresponding call to MPI_BARRIER on the child side in the beginning of pmc_c_getbuffer. There is also second call to call to MPI_BARRIER right after the first one to let the parent know that the current child has completed its receive and the parent may go on to the next call to send to its next child, if any. The counterpart of this second MPI_BARRIER is in the end of pmc_c_getbuffer.

For setting buf_shapeand for the C_F_POINTER-calls, there are also corresponding branches for transfer of two-dimensional real-number data and for two-dimensional integer-valued data. These are not shown above but replaced by three dots for clarity and conciseness.

PMC also includes an option for using two-sided communication, but it is not recommended in normal use as it is slower than one-sided communication. However, note that one-sided communication is causing problems with some MPI-libraries, which can only be circumvented via setting of use_one_sided_communication = .F.. If two-sided communication is activated, the data is substituted one variable (array) at time to bufall(ip) which is an array of local defined type bufdef and ip is the child target-process index as in the code inserts above. It is statically allocated for all children(childid)%inter_npes child processes involved and its elements contain transfer buffer buf for real-valued data, and ibuf for integer-valued data, and the buffers are dynamically re-allocated for each transfer of each variable. Here the order of process and array loops is reversed, i.e. the array loop runs outer, and the process loop inner. Data for each child target processes is substituted in buf (or ibuf), and sent separately to each receiving child process by calling pmc_send_to_child.

Child receive. The subroutine pmc_c_getbuffer in module child_mod takes care of receiving data from the parent through the RMA window which is filled by pmc_p_fillbufferas described above. It begins with a call to MPI_BARRIER which has its counterpart in pmc_p_fillbuffer after the filling operations as described above to make sure that the so-called epoch is not started before the parent has completed its buffer filling. Then again loops over the processes (ip) and arrays (j) are run, and within them the transfer data buffer size is first determined and stored in nr which in case of normal three-dimensional data of type REAL(wp) is me%pes(ip)%nrele * me%pes(ip)%array_list(j)%a_dim(1). Then the actual transfer buffer of type C_PTR is associated to the Fortran pointer buf by calling C_F_POINTER. After this, if nr > 0, the subroutine starts the epoch by calling MPI_WIN_LOCK and to get the data to buf by calling MPI_GET followed by ending the epoch by calling MPI_WIN_UNLOCK. This method of synchronization is known as passive target synchronization. Next the data is substituted into the three-dimensional array data_3d. See the truncated source code insert below.


    DO  ip = 1, me%inter_npes
        ape => me%pes(ip)
        DO  j = 1, ape%nr_arrays
            ar => ape%array_list(j)
            ...
            myindex = 1
            ...
            CALL C_F_POINTER( ar%data, data_3d, ar%a_dim(1:3) )
            DO  ij = 1, ape%nrele
                data_3d(:,ape%locind(ij)%j,ape%locind(ij)%i) = buf(myindex:myindex+ar%a_dim(1)-1)
                myindex = myindex+ar%a_dim(1)
            ENDDO
        ...
        ENDDO
    ENDDO

Note that the ij-loop runs over the horizontal dimensions of the transfer domain (nrele) substituting the whole vertical stride of length ar%a_dim(1) at each iteration. This completes the array- and process-loops and the whole receive sequence and finally MPI_BARRIERis called again to inform the parent side that the receive is completed. There is a corresponding call to MPI_BARRIER on the parent side.

For the optional two-sided communication mode, pmc_c_getbuffer has a receive buffer bufall similar to that on the parent side and statically allocated for all involved me%inter_npes parent processes. Its receive buffers bufand ibufare dynamically re-allocated for each transfer of each variable. Naturally the array and process loops run in the same order as on the parent side, i.e. the array loop runs outer. Data is received to bufall(ip)%buf or ibuf by calling pmc_recv_from_parent. All the data for the current variable is received to bufall within this process loop. After this, but within the same array loop, there is a second process loop to substitute the data from buffer to the three-dimensional array. This procedure is relatively similar to that of the one-sided communication branch shown above.

In child to parent transfer pmc_c_putbuffer sends data from child to parent. It works very much the same way as pmc_p_fillbuffer and it can be understood just by studying how pmc_p_fillbuffer works. Therefore pmc_c_putbuffer is not further explained here.

Also pmc_p_getbuffer works similarly to pmc_c_getbuffer and can be almost completely understood just by understanding pmc_c_putbuffer. However, there is one exception worth to be explained here. The range of child processes in the scope can be limited to one by an optional argument child_process_nr. If this optional argument is present, the process loop is restricted to only this particular process indicated by child_process_nr. If this argument is not present, the process loops run normally from 1 to children(childid)%inter_npes. The reason for this arrangement is in particle coupling. It is never used in coupling the Eulerian field variables.

Interpolation of child's boundary conditions#

As soon as a child has received all the coupled variables in its child's parent-grid arrays, interpolation of child's boundary conditions on its nested boundaries starts as already explained in section Data transfer. In the pmc_interface_mod, there is an intermediate-level calling routine pmci_interpolation for calling the actual interpolation routines. There are three interpolation subroutines: pmci_interp_lr for handling the left and right boundaries, pmci_interp_sn for handling the south and north boundaries, and pmci_interp_bt for handling the bottom and top boundaries. Originally, it was assumed that all child domains are set on the bottom boundary of the root domain and it was not possible to set the bottom boundary as nested, but an elevated-nest feature has been implemented later and the bottom-boundary handling was added into pmci_interp_t then renamed as pmci_interp_bt. The calling routine pmci_interpolation first calls pmci_interp_lr for all necessary variables if nesting_bounds is not set to 'vertical_only' or 'cyclic_along_x'. It is called first for all necessary variables to set the boundary conditions on the left boundary and then for the right boundary. Then, similar sequence of calls to pmci_interp_sn is conducted to set the boundary conditions on the south boundary and then for the north boundary if nesting_bounds is not set to 'vertical_only' or 'cyclic_along_y'. Finally, the boundary conditions are set to top and bottom (if the nested domain is elevated) boundaries by calling pmci_interp_bt.

The actual interpolation routines, for instance pmci_interp_lr start by setting the necessary indices which depend on the variable in question, staggered or non-staggered as projected on the boundary plane, and on the boundary itself, in this case left or right.

Next, a so-called boundary work-array is updated. Such a boundary work-array, in this case work_arr_lr, which extends by one parent-grid cell past internal subdomain boundaries is needed because the child's parent-grid arrays here designated as parent_array must definitely not reach beyond any internal boundaries. Otherwise anterpolation would be messed up around the internal boundaries. The work arrays are first filled up from the parent_array and then its ghost-nodes are exchanged by calling MPI_SENDRECV. These work arrays: work_arr_lr, work_arr_sn, and work_arr_bt are allocated by pmci_allocate_workarrays and their exchange data types to facilitate swift exchange are defined by pmci_create_workarray_exchange_datatypes. These two subroutines are called from pmci_define_index_mapping which in turn is called by pmci_setup_child in the initial preparation phase.

Then, the actual interpolation operations follow and there are separate branches for all velocity components and for any scalar variable. All the algorithms employed here are introduced and explained by Hellsten et al. (2021) in detail. Therefore they are not explained here.

Anterpolation#

Anterpolation means to use the child solution for replacing the parent solution within the subvolume occupied by the child domain (or a subset of it). The child solution must first be restricted to the parent grid. This is made by integrating the child solution such that the velocity components are anterpolated onto two-dimensional faces while scalars are anterpolated into three dimensional volumes (grid cells), see (Clark and Hall, 1991) and (Hellsten at al., 2021). The resulting fields are stored in child's parent-grid arrays. This operation is done by subroutine pmci_anterp_var as described below. After this phase is completed pmc_c_child_datatrans calls pmc_c_putbuffer and the parent receives all the anterpolated data directly in its variables by calling pmc_p_getbuffer.

There is a calling routine pmci_anterpolation called from pmci_child_datatrans. This calling routine calls the actual anterpolation routine pmci_anterp_var for all necessary variables with the index-mapping arrays passed as arguments. Note that the index-mappings are different for the velocity components (staggered) and scalar variables (non-staggered). The number of child-grid faces for each velocity component and child-grid cells for scalars for all parent grid nodes are precomputed, stored in ijkfc, and passed here as an argument. The anterpolation is just integration (summation) over the child-grid nodes on the parent-grid face for velocity components or in the parent-grid cell for scalars. The result is stored in the corresponding child's parent-grid array.

In practice, anterpolation cannot span over the whole child domain. Narrow buffer-zones are defined just inside the nested boundaries of the child domain. The default width of these buffers is two parent-grid spacings. This can be increased by setting the parameter anterpolation_buffer_width in the &nesting_parameters namelist to a value higher than two. The reason for these buffers is that the nodes directly involved with the interpolation should not be anterpolated because it would typically lead to an unstable feedback loop and blow up of the solution.

In some cases, anterpolation may produce unphysical features in the solution. For example if a canopy drag (plant or building canopy) becomes remarkably different in child and parent due to the better resolution of child, the mean-flow velocities in the parent may increase or decrease in the area occupied by the child. This usually induces unphysical secondary flow phenomena in the parent solution. In an attempt to avoid such problems, anterpolation can be switched off in the lowest part of the child domain by setting the parameter anterpolation_starting_height in the &nesting_parameters namelist to a suitable value. Note that this parameter is a grid index, not height in metres, and it is relative to the local terrain surface. Note also, that it influences all the domains the same way, i.e. it cannot be specified separately for different domains. This feature has been tested and studied by Hellsten et al. (2021).

Mass conservation correction#

The subroutine time_integration calls the pmc_interface_mod-subroutine pmci_ensure_nest_mass_conservation right before the pressure-correction step to make sure that overall mass-conservation is not violated in any of the child domains. This is carried out by first integrating the total mass flux through all the nested boundaries (including the bottom boundary in case of an elevated child). Then these are summed over the domain to find out the possible mass imbalance. Then the correction velocity is defined by dividing the imbalance by total boundary area of nested boundaries. These areas are precalcuted within the initial preparations. The correction velocity is then added onto the nested boundaries with correct sign on each boundary.

PMC finalization#

In the end of a nested run, the main program calls pmci_finalize which calls pmc_p_finalize and pmc_c_finalize. These subroutines call MPI_FREE_MEM to release the memory allocated for the base arrays of parent and child, respectively. Then they call MPI_WIN_FREE to free the RMA windows.

Instructions for adding a new variable to the coupling#

If a new variable needs to be added to the set of variables to be coupled or an existing one removed, it needs intervention in three parts of the code. These are:

  • in pmci_setup_child
  • in pmci_num_arrays
  • within the child initialization routines pmci_child_initialize and pmci_send_domain_averaged_profiles

First, in the part of pmci_setup_child where pmc_c_set_dataarray_name is called for each coupled variable, one has to add (or remove) CALL pmc_c_set_dataarray_name('varname'), where 'varname'is the name of the variable in question

In pmci_num_arrays one must ensure that pmc_max_arrayis set equal to the actual number of variables to be coupled including the added new variables as pmc_max_array is used to allocate me%pes(i)%array_list()and children(childid)%pes(j)%array_list().

Within pmci_child_initialize, one has to add (or remove) the corresponding call to pmci_interp_all for three-dimensional initialization. For homogeneous initialization one has to add (or remove) the corresponding calls and pmci_interp_1d in pmci_child_initialize and to pmci_compute_average and pmc_send_to_child in pmci_send_domain_averaged_profiles.

References#

Clark, T. and Hall, W. (1991): Multi-domain simulations of the time dependent Navier-Stokes equations: benchmark error analysis of some nesting procedures, J. Comput. Phys., 92, 456–481.

Hellsten, A., Ketelsen, K., Suehring, M., Auvinen, M., Maronga, B., Knigge, C., Barmpas, F., Tsegas, G., and Moussiopoulos, N., Raasch, S. (2021): A Nested Multi-Scale System Implemented in the Large-Eddy Simulation Model PALM model system 6.0. Geosci. Model Dev., 14(6), 3185-3214, doi.org/10.5194/gmd-14-3185-2021.