Creating Merger Forests

Merger tree forests are created from the HACC treenode files, cf. [Rangel2020]. The code rearranges the snapshot-by-snapshot catalog format to a convenient and efficient tree-format. Each processed tree-file is self-contained, i.e. no trees are split among multiple files. See Merger Forest Layout for more information about the final product.

The MPI enabled python script in haccytrees/scripts/treenodes2forest.py is the main executable which sets up the configuration and starts the conversion routine contained in haccytrees.mergertrees.assemble.catalog2tree(). The settings have to be provided to the script through a configuration file, see Configuration for an example.

Installing the haccytrees python module (see Installation) will add haccytrees-convert to your PATH (which is a wrapper of treenodes2forest.py). The code can then be executed like

CONFIG=treenodes2forest.cfg
mpirun -n <NPROC> haccytrees-convert ${CONFIG}

Note

Alternatively, you can run mpirun -n <NPROC> python <PATH_TO_treenodes2forest.py> ${CONFIG}

Note

All treenode catalog files need to be located in a single folder. If MergerTrees_updated is being used, this folder usually does not contain the first snapshot. One strategy is to symlink all catalogs in MergerTrees_updated as well as the first snapshot from the MergerTrees folder into a new folder.

Routine Outline

The assembling of the merger forests proceeds in three main steps:

  • distributing the halo catalogs among the MPI ranks such that each individual tree is fully contained on a rank.

  • finding the correct ordering of the halos, corresponding to the layout described in Merger Forest Layout.

  • writing the forests to HDF5 files.

Distributing the data

The rank to which a tree is assigned is determined by the position of the root halo at the last step, i.e. \(z=0\). The partitioning of the simulation volume is determined by the mpipartition.Partition class, using MPI’s 3D cartesian communicator.

We start by distributing the halos in the final snapshot, using the abstracted distribution function mpipartition.distribute(). We then iterate backwards over the snapshots. The halos in each snapshot are first distributed by their position. Afterwards, halos that may have crossed the rank boundaries are accounted for by marking all halos that don’t have a local descendant halo. Those halos are then communicated with the 26 directly neighboring ranks, using a MPI graph communicator connecting each 26 neighbor-set symmetrically. If there are still unaccounted halos left, those are assigned using an all2all exchange. This exchange functionality is implemented in mpipartition.exchange().

At each step, we also take note of the descendant halo array index in the previous step. This information then simplifies the next step, the reordering of the halos to form a self-contained tree.

Finding the halo ordering

After the reading and distributing phase, each rank now contains all the data it needs to generate it’s own self-contained forest. From the descendant index stored during the previous phase we can then determine where in the final array each halo has to go in order to obtain the required layout.

In a first step, we calculate the size of each subtree, starting from the earliest snapshot. Halos that are leaves of the tree have size 1 by definition. We can then iteratively add the halos size to its descendant in the next snapshot. After processing the latest snapshot, we know the size of each self-contained trees.

Knowing the size of each subtree, we can then determine the halo’s position, starting from the latest snapshot. Each root halo in the list is offset from its previous halo by the size of that halos subtree. In the earlier snapshots, each halo is positioned at its descendant halos position plus the subtree sizes of the halos that have the same descendant and came earlier in the halo list. By previously having ordered the halos in each snapshot by decreasing mass, halos that have the same descendant halo are automatically ordered the same way.

After this step, we know at which array position every halo in the snapshot-ordered catalog has to go during the next phase.

Writing the data

In order to minimize the memory requirements, all the rank-local data of each snapshot is stored into temporary containers (see Temporary Storage). The previous step, for example, only requires the descendant index to be in memory. During the writing step, we now read the temporary files field-by-field, reorder the data according to the previously determined halo ordering, and store that field into an HDF5 dataset. By iterating over the fields individually, we only need to keep one array in memory at a time. For the full Last Journey dataset for example, 32 nodes were more than sufficient to process the forests.

Configuration

The configuration file is in the ini style (cf. configparser) containing information about the simulation, the data location, the field names in the treenode catalogs and the forest output, as well as some switches to optimize the routine. See the definition of the parameters in the following example configuration:

Example configuration for Last Journey
################################################################################
# This is an example configuration for the Last Journey simulation             #
#                                                                              #
# The required sections are:                                                   #
# - simulation: name of the simulation, data location                          #
# - columns:    the naming of the columns in the treenode catalogs             #
# - output:     the output file and the columns to store / calculate           #
# - algorithm:  some additional switches, failure triggers, verbosity          #
################################################################################

[simulation]
# the simulation is used to determine cosmotools steps, scale factors, etc.
# has to be supported by haccytrees/simulations.py
simulation         = LastJourney
# the base-path of the treenode catalogs. The files are assumed to be named e.g.
# ./treenodes/m000p-499.treenodes
treenode_base      = ./treenodes/m000p-
# alternatively, you can specify it like this: (useful for galaxymergertrees!)
# treenode_base     = ./treenodes/m000p-#.treenodes


[columns]
# how the fields in the treenode catalogs are named – can change from simulation
# to simulation, e.g. fof_halo_mean vs fof_halo_com.
fof_halo_center_x = fof_halo_center_x
fof_halo_center_y = fof_halo_center_y
fof_halo_center_z = fof_halo_center_z
fof_halo_com_x    = fof_halo_mean_x
fof_halo_com_y    = fof_halo_mean_y
fof_halo_com_z    = fof_halo_mean_z
sod_halo_center_x = sod_halo_min_pot_x
sod_halo_center_y = sod_halo_min_pot_y
sod_halo_center_z = sod_halo_min_pot_z
sod_halo_com_x    = sod_halo_mean_x
sod_halo_com_y    = sod_halo_mean_y
sod_halo_com_z    = sod_halo_mean_z
sod_halo_mass     = sod_halo_mass
sod_halo_radius   = sod_halo_radius
tree_node_index   = tree_node_index
desc_node_index   = desc_node_index
tree_node_mass    = tree_node_mass

# This determines which field is used to distribute the halos among the ranks.
# Has to be available for all halos, e.g. not SOD positions.
node_position_x   = fof_halo_center_x
node_position_y   = fof_halo_center_y
node_position_z   = fof_halo_center_z

[output]
# The base of the output files. If `split_output` is enabled, files will be
# named e.g. `m000p.forest.001.hdf5` for rank 0.
output_base    = m000p.forest
# If false, haccytrees will store the forest trees in a single file. Not
# suitable for large simulations.
split_output   = yes
# The base path where temporary files will be stored. If this setting is
# commented out, all the data will be kept in memory instead.
temporary_path = tmp/m000p.temp
# The fields that will be copied to the final output. If a list of two names
# are given, the treenode data given by the first one will be stored under the
# second one (renaming).
copy_fields    = [
    "tree_node_index",
    "desc_node_index",
    "tree_node_mass",
    "fof_halo_tag",
    "fof_halo_count",
    "fof_halo_mass",
    "fof_halo_center_x",
    "fof_halo_center_y",
    "fof_halo_center_z",
    "sod_halo_count",
    "sod_halo_mass",
    "sod_halo_radius",
    "sod_halo_cdelta",
    "sod_halo_cdelta_error",
    ["sod_halo_c_acc_mass", "sod_halo_cdelta_accum"],
    ["sod_halo_c_peak_mass", "sod_halo_cdelta_peak"]
    ]
# Additional fields that are derived from other fields. See
# haccytrees/mergertrees/assemble/derived_fields.py for the supported options.
derived_fields  = ["xoff_fof", "xoff_sod", "xoff_com"]

[algorithm]
# If no, the exchange of orphaned halos is amongst the 26 neighbors first, and
# only the remaining orphans are distributed all-to-all. If yes, the orphans are
# directly exchanged amongst all ranks (slower and more memory intensive).
do_all2all_exchange    = no
# If yes, the code will abort if an orphan cannot be assigned to a rank. If no,
# the orphan will be treated as a new root, and the code continues
fail_on_desc_not_found = yes
# A waittime in seconds until additional communicator (3D cart and 26-neighbor
# graph) are constructed. Can mitigate some `PG not found` MPI errors that occur
# on cooley without waittime (may be due to some bug in mpi4py)
mpi_waittime           = 5
# Increasing verbosity. Currently the max is 2, printing distribute and exchange
# information from each rank.
verbose                = 0

Example: Last Journey

Creating the LastJourney Merger Forest took 3.2 hours on cooley, using 32 nodes with 12 MPI ranks each (total of 384 ranks). The majority of the time was spent in reading the treenode files (1.3 hours) and doing temporary IO (1h) in order to keep the memory consumption in each rank low. A much smaller fraction of the total time is in MPI communications (distribute and exchange), ~40min. Calculating the sub-tree sizes and data ordering took an insignificant amount of time.

../_images/LJ_timing_32nodes.svg

Processing time of Last Journey, split by task

Galaxy Merger trees

The code also works for galaxy mergertrees. See haccytrees/scripts/treenodes2forest.galaxymergertree.example.cfg for an example configuration file.

References

haccytrees.mergertrees.assemble.catalog2tree(simulation, treenode_base, fields_config, output_file, *, temporary_path=None, do_all2all_exchange=False, split_output=False, fail_on_desc_not_found=True, mpi_waittime=0, logger=None, verbose=False)[source]

The main routine that converts treenode-catalogs to HDF5 treenode forests

[add basic outline of algorithm]

Parameters:
  • simulation (Simulation) – a Simulation instance containing the cosmotools steps

  • treenode_base (str) – the base path for the treenode files. - if treenode_base contains #, # will be replace by the current step number - otherwise, the path will be completed by appending [step].treenodes.

  • fields_config (FieldsConfig) – a FieldsConfig instance, containing the treenodes filed names

  • output_file (str) – base path for the output file(s)

  • temporary_path (str | None) – base path for temporary files. Note: folders must exist.

  • do_all2all_exchange (bool) – if False, will exchange among neighboring ranks first and then all2all. If True, will do all2all directly

  • split_output (bool) – if True, forests will be stored in multiple HDF5 files (one per rank). If False, all data will be combined in a single file (might not be feasible for large simulations)

  • fail_on_desc_not_found (bool) – if True, will abort if a descendant halo cannot be found among all ranks. If False, the orphaned halo will become the root of the subtree.

  • mpi_waittime (float) – time in seconds for which the code will wait for the MPI to be initialized. Can help with some MPI errors (on cooley)

  • logger (Callable[[str], None] | None) – a logging function, e.g. print

  • verbose (bool | int) – verbosity level, either 0, 1, or 2

Return type:

None

[Rangel2020]

Rangel et al. (2020) arXiv:2008.08519