Segmented Runs

Segmented runs is a feature that enables a single run description YAML file to be used to launch a sequence of linked runs for a date range (e.g. a year) that is longer than can robustly executed in a single run. It is available in Version 19.1 onward.

The segmented runs feature is activated by including a segmented run section in the run description YAML file. This section describes the segmented run section. Please see NEMO-3.6 Run Description File for other details of run description files.

Warning

There are some restrictions imposed on the contents of run description YAML files when the segmented runs feature is in use. Please see Segmented Runs YAML File Restrictions below for details.

An example of a segmented run section is:

segmented run:
  start date: 2016-04-30
  start time step: 2730241
  end date: 2016-12-31
  days per segment: 30
  first segment number: 0
  segment walltime: 12:00:00
  namelists:
    namrun: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.time
    namdom: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.domain

Note

The presence of a segmented run section in a run description YAML file is what enables the feature. If you have a YAML file that contains a segmented run section but you want to execute the run without segmentation, please be sure to delete or comment out the segmented run section.

All key-value pairs in the segmented run section are required; salishsea run will raise an error if any are missing.

start date

The start date for the first run in the sequence of linked runs, formatted as YYYY-MM-DD. In the example above, the sequence of runs will start on 2016-04-30.

start time step

The time step number on which to start the first run in the sequence, formatted as an integer. If you are initializing the segmented run from restart file(s), the start time step value is the time step number of the restart file(s) given in the restart Section plus 1. In the example above, the run will start with time step 2730241.

end date

The end date for the sequence of linked runs, formatted as YYYY-MM-DD. In the example above, the sequence of runs will start on 2016-12-31.

days per segment

The number of days to use for each segment of the sequence of runs, formatted as an integer. In the example above, the run segments will be 30 days long. The length of the final segment in the sequence is adjusted to be the appropriate number of days required to bring the sequence to an end on end date; i.e. it is not necessary for the value of days per segment to divide evenly into the span of start date to end date.

first segment number

The 0-based index number of the first segment in the sequence to run. This value is normally 0. A non-zero value is used if you are restarting a segmented run after recovering from a failure of one of the run segments. Please see Restarting After a Segment Failure for details of how to restart a segmented run after failures such as running out of walltime.

segment walltime

The wall-clock time to request for the each segment of the run, formatted as HH:MM:SS. The value of walltime in the Basic Run Configuration section of the run description YAML file is ignored. In the example above, each segment of the run will have a walltime of 12:00:00.

The namelists sub-section provides paths to the namelist files containing the namrun and namdom namelists that needed to calculate the namrun values for each run segment.

namrun

Absolute path to the namelist file containing the namrun namelist. If you follow the recommended pattern of breaking namelist_cfg into different files (see SS-run-sets/v201905/), the name of this file is namelist.time. If you use a monolithic namelist_cfg file, the name of this file is probably namelist_cfg

Warning

This path must appear identically in the namelist_cfg sub-section of the namelists Section of the run description YAML file.

namdom

Absolute path to the namelist file containing the namdom namelist. If you follow the recommended pattern of breaking namelist_cfg into different files (see SS-run-sets/v201905/), the name of this file is namelist.domain. If you use a monolithic namelist_cfg file, the name of this file is probably namelist_cfg

Segmented Runs YAML File Restrictions

There are a few restrictions on how your run description YAML file must be structured for it to be usable for a segmented run in contrast to a single job run. These restrictions arise due to the processing that salishsea run has to do to construct run description and namelist files for each segment of a segmented run.

  • All paths must be absolute; i.e. start with a / or with a environment variable value that starts with a /. That means (for example) you should use $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.time instead of ./namelist.time. Paths may contain ~ or $HOME as alternative spellings of the your home directory, and $USER as an alternative spelling of your userid. You can also use system-defined environment variable values like $PROJECT and $SCRATCH.

  • The path associated with the namerun key in the namelists sub-section under segmented run must appear identially in the namelist_cfg sub-section of the namelists Section of the run description YAML file.

How Segmented Runs Work

This section describes how the salishsea run command prepares and queues the sequence of linked runs that is generated when the segmented run section is included in a run description YAML file.

The process begins by calculating several pieces of information for each segment of the sequence:

  • the segment run description dict; that is a copy of the run description dict read from the run description YAML file given in the salishsea run command with values calculated for the particular run segment

  • the file name in which the segment run description dict will be stored as YAML; that is the name of the run description YAML file given in the salishsea run command with the 0-based index of the segment appended to the name. For example, if the command-line YAML file is BR5_12SKOG2016.yaml, the first segment’s YAML file will be BR5_12SKOG2016_0.yaml, the second will be BR5_12SKOG2016_1.yaml, etc. Those are the names of the run description YAML files that will be stored in the segment results directories.

  • the directory name in which the segment run results will be stored; that is the results directory name given in the salishsea run command with the 0-based index of the segment appended to it. For example, if the command-line results directory is $SCRATCH/SKOG_graham_BASERUN/BR_2016/, the first segment’s results will be stored in $SCRATCH/SKOG_graham_BASERUN/BR_2016_0/, the second will be in $SCRATCH/SKOG_graham_BASERUN/BR_2016_1/, etc.

  • the f90nml patch dict that will be applied to the namrun namelist to set the values of nn_it000, nn_itend, and nn_date0 for the segment

Next, in temporary storage directories (one for each segment) that exists only while the salishsea run command is being executed, the namelist files containing the namrun namelist for the segments, and the segment run description YAML files are written. Each segment’s namrun namelist file is created by using the value associated with the namrun key as a template namelist file to which the f90nml patches calculated above are applied. The segment run descriptions calculated above are updated with:

  • the path of the namrun namelist for the segments

  • the path(s) of the restart file(s) that will be produced by the previous run segment

  • the segment walltime value

The segment run descriptions are stored with the YAML file names calculated above.

With all of that preparation completed, temporary run directories for each segment are created in the directory given by the runs directory key in the paths Section section of the run description YAML file from the command-line. Then the run segments are submitted in order, each with a --waitjob dependency on successful completion of the previous segment.

The run ids of the segments are the value associated with the run_id key in the YAML file from the command-line, prefixed with the 0-based index of the run segment. For example, if the run_id value is SKOG_2016_BASE, the run id of the first queued segment will be 0_SKOG_2016_BASE, the second will be 1_SKOG_2016_BASE, etc. The run ids are prefixed with their segment number (in contrast to YAML files and results directories which are suffixed) so that the segment numbers are easily visible in the output of squeue or qstat even if the base run id is long.

The salishsea run command returns a space-separated list of job ids of the queued run segments.

Restarting After a Segment Failure

If a segmented run fails part way through, you can restart it from the last restart file(s) it produced. To do so, you need update your run description YAML file, or create a new one, with the following changes:

  • Set the value of start date to the date (YYYY-MM-DD) on which your want the run to resume.

  • Set the value of start time step to the time step of the restart file(s) plus 1.

  • Set the value(s) in the restart Section section to the to the path(s) that you want the run to restart from.

  • Set the value of first segment number to the segment number in which the restart files were produced plus 1.

So, for example, let’s say you started a segmented run with a YAML file that contained:

segmented run:
  start date: 2016-04-30
  start time step: 2730241
  end date: 2016-12-31
  days per segment: 30
  first segment number: 0
  segment walltime: 12:00:00
  namelists:
    namrun: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.time
    namdom: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.domain

...

  restart:
    restart.nc: $SCRATCHDIR/SKOG/SKOG_02730240_restart.nc
    restart_trc.nc: $SCRATCHDIR/SKOG/SKOG_02730240_restart_trc.nc

Now let’s say it fails (perhaps due to exceeding walltime) during segment 2 so that you have restart files:

  • $SCRATCHDIR/SKOG_2/SKOG_02892240_restart.nc

  • $SCRATCHDIR/SKOG_2/SKOG_02892240_restart_trc.nc

corresponding to a run date of 2016-07-14. You can restart the run by editing your YAML file to:

 segmented run:
   start date: 2016-07-15
   start time step: 2892241
   end date: 2016-12-31
   days per segment: 30
   first segment number: 3
   segment walltime: 12:00:00
   namelists:
     namrun: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.time
     namdom: $PROJECT/SalishSeaCast/hindcast-sys/SS-run-sets/v201812/namelist.domain

 ...

   restart:
     restart.nc: $SCRATCHDIR/SKOG_2/SKOG_02892240_restart.nc
     restart_trc.nc: $SCRATCHDIR/SKOG_2/SKOG_02892240_restart_trc.nc