How to make a simulation ?

Allow a user, assumed to be already acquainted to tori, yuki or vargas computing environment, to technically design and run a CNRM-CM experiment (coupled or not)

Article mis en ligne le 3 octobre 2012

dernière modification le 26 septembre 2016

par senesi, voldoire

CNRM-CM available components are :

the atmospheric model, Arpege, which embeds the Land Surface Model, Surfex
the ocean model Nemo, which may embed a sea-ice model, either Gelato or LIM, and which can perform its I/O using a companion process, called the I/O server.
the river routing model TRIP
the coupler Oasis, which manages exchange of information between these three model groups
a set of scripts used to define and run experiments, named ECLIS (Environment for CLImate Simulations)

A full coupled CNRM-CM experiment will involve 5 or 6 models (depending on SeaIce) and the coupler, but you can also run in these configurations :

Arpege/Aladin only, with or without Surfex, providing prescribed oceanic boundary conditions forcings,
Surfex only (so-called ’offline mode’), providing atmospheric forcings, and oceanic ones if applicable
NEMO (and possibly a Sea-Ice model) providing atmospheric forcings

All models can be provided with nudging data..

The 7 models (coupler included) each have a namelist, almost have a restart file, and most have their own output format. Each model group has a binary. This information is stored in a so-called param file used by ECLIS. The binaries are launched together using MPI. A CNRM-CM run is made of an automated sequence of "macro-jobs", where each "macro-job" iterates on a number of months (usually on one year).

CNRM-CM runs at MÃ©tÃ©o-France on beaufix, tori and yuki , and at IDRIS on vargas.

– Setting up your environment
– Designing an experiment
– Launching the experiment
– Monitoring the experiment’s progress
– Managing a crashed experiment
– Cleaning an experiment

Setting up your environment

If you want an easy life and are OK with adding some directories to your PATH (with low priority), and you use bash : add in your .profile (or .bash_profile if you have one) the lines :
- export ECLIS=xxx where xxx is an ECLIS location, as described in the page describing ECLIS management. At the time of writing, on beaufix, you may write export ECLIS=~senesi/eclis/V6.2 or export ECLIS=~senesi/eclis/current
- source $ECLIS/setup
otherwise, please read the guidelines in file ’setup’
- if you want to run old experiments (i.e. created using version of ECLIS before V5.4), you can keep you environment setup and also create new experiments (using ECLIS with version > 5.7)
- for old versions of ECLIS, see foootnote [1]

On MÃ©tÃ©o-France computing center, to enable ftput and ftget utilities, execute command ftmotpasse -u xxxxx -h cougar (where xxxx is your cougar user). At IDRIS, ensure that mfget and mfput are working for transfers to gaya without password prompting

if you want to use script ’delexp’ for cleaning experiments outputs ; you have to enable simple ftp without password prompting : create a $HOME/.netrc file (with permissions 600) with a line such as machine cougar login my_MF_login password my_MF_password (or, at IDRIS machine gaya login my_gaya_login password my_gaya_password ). This is however not mandatory for installing or running experiments

Designing an experiment

It consists in preparing a so called "param file", named after the pattern "param_EXPID", where EXPID is the experiment name (and must not exceed 16 characters, thanks to Nemo). You may have a look and find directions for preparing such a file in the example files in directory ECLIS/params ; you rather should refer to a more comprehensive and up-to-date (but less documented) set of "param" files in directories tori and beaufix of ECLIS/testing/ (see the README file)

Please also note in the examples that parameter files should follow bash syntax and explicitly invoke bash through their first line (# !/bin/bash)

Please refer to the list of all parameters available in param files for an exhaustive point of view. ; there is also an index for these parameters

Install the experiment by executing the param_EXPID file : ./param_EXPID
(do not forget to first make it executable : chmod u+x param_EXPID)

This will create a number of directories, and fetch the experiment input files. This will also create an experiment configuration file named like EXPID.conf, which, together with the ’history’ file. These files are located in the so-called ’relance’ directory, usually located in /.relances (your param file may set this location, through parameter RELDIR)

Launching the experiment

This is even more simply done by typing "relan EXPID" (from anywhere), or by answering ’yes’ after experiment installation. You will get some lines of ouputs, ending by the output of a qsub/llsubmit command. What actually happens is that the "relan" system launches the main CNRM-CM5 script, which itselfs create a "macro-job" for simulating a given number of months (as chosen in the param_file) ; this "macro-job" follows a syntax which is interpreted by the Perl script "mtool" ; the latter creates four actual jobs from it, each one lauching the next one using qsub or llsubmit ; the first one, which is scalar/single proc, fetches input data from cougar/gaya, if they are not yet in due place on the supercomputer file system (GFS) ; the second one actually runs the models in parallel/vector queue ; the third one, which is a scalar/mono-proc job, sends the results from the supercomputer ot the archive machine and uses the "relan" system for launching another "macro-job" for running CNRM-CM5 on the next set of months (or next year), and so on until the whole period has been simulated. ; and the last, fourth, job performs some house-keeping.

The models outputs end up on the archive machine (cougar or gaya) at the location indicated by parameter EXPARCHDIR in param_EXPID (usually equal to GROUP/EXPID) ; they are sent there at the end of each macro-job

Monitoring the experiment’s progress

This can be done in various ways :

using squeue -u $USER (on beaufix) : each of the jobs of a macro-job is named EXPID_YYYYMM after the date of its first month (on IBM, use llq -l | grep "Job Name"
looking at files in the experiment relance directory (which you can easily reach by typing "cde EXPID", and which actually is $HOME/relances/GROUP/EXPID - where GROUP has been set in the experiment parameter file)
- file EXPID_his keeps a record of the first date of the last launched run (line DACT) and of the last date completed (line DAFC). It also records dates of launch and end of each macro-job (lines INFA and INFD) ; line DATF indicates the date for simulation period end ; you may modify it for smoothly stopping a running experiment, by shortening the period. You can lengthen also the period ; in that case, if the experiment was no more running, you will re-start it by "relan "
- macro-job outputs and models logfiles : see below
- subdirectory ’steps’ (new with V5.4) : it holds the jobs input and output for the four jobs described below ; on vargas, step02.out is updated all along computation step
- subdirectory ’ftexp’ stores the models ouputs until they are moved to the archive machine ; it holds one sub-directory per model and per month
checking existence of the models outputs on cougar/gaya (see above) ;

Managing a crashed experiment

mails : the CNRM-CM5 script will send a mail (to the address set in param_EXPID file) if anyone of the three jobs forming a macro-job fails. The mail provides a hint on what was wrong : fethhing the input, computing, or sending the outputs to the archive machine

CNRM-CM5 macro-jobs outputs : they will provide additional information w.r.t. the mail ; you will find them in the experiment relance directory (use "cde ") ; they are named after the pattern EXPID_.o ; if you do not find the macro-job output for a date for which you know that a job was launched, you will have to go the the submit directory which holds the jobs and job outputs for each of the 4 steps of the macro-job ; this directory is under symbolic link ’steps’ in the relance directory (from version V5.4) or can be reached using command "cds "

searching for model crash reason : you will have to dig in model logfiles in the relance directory, which are named after model usual behaviour (according to the usual namelists content) :
- last_stdall : stdout and stderr of each process for each model
- last_NODE.001_01 : Arpege logfile
- last_OUTPUT_LISTING : Surfex logfile
- last_cplout : Oasis coupler logfile
- last_gltout_0xx : Gelato logfiles
- last_ocean.output : Nemo logfile
- last_output.trip : Trip logfile

usual problems : in the compute phase, and apart from errors in the preparation of the model run, the well known crash symptoms encountered are
- a too strong current in the oceanic model, which is diagnosed by the words ’the zonal velocity is larger than 20 m/s’ in last_ocean.output, and which may be solved by relaunching with a very slight and temporary change of parameter "rn_addhft" in Nemo namelist (named "namelist") in relance directory
- a too strong wind in Arpege which is diagnosed by the words "WIND TOO STRONG" in last_NODE.001 and which is solved by re-launching with a very slight change of parameter HDIRDIV in Arpege namelist (named "fort.4") in the relance directory
- a numerical stability problem in Arpege causing routines larmes or ... to crash, and which is solved by slightly and temporarily varying diffusion parameter RRDXTAU

re-launching the experiment : once you have cured the problem, you just have to type relan <EXPID>, and you will be asked for a confirmation, because the relan system do diagnose that a job was earlier launched, which did not send back a notice of good end. Mind also that, for the two model crash cases described above you should set a start date (at line DACT of "_his" file) which is at least a full month before the crash date. In the other cases, you may set DACT to the beginning of the month of the crash. The CNRM-CM5 script will handle nicely this change in the macro-jobs start month

Cleaning an experiment

The delexp script has been designed to enable cleaning files on the computing machine and on the archive machine (namely tori ans cougar for most users). This script works on all configurations. delexp -h will inform about available options.