HWRF, gsi_d02_wrapper, returncode=174

Submitted by lliu on Sun, 08/01/2021 - 10:16
We are installing and running HWRF package version 4.0a on our SLURM system computer. We compiled no problem. 
 
In running HWRF, we completed Steps 1, 2, and 3 with no obvious mistakes in the information file. There are also results data in the output directory. Everything seems fine.

However, when we ran Step 4 "gsi_d02_wrapper" file, the program stopped with the error message in the log file:

07/30 09:57:55.421 hwrf.gsi_d02 (gsi.py:942) CRITICAL: [MainThread] GSI failed for <WRFDomain name=storm1ghost_parent> domain: exe('/usr/bin/srun')['--export=ALL','--cpu_bind=core','--distribution=block:block','/projects/ees/dhs-crc/aniya/HWRFmodel/hwrfrun/sorc/GSI/src/gsi.exe'].in('gsiparm.anl',string=False).out('stdout',append=False).env(OMP_NUM_THREADS=1, KMP_AFFINITY=scatter, KMP_NUM_THREADS=1, OMP_STACKSIZE=128M): non-zero exit status (returncode=174)

This is GSI related, I think. What is this problem about? Is it about the memory issue? I will attach the complete information file, if needed.
 
Greatly appreciate your help!
 
Liping

Hi Liping,

I can't see exactly what the problem is from this but it does seem like a good chance there is a memory issue. Can you try OMP_STACKSIZE=1024M and see if anything changes?

Thanks,

 

Will

Permalink

In reply to by wmayfield

Where do I modify the value for OMP_STACKSIZE=1024M ?
 
Thank you so much for answering my question! I am really puzzled. I can't find any detailed information for the error. I searched the output directory gsi_d02, the files stdout and gsiparm.anl, and some other status files. No information for the exit error.
 
The only thing I noticed is that OMP_NUM_THREADS=1 in the execution, but we have "export OMP_NUM_THREADS  2" in the gsi_d02_wrapper file.
 
I don't know where do I go to modify the values for the setting environments for the srun for OMP_NUM_THREADS, OMP_STACKSIZE etc.?

Hi Liping,

I have very little experience with HWRF, but to me it looks like these options are set up in the run_gsi_exe method in ./hwrfrun/ush/hwrf/gsi.py that starts around line 902, so that could be a place to try changing those values. In stdout, can you tell at what stage GSI is running when it ends? If you want to post the last several lines of stdout, that could help identify where the failure happens.

Thanks,

Will

Permalink

In reply to by wmayfield

Hi Will,

Thank you so much for your guidance! You are amazing! I did find the "gsi.py" file. In that file, I found the following lines related with running "gsi.exe": (lines 902--949, I bold faced the lines that are closely related with the messages in the out-log file I got) 

 def run_gsi_exe(self):
        """!Runs the actual GSI executable."""
        logger=self.log()

        cmd = mpi(self.getexe('gsi'))

        threads=os.environ.get('GSI_THREADS','1')
        threads=int(threads)
        if threads>1:
            cmd=openmp(cmd,threads=threads)

        cmd = mpirun(cmd,allranks=True).env(OMP_STACKSIZE='128M')
        if threads==1:
            cmd=openmp(cmd,threads=threads)

        cmd = cmd < 'gsiparm.anl'

        # Redirection is mandatory for the GSI so we can process the
        # output later on.
        cmd = cmd > 'stdout'
        print("Strange")
        print(cmd)
        logger.warning(repr(cmd))
        # Send a message at the highest logging level before and after
        # GSI so the NCEP-wide jlogfile has a message about the status
        # of GSI.
        self.postmsg('Starting GSI for %s domain'%(str(self.domain),))

        sleeptime=self.conffloat('sleeptime',30.0)

        ntries=0
        maxtries=1
        while ntries<maxtries:
            try:
                ntries+=1
                getrlimit(logger=logger)
                with rusage(logger=logger):
                    checkrun(cmd,sleeptime=sleeptime)
                break
            except Exception as e:
                if ntries>=maxtries:
                    logger.critical('GSI failed for %s domain: %s'
                                    %(str(self.domain),str(e)),exc_info=True)

                    raise
                else:
                    logger.warning('GSI failed for %s domain: %s, will retry %d more time(s)'
                                 %(str(self.domain),str(e),maxtries-ntries),
                                 exc_info=True)
        self.postmsg('GSI succeeded for %s domain'%(str(self.domain),))

As for the "stdout" file, here are the few last lines: (sorry for the long message)

 CREATE_PCP_RANDOM:  iseed=  2017090512
 radiance_obstype_init: total_rad_type=          18  types are: hirs
 goes_img  airs      amsua     amsub     mhs       ssmi      amsre
 ssmis     sndr      iasi      seviri    atms      cris      amsr2
 gmi       saphir    ahi
 PCPINFO_READ:  no pcpbias file.  set predxp=0.0
 glbsoi: starting ...
  before compute nlat_ens,nlon_ens, nlat,nlon,nlat_ens,nlon_ens,r_e,eps=
         315         315         629         315   1.00000000000000
  0.000000000000000E+000
  dual_res,nlat,nlon,nlat_ens,nlon_ens,r_e,eps= F         315         315
         315         315   1.00000000000000       0.000000000000000E+000
 gsi_metguess_mod*create_: alloc() for met-guess done
 guess_grids*create_chemges_grids: trouble getting number of chem/gases
 in read_wrf_nmm_netcdf: it filename =            1 sigf03
 in read_wrf_nmm_netcdf: it filename =            2 sigf06
 in read_wrf_nmm_netcdf: it filename =            3 sigf09