Bestman errors

From CEDPS

Revision as of 05:34, 23 July 2009 by Dang (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Background

This page shows how to troubleshoot BeStMan errors from the raw logs, NetLogger logs and database.

Scenarios

(1) Server runs out of memory

This occurred on NERSC PDSF system, pdsfsrm.nersc.gov, on/around 7/16/2009. In the raw logs, this appeared as just a "gap" in the data.

Raw log

2009-07-15T11:11:02-07:00 pdsfgrid2.nersc.gov event_srm_log ts=2009-07-15T11:11:03.200Z level=Info
class=gov.lbl.srm.impl.TSRMService event=outcoming.SrmStatusOfCopyRequestSRM_SUCCESS tid=Thread-11

2009-07-17T07:14:36-07:00 pdsfgrid2.nersc.gov event_srm_log ts=2009-07-17T07:14:37.317Z level=Info  
class=gov.lbl.srm.server.TSRMServer event=incoming.srmStatusOfCopyRequest rid=didenko:19_COPY_701829945 tid=Thread-15

This snippet translates into roughly equivalent NetLogger (Best Practices) logs.

NetLogger BP log

ts=2009-07-15T11:11:03.200Z event=srm.server.impl.TSRMService.outcoming.SrmStatusOfCopyRequestSRM_SUCCESS 
level=Info req.id=unknown th.id=11

ts=2009-07-17T07:14:37.317Z event=srm.server.TSRMServer.incoming.srmStatusOfCopyRequest 
level=Info req.id=didenko:19_COPY_701829945 th.id=15

Analysis

Discussion with Junmin (Gu, main programmer for BeStMan) revealed that there are no errors expected in the logs in this case. The server will just exit (silently) with an OutOfMemory error.

So, what would really be useful here is the extremely simple ability to recognize "gaps" in the data. With this in mind, we developed the tool, nl_gap, which examines the database and prints out any days that did not have data. Besides the location of the database, the input includes a date range to restrict the search (default is the whole database), and an event prefix, which in this case would be "srm.".

Example usage on osp.nersc.gov, the collector host for the pdsfsrm.nersc.gov logs. Because of the date and namespace restrictions, the query is very fast even though there are 1.3M total records and over 124,000 SRM records.

$ time nl_gap --url=mysql://localhost --db=nldb --timerange=2009-07::2009-08 --ns=srm
Gaps:
   2009-07-16
real	0m0.568s
user	0m0.163s
sys	0m0.042s

Options explained:

--url=mysql://localhost 
Connect to MySQL server on localhost
--db=nldb 
Use database "nldb"
--timerange=2009-07::2009-08 
Restrict to July, 2009
--ns=srm 
Only examine events whose name starts in "srm" (a namespace guaranteed by the NetLogger parser)

One possible snag in this plan, for SRM, is that the server could be alive but not logging because there are not transfers occurring. Further discussions with Junmin led to an agreement that the SRM server will log an occasional "heartbeat" message independent of whether it is doing anything.