Bestman errors
From CEDPS
Contents |
Background
This page shows how to troubleshoot BeStMan errors from the raw logs, NetLogger logs and database.
Scenarios
(1) Server runs out of memory
This occurred on NERSC PDSF system, pdsfsrm.nersc.gov, on/around 7/16/2009. In the raw logs, this appeared as just a "gap" in the data.
Raw log
2009-07-15T11:11:02-07:00 pdsfgrid2.nersc.gov event_srm_log ts=2009-07-15T11:11:03.200Z level=Info class=gov.lbl.srm.impl.TSRMService event=outcoming.SrmStatusOfCopyRequestSRM_SUCCESS tid=Thread-11 2009-07-17T07:14:36-07:00 pdsfgrid2.nersc.gov event_srm_log ts=2009-07-17T07:14:37.317Z level=Info class=gov.lbl.srm.server.TSRMServer event=incoming.srmStatusOfCopyRequest rid=didenko:19_COPY_701829945 tid=Thread-15
This snippet translates into roughly equivalent NetLogger (Best Practices) logs.
NetLogger BP log
ts=2009-07-15T11:11:03.200Z event=srm.server.impl.TSRMService.outcoming.SrmStatusOfCopyRequestSRM_SUCCESS level=Info req.id=unknown th.id=11 ts=2009-07-17T07:14:37.317Z event=srm.server.TSRMServer.incoming.srmStatusOfCopyRequest level=Info req.id=didenko:19_COPY_701829945 th.id=15
Analysis
Discussion with Junmin (Gu, main programmer for BeStMan) revealed that there are no errors expected in the logs in this case. The server will just exit (silently) with an OutOfMemory error.
So, what would really be useful here is the extremely simple ability to recognize "gaps" in the data. With this in mind, we developed the tool, nl_gap, which examines the database and prints out any days that did not have data. Besides the location of the database, the input includes a date range to restrict the search (default is the whole database), and an event prefix, which in this case would be "srm.".
Example usage on osp.nersc.gov, the collector host for the pdsfsrm.nersc.gov logs. Because of the date and namespace restrictions, the query is very fast even though there are 1.3M total records and over 124,000 SRM records.
$ time nl_gap --url=mysql://localhost --db=nldb --timerange=2009-07::2009-08 --ns=srm Gaps: 2009-07-16 real 0m0.568s user 0m0.163s sys 0m0.042s
Options explained:
- --url=mysql://localhost
- Connect to MySQL server on localhost
- --db=nldb
- Use database "nldb"
- --timerange=2009-07::2009-08
- Restrict to July, 2009
- --ns=srm
- Only examine events whose name starts in "srm" (a namespace guaranteed by the NetLogger parser)
One possible snag in this plan, for SRM, is that the server could be alive but not logging because there are not transfers occurring. Further discussions with Junmin led to an agreement that the SRM server will log an occasional "heartbeat" message independent of whether it is doing anything.
