Techx portal integration
From CEDPS
Contents |
Overview
See NetLogger Mantis issue #219 for updates.
At the Star meeting in Davis we agreed to try to do the following proof-of-concept demo, with a goal of Aug 1 for the first, simple version of the demo.
Demo the Tech-X portal connecting to both the UCM database and the CEDPS Nersc database, and allow a Star production manager to drill down into the CEDPS logs when UCM logging detects a problem.
sample options:
- show all NERSC error logs for a given job
- show all NERSC logs for a given host
- show all NERSC that might possibly be related (based on hostname and time)
Action items
Dan/Brian:
- talk to NERSC security team and get permission to open up a mySQL port
to 1 host and BNL and 1 host at LBL for the portal to talk to.
- create a sample database of Star logs from Nersc logs and send to David.
- create sample queries for Star logs and send to David.
David:
- need new query engine for GridSphere portal to CEDPS DB
- work with BNL to instrument SUMS with UCM
Doug:
- look into changing the bestman configuration to use NERSC server instead of the BNL server.
- ask Jeff Porter about that state of condor-G testing with GT4, and consider switching to using WS-Gram on PDSF.
Sample queries
I put a dumpfile for database "osg_test" at http://acs.lbl.gov/~dang/CEDPS/osg_test.dump.bz2
Despite the dumpfile size, This is a relatively small DB with 181K entries, 122K of which are SGE jobs, and roughly 56K of those are STAR (group = 'rhstar') jobs. The table structure is the "old" one, which only differs from current by the absence of the 'hash' column in the 'event' table. This shouldn't affect much as long as you just assume that the jobs are all unique.
DB size
mysql> select count(*) from event; +----------+ | count(*) | +----------+ | 181908 | +----------+ 1 row in set (0.00 sec) mysql> select count(*) from event where name = 'sge.job'; +----------+ | count(*) | +----------+ | 122861 | +----------+ 1 row in set (1.01 sec) mysql> select count(*) from attr where name = 'group' and value = 'rhstar'; +----------+ | count(*) | +----------+ | 56361 | +----------+
num. of errors
How many errors were there in STAR jobs (answer = 1566 for this dataset)
select count(event.id) from attr as a1
join event on a1.e_id = event.id
join attr as a2 on a2.e_id = event.id
where a1.name = 'exit_status' and a1.value != '0' and
a2.name = 'group' and a2.value = 'rhstar' and
event.name = 'sge.job';
jobs with errors
List all SGE job_number 's for STAR jobs with errors -- #rows is same as result from (1)
select jobnum.value from attr as a1
join event on a1.e_id = event.id
join attr as a2 on a2.e_id = event.id
join attr as jobnum on jobnum.e_id = event.id
where a1.name = 'exit_status' and a1.value != '0' and
a2.name = 'group' and a2.value = 'rhstar' and
jobnum.name = 'job_number' and event.name = 'sge.job';
attrs of a job
You can use the previous query as a template for getting other attributes for a STAR job by simply adding joins to new aliases of the 'attr' table, selecting on the value and adding <alias>.name = 'attr_name' to the where clause, e.g., to also get 'job_name':
select jobnum.value as 'jobnum', jobname.value as 'jobname' from attr as a1 join event on a1.e_id = event.id join attr as a2 on a2.e_id = event.id join attr as jobnum on jobnum.e_id = event.id join attr as jobname on jobname.e_id = event.id where a1.name = 'exit_status' and a1.value != '0' and a2.name = 'group' and a2.value = 'rhstar' and jobnum.name = 'job_number' and jobname.name = 'job_name' and event.name = 'sge.job';
The full list of available attributes is:
mysql> select name from attr where e_id = 155; +-----------------+ | name | +-----------------+ | exit_status | | ru_inblock | | ru_minflt | | io | | owner | | slots | | job_number | | category | | ru_nvcsw | | group | | ru_isrss | | ru_stime | | ru_nsignals | | hostname | | priority | | failed | | ru_nivcsw | | department | | pe_taskid | | dur | | job_name | | status | | task_number | | granted_pe | | mem | | start_time | | ru_msgsnd | | ru_wallclock | | ru_utime | | submission_time | | maxvmem | | ru_oublock | | ru_nswap | | ru_majflt | | account | | iow | | qname | | project | | ru_ixrss | | ru_ismrss | | end_time | | ru_idrss | | ru_maxrss | | ru_msgrcv | | cpu | +-----------------+
