Pegasus Sample Queries

From CEDPS

Jump to: navigation, search

Contents

Background

The Pegasus workflow tool produces "quickstart" invocation logs in XML.

The CEDPS project has a general-purpose pipeline for parsing and loading data into a database.

This page describes an experiment where we parsed and loaded a large set of quickstart records into the CEDPS database, then performed some queries on the data.

Note: This was done on MySQL. The date functions and perhaps some other ones will be different for other servers.

Sample dataset

Sample results for a 800K invocation record data set are shown.

Each invocation record was mapped to a single NetLogger event with several attributes.

The NetLogger events look like this (all on one line):

ts=2007-11-30T07:03:44.125-06:00 event=pegasus.invocation level=Info 
status=0 host=tg-c382.ncsa.teragrid.org user=scottcal duration=0.121 
guid=2d2e3415a5872cfccb79112404951195 type=scec::PeakValCalc_Okaya:1.0

Parse/Load Performance

The time to parse/load the data serially was:

  • Parse 800K invocation records: ~40 minutes (41 min, 5 sec)
  • Load parsed data into database: ~40 minutes


Basic queries

Answer these questions

1. how many jobs ran on a given day

2. what was the cumulative runtime of these jobs

3. how many jobs ran on given hosts

4. how many jobs of a given type ran on a given day

5. how many jobs failed

6. how many jobs succeeded

Database setup

  • Use UTC time zone
set time_zone = '-00:00';
  • Get date range of data
select from_unixtime(min(time)), from_unixtime(max(time)) from event;
  • Get time in seconds since epoch for a given day
 select unix_timestamp('2007-11-30');

Queries

The day chosen was Nov. 30, 2007, which had the majority (85%) of the data.

How many jobs ran on a given day

select count(id) from event 
where time >= unix_timestamp('2007-11-30') and time < unix_timestamp('2007-12-01')
and name = 'pegasus.invocation';
Results
+-----------+
| count(id) |
+-----------+
|    754063 | 
+-----------+
1 row in set (0.81 sec)

What was the cumulative runtime of these jobs

select sum(value) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time < unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation'  and attr.name = 'duration';
Results
+-----------------+
| sum(value)      |
+-----------------+
| 18973394.794999 | 
+-----------------+
1 row in set (25.75 sec)


How many jobs ran on given hosts

select count(id), value from event join attr on e_id = id 
where event.name = 'pegasus.invocation' and attr.name = 'host' 
group by value;
Results
+-----------+---------------------------+
| count(id) | value                     |
+-----------+---------------------------+
|      2290 | tg-c037.ncsa.teragrid.org | 
|      2540 | tg-c038.ncsa.teragrid.org | 
|      2830 | tg-c040.ncsa.teragrid.org | 
.....
|       580 | tg-c921.ncsa.teragrid.org | 
|       720 | tg-c922.ncsa.teragrid.org | 
|       600 | tg-c923.ncsa.teragrid.org | 
+-----------+---------------------------+
361 rows in set (34.96 sec)

If you dump the results to an output file, e.g.:

select ... into outfile "foo.out"

Then you can graph the results very simply with the R statistical package

library(lattice)
d <- read.table("foo.out")
densityplot(V2 ~ V1, d, main="Kernel density plot of #jobs per host")

The plot produced is shown below. Here's a zoomable hi-res PDF version.

Jobs per Host
Broken down by hour

This time, I ran within R using its RMySQL package. Here is the entire set of R commands:

# connect to DB
c <- dbConnect(MySQL(), dbname="pegasus2")
# query
q2 <- "select count(id) as 'count', hour(from_unixtime(time)) as 'hour', value from event join attr on e_id = id
where event.name = 'pegasus.invocation' and attr.name = 'host' and 
time >= unix_timestamp('2007-11-30') and time <  unix_timestamp('2007-12-01') 
group by value, hour(from_unixtime(time))"
# get result
d2 <- dbGetQuery(c, q2)
# plot result
histogram(~count|as.factor(hour), d2, scales=list(x=list(log=10)), 
  main="histogram of number of jobs per host, by hour\n11/30/2007 cybershake run", type="count")
Results
Jobs per Host, broken down by hour

zoomable hi-res PDF version

What happened to hour 1?
After some investigation, it turns out that MySQL thinks that times between 23:59:59.508 and 23:59:59.971 are actually hour 0 on the next day. AFAICT this means it rounds the fractional seconds up (!!) So, the run started on hour 1 (UTC) and the data from hour 0 is actually from the next day.

How many jobs of a given type ran on a given day

select attr.value, count(id) from attr join event on e_id = id 
where time >= unix_timestamp('2007-11-30') and time <= unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'type' 
group by attr.value;
Results
+--------------------------------+-----------+
| value                          | count(id) |
+--------------------------------+-----------+
| scec::extract_sgt:1.0          |      7907 | 
| scec::PeakValCalc_Okaya:1.0    |    339369 | 
| scec::seismogram_synthesis:1.0 |    406787 | 
+--------------------------------+-----------+
3 rows in set (24.47 sec)

How many jobs failed

Note: This result suggests that we're grabbing the wrong indication of 'success' from the original data.

select count(id) from attr join event on e_id = id  
where time >= unix_timestamp('2007-11-30') and time <= unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'status' and attr.value != '0';
Results
+-----------+
| count(id) |
+-----------+
|         0 | 
+-----------+
1 row in set (9.42 sec)

How many jobs succeeded

Note: in addition to the previous note, this suggests it's quicker to do the previous query and subtract the number of records! This is probably an artifact of the string comparison, which itself may be eliminated (for numeric values) in future versions of the schema.

select count(id) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time <=  unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'status' and attr.value = '0';
Results
+-----------+
| count(id) |
+-----------+
|    754063 | 
+-----------+
1 row in set (25.93 sec)

More queries

From Karan Vahi, 7/28/2008

Queries Per Workflow Where Workflow ID Is a DAX Label

  • Total number of jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
where  attr.name = 'status' and ident.name='workflow' and ident.value
LIKE 'CyberShake_WNGC%';
  • Total number of succeeded jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id
where  attr.name = 'status' and attr.value = '0' and
ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
  • Breakdown of jobs
select attr.value, count(attr.e_id) from attr 
join ident on attr.e_id = ident.e_id
where  ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%'  and
       attr.name='type' group by attr.value;


  • Total Runtime of jobs
select sum(attr.value) from attr join ident on attr.e_id=ident.e_id
where attr.name='duration' and ident.name='workflow' and ident.value
LIKE 'CyberShake_WNGC%';

Queries Per Workflow Per Job Type

  • Runtime Breakdown by job type per workflow
 select TRANSFORMATION, count(TRANSFORMATION) as number
 ,round(sum(attr.value),2) as sum_seconds,
 round(sum(attr.value)/(3600),2) as sum_hours, round(avg(attr.value),2)
 as avg_seconds from attr join (select attr.e_id as event_id,
 attr.value as TRANSFORMATION from  attr join ident on
 attr.e_id=ident.e_id  where attr.name='type' and
 ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident
 on attr.e_id=event_id WHERE attr.name='duration' group by
 TRANSFORMATION;
  • Number of failures by job type per workflow
select TRANSFORMATION, count(TRANSFORMATION) as failures from attr
join (select attr.e_id as event_id, attr.value as TRANSFORMATION from
attr join ident on attr.e_id=ident.e_id  where attr.name='type' and
ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on
attr.e_id=event_id WHERE attr.name = 'status' and attr.value != '0'
group by TRANSFORMATION;

Queries Per Unit Time Per Workflow

  • Jobs Per Day Per Workflow
select count(id) as 'count', day(from_unixtime(time)) as day from event 
  join attr on attr.e_id = event.id 
  join ident on attr.e_id=ident.e_id 
  where event.name = 'pegasus.invocation' and attr.name = 'host' and 
        ident.name='workflow' and 
        ident.value LIKE 'CyberShake_CCP%' 
  group by day;
  • Jobs Per Hour Per Day Per Workflow
select count(id) as 'count', hour(from_unixtime(time)) as hour,
day(from_unixtime(time)) as day from event join attr on attr.e_id =
event.id join ident on attr.e_id=ident.e_id where event.name =
'pegasus.invocation' and attr.name = 'host' and ident.name='workflow'
and ident.value LIKE 'CyberShake_CCP%' group by hour, day;
  • Jobs Per Host Per Hour Per Workflow
The below is not complete as I am not bounding on days in it.
select count(id) as 'count', hour(from_unixtime(time)) as 'hour',
attr.value  as value from event join attr on attr.e_id = event.id join
ident on attr.e_id=ident.e_id where event.name = 'pegasus.invocation'
and attr.name = 'host' and ident.name='workflow' and ident.value LIKE
'CyberShake_USC%' group by value, hour;
Personal tools