Pegasus Sample Queries
From CEDPS
Contents |
Background
The Pegasus workflow tool produces "quickstart" invocation logs in XML.
The CEDPS project has a general-purpose pipeline for parsing and loading data into a database.
This page describes an experiment where we parsed and loaded a large set of quickstart records into the CEDPS database, then performed some queries on the data.
Note: This was done on MySQL. The date functions and perhaps some other ones will be different for other servers.
Sample dataset
Sample results for a 800K invocation record data set are shown.
Each invocation record was mapped to a single NetLogger event with several attributes.
The NetLogger events look like this (all on one line):
ts=2007-11-30T07:03:44.125-06:00 event=pegasus.invocation level=Info status=0 host=tg-c382.ncsa.teragrid.org user=scottcal duration=0.121 guid=2d2e3415a5872cfccb79112404951195 type=scec::PeakValCalc_Okaya:1.0
Parse/Load Performance
The time to parse/load the data serially was:
- Parse 800K invocation records: ~40 minutes (41 min, 5 sec)
- Load parsed data into database: ~40 minutes
Basic queries
Answer these questions
1. how many jobs ran on a given day
2. what was the cumulative runtime of these jobs
3. how many jobs ran on given hosts
4. how many jobs of a given type ran on a given day
5. how many jobs failed
6. how many jobs succeeded
Database setup
- Use UTC time zone
set time_zone = '-00:00';
- Get date range of data
select from_unixtime(min(time)), from_unixtime(max(time)) from event;
- Get time in seconds since epoch for a given day
select unix_timestamp('2007-11-30');
Queries
The day chosen was Nov. 30, 2007, which had the majority (85%) of the data.
How many jobs ran on a given day
select count(id) from event
where time >= unix_timestamp('2007-11-30') and time < unix_timestamp('2007-12-01')
and name = 'pegasus.invocation';
- Results
+-----------+ | count(id) | +-----------+ | 754063 | +-----------+ 1 row in set (0.81 sec)
What was the cumulative runtime of these jobs
select sum(value) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time < unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'duration';
- Results
+-----------------+ | sum(value) | +-----------------+ | 18973394.794999 | +-----------------+ 1 row in set (25.75 sec)
How many jobs ran on given hosts
select count(id), value from event join attr on e_id = id where event.name = 'pegasus.invocation' and attr.name = 'host' group by value;
- Results
+-----------+---------------------------+ | count(id) | value | +-----------+---------------------------+ | 2290 | tg-c037.ncsa.teragrid.org | | 2540 | tg-c038.ncsa.teragrid.org | | 2830 | tg-c040.ncsa.teragrid.org | ..... | 580 | tg-c921.ncsa.teragrid.org | | 720 | tg-c922.ncsa.teragrid.org | | 600 | tg-c923.ncsa.teragrid.org | +-----------+---------------------------+ 361 rows in set (34.96 sec)
If you dump the results to an output file, e.g.:
select ... into outfile "foo.out"
Then you can graph the results very simply with the R statistical package
library(lattice)
d <- read.table("foo.out")
densityplot(V2 ~ V1, d, main="Kernel density plot of #jobs per host")
The plot produced is shown below. Here's a zoomable hi-res PDF version.
Broken down by hour
This time, I ran within R using its RMySQL package. Here is the entire set of R commands:
# connect to DB
c <- dbConnect(MySQL(), dbname="pegasus2")
# query
q2 <- "select count(id) as 'count', hour(from_unixtime(time)) as 'hour', value from event join attr on e_id = id
where event.name = 'pegasus.invocation' and attr.name = 'host' and
time >= unix_timestamp('2007-11-30') and time < unix_timestamp('2007-12-01')
group by value, hour(from_unixtime(time))"
# get result
d2 <- dbGetQuery(c, q2)
# plot result
histogram(~count|as.factor(hour), d2, scales=list(x=list(log=10)),
main="histogram of number of jobs per host, by hour\n11/30/2007 cybershake run", type="count")
- Results
- What happened to hour 1?
- After some investigation, it turns out that MySQL thinks that times between 23:59:59.508 and 23:59:59.971 are actually hour 0 on the next day. AFAICT this means it rounds the fractional seconds up (!!) So, the run started on hour 1 (UTC) and the data from hour 0 is actually from the next day.
How many jobs of a given type ran on a given day
select attr.value, count(id) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time <= unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'type'
group by attr.value;
- Results
+--------------------------------+-----------+ | value | count(id) | +--------------------------------+-----------+ | scec::extract_sgt:1.0 | 7907 | | scec::PeakValCalc_Okaya:1.0 | 339369 | | scec::seismogram_synthesis:1.0 | 406787 | +--------------------------------+-----------+ 3 rows in set (24.47 sec)
How many jobs failed
Note: This result suggests that we're grabbing the wrong indication of 'success' from the original data.
select count(id) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time <= unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'status' and attr.value != '0';
- Results
+-----------+ | count(id) | +-----------+ | 0 | +-----------+ 1 row in set (9.42 sec)
How many jobs succeeded
Note: in addition to the previous note, this suggests it's quicker to do the previous query and subtract the number of records! This is probably an artifact of the string comparison, which itself may be eliminated (for numeric values) in future versions of the schema.
select count(id) from attr join event on e_id = id
where time >= unix_timestamp('2007-11-30') and time <= unix_timestamp('2007-12-01')
and event.name = 'pegasus.invocation' and attr.name = 'status' and attr.value = '0';
- Results
+-----------+ | count(id) | +-----------+ | 754063 | +-----------+ 1 row in set (25.93 sec)
More queries
From Karan Vahi, 7/28/2008
Queries Per Workflow Where Workflow ID Is a DAX Label
- Total number of jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id where attr.name = 'status' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
- Total number of succeeded jobs
select count(attr.e_id) from attr join ident on attr.e_id = ident.e_id where attr.name = 'status' and attr.value = '0' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
- Breakdown of jobs
select attr.value, count(attr.e_id) from attr
join ident on attr.e_id = ident.e_id
where ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%' and
attr.name='type' group by attr.value;
- Total Runtime of jobs
select sum(attr.value) from attr join ident on attr.e_id=ident.e_id where attr.name='duration' and ident.name='workflow' and ident.value LIKE 'CyberShake_WNGC%';
Queries Per Workflow Per Job Type
- Runtime Breakdown by job type per workflow
select TRANSFORMATION, count(TRANSFORMATION) as number ,round(sum(attr.value),2) as sum_seconds, round(sum(attr.value)/(3600),2) as sum_hours, round(avg(attr.value),2) as avg_seconds from attr join (select attr.e_id as event_id, attr.value as TRANSFORMATION from attr join ident on attr.e_id=ident.e_id where attr.name='type' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on attr.e_id=event_id WHERE attr.name='duration' group by TRANSFORMATION;
- Number of failures by job type per workflow
select TRANSFORMATION, count(TRANSFORMATION) as failures from attr join (select attr.e_id as event_id, attr.value as TRANSFORMATION from attr join ident on attr.e_id=ident.e_id where attr.name='type' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%') ident on attr.e_id=event_id WHERE attr.name = 'status' and attr.value != '0' group by TRANSFORMATION;
Queries Per Unit Time Per Workflow
- Jobs Per Day Per Workflow
select count(id) as 'count', day(from_unixtime(time)) as day from event
join attr on attr.e_id = event.id
join ident on attr.e_id=ident.e_id
where event.name = 'pegasus.invocation' and attr.name = 'host' and
ident.name='workflow' and
ident.value LIKE 'CyberShake_CCP%'
group by day;
- Jobs Per Hour Per Day Per Workflow
select count(id) as 'count', hour(from_unixtime(time)) as hour, day(from_unixtime(time)) as day from event join attr on attr.e_id = event.id join ident on attr.e_id=ident.e_id where event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_CCP%' group by hour, day;
- Jobs Per Host Per Hour Per Workflow
The below is not complete as I am not bounding on days in it. select count(id) as 'count', hour(from_unixtime(time)) as 'hour', attr.value as value from event join attr on attr.e_id = event.id join ident on attr.e_id=ident.e_id where event.name = 'pegasus.invocation' and attr.name = 'host' and ident.name='workflow' and ident.value LIKE 'CyberShake_USC%' group by value, hour;


