SG Job Timing and Status Statistics
From EGEE-see WIki
In order to provide basic information to application developers and users on how to benchmark the performance of their applications on the SEE-GRID infrastructure, as well as the performance of the infrastructure itself, the following guide is prepared by AEGIS01-PHY-SCL site admins of the Institute of Physics Belgrade.
It is possible to develop various benchmarking methods for different attributes of Grid system in order to measure its quality of service. Quantities that define QoS are: response time, total performance, security, scalability of resources, etc. Acquiring timing information regarding different states in job life cycle can provide valuable information to both developers aiming to improve performance of Grid services and users who plan gridification of their applications.
Job life cycle, in the case of successfully executed job, consists of following states: submitted, waiting, ready, scheduled, running, done, cleared. Also, job can be aborted, when the conditions for its execution are not met, or canceled by the user. Different information about job status can be obtained by querying Logging and Bookkeeping (LB) service. An example of calculating time that elapsed from the start of submitting job to the entering in the specific state is presented below.
At the beginning of procedure you should record time which will be referenced as start time. It is accomplished by executing command date +%s (seconds since 1970-01-01 00:00:00 UTC) immediately before job submission. Recording time immediately after execution glite-job-submit command makes possible to calculate time (in seconds) needed for execution this command.
echo "Submitting job $JOB" > $DIR/jobsubmit.txt date +%s >> $DIR/jobsubmit.txt glite-job-submit -o $DIR/jid $JOB >> $DIR/jobsubmit.txt 2>&1 date +%s >> $DIR/jobsubmit.txt
Details of job submission are in jobsubmit.txt:
Submitting job hello.jdl 1190888105 Selected Virtual Organisation name (from proxy certificate extension): aegis Connecting to host wms.phy.bg.ac.yu, port 7772 Logging to host wms.phy.bg.ac.yu, port 9002 ================================ glite-job-submit Success ===================================== The job has been successfully submitted to the Network Server. Use glite-job-status command to check job current status. Your job identifier is: - https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ The job identifier has been saved in the following file: /home/neda/wiki1/jid ============================================================================================= 1190888108
Different information can be obtained by querying Logging and Bookkeeping service using glite-job-status command. Amount of available information is defined by option -v (verbose). When verbosity level 3 is used, data about enter times in different states are provided.
$ glite-job-status -v 3 -i $DIR/jid > $DIR/jobstatus.txt
$ cat $DIR/jobstatus.txt
*************************************************************
BOOKKEEPING INFORMATION:
Status info for the Job : https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ
Current Status: Done (Success)
Exit code: 0
Status Reason: Job terminated successfully
Destination: ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
Submitted: Thu Sep 27 12:15:05 2007 CEST
---
- cancelling = 0
- ce_node = gt2 ce.phy.bg.ac.yu:2119/jobmanager-pbs
- children_num = 0
- condorId = 1409293
- cpuTime = 0
- destination = ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
- done_code = 0
- expectUpdate = 0
- jobtype = 0
- lastUpdateTime = Thu Sep 27 12:21:12 2007 CEST
- location = none
- network_server = wms.phy.bg.ac.yu:7772
- owner = /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka
- Payload_Running = 0
- Possible_Ce_Nodes = gt2 ce.phy.bg.ac.yu:2119/jobmanager-pbs
- Possible Destinations = ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
- resubmitted = 0
- seed = uLU0BArrdV98O41PLThJ5Q
- subjob_failed = 0
---
- children_hist = 0
- condor_jdl =
stream_error = False
+edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ"
Arguments = 'UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000'
GlobusScheduler = ce.phy.bg.ac.yu:2119/jobmanager-pbs
Transfer_Executable = True
+ce_id = "ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis"
Output = /var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardOutput
Submit_Event_Notes = (https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ) (UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=
000000:APP=000000) (0)
Copy_to_Spool = False
Executable = /var/glite/jobcontrol/submit/VL/JobWrapper.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.sh
X509UserProxy = /var/glite/spool/glite-renewd/331cd2bc3a3193ac9cd8fe562c6f1893.1
error = /var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardError
+LB_sequence_code = "UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000"
Notification = never
stream_output = False
GlobusRSL = (queue=aegis)(jobtype=single)
+Type = "job"
Universe = grid
+UserSubjectName = "/C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka"
Log = /var/glite/logmonitor/CondorG.log/CondorG.1190844213.log
grid_type = globus
+CondorSubmitFile = "/var/glite/jobcontrol/submit/VL/Condor.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.submit"
Queue 1
- jdl =
[
requirements = ( ( RegExp("^ce.phy.bg.ac.yu*",other.GlueCEUniqueID) ) && ( other.GlueCEStateStatus == "Production" ) ) && ( other.GlueCEStateStatus == "Production" );
RetryCount = 3;
edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ";
Arguments = "";
MyProxyServer = "myproxy.phy.bg.ac.yu";
JobType = "normal";
Executable = "/bin/hostname";
StdOutput = "message.txt";
VOMS_FQAN = "/aegis/Role=NULL/Capability=NULL";
OutputSandbox = { "message.txt","stderror" };
LB_sequence_code = "UI=000002:NS=0000000000:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000";
VirtualOrganisation = "aegis";
rank = -other.GlueCEStateEstimatedResponseTime;
Type = "job";
StdError = "stderror";
DefaultRank = -other.GlueCEStateEstimatedResponseTime
]
- matched_jdl =
[
Arguments =
[
JobAd =
[
stream_error = false;
edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ";
GlobusScheduler = "ce.phy.bg.ac.yu:2119/jobmanager-pbs";
Transfer_Executable = true;
ce_id = "ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis";
Output = "/var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardOutput";
Copy_to_Spool = false;
Executable = "/var/glite/jobcontrol/submit/VL/JobWrapper.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.sh";
X509UserProxy = "/var/glite/spool/glite-renewd/331cd2bc3a3193ac9cd8fe562c6f1893.1";
Error_ = "/var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardError";
LB_sequence_code = "UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000";
Notification = "never";
stream_output = false;
GlobusRSL = "(queue=aegis)(jobtype=single)";
Type = "job";
Universe = "grid";
UserSubjectName = "/C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka";
Log = "/var/glite/logmonitor/CondorG.log/CondorG.log";
grid_type = "globus"
]
];
Command = "Submit";
Source = 2;
Protocol = "1.0.0"
]
- rsl =
(queue=aegis)(jobtype=single)
- stateEnterTimes =
Submitted : Thu Sep 27 12:15:05 2007 CEST
Waiting : Thu Sep 27 12:15:07 2007 CEST
Ready : Thu Sep 27 12:15:10 2007 CEST
Scheduled : Thu Sep 27 12:15:28 2007 CEST
Running : Thu Sep 27 12:20:54 2007 CEST
Done : Thu Sep 27 12:21:12 2007 CEST
Cleared : ---
Aborted : ---
Cancelled : ---
Unknown : ---
*************************************************************
The current status of job and state entering times can be extracted :
$ egrep 'Current Status | Submitted | Waiting | Ready | Scheduled | Running | Done | Cleared | Aborted | Cancelled | Unknown' jobstatus.txt > $DIR/timestatus.txt
$ cat $DIR/timestatus.txt
Current Status: Done (Success)
Submitted : Thu Sep 27 12:15:05 2007 CEST
Waiting : Thu Sep 27 12:15:07 2007 CEST
Ready : Thu Sep 27 12:15:10 2007 CEST
Scheduled : Thu Sep 27 12:15:28 2007 CEST
Running : Thu Sep 27 12:20:54 2007 CEST
Done : Thu Sep 27 12:21:12 2007 CEST
Cleared : ---
Aborted : ---
Cancelled : ---
Unknown : ---
By knowing these times and start time (the moment the user submitted the job) it is possible to calculate amount of time needed for job to enter any of these states. For example, for the state Submitted, entering time can be expressed in seconds since 1970-01-01 00:00:00 UTC
$ grep 'Submitted' $DIR/timestatus.txt | grep -v 'Current Status' | awk '{print $4, $5, $6}'
Sep 27 12:15:05
$ date -d "Sep 27 12:15:05" +%s
1190888105
and after that, the difference between start time and this time can be calculated. In this manner you can calculate time elapsed from the moment the user started submission of job to the moment of entering a state you are interested for.
This method also can be used when the user submits a sequence of jobs and acquire data about total time needed to submit jobs, avarage time of job submission, average state consuming time per job, etc.
As seen above, information of job status also can be obtained by querying LB service. For this it is not necessary to use verbosity level 3 of glite-job-status command:
$ egrep 'Current | Reason' $DIR/jobstatus.txt Current Status: Done (Success) Status Reason: Job terminated successfully
This information can be used for calculating percentage of successfully executed jobs or other state of interest (Aborted, Cancelled, etc). If this is combined with timing data mentioned above, statistic about number of jobs in particular state in certain moment can be created.
In this example job was submitted via Network Server, so the related set of commands was used.The same approach can be used when jobs are submitted via WMProxy using adequate set of commands (glite-wms-job-...).
