SG Job Timing and Status Statistics

From EGEE-see WIki

Jump to: navigation, search

In order to provide basic information to application developers and users on how to benchmark the performance of their applications on the SEE-GRID infrastructure, as well as the performance of the infrastructure itself, the following guide is prepared by AEGIS01-PHY-SCL site admins of the Institute of Physics Belgrade.

It is possible to develop various benchmarking methods for different attributes of Grid system in order to measure its quality of service. Quantities that define QoS are: response time, total performance, security, scalability of resources, etc. Acquiring timing information regarding different states in job life cycle can provide valuable information to both developers aiming to improve performance of Grid services and users who plan gridification of their applications.

Job life cycle, in the case of successfully executed job, consists of following states: submitted, waiting, ready, scheduled, running, done, cleared. Also, job can be aborted, when the conditions for its execution are not met, or canceled by the user. Different information about job status can be obtained by querying Logging and Bookkeeping (LB) service. An example of calculating time that elapsed from the start of submitting job to the entering in the specific state is presented below.

At the beginning of procedure you should record time which will be referenced as start time. It is accomplished by executing command date +%s (seconds since 1970-01-01 00:00:00 UTC) immediately before job submission. Recording time immediately after execution glite-job-submit command makes possible to calculate time (in seconds) needed for execution this command.

echo "Submitting job $JOB" > $DIR/jobsubmit.txt 
date +%s >> $DIR/jobsubmit.txt
glite-job-submit -o $DIR/jid $JOB >> $DIR/jobsubmit.txt 2>&1
date +%s >> $DIR/jobsubmit.txt

Details of job submission are in jobsubmit.txt:

Submitting job hello.jdl
 1190888105
 
 Selected Virtual Organisation name (from proxy certificate extension): aegis
 Connecting to host wms.phy.bg.ac.yu, port 7772
 Logging to host wms.phy.bg.ac.yu, port 9002
 
 ================================ glite-job-submit Success =====================================
  The job has been successfully submitted to the Network Server.
  Use glite-job-status command to check job current status. Your job identifier is:
 
  - https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ
 
  The job identifier has been saved in the following file:
  /home/neda/wiki1/jid
 =============================================================================================
 
 1190888108

Different information can be obtained by querying Logging and Bookkeeping service using glite-job-status command. Amount of available information is defined by option -v (verbose). When verbosity level 3 is used, data about enter times in different states are provided.

$ glite-job-status -v 3 -i $DIR/jid > $DIR/jobstatus.txt
$ cat $DIR/jobstatus.txt
 *************************************************************
BOOKKEEPING INFORMATION:

Status info for the Job : https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ
Current Status:     Done (Success)
Exit code:          0
Status Reason:      Job terminated successfully
Destination:        ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
Submitted:          Thu Sep 27 12:15:05 2007 CEST
---

- cancelling              =    0
- ce_node                 =    gt2 ce.phy.bg.ac.yu:2119/jobmanager-pbs
- children_num            =    0
- condorId                =    1409293
- cpuTime                 =    0
- destination             =    ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
- done_code               =    0
- expectUpdate            =    0
- jobtype                 =    0
- lastUpdateTime          =    Thu Sep 27 12:21:12 2007 CEST
- location                =    none
- network_server          =    wms.phy.bg.ac.yu:7772
- owner                   =    /C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka
- Payload_Running         =    0
- Possible_Ce_Nodes       =    gt2 ce.phy.bg.ac.yu:2119/jobmanager-pbs
- Possible Destinations   =    ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis
- resubmitted             =    0
- seed                    =    uLU0BArrdV98O41PLThJ5Q
- subjob_failed           =    0
---
- children_hist   =   0
     
- condor_jdl              =    
      stream_error = False
      +edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ"
      Arguments = 'UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000'
      GlobusScheduler = ce.phy.bg.ac.yu:2119/jobmanager-pbs
      Transfer_Executable = True 
      +ce_id = "ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis"
      Output = /var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardOutput
      Submit_Event_Notes = (https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ) (UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=
000000:APP=000000) (0)
      Copy_to_Spool = False
      Executable = /var/glite/jobcontrol/submit/VL/JobWrapper.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.sh
      X509UserProxy = /var/glite/spool/glite-renewd/331cd2bc3a3193ac9cd8fe562c6f1893.1
      error = /var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardError
      +LB_sequence_code = "UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000003:LM=000000:LRMS=000000:APP=000000"
      Notification = never
      stream_output = False
      GlobusRSL = (queue=aegis)(jobtype=single)
      +Type = "job"
      Universe = grid
      +UserSubjectName = "/C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka"
      Log = /var/glite/logmonitor/CondorG.log/CondorG.1190844213.log
      grid_type = globus
      +CondorSubmitFile = "/var/glite/jobcontrol/submit/VL/Condor.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.submit"
      Queue 1
- jdl             =   
      
       [
        requirements = ( ( RegExp("^ce.phy.bg.ac.yu*",other.GlueCEUniqueID) ) && ( other.GlueCEStateStatus == "Production" ) ) && ( other.GlueCEStateStatus == "Production" ); 
        RetryCount = 3; 
        edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ"; 
        Arguments = ""; 
        MyProxyServer = "myproxy.phy.bg.ac.yu"; 
        JobType = "normal"; 
        Executable = "/bin/hostname"; 
        StdOutput = "message.txt"; 
        VOMS_FQAN = "/aegis/Role=NULL/Capability=NULL"; 
        OutputSandbox = { "message.txt","stderror" }; 
        LB_sequence_code = "UI=000002:NS=0000000000:WM=000000:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000"; 
        VirtualOrganisation = "aegis"; 
        rank =  -other.GlueCEStateEstimatedResponseTime; 
        Type = "job"; 
        StdError = "stderror"; 
        DefaultRank =  -other.GlueCEStateEstimatedResponseTime
       ]
- matched_jdl     =         
       [
        Arguments = 
         [
          JobAd = 
           [
            stream_error = false; 
            edg_jobid = "https://wms.phy.bg.ac.yu:9000/VLb70oyxfvqY4xUxm4VRvQ"; 
            GlobusScheduler = "ce.phy.bg.ac.yu:2119/jobmanager-pbs"; 
            Transfer_Executable = true; 
            ce_id = "ce.phy.bg.ac.yu:2119/jobmanager-pbs-aegis"; 
            Output = "/var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardOutput"; 
            Copy_to_Spool = false; 
            Executable = "/var/glite/jobcontrol/submit/VL/JobWrapper.https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ.sh"; 
            X509UserProxy = "/var/glite/spool/glite-renewd/331cd2bc3a3193ac9cd8fe562c6f1893.1"; 
            Error_ = "/var/glite/jobcontrol/condorio/VL/https_3a_2f_2fwms.phy.bg.ac.yu_3a9000_2fVLb70oyxfvqY4xUxm4VRvQ/StandardError"; 
            LB_sequence_code = "UI=000002:NS=0000000003:WM=000004:BH=0000000000:JSS=000000:LM=000000:LRMS=000000:APP=000000"; 
            Notification = "never"; 
            stream_output = false; 
            GlobusRSL = "(queue=aegis)(jobtype=single)"; 
            Type = "job"; 
            Universe = "grid"; 
            UserSubjectName = "/C=RS/O=AEGIS/OU=Institute of Physics Belgrade/CN=Neda Svraka"; 
            Log = "/var/glite/logmonitor/CondorG.log/CondorG.log"; 
            grid_type = "globus"
           ]
         ]; 
        Command = "Submit"; 
        Source = 2; 
        Protocol = "1.0.0"
       ]
- rsl             =   
      (queue=aegis)(jobtype=single)
- stateEnterTimes =   
      Submitted        : Thu Sep 27 12:15:05 2007 CEST
      Waiting          : Thu Sep 27 12:15:07 2007 CEST
      Ready            : Thu Sep 27 12:15:10 2007 CEST
      Scheduled        : Thu Sep 27 12:15:28 2007 CEST
      Running          : Thu Sep 27 12:20:54 2007 CEST
      Done             : Thu Sep 27 12:21:12 2007 CEST
      Cleared          :                ---
      Aborted          :                ---
      Cancelled        :                ---
      Unknown          :                ---

*************************************************************

The current status of job and state entering times can be extracted :

$ egrep 'Current Status | Submitted | Waiting | Ready | Scheduled | Running | Done | Cleared | Aborted | Cancelled | Unknown' jobstatus.txt > $DIR/timestatus.txt
$ cat $DIR/timestatus.txt
Current Status:     Done (Success)
      Submitted        : Thu Sep 27 12:15:05 2007 CEST
      Waiting          : Thu Sep 27 12:15:07 2007 CEST
      Ready            : Thu Sep 27 12:15:10 2007 CEST
      Scheduled        : Thu Sep 27 12:15:28 2007 CEST
      Running          : Thu Sep 27 12:20:54 2007 CEST
      Done             : Thu Sep 27 12:21:12 2007 CEST
      Cleared          :                ---
      Aborted          :                ---
      Cancelled        :                ---
      Unknown          :                ---

By knowing these times and start time (the moment the user submitted the job) it is possible to calculate amount of time needed for job to enter any of these states. For example, for the state Submitted, entering time can be expressed in seconds since 1970-01-01 00:00:00 UTC

$ grep 'Submitted' $DIR/timestatus.txt | grep -v 'Current Status' | awk '{print $4, $5, $6}' 
Sep 27 12:15:05
$ date -d "Sep 27 12:15:05" +%s
1190888105

and after that, the difference between start time and this time can be calculated. In this manner you can calculate time elapsed from the moment the user started submission of job to the moment of entering a state you are interested for.

This method also can be used when the user submits a sequence of jobs and acquire data about total time needed to submit jobs, avarage time of job submission, average state consuming time per job, etc.

As seen above, information of job status also can be obtained by querying LB service. For this it is not necessary to use verbosity level 3 of glite-job-status command:

$ egrep 'Current | Reason'  $DIR/jobstatus.txt 
Current Status:     Done (Success)
Status Reason:      Job terminated successfully

This information can be used for calculating percentage of successfully executed jobs or other state of interest (Aborted, Cancelled, etc). If this is combined with timing data mentioned above, statistic about number of jobs in particular state in certain moment can be created.

In this example job was submitted via Network Server, so the related set of commands was used.The same approach can be used when jobs are submitted via WMProxy using adequate set of commands (glite-wms-job-...).

Personal tools