Advanced Sandbox Management
From EGEE-see WIki
This guide is a part of SEE-GRID Gridification Guide. It is aimed to explain different ways of providing and storing job input and output files. We will give couple of examples of JDL files which are using local and network files in their Sandboxes.
We are assuming that you already know how to create proxy and submit a job, and that you do have access to a configured glite-UI. If this is not the case, please first read the Job Submission guide . All examples in this guide are executed on a glite-UI.
Locating the files
Developers of applications and data producers should publish the files' location within their user communities or Virtual Organization. Different communities have different ways of doing that: trough their web pages, various implementations of data catalogs, specific search tools, etc.
The location of the data and application is usually decided on the VO level, but if you don't already know which SE element you should use and you are free to choose where you want to store your files, here is an example of how you can make your pick.
First you should choose a storage element from the list of SEs that support your VO. Listing available SEs for seegrid VO can be done with the following command:
[danica@ui simpleJob]$ lcg-infosites --vo seegrid se Avail Space(Kb) Used Space(Kb) Type SEs ---------------------------------------------------------- 1045812474 1101671174 n.a se.ngcc.acad.bg 734025271 229837432 n.a se001.imbm.bas.bg 10319728715 5770829040 n.a se.ipb.ac.rs 33328817 1873434 n.a grid-ce.etf.bg.ac.rs .....
Now you pick one of the SEs from this list that you think will suite your needs. Take for example grid-ce.etf.bg.ac.rs. Next step is finding the location on this SE where you can store your data. It is usually of the form <host_name>/dpm/<site_domain>/home/<voname>. You can use lcg-ls command and srm protocol for this purpose (add srm:// in front of SE hostname):
[danica@ui simpleJob]$ lcg-ls srm://grid-ce.etf.bg.ac.rs/ //dpm [danica@ui simpleJob]$ lcg-ls srm://grid-ce.etf.bg.ac.rs/dpm /dpm/etf.bg.ac.rs .... [danica@ui simpleJob]$ lcg-ls srm://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/ /dpm/etf.bg.ac.rs/home/seegrid/
Uploading the files manually
We would like to register our application (hello.py) in sub-directory named hello-job. We use lcg-cr for that purpose
[danica@ui simpleJob]$ lcg-cr -d srm://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/hello.sh \ -l /grid/seegrid/hello-job/hello.sh hello.sh guid:3ac8147e-de67-44ca-ae05-1a088bee190f
This command will create hello-job sub-directory (if it doesn't exist), it will copy the file to specified location by using srm protocol, it will register the file in lfc catalog (specified in $LFC_HOST) under provided logical file name /grid/seegrid/hello-job/hello.sh, and it will register your file to grid and return you its unique identifier (GUID). The directory in your LFC catalog must already be present or the registration of file will fail.
Preparing directory for output files
If we want to store our output files on the same SE, in a subdirectory called for example output, we first need to create this directory. To do so, we will use a little trick: we will create empty file, upload it to desired location, in the same path where we want to store output of our job, and then simply remove that file.
[danica@ui simpleJob]$ touch temp [danica@ui simpleJob]$ lcg-cr -d srm://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/output/temp temp guid:3247e05c-d9a4-49d5-a23b-3c5cc748419f [danica@ui simpleJob]$ lcg-del -a guid:3247e05c-d9a4-49d5-a23b-3c5cc748419f [danica@ui simpleJob]$ rm temp
With lcg-ls -l we can now check the details of hello-job directory content:
[danica@ui simpleJob]$ lcg-ls srm://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job /dpm/etf.bg.ac.rs/home/seegrid/hello-job/hello.sh /dpm/etf.bg.ac.rs/home/seegrid/hello-job/output
Sandbox files on grid
Let's now take a look at our job JDL:
[danica@ui simpleJob]$ cat simpleJob.jdl
[
Executable = "/bin/bash";
Arguments = "hello.sh" ;
StdOutput = "stdout.txt";
StdError = "stderr.txt";
InputSandbox = {"gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/hello.sh"};
OutputSandbox = {"stdout.txt", "stderr.txt"};
OutputSandboxBaseDestURI = "gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/output/";
]
First thing that we changed here is content of InputSandbox. Instead of pointing to local file hello.sh, it now points to a location on SE where we previously stored it, but instead of srm we have to use gsiftp protocol.
The OutputSandbox remained the same, but the new attribute OutputSandboxBaseDestURI is added, which specifies the location where the job will upload the output files. If we omitted this new attribute, then the output files should be retrieved manually with glite-wms-job-output. The advantage of using SE for storing output is that files are available immediately upon the job finish, and you do not depend on WMS availability in order to get your results.
Multiple input files at the same location
If you have more than one file in the same location on SE, instead of writing the whole path for every file, you can make use of InputSandboxBaseURI, like this
InputSandbox = {"hello.sh","inputA.txt","myInput/B.txt"};
InputSandboxBaseURI = "gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/";
which is equivalent to
InputSandbox = {"gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/hello.sh",
"gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/inputA.txt",
"gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/myInput/B.txt"
};
Multiple input files on different locations
If in addition to files stored on SE, you also have input file stored locally on your UI, you should specify that by adding file:// protocol in front of an absolute path of your file. And if you want to add input file stored on another SE, you can do so by specifying the full path to it (including gsiftp). Here is an example of mixed InputSandbox
InputSandbox = {"hello.sh","inputA.txt","myInput/B.txt",
"/tmp/hello/local-input-1.txt
"file:///tmp/hello/local-input-2.txt",
"gsiftp://se.ipb.ac.rs/dpm/ipb.ac.rs/home/seegrid/hello/myInput"};
InputSandboxBaseURI = "gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/";
If we omit InputSandboxBaseURI from this JDL, first three files would be considered to be stored locally on UI. Fourth file would be considered local even when InputSandboxBaseURI is specified because it starts with backslash.
Multiple output file on same location
As we saw in example above, if you want all output file to be uploaded on the same path of SE, we would write something like:
OutputSandbox = {"stdout.txt", "stderr.txt"};
OutputSandboxBaseDestURI = "gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/hello-job/output/";
Unlike with InputSandbox, you can not specify the final destination directly in it. Name of output files are relative to working directory on WN. If we want to upload all of them in the same location, that must be specified in OutputSandboxBaseDestURI attribute.
Multiple output file on different locations
If you want to upload output files to different locations, you would have to use attribute OutputSandboxDestURI. It is a list of destinations of all output files including protocol and filename (not just path to destination directory), listed in the same order as they are in the OutputSandbox.
OutputSandbox = {"file.1", "file.2", "out/myOutput"};
OutputSandboxDestURI = { "gsiftp://se.ipb.ac.rs/dpm/ipb.ac.rs/home/seegrid/hello/myInput",
"gsiftp://grid-ce.etf.bg.ac.rs/dpm/etf.bg.ac.rs/home/seegrid/file.2",
"myOutput"
};
First two files will be uploaded to adequate SEs, while third will be uploaded to WMS and it will wait for user to retrieve it with glite-wms-job-output. Note: Number of file destinations in OutputSandboxDestURI must be the same as number of files in OutputSandbox. Otherwise the JDL will be considered invalid. Note: OutputSandboxDestURI and OutputSandboxBaseDestURI cannot both be present in job description. Note: All output file location directories must exist at the time of job execution, or the job will fail to upload data.
Submitting the job
Now that you prepared your JDL, you uploaded all the input files to SEs or located input files that were already uploaded by your developer, data producers or your other jobs, you made sure that all output file directories exist, you can submit your JDL file as explained in Job Submission Guide
