How to limit per-process physical memory consumption on WNs

From EGEE-see WIki

Jump to: navigation, search

Contents

Introduction

This text summarizes the problems of physical memory shortage faced at the HG-01-GRNET and HG-06-EKT sites and the steps taken to increase the reliability of their Worker Nodes when the jobs submitted for execution have very large memory requirements.

Description of the problem

The problem would arise when certain jobs submitted to our sites would allocate excessive amounts of memory, leading the system to an Out-of-Memory condition later on. This happens because by default Linux overcommits memory; requests to extend the virtual memory limit of processes are successful, but actual pages of physical memory are only allocated when the process touches the corresponding VM spaces. If the system runs out of physical pages at this point, the kernel invokes the OOM killer, which starts terminating processes based on a set of criteria, in order to reclaim physical memory.

However, it is very common for processes other than the offending one to be killed. Very often, one of the processes selected for termination was mmfsd, the GPFS daemon, which led to critical GPFS mounts (e.g. /home) being unavailable for all processes on this WN. Since the WN would now appear idle and there was no way for the Local Resource Management System (Torque) to detect the malfunction of GPFS, Torque would continue to forward jobs to it. The jobs would die instantly with a "Stale NFS handle" error message for /home and the cycle would start again.

A short-term solution to the problem of the GPFS daemon dying could have been utilizing an even larger swapfile. However, this would not deal with the real cause of the problem and would impose unjustified slowdown to jobs happening to share the same WN with a job allocating much more than its fair share of memory, in the case of an SMP machine. Since execution becomes orders of magnitude slower when the system starts to swap, in all probability the jobs would not make the wallclock time limits associated with the Torque queue and would be forced to terminate by the scheduling system.

Limiting memory consumption per process

So, the decision was made to limit the amount of memory that each process can allocate, by employing the appropriate mechanism, as provided by the Linux kernel (see also the limits.conf manpage). For each process, the kernel defines a series of resource limits, which can be inspected at the shell, using the ulimit command:


(vangelis@daedalus#/dev/pts/6)-(~)
$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
max nice (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 8175
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
max rt priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 8175
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Although there are many limits defined related to memory, the only one that seems to be regurarly enforced throughout the Linux v2.4 series kernels is RLIMIT_AS, the "address space" limit, which limits the size of the virtual address space visible by process. Essentially, setting a hard limit for RLIMIT_AS means malloc() calls over a certain size will fail instantly for the calling process, before any pages are actually allocated off physical memory. Setting this limit can be done with the ulimit -v command, while checking whether it is enforced can be done with a tool like xmalloc.c (C source here):

[vkoukis@ui01 limitmem]$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) 4
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited
[vkoukis@ui01 limitmem]$ ./xmalloc 262144
Malloc was successful
[vkoukis@ui01 limitmem]$ ulimit -v 131072
[vkoukis@ui01 limitmem]$ ./xmalloc 262144
malloc failed: Cannot allocate memory

To make the changes permanent, the user administrator of a Grid node can set the value of RLIMIT_AS per pool account group, in /etc/security/limits.conf, thus:

[root@wn001 root]# cat /etc/security/limits.conf
<snip>
@biomed hard as 1331200
@atlas hard as 1331200
@alice hard as 1331200
@lhcb hard as 1331200
<snip>

The values in this file are enforced by the login mechanism, and more specifically by the PAM (Pluggable Authentication Modules) library:

[root@wn001 pam.d]# cat /etc/pam.d/system-auth |grep limits
session required /lib/security/$ISA/pam_limits.so

So they do not apply to processes spawned by Torque, since they are children of the pbs_mom process and inherit its resource limits. A quick and dirty solution would have been to use the ulimit -v command at the top of /etc/init.d/pbs_mom, but this would not allow for different resource limits depending on the VO, and would also make clean upgrades of Torque impossible. A much better solution is to set per-queue resource limits and have them enforced by "pbs_mom" itself (see the pbs_resources_linux manpage and pbs_resmom/linux/mom_mach.c in the Torque source tree), before the job begins running. The name of the corresponding queue limit is vmem, thus the following fragment:

[root@ce01 JobManager]# qmgr
Max open servers: 4
Qmgr: set queue dteam resources_max.vmem=524288k
Qmgr: list queue dteam@ce01
Queue dteam
queue_type = Execution
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0
Exiting:0
resources_max.cput = 48:00:00
resources_max.vmem = 524288kb
resources_max.walltime = 72:00:00
acl_group_enable = True
acl_groups = dteam,dteamsgm,dteamprd
mtime = Tue Sep 4 13:50:55 2007
resources_assigned.nodect = 0
resources_assigned.vmem = 0b
enabled = True
started = True

does not allow the VM address space of jobs submitted to the local "dteam" queue to grow over 512MB.

Even so, setting these limits is not enough to confine parallel (MPI) jobs, because most MPICH setups use mpirun, which in turns uses ssh in order to spawn the peer processes of the parallel job across WNs. Thus, in order to ensure that the limits are enforced not only for the first process, but for all processes belonging to the parallel jobs, these values must be defined at both places, both as properties of the Torque queue and in /etc/security/limits.conf.

With these changes in place, the system is much more stable and GPFS has never aborted in such way. The limit has been set to half the physical memory of each WN, since they are dual machines and there are two PBS job slots defined per node, along with a swapfile of suitable size (about half the available physical memory).

Verifying correct operation

To verify the correct operation of the limiting mechanism, one can submit an xmalloc job with suitable arguments. Verifying that the limits also hold for MPI processes can be done by spawing xmalloc on remote nodes over ssh:

[root@wn001 root]# su - dteam001
[dteam001@wn001 dteam001]$ ssh wn002 ./xmalloc 1500000
malloc failed: Cannot allocate memory
[dteam001@wn001 dteam001]$ ssh wn002 ./xmalloc 131072
Malloc was successful

Sample Code

The C source for the xmalloc.c utility follows:

/*
 * xmalloc.c
 *
 * Just allocate a chunk of memory
 * and make sure it gets written, so that
 * the demand-paging mechanism of Linux actually
 * has to find physical pages or swap for it
 *
 * Vangelis Koukis <vkoukis@cslab.ece.ntua.gr>
 * July 2007
 */ 
  
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h> 

int main(int argc, char *argv[])
{
	char *p;
	long i;
	size_t n; 

	/* I'm too bored to do proper cmdline parsing */
	if (argc != 2 || atol(argv[1]) <= 0 ) {
		fprintf(stderr, "I'm bored... Give me the size of the memory chunk in KB\n");
		return 1;
	}
	n = 1024 * atol(argv[1]);

	if (! (p = malloc(n))) {
		perror("malloc failed");
		return 2;
	} 

 	/* Temp, just want to check malloc */
	printf("Malloc was successful\n");
	return 0; 

	/* Touch all of the buffer, to make sure it gets allocated */
	for (i = 0; i < n; i++)
		p[i] = 'A';


	printf("Allocated and touched buffer, sleeping for 60 sec...\n");
	sleep(60);
	printf("Done!\n"); 

	return 0;
}

vkoukis 13:57, 23 November 2007 (EET)

Personal tools