Platform LSF Version 9 Release 1.3 Configuration Reference SC27-5306-03 Platform LSF Version 9 Release 1.3 Configuration Reference SC27-5306-03 Note Before using this information and the product it supports, read the information in “Notices” on page 657. First edition This edition applies to version 9, release 1 of IBM Platform LSF (product numbers 5725G82 and 5725L25) and to all subsequent releases and modifications until otherwise indicated in new editions. Significant changes or additions to the text and illustrations are indicated by a vertical line (|) to the left of the change. If you find an error in any Platform Computing documentation, or you have a suggestion for improving it, please let us know. In the IBM Knowledge Center, add your comments and feedback to any topic. You can also send your suggestions, comments and questions to the following email address: [email protected] Be sure include the publication title and order number, and, if applicable, the specific location of the information about which you have comments (for example, a page number or a browser URL). When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in any way it believes appropriate without incurring any obligation to you. © Copyright IBM Corporation 1992, 2014. US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. Contents Chapter 1. Configuration Files . . . . . 1 | cshrc.lsf and profile.lsf . . . . . . . . . . . 1 dc_conf.cluster_name.xml parameters . . . . . . 9 hosts . . . . . . . . . . . . . . . . 16 install.config . . . . . . . . . . . . . . 18 lim.acct . . . . . . . . . . . . . . . . 30 lsb.acct . . . . . . . . . . . . . . . . 31 lsb.applications . . . . . . . . . . . . . 40 lsb.events . . . . . . . . . . . . . . . 88 lsb.hosts . . . . . . . . . . . . . . . 129 lsb.modules . . . . . . . . . . . . . . 145 lsb.params . . . . . . . . . . . . . . 149 lsb.queues . . . . . . . . . . . . . . 228 lsb.resources. . . . . . . . . . . . . . 293 lsb.serviceclasses . . . . . . . . . . . . 327 lsb.users . . . . . . . . . . . . . . . 338 lsf.acct. . . . . . . . . . . . . . . . 345 lsf.cluster . . . . . . . . . . . . . . . 347 lsf.conf . . . . . . . . . . . . . . . 368 lsf.datamanager . . . . . . . . . . . . 514 © Copyright IBM Corp. 1992, 2014 lsf.licensescheduler lsf.shared . . . . lsf.sudoers . . . lsf.task . . . . setup.config . . . slave.config . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 2. Environment Variables Environment variables set for job execution Environment variables for resize notification command . . . . . . . . . . . Environment variables for session scheduler (ssched) . . . . . . . . . . . . Environment variable reference . . . . . . . . . . . . . . . . 521 578 584 590 593 595 . . 603 . . . 603 . . . 604 . . . . . 605 . 606 Notices . . . . . . . . . . . . . . 657 Trademarks . . . . . . . Privacy policy considerations . . . . . . . . . . . . . . 659 . 659 iii iv Platform LSF Configuration Reference Chapter 1. Configuration Files Important: Specify any domain names in all uppercase letters in all configuration files. cshrc.lsf and profile.lsf About cshrc.lsf and profile.lsf The user environment shell files cshrc.lsf and profile.lsf set the LSF operating environment on an LSF host. They define machine-dependent paths to LSF commands and libraries as environment variables: v cshrc.lsf sets the C shell (csh or tcsh) user environment for LSF commands and libraries v profile.lsf sets and exports the Bourne shell/Korn shell (sh, ksh, or bash) user environment for LSF commands and libraries Tip: LSF Administrators should make sure that cshrc.lsf or profile.lsf are available for users to set the LSF environment variables correctly for the host type running LSF. Location cshrc.lsf and profile.lsf are created by lsfinstall during installation. After installation, they are located in LSF_CONFDIR (LSF_TOP/conf/). Format cshrc.lsf and profile.lsf are conventional UNIX shell scripts: v cshrc.lsf runs under /bin/csh v profile.lsf runs under /bin/sh What cshrc.lsf and profile.lsf do cshrc.lsf and profile.lsf determine the binary type (BINARY_TYPE) of the host and set environment variables for the paths to the following machine-dependent LSF directories, according to the LSF version (LSF_VERSION) and the location of the top-level installation directory (LSF_TOP) defined at installation: v LSF_BINDIR v LSF_SERVERDIR v LSF_LIBDIR v XLSF_UIDDIR cshrc.lsf and profile.lsf also set the following user environment variables: v LSF_ENVDIR v LD_LIBRARY_PATH v PATH to include the paths to: – LSF_BINDIR – LSF_SERVERDIR © Copyright IBM Corp. 1992, 2014 1 cshrc.lsf and profile.lsf v MANPATH to include the path to the LSF man pages If EGO is enabled If EGO is enabled in the LSF cluster (LSF_ENABLE_EGO=Y and LSF_EGO_ENVDIR are defined in lsf.conf), cshrc.lsf and profile.lsf set the following environment variables. v v v v v v v EGO_BINDIR EGO_CONFDIR EGO_ESRVDIR EGO_LIBDIR EGO_LOCAL_CONFDIR EGO_SERVERDIR EGO_TOP Setting the LSF environment with cshrc.lsf and profile.lsf Before using LSF, you must set the LSF execution environment. After logging on to an LSF host, use one of the following shell environment files to set your LSF environment: v For example, in csh or tcsh: source /usr/lsf/lsf_9/conf/cshrc.lsf v For example, in sh, ksh, or bash: . /usr/lsf/lsf_9/conf/profile.lsf Making your cluster available to users with cshrc.lsf and profile.lsf To set the LSF user environment, run one of the following two shell files: v LSF_CONFDIR/cshrc.lsf (for csh, tcsh) v LSF_CONFDIR/profile.lsf (for sh, ksh, or bash) Tip: LSF administrators should make sure all LSF users include one of these files at the end of their own .cshrc or .profile file, or run one of these two files before using LSF. For csh or tcsh Add cshrc.lsf to the end of the .cshrc file for all users: v Copy the cshrc.lsf file into .cshrc, or v Add a line similar to the following to the end of .cshrc: source /usr/lsf/lsf_9/conf/cshrc.lsf After running cshrc.lsf, use setenv to see the environment variable settings. For example: setenv PATH=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin ... MANPATH=/usr/lsf/lsf_9/9.1/man ... LSF_BINDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin 2 Platform LSF Configuration Reference cshrc.lsf and profile.lsf LSF_SERVERDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc LSF_LIBDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib LD_LIBRARY_PATH=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib XLSF_UIDDIR=/usr/lsf/9.1/linux2.6-glibc2.3-x86/lib/uid LSF_ENVDIR=/usr/lsf/lsf_9/conf Note: These variable settings are an example only. Your system may set additional variables. For sh, ksh, or bash Add profile.lsf to the end of the .profile file for all users: v Copy the profile.lsf file into .profile, or v Add a line similar to following to the end of .profile: . /usr/lsf/lsf_9/conf/profile.lsf After running profile.lsf, use the setenv command to see the environment variable settings. For example: setenv ... LD_LIBRARY_PATH=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib LSF_BINDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin LSF_ENVDIR=/usr/lsf/lsf_9/conf LSF_LIBDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib LSF_SERVERDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc MANPATH=/usr/lsf/lsf_9/9.1/man PATH=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin ... XLSF_UIDDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib/uid ... Note: These variable settings are an example only. Your system may set additional variables. cshrc.lsf and profile.lsf on dynamically added LSF slave hosts Dynamically added LSF hosts that will not be master candidates are slave hosts. Each dynamic slave host has its own LSF binaries and local lsf.conf and shell environment scripts (cshrc.lsf and profile.lsf). LSF environment variables set by cshrc.lsf and profile.lsf LSF_BINDIR Syntax LSF_BINDIR=dir Description Directory where LSF user commands are installed. Chapter 1. Configuration Files 3 cshrc.lsf and profile.lsf Examples v Set in csh and tcsh by cshrc.lsf: setenv LSF_BINDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin v Set and exported in sh, ksh, or bash by profile.lsf: LSF_BINDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin Values v In cshrc.lsf for csh and tcsh: setenv LSF_BINDIR $LSF_TOP/$LSF_VERSION/$BINARY_TYPE/bin v Set and exported in profile.lsf for sh, ksh, or bash: LSF_BINDIR=$LSF_TOP/$LSF_VERSION/$BINARY_TYPE/bin LSF_ENVDIR Syntax LSF_ENVDIR=dir Description Directory containing the lsf.conf file. By default, lsf.conf is installed by creating a shared copy in LSF_CONFDIR and adding a symbolic link from /etc/lsf.conf to the shared copy. If LSF_ENVDIR is set, the symbolic link is installed in LSF_ENVDIR/lsf.conf. The lsf.conf file is a global environment configuration file for all LSF services and applications. The LSF default installation places the file in LSF_CONFDIR. Examples v Set in csh and tcsh by cshrc.lsf: setenv LSF_ENVDIR /usr/lsf/lsf_9/conf v Set and exported in sh, ksh, or bash by profile.lsf: LSF_ENVDIR=/usr/lsf/lsf_9/conf Values v In cshrc.lsf for csh and tcsh: setenv LSF_ENVDIR $LSF_TOP/conf v Set and exported in profile.lsf for sh, ksh, or bash: LSF_ENVDIR=$LSF_TOP/conf LSF_LIBDIR Syntax LSF_LIBDIR=dir Description Directory where LSF libraries are installed. Library files are shared by all hosts of the same type. 4 Platform LSF Configuration Reference cshrc.lsf and profile.lsf Examples v Set in csh and tcsh by cshrc.lsf: setenv LSF_LIBDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib v Set and exported in sh, ksh, or bash by profile.lsf: LSF_LIBDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib Values v In cshrc.lsf for csh and tcsh: setenv LSF_LIBDIR $LSF_TOP/$LSF_VERSION/$BINARY_TYPE/lib v Set and exported in profile.lsf for sh, ksh, or bash: LSF_LIBDIR=$LSF_TOP/$LSF_VERSION/$BINARY_TYPE/lib LSF_SERVERDIR Syntax LSF_SERVERDIR=dir Description Directory where LSF server binaries and shell scripts are installed. These include lim, res, nios, sbatchd, mbatchd, and mbschd. If you use elim, eauth, eexec, esub, etc, they are also installed in this directory. Examples v Set in csh and tcsh by cshrc.lsf: setenv LSF_SERVERDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc v Set and exported in sh, ksh, or bash by profile.lsf: LSF_SERVERDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc Values v In cshrc.lsf for csh and tcsh: setenv LSF_SERVERDIR $LSF_TOP/$LSF_VERSION/$BINARY_TYPE/etc v Set and exported in profile.lsf for sh, ksh, or bash: LSF_SERVERDIR=$LSF_TOP/$LSF_VERSION/$BINARY_TYPE/etc XLSF_UIDDIR Syntax XLSF_UIDDIR=dir Description (UNIX and Linux only) Directory where Motif User Interface Definition files are stored. These files are platform-specific. Examples v Set in csh and tcsh by cshrc.lsf: setenv XLSF_UIDDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib/uid v Set and exported in sh, ksh, or bash by profile.lsf: Chapter 1. Configuration Files 5 cshrc.lsf and profile.lsf XLSF_UIDDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib/uid Values v In cshrc.lsf for csh and tcsh: setenv XLSF_UIDDIR $LSF_TOP/$LSF_VERSION/$BINARY_TYPE/lib/uid v Set and exported in profile.lsf for sh, ksh, or bash: XLSF_UIDDIR=$LSF_TOP/$LSF_VERSION/$BINARY_TYPE/lib/uid EGO environment variables set by cshrc.lsf and profile.lsf EGO_BINDIR Syntax EGO_BINDIR=dir Description Directory where EGO user commands are installed. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_BINDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin v Set and exported in sh, ksh, or bash by profile.lsf: EGO_BINDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/bin Values v In cshrc.lsf for csh and tcsh: setenv EGO_BINDIR $LSF_BINDIR v Set and exported in profile.lsf for sh, ksh, or bash: EGO_BINDIR=$LSF_BINDIR EGO_CONFDIR Syntax EGO_CONFDIR=dir Description Directory containing the ego.conf file. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_CONFDIR /usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel v Set and exported in sh, ksh, or bash by profile.lsf: EGO_CONFDIR=/usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel Values v In cshrc.lsf for csh and tcsh: setenv EGO_CONFDIR /usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel v Set and exported in profile.lsf for sh, ksh, or bash: EGO_CONFDIR=/usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel 6 Platform LSF Configuration Reference cshrc.lsf and profile.lsf EGO_ESRVDIR Syntax EGO_ESRVDIR=dir Description Directory where the EGO the service controller configuration files are stored. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_ESRVDIR /usr/lsf/lsf_9/conf/ego/lsf702/eservice v Set and exported in sh, ksh, or bash by profile.lsf: EGO_ESRVDIR=/usr/lsf/lsf_9/conf/ego/lsf702/eservice Values v In cshrc.lsf for csh and tcsh: setenv EGO_ESRVDIR /usr/lsf/lsf_9/conf/ego/lsf702/eservice v Set and exported in profile.lsf for sh, ksh, or bash: EGO_ESRVDIR=/usr/lsf/lsf_9/conf/ego/lsf702/eservice EGO_LIBDIR Syntax EGO_LIBDIR=dir Description Directory where EGO libraries are installed. Library files are shared by all hosts of the same type. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_LIBDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib v Set and exported in sh, ksh, or bash by profile.lsf: EGO_LIBDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/lib Values v In cshrc.lsf for csh and tcsh: setenv EGO_LIBDIR $LSF_LIBDIR v Set and exported in profile.lsf for sh, ksh, or bash: EGO_LIBDIR=$LSF_LIBDIR EGO_LOCAL_CONFDIR Syntax EGO_LOCAL_CONFDIR=dir Description The local EGO configuration directory containing the ego.conf file. Chapter 1. Configuration Files 7 cshrc.lsf and profile.lsf Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_LOCAL_CONFDIR /usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel v Set and exported in sh, ksh, or bash by profile.lsf: EGO_LOCAL_CONFDIR=/usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel Values v In cshrc.lsf for csh and tcsh: setenv EGO_LOCAL_CONFDIR /usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel v Set and exported in profile.lsf for sh, ksh, or bash: EGO_LOCAL_CONFDIR=/usr/lsf/lsf_9/conf/ego/lsf1.2.3/kernel EGO_SERVERDIR Syntax EGO_SERVERDIR=dir Description Directory where EGO server binaries and shell scripts are installed. These include vemkd, pem, egosc, and shell scripts for EGO startup and shutdown. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_SERVERDIR /usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc v Set and exported in sh, ksh, or bash by profile.lsf: EGO_SERVERDIR=/usr/lsf/lsf_9/9.1/linux2.6-glibc2.3-x86/etc Values v In cshrc.lsf for csh and tcsh: setenv EGO_SERVERDIR $LSF_SERVERDIR v Set and exported in profile.lsf for sh, ksh, or bash: EGO_SERVERDIR=$LSF_SERVERDIR EGO_TOP Syntax EGO_TOP=dir Description The top-level installation directory. The path to EGO_TOP must be shared and accessible to all hosts in the cluster. Equivalent to LSF_TOP. Examples v Set in csh and tcsh by cshrc.lsf: setenv EGO_TOP /usr/lsf/lsf_9 v Set and exported in sh, ksh, or bash by profile.lsf: EGO_TOP=/usr/lsf/lsf_9 8 Platform LSF Configuration Reference cshrc.lsf and profile.lsf Values v In cshrc.lsf for csh and tcsh: setenv EGO_TOP /usr/lsf/lsf_9 v Set and exported in profile.lsf for sh, ksh, or bash: EGO_TOP=/usr/lsf/lsf_9 dc_conf.cluster_name.xml parameters This is a new file for Dynamic Cluster configuration parameters. ResourceGroupConf section Example: <ResourceGroupConf> <HypervisorResGrps> <ResourceGroup> <Name>KVMRedHat_Hosts</Name> <Template>rhel55hv</Template> <MembersAreAlsoPhysicalHosts>Yes</MembersAreAlsoPhysicalHosts> </ResourceGroup> </HypervisorResGrps> </ResourceGroupConf> The following parameters are configured in the <ResourceGroupConf> section of the file. MembersAreAlsoPhysicalHosts Syntax Yes Description Defines a host as a hypervisor which can also run physical machine workload on idle resources. Default Not defined. Name Syntax KVMRedHat_Hosts Description The name of the resource group to which this hypervisors of the given template belong. The only valid value is KVMRedHat_Hosts. Default Not defined. Chapter 1. Configuration Files 9 dc_conf.cluster_name.xml Template Syntax template_name Description The name of the template associated with the hypervisor resource group. This must match a template name defined in the Templates section. Default Not defined. Parameters section Most parameters are configured as shown: <ParametersConf> <Parameter name="PARAMETER_1"> <Value>value_1</Value> </Parameter> <Parameter name="PARAMETER_2"> <Value>value_2</Value> </Parameter> </ParametersConf> The following parameters are configured in the <ParametersConf> section of the file. DC_CLEAN_PERIOD Syntax seconds Description Time to keep provision requests in memory for query purposes after they are completed. Before this period is completed, the requests appear in the output of the bdc action command. After this period expires, they only appear in the output of bdc hist. Specify the value in seconds. Default 1800 DC_CONNECT_STRING Syntax admin_user_name::directory_path Description A connection string to specify the Platform Cluster Manager Advanced Edition administrator account name and installation directory 10 Platform LSF Configuration Reference dc_conf.cluster_name.xml Default Admin::/opt/platform The Platform Cluster Manager Advanced Edition administrator is Admin and the installation directory is /opt/platform. DC_EVENT_FILES_MAX_NUM Syntax integer Description Maximum number of Dynamic Cluster event log files (dc.events) to keep. Dependency DC_EVENT_FILES_MIN_SWITCH_PERIOD Default 10 DC_EVENT_FILES_MIN_SWITCH_PERIOD Syntax seconds Description The minimum elapsed time period before archiving the Dynamic Cluster event log. Define the value in seconds. Works together with DC_PROVISION_REQUESTS_MAX_FINISHED to control how frequently Dynamic Cluster archives the file dc.events. The number of finished requests in the file is evaluated regularly, at the interval defined by this parameter. The file is archived if the number of requests has reached the threshold defined by DC_PROVISION_REQUESTS_MAX_FINISHED. The event log file names are switched when a new file is archived. The new file is named dc.events, the archived file is named dc.events.1, and the previous dc.events.1 is renamed dc.events.2, and so on. Dependency DC_PROVISION_REQUESTS_MAX_FINISHED Default 1800 DC_LIVEMIG_MAX_DOWN_TIME Syntax seconds Chapter 1. Configuration Files 11 dc_conf.cluster_name.xml Description For KVM hypervisors only. The maximum amount of time that a VM can be down during a live migration. This is the amount of time from when the VM is stopped on the source hypervisor and started on the target hypervisor. If the live migration cannot guarantee this down time, the system will continue to retry the live migration until it can guarantee this maximum down time (or the DC_LIVEMIG_MAX_EXEC_TIME parameter value is reached). Specify the value in seconds, or specify 0 to use the hypervisor default for the down time. Default 0 DC_LIVEMIG_MAX_EXEC_TIME Syntax seconds Description For KVM hypervisors only. The maximum amount of time that the system can attempt a live migration. If the live migration cannot guarantee the down time (as specified by the DC_LIVEMIG_MAX_DOWN_TIME parameter) within this amount of time, the live migration fails. Specify the value in seconds from 1 to 2147483646. Default 2147483646 DC_MACHINE_MAX_LIFETIME Syntax minutes Description Limits the lifetime of a dynamically created virtual machine. After the specified time period has elapsed since a VM’s creation, if the VM becomes idle, the system automatically powers off and destroy the VM. This parameter is useful when propagating updates to a shared template. If a shared template is updated, all VM instances which were generated from this template will still run with the old template even if they were powered off. Setting this parameter to a finite value will cause VMs to be uninstalled (deleted from disk) after the specified period and completing running workload. Any further requests for the shared template must install a new VM, which will then be based on the new version of the template. Therefore administrators can be sure that the template update has been propagated throughout the system after the specified time period. Default Not defined. Infinite time. 12 Platform LSF Configuration Reference dc_conf.cluster_name.xml DC_MACHINE_MIN_TTL Syntax minutes Description Minimum time to live of the Dynamic Cluster host or virtual machine before it can be reprovisioned. This parameter is used to prevent system resources from being reprovisioned too often and generating unnecessary load on the infrastructure. For example, if the value is set to 60, any freshly provisioned machine will not be reprovisioned in less than 60 minutes. Default 0 DC_PROVISION_REQUESTS_MAX_FINISHED Syntax integer Description Number of finished provisioning requests in the Dynamic Cluster event log before it is archived. Works together with DC_EVENT_FILES_MIN_SWITCH_PERIOD to control how frequently Dynamic Cluster archives the file dc.events. The number of jobs in the file is evaluated regularly, at the interval defined by DC_EVENT_FILES_MIN_SWITCH_PERIOD. The file is archived if the number of jobs has reached or exceeded the threshold defined by DC_PROVISION_REQUESTS_MAX_FINISHED. The event log file names are switched when a new file is archived. The new file is named dc.events, the archived file is named dc.events.1, the previous dc.events.1 is renamed dc.events.2, and so on. Dependency DC_EVENT_FILES_MIN_SWITCH_PERIOD Default 5000 DC_REPROVISION_GRACE_PERIOD Syntax seconds Description After each job finishes, allow a grace period before the machine can accept another provision request. Specify the value in seconds. By default, when a job completes, the machine it was running on becomes eligible for reprovisioning. However, some jobs have post-execution processes that may be Chapter 1. Configuration Files 13 dc_conf.cluster_name.xml interrupted if the host is reprovisioned too quickly. This parameter configures a grace period after job termination during which the host cannot be reprovisioned, which gives these processes a chance to complete. To ensure that the machine is not reprovisioned until post-execution processing is done, specify JOB_INCLUDE_POSTPROC=Y in lsb.params. Default 0 DC_VM_UNINSTALL_PERIOD Syntax minutes Description Time period to uninstall (delete from storage) a dynamically created VM in the off state. Specify the value in minutes. A virtual machine can be created to meet peak workload demands. However, after peak loads pass, those virtual machines will be powered off and stored in storage. Those dynamic virtual machines can be configured to be deleted if they remain off for a long time. Note: VMs in the OFF status still hold an IP address reservation. To release this IP address, the VM must be uninstalled (deleted from disk). To have VMs release their IP reservation immediately when powered down, specify 0 as the value of this parameter to uninstall them immediately." Default 1440 DC_VM_MEMSIZE_DEFINED Syntax integer Multiple values allowed. This parameter wraps each value in <memsize> instead of <Value>. For example: <memsize> integer </memsize> Description Specify one or more choices for VM memory size. Specify the value in MB. Separate values in a list with space. The memory size of any new VM is the smallest of all the choices that satisfy the job’s resource requirement. 14 Platform LSF Configuration Reference dc_conf.cluster_name.xml For example, if a job requires 500 MB memory, and this parameter is set to "1024 1536", the VM is created with 1024 MB memory. If a job requires 1025 MB memory, a VM is created with 1536 MB memory. Using this feature helps prevent the hypervisor hosts from being fragmented with multiple VMs of different size. When the VMs have standardized memory size, they can easily be reused for jobs with similar memory requirements. Dependency Dynamic Cluster only. If this parameter is used, DC_VM_MEMSIZE_STEP in lsb.params is ignored. Valid Values 512 minimum Default Not defined DC_VM_PREFIX Syntax string Description Prefix for naming a new VM that is created by Dynamic Cluster. Specify a text string. You can specify more than one name using multiple <Value/> entries, but only the first value is used for dynamically creating new VMs. However, all VMs named with any of these prefixes will be treated as dynamically created VMs even if they were manually created. They will be subject to DC_VM_UNINSTALL_PERIOD, DC_MACHINE_MAX_LIFETIME. Default vm_lsf_dyn DC_VM_RESOURCE_GROUPS Syntax resource_group_name Description Specify names of Dynamic Cluster resource groups which are allowed to create new VMs. For KVM hypervisors, the only valid value is KVMRedHat_Host. DC_WORKDIR Syntax directory_path Chapter 1. Configuration Files 15 dc_conf.cluster_name.xml Description Dynamic Cluster working directory. This is the location where the dc.events file will be stored. Default /opt/lsf/work/cluster_name/dc Templates section For more information, see the setup instructions. hosts For hosts with multiple IP addresses and different official host names configured at the system level, this file associates the host names and IP addresses in LSF. By default, LSF assumes each host in the cluster: v Has a unique official host name v Can resolve its IP address from its name v Can resolve its official name from its IP address Hosts with only one IP address, or hosts with multiple IP addresses that already resolve to a unique official host name should not be configured in this file: they are resolved using the default method for your system (for example, local configuration files like /etc/hosts or through DNS.) The LSF hosts file is used in environments where: v Machines in cluster have multiple network interfaces and cannot be set up in the system with a unique official host name v DNS is slow or not configured properly v Machines have special topology requirements; for example, in HPC systems where it is desirable to map multiple actual hosts to a single head end host The LSF hosts file is not installed by default. It is usually located in the directory specified by LSF_CONFDIR. The format of LSF_CONFDIR/hosts is similar to the format of the /etc/hosts file on UNIX machines. hosts file structure One line for each IP address, consisting of the IP address, followed by the official host name, optionally followed by host aliases, all separated by spaces or tabs. Each line has the form: ip_address official_name [alias [alias ...]] IP addresses can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. You can use IPv6 addresses if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf; you do not have to map IPv4 addresses to an IPv6 format. Use consecutive lines for IP addresses belonging to the same host. You can assign different aliases to different addresses. 16 Platform LSF Configuration Reference hosts Use a pound sign (#) to indicate a comment (the rest of the line is not read by LSF). Do not use #if as this is reserved syntax for time-based configuration. IP address Written using an IPv4 or IPv6 format. LSF supports both formats; you do not have to map IPv4 addresses to an IPv6 format (if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf). v IPv4 format: nnn.nnn.nnn.nnn v IPv6 format: nnnn:nnnn:nnnn:nnnn:nnnn:nnnn:nnnn:nnnn Official host name The official host name. Single character names are not allowed. Specify -GATEWAY or -GW as part of the host name if the host serves as a GATEWAY. Specify -TAC as the last part of the host name if the host is a TAC and is a DoD host. Specify the host name in the format defined in Internet RFC 952, which states: A name (Net, Host, Gateway, or Domain name) is a text string up to 24 characters drawn from the alphabet (A-Z), digits (0-9), minus sign (-), and period (.). Periods are only allowed when they serve to delimit components of domain style names. (See RFC 921, Domain Name System Implementation Schedule, for background). No blank or space characters are permitted as part of a name. No distinction is made between upper and lower case. The first character must be an alpha character. The last character must not be a minus sign or a period. RFC 952 has been modified by RFC 1123 to relax the restriction on the first character being a digit. For maximum interoperability with the Internet, you should use host names no longer than 24 characters for the host portion (exclusive of the domain component). Aliases Optional. Aliases to the host name. The default host file syntax ip_address official_name [alias [alias ...]] is powerful and flexible, but it is difficult to configure in systems where a single host name has many aliases, and in multihomed host environments. In these cases, the hosts file can become very large and unmanageable, and configuration is prone to error. The syntax of the LSF hosts file supports host name ranges as aliases for an IP address. This simplifies the host name alias specification. To use host name ranges as aliases, the host names must consist of a fixed node group name prefix and node indices, specified in a form like: Chapter 1. Configuration Files 17 hosts host_name[index_x-index_y, index_m, index_a-index_b] For example: atlasD0[0-3,4,5-6, ...] is equivalent to: atlasD0[0-6, ...] The node list does not need to be a continuous range (some nodes can be configured out). Node indices can be numbers or letters (both upper case and lower case). For example, some systems map internal compute nodes to single LSF host names. A host file might contains 64 lines, each specifying an LSF host name and 32 node names that correspond to each LSF host: ... 177.16.1.1 atlasD0 atlas0 atlas1 atlas2 atlas3 atlas4 ... atlas31 177.16.1.2 atlasD1 atlas32 atlas33 atlas34 atlas35 atlas36 ... atlas63 ... In the new format, you still map the nodes to the LSF hosts, so the number of lines remains the same, but the format is simplified because you only have to specify ranges for the nodes, not each node individually as an alias: ... 177.16.1.1 atlasD0 atlas[0-31] 177.16.1.2 atlasD1 atlas[32-63] ... You can use either an IPv4 or an IPv6 format for the IP address (if you define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf). IPv4 Example 192.168.1.1 hostA hostB 192.168.2.2 hostA hostC host-C In this example, hostA has 2 IP addresses and 3 aliases. The alias hostB specifies the first address, and the aliases hostC and host-C specify the second address. LSF uses the official host name, hostA, to identify that both IP addresses belong to the same host. IPv6 Example 3ffe:b80:3:1a91::2 hostA hostB 3ffe:b80:3:1a91::3 hostA hostC host-C In this example, hostA has 2 IP addresses and 3 aliases. The alias hostB specifies the first address, and the aliases hostC and host-C specify the second address. LSF uses the official host name, hostA, to identify that both IP addresses belong to the same host. install.config About install.config The install.config file contains options for LSF installation and configuration. Use lsfinstall -f install.config to install LSF using the options specified in install.config. 18 Platform LSF Configuration Reference install.config Template location A template install.config is included in the installer script package lsf9.1.3_lsfinstall.tar.Z and is located in the lsf9.1.3_lsfinstall directory created when you uncompress and extract the installer script package. Edit the file and uncomment the options you want in the template file. Replace the example values with your own settings to specify the options for your new installation. Important: The sample values in the install.config template file are examples only. They are not default installation values. After installation, the install.config containing the options you specified is located in LSF_TOP/9.1/install/. Format Each entry in install.config has the form: NAME="STRING1 STRING2 ..." The equal sign = must follow each NAME even if no value follows and there should be no spaces around the equal sign. A value that contains multiple strings separated by spaces must be enclosed in quotation marks. Blank lines and lines starting with a pound sign (#) are ignored. Parameters v CONFIGURATION_TEMPLATE v EGO_DAEMON_CONTROL v ENABLE_DYNAMIC_HOSTS v ENABLE_EGO v ENABLE_STREAM v v v v v v v LSF_ADD_SERVERS LSF_ADD_CLIENTS LSF_ADMINS LSF_CLUSTER_NAME LSF_DYNAMIC_HOST_WAIT_TIME LSF_ENTITLEMENT_FILE LSF_MASTER_LIST v v v v v v v LSF_QUIET_INST LSF_SILENT_INSTALL_TARLIST LSF_TARDIR LSF_TOP PATCH_BACKUP_DIR PATCH_HISTORY_DIR SILENT_INSTALL Chapter 1. Configuration Files 19 install.config CONFIGURATION_TEMPLATE Syntax CONFIGURATION_TEMPLATE="DEFAULT" | "PARALLEL" | "HIGH_THROUGHPUT" Description LSF Standard Edition on UNIX or Linux only. Selects the configuration template for this installation, which determines the initial LSF configuration parameters specified when the installation is complete. The following are valid values for this parameter: DEFAULT This template should be used for clusters with mixed workload. This configuration can serve different types of workload with good performance, but is not specifically tuned for a particular type of cluster. PARALLEL This template provides extra support for large parallel jobs. This configuration is designed for long running parallel jobs, and should not be used for clusters that mainly run short jobs due to the longer reporting time for each job. HIGH_THROUGHPUT This template is designed to be used for clusters that mainly run short jobs, where over 80% of jobs finish within one minute. This high turnover rate requires LSF to be more responsive and fast acting. However, this configuration will consume more resources as the daemons become busier. The installer uses the DEFAULT configuration template when installing LSF Standard Edition on Windows. Note: Do not specify CONFIGURATION_TEMPLATE for LSF Express Edition and Advanced Edition. These editions have their own default configuration templates for all installations. The installer specifies the following initial configuration file parameter values based on the selected configuration template: v DEFAULT – lsf.conf: DAEMON_SHUTDOWN_DELAY=180 LSF_LINUX_CGROUP_ACCT=Y LSF_PROCESS_TRACKING=Y – lsb.params: JOB_DEP_LAST_SUB=1 JOB_SCHEDULING_INTERVAL=1 MAX_JOB_NUM=10000 NEWJOB_REFRESH=Y SBD_SLEEP_TIME=7 v PARALLEL – lsf.conf: LSB_SHORT_HOSTLIST=1 LSF_LINUX_CGROUP_ACCT=Y LSF_PROCESS_TRACKING=Y LSF_ENABLE_EXTSCHEDULER=Y LSF_HPC_EXTENSIONS="CUMULATIVE_RUSAGE LSB_HCLOSE_BY_RES SHORT_EVENTFILE" Refer to the Enable LSF HPC Features section for a full description. 20 Platform LSF Configuration Reference install.config – lsb.params: JOB_DEP_LAST_SUB=1 JOB_SCHEDULING_INTERVAL=1 NEWJOB_REFRESH=Y v HIGH_THROUGHPUT – lsf.conf: LSB_MAX_PACK_JOBS=300 LSB_SHORT_HOSTLIST=1 – lsb.params: CONDENSE_PENDING_REASONS=Y JOB_SCHEDULING_INTERVAL=50ms MAX_INFO_DIRS=500 MAX_JOB_ARRAY_SIZE=10000 MAX_JOB_NUM=100000 MIN_SWITCH_PERIOD=1800 NEWJOB_REFRESH=Y PEND_REASON_UPDATE_INTERVAL=60 SBD_SLEEP_TIME=3 The installer specifies the following initial configuration parameters for all configuration templates: v lsf.conf: EGO_ENABLE_AUTO_DAEMON_SHUTDOWN=Y LSB_DISABLE_LIMLOCK_EXCL=Y LSB_MOD_ALL_JOBS=Y LSF_DISABLE_LSRUN=Y LSB_SUBK_SHOW_EXEC_HOST=Y LSF_PIM_LINUX_ENHANCE=Y LSF_PIM_SLEEPTIME_UPDATE=Y LSF_STRICT_RESREQ=Y LSF_UNIT_FOR_LIMITS=MB v lsb.params: ABS_RUNLIMIT=Y DEFAULT_QUEUE=normal interactive JOB_ACCEPT_INTERVAL=0 MAX_CONCURRENT_QUERY=100 MAX_JOB_NUM=10000 MBD_SLEEP_TIME=10 PARALLEL_SCHED_BY_SLOT=Y In addition, the installer enables the following features for all configuration templates: v Fairshare scheduling (LSF Standard Edition and Advanced Edition): All queues except admin and license have fairshare scheduling enabled as follows in lsb.queues: Begin Queue ... FAIRSHARE=USER_SHARES[[default, 1]] ... End Queue v Host groups (LSF Standard Edition on UNIX or Linux): Master candidate hosts are assigned to the master_hosts host group. v User groups (LSF Standard Edition on UNIX or Linux): LSF administrators are assigned to the lsfadmins user group. v Affinity scheduling in both lsb.modules and lsb.hosts. Chapter 1. Configuration Files 21 install.config Example CONFIGURATION_TEMPLATE="HIGH_THROUGHPUT" Default DEFAULT (the default configuration template is used) EGO_DAEMON_CONTROL Syntax EGO_DAEMON_CONTROL="Y" | "N" Description Enables EGO to control LSF res and sbatchd. Set the value to "Y" if you want EGO Service Controller to start res and sbatchd, and restart if they fail. To avoid conflicts, leave this parameter undefined if you use a script to start up LSF daemons. Note: If you specify EGO_ENABLE="N", this parameter is ignored. Example EGO_DAEMON_CONTROL="N" Default N (res and sbatchd are started manually) ENABLE_DYNAMIC_HOSTS Syntax ENABLE_DYNAMIC_HOSTS="Y" | "N" Description Enables dynamically adding and removing hosts. Set the value to "Y" if you want to allow dynamically added hosts. If you enable dynamic hosts, any host can connect to cluster. To enable security, configure LSF_HOST_ADDR_RANGE in lsf.cluster.cluster_name after installation and restrict the hosts that can connect to your cluster. Example ENABLE_DYNAMIC_HOSTS="N" Default N (dynamic hosts not allowed) 22 Platform LSF Configuration Reference install.config ENABLE_EGO Syntax ENABLE_EGO="Y" | "N" Description Enables EGO functionality in the LSF cluster. ENABLE_EGO="Y" causes lsfinstall uncomment LSF_EGO_ENVDIR and sets LSF_ENABLE_EGO="Y" in lsf.conf. ENABLE_EGO="N" causes lsfinstall to comment out LSF_EGO_ENVDIR and sets LSF_ENABLE_EGO="N" in lsf.conf. Set the value to "Y" if you want to take advantage of the following LSF features that depend on EGO: v LSF daemon control by EGO Service Controller v EGO-enabled SLA scheduling Default N (EGO is disabled in the LSF cluster) ENABLE_STREAM Syntax ENABLE_STREAM="Y" | "N" Description Enables LSF event streaming. Enable LSF event streaming if you intend to install IBM Platform Analytics or IBM Platform Application Center. Default N (Event streaming is disabled) LSF_ADD_SERVERS Syntax LSF_ADD_SERVERS="host_name [ host_name...]" Description List of additional LSF server hosts. The hosts in LSF_MASTER_LIST are always LSF servers. You can specify additional server hosts. Specify a list of host names two ways: v Host names separated by spaces v Chapter 1. Configuration Files 23 install.config Name of a file containing a list of host names, one host per line. Valid Values Any valid LSF host name. Example 1 List of host names: LSF_ADD_SERVERS="hosta hostb hostc hostd" Example 2 Host list file: LSF_ADD_SERVERS=:lsf_server_hosts The file lsf_server_hosts contains a list of hosts: hosta hostb hostc hostd Default Only hosts in LSF_MASTER_LIST are LSF servers. LSF_ADD_CLIENTS Syntax LSF_ADD_CLIENTS="host_name [ host_name...]" Description List of LSF client-only hosts. Tip: After installation, you must manually edit lsf.cluster.cluster_name to include the host model and type of each client listed in LSF_ADD_CLIENTS. Valid Values Any valid LSF host name. Example 1 List of host names: LSF_ADD_CLIENTS="hoste hostf" Example 2 Host list file: LSF_ADD_CLIENTS=:lsf_client_hosts The file lsf_client_hosts contains a list of hosts: 24 Platform LSF Configuration Reference install.config hoste hostf Default No client hosts installed. LSF_ADMINS Syntax LSF_ADMINS="user_name [ user_name ... ]" Description Required. List of LSF administrators. The first user account name in the list is the primary LSF administrator. It cannot be the root user account. Typically this account is named lsfadmin. It owns the LSF configuration files and log files for job events. It also has permission to reconfigure LSF and to control batch jobs submitted by other users. It typically does not have authority to start LSF daemons. Usually, only root has permission to start LSF daemons. All the LSF administrator accounts must exist on all hosts in the cluster before you install LSF. Secondary LSF administrators are optional. CAUTION: You should not configure the root account as the primary LSF administrator. Valid Values Existing user accounts Example LSF_ADMINS="lsfadmin user1 user2" Default None—required variable LSF_CLUSTER_NAME Syntax LSF_CLUSTER_NAME="cluster_name" Description Required. The name of the LSF cluster. Example LSF_CLUSTER_NAME="cluster1" Chapter 1. Configuration Files 25 install.config Valid Values Any alphanumeric string containing no more than 39 characters. The name cannot contain white spaces. Important: Do not use the name of any host, user, or user group as the name of your cluster. Default None—required variable LSF_DYNAMIC_HOST_WAIT_TIME Syntax LSF_DYNAMIC_HOST_WAIT_TIME=seconds Description Time in seconds slave LIM waits after startup before calling master LIM to add the slave host dynamically. This parameter only takes effect if you set ENABLE_DYNAMIC_HOSTS="Y" in this file. If the slave LIM receives the master announcement while it is waiting, it does not call the master LIM to add itself. Recommended value Up to 60 seconds for every 1000 hosts in the cluster, for a maximum of 15 minutes. Selecting a smaller value will result in a quicker response time for new hosts at the expense of an increased load on the master LIM. Example LSF_DYNAMIC_HOST_WAIT_TIME=60 Hosts will wait 60 seconds from startup to receive an acknowledgement from the master LIM. If it does not receive the acknowledgement within the 60 seconds, it will send a request for the master LIM to add it to the cluster. Default Slave LIM waits forever LSF_ENTITLEMENT_FILE Syntax LSF_ENTITLEMENT_FILE=path Description Full path to the LSF entitlement file. LSF uses the entitlement to determine which feature set to be enable or disable based on the edition of the product. The entitlement file for LSF Standard Edition is platform_lsf_std_entitlement.dat. For LSF Express Edition, the file is platform_lsf_exp_entitlement.dat. For LSF 26 Platform LSF Configuration Reference install.config Advanced Edition, the file is platform_lsf_adv_entitlement.dat. The entitlement file is installed as <LSF_TOP>/conf/lsf.entitlement. You must download the entitlement file for the edition of the product you are running, and set LSF_ENTITLEMENT_FILE to the full path to the entitlement file you downloaded. Once LSF is installed and running, run the lsid command to see which edition of LSF is enabled. Example LSF_ENTITLEMENT_FILE=/usr/share/lsf_distrib/lsf.entitlement Default None — required variable LSF_MASTER_LIST Syntax LSF_MASTER_LIST="host_name [ host_name ...]" Description Required for a first-time installation. List of LSF server hosts to be master or master candidates in the cluster. You must specify at least one valid server host to start the cluster. The first host listed is the LSF master host. During upgrade, specify the existing value. Valid Values LSF server host names Example LSF_MASTER_LIST="hosta hostb hostc hostd" Default None—required variable LSF_QUIET_INST Syntax LSF_QUIET_INST="Y" | "N" Description Enables quiet installation. Set the value to Y if you want to hide the LSF installation messages. Chapter 1. Configuration Files 27 install.config Example LSF_QUIET_INST="Y" Default N (installer displays messages during installation) LSF_SILENT_INSTALL_TARLIST Syntax LSF_SILENT_INSTALL_TARLIST="ALL" | "Package_Name ..." Description A string which contains all LSF package names to be installed. This name list only applies to the silent install mode. Supports keywords ?all?, ?ALL? and ?All? which can install all packages in LSF_TARDIR. LSF_SILENT_INSTALL_TARLIST="ALL" | "lsf9.1.3_linux2.6-glibc2.3x86_64.tar.Z" Default None LSF_TARDIR Syntax LSF_TARDIR="/path" Description Full path to the directory containing the LSF distribution tar files. Example LSF_TARDIR="/usr/share/lsf_distrib" Default The parent directory of the current working directory. For example, if lsfinstall is running under usr/share/lsf_distrib/lsf_lsfinstall the LSF_TARDIR default value is usr/share/lsf_distrib. LSF_TOP Syntax LSF_TOP="/path" Description Required. Full path to the top-level LSF installation directory. 28 Platform LSF Configuration Reference install.config Valid Value The path to LSF_TOP must be shared and accessible to all hosts in the cluster. It cannot be the root directory (/). The file system containing LSF_TOP must have enough disk space for all host types (approximately 300 MB per host type). Example LSF_TOP="/usr/share/lsf" Default None - required variable PATCH_BACKUP_DIR Syntax PATCH_BACKUP_DIR="/path" Description Full path to the patch backup directory. This parameter is used when you install a new cluster for the first time, and is ignored for all other cases. The file system containing the patch backup directory must have sufficient disk space to back up your files (approximately 400 MB per binary type if you want to be able to install and roll back one enhancement pack and a few additional fixes). It cannot be the root directory (/). If the directory already exists, it must be writable by the cluster administrator (lsfadmin). If you need to change the directory after installation, edit PATCH_BACKUP_DIR in LSF_TOP/patch.conf and move the saved backup files to the new directory manually. Example PATCH_BACKUP_DIR="/usr/share/lsf/patch/backup" Default LSF_TOP/patch/backup PATCH_HISTORY_DIR Syntax PATCH_HISTORY_DIR="/path" Description Full path to the patch history directory. This parameter is used when you install a new cluster for the first time, and is ignored for all other cases. It cannot be the root directory (/). If the directory already exists, it must be writable by lsfadmin. Chapter 1. Configuration Files 29 install.config The location is saved as PATCH_HISTORY_DIR in LSF_TOP/patch.conf. Do not change the directory after installation. Example PATCH_BACKUP_DIR="/usr/share/lsf/patch" Default LSF_TOP/patch SILENT_INSTALL Syntax SILENT_INSTALL="Y" | "N" Description Enabling the silent installation (setting this parameter to Y) means you want to do the silent installation and accept the license agreement. Default N lim.acct The lim.acct file is the log file for Load Information Manager (LIM). Produced by lsmon, lim.acct contains host load information collected and distributed by LIM. lim.acct structure The first line of lim.acct contains a list of load index names separated by spaces. This list of load index names can be specified in the lsmon command line. The default list is "r15s r1m r15m ut pg ls it swp mem tmp". Subsequent lines in the file contain the host’s load information at the time the information was recorded. Fields Fields are ordered in the following sequence: time (%ld) The time when the load information is written to the log file host name (%s) The name of the host. status of host (%d) An array of integers. The first integer marks the operation status of the host. Additional integers are used as a bit map to indicate load status of the host. An integer can be used for 32 load indices. If the number of user defined load indices is not more than 21, only one integer is used for both built-in load indices and external load indices. See the hostload structure in ls_load(3) for the description of these fields. indexvalue (%f) 30 Platform LSF Configuration Reference lim.acct A sequence of load index values. Each value corresponds to the index name in the first line of lim.acct. The order in which the index values are listed is the same as the order of the index names. lsb.acct The lsb.acct file is the batch job log file of LSF. The master batch daemon (see mbatchd(8)) generates a record for each job completion or failure. The record is appended to the job log file lsb.acct. The file is located in LSB_SHAREDIR/cluster_name/logdir, where LSB_SHAREDIR must be defined in lsf.conf(5) and cluster_name is the name of the LSF cluster, as returned by lsid(1). See mbatchd(8) for the description of LSB_SHAREDIR. The bacct command uses the current lsb.acct file for its output. lsb.acct structure The job log file is an ASCII file with one record per line. The fields of a record are separated by blanks. If the value of some field is unavailable, a pair of double quotation marks ("") is logged for character string, 0 for time and number, and -1 for resource usage. Configuring automatic archiving The following parameters in lsb.params affect how records are logged to lsb.acct: ACCT_ARCHIVE_AGE=days Enables automatic archiving of LSF accounting log files, and specifies the archive interval. LSF archives the current log file if the length of time from its creation date exceeds the specified number of days. By default there is no limit to the age of lsb.acct. ACCT_ARCHIVE_SIZE=kilobytes Enables automatic archiving of LSF accounting log files, and specifies the archive threshold. LSF archives the current log file if its size exceeds the specified number of kilobytes. By default, there is no limit to the size of lsb.acct. ACCT_ARCHIVE_TIME=hh:mm Enables automatic archiving of LSF accounting log file lsb.acct, and specifies the time of day to archive the current log file. By default, no time is set for archiving lsb.acct. MAX_ACCT_ARCHIVE_FILE=integer Enables automatic deletion of archived LSF accounting log files and specifies the archive limit. By default, lsb.acct.n files are not automatically deleted. Records and fields The fields of a record are separated by blanks. The first string of an event record indicates its type. The following types of events are recorded: v JOB_FINISH Chapter 1. Configuration Files 31 lsb.acct v EVENT_ADRSV_FINISH v JOB_RESIZE JOB_FINISH A job has finished. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.acct file format. The fields in order of occurrence are: Event type (%s) Which is JOB_FINISH Version Number (%s) Version number of the log file format Event Time (%d) Time the event was logged (in seconds since the epoch) jobId (%d) ID for the job userId (%d) UNIX user ID of the submitter options (%d) Bit flags for job processing numProcessors (%d) Number of processors initially requested for execution submitTime (%d) Job submission time beginTime (%d) Job start time – the job should be started at or after this time termTime (%d) Job termination deadline – the job should be terminated by this time startTime (%d) Job dispatch time – time job was dispatched for execution userName (%s) User name of the submitter queue (%s) Name of the job queue to which the job was submitted resReq (%s) Resource requirement specified by the user dependCond (%s) Job dependency condition specified by the user 32 Platform LSF Configuration Reference lsb.acct preExecCmd (%s) Pre-execution command specified by the user fromHost (%s) Submission host name cwd (%s) | | Current working directory (up to 4094 characters for UNIX or 512 characters for Windows), or the current working directory specified by bsub -cwd if that command was used. inFile (%s) Input file name (up to 4094 characters for UNIX or 512 characters for Windows) outFile (%s) output file name (up to 4094 characters for UNIX or 512 characters for Windows) errFile (%s) Error output file name (up to 4094 characters for UNIX or 512 characters for Windows) jobFile (%s) Job script file name numAskedHosts (%d) Number of host names to which job dispatching will be limited askedHosts (%s) List of host names to which job dispatching will be limited (%s for each); nothing is logged to the record for this value if the last field value is 0. If there is more than one host name, then each additional host name will be returned in its own field numExHosts (%d) Number of processors used for execution If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of hosts listed in the execHosts field. Logged value reflects the allocation at job finish time. execHosts (%s) List of execution host names (%s for each); nothing is logged to the record for this value if the last field value is 0. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. The logged value reflects the allocation at job finish time. jStatus (%d) Job status. The number 32 represents EXIT, 64 represents DONE hostFactor (%f) CPU factor of the first execution host. Chapter 1. Configuration Files 33 lsb.acct jobName (%s) Job name (up to 4094 characters). command (%s) Complete batch job command specified by the user (up to 4094 characters for UNIX or 512 characters for Windows). | lsfRusage The following fields contain resource usage information for the job (see getrusage(2)). If the value of some field is unavailable (due to job exit or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB. ru_utime (%f) User time used ru_stime (%f) System time used ru_maxrss (%f) Maximum shared text size ru_ixrss (%f) Integral of the shared text size over time (in KB seconds) ru_ismrss (%f) Integral of the shared memory size over time (valid only on Ultrix) ru_idrss (%f) Integral of the unshared data size over time ru_isrss (%f) Integral of the unshared stack size over time ru_minflt (%f) Number of page reclaims ru_majflt (%f) Number of page faults ru_nswap (%f) Number of times the process was swapped out ru_inblock (%f) Number of block input operations ru_oublock (%f) Number of block output operations ru_ioch (%f) Number of characters read and written (valid only on HP-UX) ru_msgsnd (%f) Number of System V IPC messages sent ru_msgrcv (%f) 34 Platform LSF Configuration Reference lsb.acct Number of messages received ru_nsignals (%f) Number of signals received ru_nvcsw (%f) Number of voluntary context switches ru_nivcsw (%f) Number of involuntary context switches ru_exutime (%f) Exact user time used (valid only on ConvexOS) mailUser (%s) Name of the user to whom job related mail was sent projectName (%s) LSF project name exitStatus (%d) UNIX exit status of the job maxNumProcessors (%d) Maximum number of processors specified for the job loginShell (%s) Login shell used for the job timeEvent (%s) Time event string for the job - Platform Process Manager only idx (%d) Job array index maxRMem (%d) Maximum resident memory usage in KB of all processes in the job maxRSwap (%d) Maximum virtual memory usage in KB of all processes in the job inFileSpool (%s) Spool input file (up to 4094 characters for UNIX or 512 characters for Windows) commandSpool (%s) Spool command file (up to 4094 characters for UNIX or 512 characters for Windows) rsvId %s Advance reservation ID for a user group name less than 120 characters long; for example, "user2#0" If the advance reservation user group name is longer than 120 characters, the rsvId field output appears last. sla (%s) Chapter 1. Configuration Files 35 lsb.acct SLA service class name under which the job runs exceptMask (%d) Job exception handling Values: v J_EXCEPT_OVERRUN 0x02 v J_EXCEPT_UNDERUN 0x04 v J_EXCEPT_IDLE 0x80 additionalInfo (%s) Placement information of HPC jobs exitInfo (%d) Job termination reason, mapped to corresponding termination keyword displayed by bacct. warningAction (%s) Job warning action warningTimePeriod (%d) Job warning time period in seconds chargedSAAP (%s) SAAP charged to a job licenseProject (%s) Platform License Scheduler project name app (%s) Application profile name postExecCmd (%s) Post-execution command to run on the execution host after the job finishes runtimeEstimation (%d) Estimated run time for the job, calculated as the CPU factor of the submission host multiplied by the runtime estimate (in seconds). jobGroupName (%s) Job group name requeueEvalues (%s) Requeue exit value options2 (%d) Bit flags for job processing resizeNotifyCmd (%s) Resize notification command to be invoked on the first execution host upon a resize request. lastResizeTime (%d) Last resize time. The latest wall clock time when a job allocation is changed. rsvId %s 36 Platform LSF Configuration Reference lsb.acct Advance reservation ID for a user group name more than 120 characters long. If the advance reservation user group name is longer than 120 characters, the rsvId field output appears last. jobDescription (%s) Job description (up to 4094 characters). submitEXT Submission extension field, reserved for internal use. Num (%d) Number of elements (key-value pairs) in the structure. key (%s) Reserved for internal use. value (%s) Reserved for internal use. options3 (%d) Bit flags for job processing | bsub -W(%d) Job submission runtime limit numHostRusage(%d) | | The number of host-based resource usage entries (hostRusage) that follow. Enabled by default. hostRusage The following fields contain host-based resource usage information for the job. To disable reporting of hostRusage set LSF_HPC_EXTENSIONS=NO_HOST_RUSAGE in lsf.conf. hostname (%s) Name of the host. mem(%d) Total resident memory usage of all processes in the job running on this host. swap(%d) The total virtual memory usage of all processes in the job running on this host. utime(%d) User time used on this host. stime(%d) System time used on this host. hHostExtendInfo(%d) Number of following key-value pairs containing extended host information (PGIDs and PIDs). Set to 0 in lsb.events, lsb.acct, and lsb.stream files. srcJobId (%d) Chapter 1. Configuration Files 37 lsb.acct The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster effectiveResReq (%s) The runtime resource requirements used for the job. network (%s) Network requirements for IBM Parallel Environment (PE) jobs. totalProvisionTime (%d) Platform Dynamic Cluster only - time in seconds that the job has been in the provisioning (PROV) state. runTime (%d) Time in seconds that the job has been in the run state. runTime includes the totalProvisionTime. cpu_frequency(%d) CPU frequency at which the job ran. options4 (%d) Bit flags for job processing | numAllocSlots(%d) Number of allocated slots. | | allocSlots(%s) List of execution host names where the slots are allocated. | EVENT_ADRSV_FINISH An advance reservation has expired. The fields in order of occurrence are: Event type (%s) Which is EVENT_ADRSV_FINISH Version Number (%s) Version number of the log file format Event Logging Time (%d) Time the event was logged (in seconds since the epoch); for example, "1038942015" Reservation Creation Time (%d) Time the advance reservation was created (in seconds since the epoch); for example, 1038938898 Reservation Type (%d) Type of advance reservation request: v User reservation (RSV_OPTION_USER, defined as 0x001) 38 Platform LSF Configuration Reference lsb.acct v User group reservation (RSV_OPTION_GROUP, defined as 0x002) v System reservation (RSV_OPTION_SYSTEM, defined as 0x004) v Recurring reservation (RSV_OPTION_RECUR, defined as 0x008) For example, 9is a recurring reservation created for a user. Creator ID (%d) UNIX user ID of the reservation creator; for example, 30408 Reservation ID (rsvId %s) For example, user2#0 User Name (%s) User name of the reservation user; for example, user2 Time Window (%s) Time window of the reservation: v One-time reservation in seconds since the epoch; for example, 1033761000-1033761600 v Recurring reservation; for example, 17:50-18:00 Creator Name (%s) User name of the reservation creator; for example, user1 Duration (%d) Duration of the reservation, in hours, minutes, seconds; for example, 600is 6 hours, 0 minutes, 0 seconds Number of Resources (%d) Number of reserved resource pairs in the resource list; for example 2indicates 2 resource pairs (hostA 1 hostB 1) Host Name (%s) Reservation host name; for example, hostA Number of CPUs (%d) Number of reserved CPUs; for example 1 JOB_RESIZE When there is an allocation change, LSF logs the event after mbatchd receives a JOB_RESIZE_NOTIFY_DONE event. From lastResizeTime and eventTime, you can calculate the duration of previous job allocation. The fields in order of occurrence are: Version number (%s) The version number. Event Time (%d) Time the event was logged (in seconds since the epoch). jobId (%d) ID for the job. tdx (%d) Job array index. Chapter 1. Configuration Files 39 lsb.acct startTime (%d) The start time of the running job. userId (%d) UNIX user ID of the user invoking the command userName (%s) User name of the submitter resizeType (%d) Resize event type, 0, grow, 1 shrink. lastResizeTime(%d) The wall clock time when job allocation is changed previously. The first lastResizeTime is the job start time. numExecHosts (%d) The number of execution hosts before allocation is changed. Support LSF_HPC_EXTENSIONS="SHORT_EVENTFILE". execHosts (%s) Execution host list before allocation is changed. Support LSF_HPC_EXTENSIONS="SHORT_EVENTFILE". numResizeHosts (%d) Number of processors used for execution during resize. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of hosts listed in short format. resizeHosts (%s) List of execution host names during resize. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. | numAllocSlots(%d) Number of allocated slots. | | allocSlots(%s) List of execution host names where the slots are allocated. | | | numResizeSlots (%d) Number of allocated slots for executing resize. | | resizeSlots (%s) List of execution host names where slots are allocated for resizing. lsb.applications The lsb.applications file defines application profiles. Use application profiles to define common parameters for the same type of jobs, including the execution requirements of the applications, the resources they require, and how they should be run and managed. This file is optional. Use the DEFAULT_APPLICATION parameter in lsb.params to specify a default application profile for all jobs. LSF does not automatically assign a default application profile. 40 Platform LSF Configuration Reference lsb.applications This file is installed by default in LSB_CONFDIR/cluster_name/configdir. Changing lsb.applications configuration After making any changes to lsb.applications, run badmin reconfig to reconfigure mbatchd. Configuration changes apply to pending jobs only. Running jobs are not affected. lsb.applications structure Each application profile definition begins with the line Begin Application and ends with the line End Application. The application name must be specified. All other parameters are optional. Example Begin Application NAME = catia DESCRIPTION = CATIA V5 CPULIMIT = 24:0/hostA FILELIMIT = 20000 DATALIMIT = 20000 CORELIMIT = 20000 TASKLIMIT = 5 # 24 hours of host hostA # jobs data segment limit # job processor limit REQUEUE_EXIT_VALUES = 55 34 78 End Application See the lsb.applications template file for additional application profile examples. #INCLUDE Syntax #INCLUDE "path-to-file" Description A MultiCluster environment allows common configurations to be shared by all clusters. Use #INCLUDE to centralize the configuration work for groups of clusters when they all need to share a common configuration. Using #INCLUDE lets you avoid having to manually merge these common configurations into each local cluster's configuration files. To make the new configuration active, use badmin reconfig, then use bapp to confirm the changes. After configuration, both common resources and local resources will take effect on the local cluster. Examples #INCLUDE "/scratch/Shared/lsf.applications.common.g" #INCLUDE "/scratch/Shared/lsf.applications.common.o" Begin Application ... Time-based configuration supports MultiCluster configuration in terms of shared configuration for groups of clusters. That means you can include a common configuration file by using the time-based feature in local configuration files. If you Chapter 1. Configuration Files 41 lsb.applications want to use the time-based function with an include file, the time-based #include should be placed before all sections. For example: #if time(11:00-20:00) #include "/scratch/Shared/lsf.applications.common.grape" #endif All #include lines must be inserted at the beginning of the local configuration file. If placed within or after any other sections, LSF reports an error. Default Not defined. ABS_RUNLIMIT Syntax ABS_RUNLIMIT=y | Y Description If set, absolute (wall-clock) run time is used instead of normalized run time for all jobs submitted with the following values: v Run time limit or run time estimate specified by the -W or -We option of bsub v RUNLIMIT queue-level parameter in lsb.queues v RUNLIMIT application-level parameter in lsb.applications v RUNTIME parameter in lsb.applications The runtime estimates and limits are not normalized by the host CPU factor. Default Not defined. Run limit and runtime estimate are normalized. BIND_JOB BIND_JOB specifies the processor binding policy for sequential and parallel job processes that run on a single host. On Linux execution hosts that support this feature, job processes are hard bound to selected processors. Syntax BIND_JOB=NONE | BALANCE | PACK | ANY | USER | USER_CPU_LIST Description | | | | | Note: BIND_JOB is deprecated in LSF Standard Edition and LSF Advanced Edition. You should enable LSF CPU and memory affinity scheduling in with the AFFINITY parameter in lsb.hosts. If both BIND_JOB and affinity scheduling are enabled, affinity scheduling takes effect, and LSF_BIND_JOB is disabled. BIND_JOB and LSF_BIND_JOB are the only affinity options available in LSF Express Edition. | If processor binding feature is not configured with the BIND_JOB parameter in an application profile in lsb.applications, the LSF_BIND_JOB configuration setting lsf.conf takes effect. The application profile configuration for processor binding overrides the lsf.conf configuration. 42 Platform LSF Configuration Reference lsb.applications For backwards compatibility: v BIND_JOB=Y is interpreted as BIND_JOB=BALANCE v BIND_JOB=N is interpreted as BIND_JOB=NONE Supported platforms Linux with kernel version 2.6 or higher Default Not defined. Processor binding is disabled. CHKPNT_DIR Syntax CHKPNT_DIR=chkpnt_dir Description Specifies the checkpoint directory for automatic checkpointing for the application. To enable automatic checkpoint for the application profile, administrators must specify a checkpoint directory in the configuration of the application profile. If CHKPNT_PERIOD, CHKPNT_INITPERIOD or CHKPNT_METHOD was set in an application profile but CHKPNT_DIR was not set, a warning message is issued and and those settings are ignored. The checkpoint directory is the directory where the checkpoint files are created. Specify an absolute path or a path relative to the current working directory for the job. Do not use environment variables in the directory path. If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration. If checkpoint-related configuration is specified in the queue, application profile, and at job level: v Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value. v The merged result of job-level and application profile settings override queue-level configuration. To enable checkpointing of MultiCluster jobs, define a checkpoint directory in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster. LSF uses the directory specified in the execution cluster. Checkpointing is not supported if a job runs on a leased host. The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name. Chapter 1. Configuration Files 43 lsb.applications Default Not defined CHKPNT_INITPERIOD Syntax CHKPNT_INITPERIOD=init_chkpnt_period Description Specifies the initial checkpoint period in minutes. CHKPNT_DIR must be set in the application profile for this parameter to take effect. The periodic checkpoint specified by CHKPNT_PERIOD does not happen until the initial period has elapse. Specify a positive integer. Job-level command line values override the application profile configuration. If administrators specify an initial checkpoint period and do not specify a checkpoint period (CHKPNT_PERIOD), the job will only checkpoint once. If the initial checkpoint period if a job is specified, and you run bchkpnt to checkpoint the job at a time before the initial checkpoint period, the initial checkpoint period is not changed by bchkpnt. The first automatic checkpoint still happens after the specified number of minutes. Default Not defined CHKPNT_PERIOD Syntax CHKPNT_PERIOD=chkpnt_period Description Specifies the checkpoint period for the application in minutes. CHKPNT_DIR must be set in the application profile for this parameter to take effect. The running job is checkpointed automatically every checkpoint period. Specify a positive integer. Job-level command line values override the application profile and queue level configurations. Application profile level configuration overrides the queue level configuration. Default Not defined 44 Platform LSF Configuration Reference lsb.applications CHKPNT_METHOD Syntax CHKPNT_METHOD=chkpnt_method Description Specifies the checkpoint method. CHKPNT_DIR must be set in the application profile for this parameter to take effect. Job-level command line values override the application profile configuration. Default Not defined CHUNK_JOB_SIZE Syntax CHUNK_JOB_SIZE=integer Description Chunk jobs only. Allows jobs submitted to the same application profile to be chunked together and specifies the maximum number of jobs allowed to be dispatched together in a chunk. Specify a positive integer greater than or equal to 1. All of the jobs in the chunk are scheduled and dispatched as a unit, rather than individually. Specify CHUNK_JOB_SIZE=1 to disable job chunking for the application. This value overrides chunk job dispatch configured in the queue. Use the CHUNK_JOB_SIZE parameter to configure application profiles that chunk small, short-running jobs. The ideal candidates for job chunking are jobs that have the same host and resource requirements and typically take 1 to 2 minutes to run. The ideal candidates for job chunking are jobs that have the same host and resource requirements and typically take 1 to 2 minutes to run. Job chunking can have the following advantages: v Reduces communication between sbatchd and mbatchd and reduces scheduling overhead in mbschd. v Increases job throughput in mbatchd and CPU utilization on the execution hosts. However, throughput can deteriorate if the chunk job size is too big. Performance may decrease on profiles with CHUNK_JOB_SIZE greater than 30. You should evaluate the chunk job size on your own systems for best performance. With MultiCluster job forwarding model, this parameter does not affect MultiCluster jobs that are forwarded to a remote cluster. Chapter 1. Configuration Files 45 lsb.applications Compatibility This parameter is ignored and jobs are not chunked under the following conditions: v CPU limit greater than 30 minutes (CPULIMIT parameter in lsb.queues or lsb.applications) v Run limit greater than 30 minutes (RUNLIMIT parameter in lsb.queues or lsb.applications) v Runtime estimate greater than 30 minutes (RUNTIME parameter in lsb.applications) If CHUNK_JOB_DURATION is set in lsb.params, chunk jobs are accepted regardless of the value of CPULIMIT, RUNLIMIT or RUNTIME. Default Not defined CORELIMIT Syntax CORELIMIT=integer Description The per-process (soft) core file size limit for all of the processes belonging to a job from this application profile (see getrlimit(2)). Application-level limits override any default limit specified in the queue, but must be less than the hard limit of the submission queue. Job-level core limit (bsub -C) overrides queue-level and application-level limits. By default, the limit is specified in KB. Use LSF_UNIT_FOR_LIMITS in lsf.conf to specify a larger unit for the the limit (MB, GB, TB, PB, or EB). Default Unlimited CPU_FREQUENCY Syntax CPU_FREQUENCY=[float_number][unit] Description Specifies the CPU frequency for an application profile. All jobs submit to the application profile require the specified CPU frequency. Value is a positive float number with units (GHz, MHz, or KHz). If no units are set, the default is GHz. This value can also be set using the command bsub –freq. The submission value will overwrite the application profile value, and the application profile value will overwrite the queue value. 46 Platform LSF Configuration Reference lsb.applications Default Not defined (Nominal CPU frequency is used) CPULIMIT Syntax CPULIMIT=[[hour:]minute[/host_name | /host_model] Description Normalized CPU time allowed for all processes of a job running in the application profile. The name of a host or host model specifies the CPU time normalization host to use. Limits the total CPU time the job can use. This parameter is useful for preventing runaway jobs or jobs that use up too many resources. When the total CPU time for the whole job has reached the limit, a SIGXCPU signal is sent to all processes belonging to the job. If the job has no signal handler for SIGXCPU, the job is killed immediately. If the SIGXCPU signal is handled, blocked, or ignored by the application, then after the grace period expires, LSF sends SIGINT, SIGTERM, and SIGKILL to the job to kill it. If a job dynamically spawns processes, the CPU time used by these processes is accumulated over the life of the job. Processes that exist for fewer than 30 seconds may be ignored. By default, jobs submitted to the application profile without a job-level CPU limit (bsub -c) are killed when the CPU limit is reached. Application-level limits override any default limit specified in the queue. The number of minutes may be greater than 59. For example, three and a half hours can be specified either as 3:30 or 210. If no host or host model is given with the CPU time, LSF uses the default CPU time normalization host defined at the queue level (DEFAULT_HOST_SPEC in lsb.queues) if it has been configured, otherwise uses the default CPU time normalization host defined at the cluster level (DEFAULT_HOST_SPEC in lsb.params) if it has been configured, otherwise uses the host with the largest CPU factor (the fastest host in the cluster). On Windows, a job that runs under a CPU time limit may exceed that limit by up to SBD_SLEEP_TIME. This is because sbatchd periodically checks if the limit has been exceeded. On UNIX systems, the CPU limit can be enforced by the operating system at the process level. You can define whether the CPU limit is a per-process limit enforced by the OS or a per-job limit enforced by LSF with LSB_JOB_CPULIMIT in lsf.conf. Chapter 1. Configuration Files 47 lsb.applications Default Unlimited DATALIMIT Syntax DATALIMIT=integer Description The per-process (soft) data segment size limit (in KB) for all of the processes belonging to a job running in the application profile (see getrlimit(2)). By default, jobs submitted to the application profile without a job-level data limit (bsub -D) are killed when the data limit is reached. Application-level limits override any default limit specified in the queue, but must be less than the hard limit of the submission queue. Default Unlimited DESCRIPTION Syntax DESCRIPTION=text Description Description of the application profile. The description is displayed by bapp -l. The description should clearly describe the service features of the application profile to help users select the proper profile for each job. The text can include any characters, including white space. The text can be extended to multiple lines by ending the preceding line with a backslash (\). The maximum length for the text is 512 characters. DJOB_COMMFAIL_ACTION Syntax DJOB_COMMFAIL_ACTION="KILL_TASKS|IGNORE_COMMFAIL" Description Defines the action LSF should take if it detects a communication failure with one or more remote parallel or distributed tasks. If defined with "KILL_TASKS", LSF tries to kill all the current tasks of a parallel or distributed job associated with the communication failure. If defined with "IGNORE_COMMFAIL", failures will be ignored and the job continues. If not defined, LSF terminates all tasks and shuts down the entire job. This parameter only applies to the blaunch distributed application framework. 48 Platform LSF Configuration Reference lsb.applications When defined in an application profile, the LSB_DJOB_COMMFAIL_ACTION variable is set when running bsub -app for the specified application. Default Not defined. Terminate all tasks, and shut down the entire job. DJOB_DISABLED Syntax DJOB_DISABLED=Y | N Description Disables the blaunch distributed application framework. Default Not defined. Distributed application framework is enabled. DJOB_ENV_SCRIPT Syntax DJOB_ENV_SCRIPT=script_name Description Defines the name of a user-defined script for setting and cleaning up the parallel or distributed job environment. The specified script must support a setup argument and a cleanup argument. The script is executed by LSF with the setup argument before launching a parallel or distributed job, and with argument cleanup after the job is finished. The script runs as the user, and is part of the job. If a full path is specified, LSF uses the path name for the execution. Otherwise, LSF looks for the executable from $LSF_BINDIR. This parameter only applies to the blaunch distributed application framework. When defined in an application profile, the LSB_DJOB_ENV_SCRIPT variable is set when running bsub -app for the specified application. The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). if DJOB_ENV_SCRIPT=openmpi_rankfile.sh is set in lsb.applications, LSF creates a host rank file and sets the environment variable LSB_RANK_HOSTFILE. Default Not defined. Chapter 1. Configuration Files 49 lsb.applications DJOB_HB_INTERVAL Syntax DJOB_HB_INTERVAL=seconds Description Value in seconds used to calculate the heartbeat interval between the task RES and job RES of a parallel or distributed job. This parameter only applies to the blaunch distributed application framework. When DJOB_HB_INTERVAL is specified, the interval is scaled according to the number of tasks in the job: max(DJOB_HB_INTERVAL, 10) + host_factor where host_factor = 0.01 * number of hosts allocated for the job Default Not defined. Interval is the default value of LSB_DJOB_HB_INTERVAL. DJOB_RESIZE_GRACE_PERIOD Syntax DJOB_RESIZE_GRACE_PERIOD = seconds Description When a resizable job releases resources, the LSF distributed parallel job framework terminates running tasks if a host has been completely removed. A DJOB_RESIZE_GRACE_PERIOD defines a grace period in seconds for the application to clean up tasks itself before LSF forcibly terminates them. Default No grace period. DJOB_RU_INTERVAL Syntax DJOB_RU_INTERVAL=seconds Description Value in seconds used to calculate the resource usage update interval for the tasks of a parallel or distributed job. This parameter only applies to the blaunch distributed application framework. 50 Platform LSF Configuration Reference lsb.applications When DJOB_RU_INTERVAL is specified, the interval is scaled according to the number of tasks in the job: max(DJOB_RU_INTERVAL, 10) + host_factor where host_factor = 0.01 * number of hosts allocated for the job Default Not defined. Interval is the default value of LSB_DJOB_RU_INTERVAL. DJOB_TASK_BIND Syntax DJOB_TASK_BIND=Y | y | N | n Description For CPU and memory affinity scheduling jobs launched with the blaunch distributed application framework. To enable LSF to bind each task to the proper CPUs or NUMA nodes you must use blaunch to start tasks. You must set DJOB_TASK_BIND=Y in lsb.applications or LSB_DJOB_TASK_BIND=Y in the submission environment before submitting the job. When set, only the CPU and memory bindings allocated to the task itself will be set in each tasks environment. If DJOB_TASK_BIND=N or LSB_DJOB_TASK_BIND=N, or they are not set, each task will have the same CPU or NUMA node binding on one host. If you do not use blaunch to start tasks, and use another MPI mechanism such as IBM Platform MPI or IBM Parallel Environment, you should not set DJOB_TASK_BIND or set it to N. Default N ENV_VARS Syntax ENV_VARS="name='value'[,name1='value1'] [,name2='value2',... ]" Description ENV_VARS defines application-specific environment variables that will be used by jobs for the application. Use this parameter to define name/value pairs as environment variables. These environment variables are also used in the pre/post-execution environment. You can include spaces within the single quotation marks when defining a value. Commas and double quotation marks are reserved by LSF and cannot be used as part of the environment variable name or value. If the same environment variable is named multiple times in ENV_VARS and given different values, the last value in Chapter 1. Configuration Files 51 lsb.applications the list will be the one which takes effect. LSF does not allow environment variables to contain other environment variables to be expanded on the execution side. Do not redefine LSF environment variables in ENV_VARS. To define a NULL environment variable, use single quotes with nothing inside. For example: ENV_VARS="TEST_CAR=''" Any variable set in the user’s environment will overwrite the value in ENV_VARS. The application profile value will overwrite the execution host environment value. After changing the value of this parameter, run badmin reconfig to have the changes take effect. The changes apply to pending jobs only. Running jobs are not affected. Default Not defined. FILELIMIT Syntax FILELIMIT=integer Description The per-process (soft) file size limit (in KB) for all of the processes belonging to a job running in the application profile (see getrlimit(2)). Application-level limits override any default limit specified in the queue, but must be less than the hard limit of the submission queue. Default Unlimited HOST_POST_EXEC Syntax HOST_POST_EXEC=command Description Enables host-based post-execution processing at the application level. The HOST_POST_EXEC command runs on all execution hosts after the job finishes. If job based post-execution POST_EXEC was defined at the queue-level/application-level/ job-level, the HOST_POST_EXEC command runs after POST_EXEC of any level. Host-based post-execution commands can be configured at the queue and application level, and run in the following order: 1. The application-level command 2. The queue-level command. 52 Platform LSF Configuration Reference lsb.applications The supported command rule is the same as the existing POST_EXEC for the queue section. See the POST_EXEC topic for details. Note: The host-based post-execution command cannot be executed on Windows platforms. This parameter cannot be used to configure job-based post-execution processing. Default Not defined. HOST_PRE_EXEC Syntax HOST_PRE_EXEC=command Description Enables host-based pre-execution processing at the application level. The HOST_PRE_EXEC command runs on all execution hosts before the job starts. If job based pre-execution PRE_EXEC was defined at the queue-level/application-level/joblevel, the HOST_PRE_EXEC command runs before PRE_EXEC of any level. Host-based pre-execution commands can be configured at the queue and application level, and run in the following order: 1. The queue-level command 2. The application-level command. The supported command rule is the same as the existing PRE_EXEC for the queue section. See the PRE_EXEC topic for details. Note: The host-based pre-execution command cannot be executed on Windows platforms. This parameter cannot be used to configure job-based pre-execution processing. Default Not defined. JOB_CWD Syntax JOB_CWD=directory Description Current working directory (CWD) for the job in the application profile. The path can be absolute or relative to the submission directory. The path can include the following dynamic patterns (which are case sensitive): v %J - job ID Chapter 1. Configuration Files 53 lsb.applications v v v v v %JG - job group (if not specified, it will be ignored) %I - job index (default value is 0) %EJ - execution job ID %EI - execution job index %P - project name v %U - user name v %G - user group Unsupported patterns are treated as text. If this parameter is changed, then any newly submitted jobs with the -app option will use the new value for CWD if bsub -cwd is not defined. JOB_CWD supports all LSF path conventions such as UNIX, UNC and Windows formats. In the mixed UNIX /Windows cluster it can be specified with one value for UNIX and another value for Windows separated by a pipe character (|). JOB_CWD=unix_path|windows_path The first part of the path must be for UNIX and the second part must be for Windows. Both paths must be full paths. Default Not defined. JOB_CWD_TTL Syntax JOB_CWD_TTL=hours Description Specifies the time-to-live for the current working directory (CWD) of a job. LSF cleans created CWD directories after a job finishes based on the TTL value. LSF deletes the CWD for the job if LSF created that directory for the job. The following options are available: v 0 - sbatchd deletes CWD when all process related to the job finish. v 2147483647 - Never delete the CWD for a job. v 1 to 2147483646 - Delete the CWD for a job after the timeout expires. The system checks the directory list every 5 minutes with regards to cleaning and deletes only the last directory of the path to avoid conflicts when multiple jobs share some parent directories. TTL will be calculated after the post-exec script finishes. When LSF (sbatchd) starts, it checks the directory list file and deletes expired CWDs. If the value for this parameter is not set in the application profile, LSF checks to see if it is set at the cluster-wide level. If neither is set, the default value is used. Default Not defined. The value of 2147483647 is used, meaning the CWD is not deleted. 54 Platform LSF Configuration Reference lsb.applications JOB_INCLUDE_POSTPROC Syntax JOB_INCLUDE_POSTPROC=Y | N Description Specifies whether LSF includes the post-execution processing of the job as part of the job. When set to Y: v Prevents a new job from starting on a host until post-execution processing is finished on that host v Includes the CPU and run times of post-execution processing with the job CPU and run times v sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time The variable LSB_JOB_INCLUDE_POSTPROC in the user environment overrides the value of JOB_INCLUDE_POSTPROC in an application profile in lsb.applications. JOB_INCLUDE_POSTPROC in an application profile in lsb.applications overrides the value of JOB_INCLUDE_POSTPROC in lsb.params. For CPU and memory affinity jobs, if JOB_INCLUDE_POSTPROC=Y, LSF does not release affinity resources until post-execution processing has finished, since slots are still occupied by the job during post-execution processing. For SGI cpusets, if JOB_INCLUDE_POSTPROC=Y, LSF does not release the cpuset until post-execution processing has finished, even though post-execution processes are not attached to the cpuset. Default N. Post-execution processing is not included as part of the job, and a new job can start on the execution host before post-execution processing finishes. JOB_POSTPROC_TIMEOUT Syntax JOB_POSTPROC_TIMEOUT=minutes Description Specifies a timeout in minutes for job post-execution processing. The specified timeout must be greater than zero If post-execution processing takes longer than the timeout, sbatchd reports that post-execution has failed (POST_ERR status). On UNIX and Linux, it kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed. Chapter 1. Configuration Files 55 lsb.applications If JOB_INCLUDE_POSTPROC=Y, and sbatchd kills the post-execution processes because the timeout has been reached, the CPU time of the post-execution processing is set to 0, and the job’s CPU time does not include the CPU time of post-execution processing. JOB_POSTPROC_TIMEOUT defined in an application profile in lsb.applications overrides the value in lsb.params. JOB_POSTPROC_TIMEOUT cannot be defined in user environment. When running host-based post execution processing, set JOB_POSTPROC_TIMEOUT to a value that gives the process enough time to run. Default Not defined. Post-execution processing does not time out. JOB_PREPROC_TIMEOUT Syntax JOB_PREPROC_TIMEOUT=minutes Description Specify a timeout in minutes for job pre-execution processing. The specified timeout must be an integer greater than zero. If the job's pre-execution processing takes longer than the timeout, LSF kills the job's pre-execution processes, kills the job with a pre-defined exit value of 98, and then requeues the job to the head of the queue. However, if the number of pre-execution retries has reached the limit, LSF suspends the job with PSUSP status instead of requeuing it. JOB_PREPROC_TIMEOUT defined in an application profile in lsb.applications overrides the value in lsb.params. JOB_PREPROC_TIMEOUT cannot be defined in the user environment. On UNIX and Linux, sbatchd kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed. Default Not defined. Pre-execution processing does not time out. However, when running host-based pre-execution processing, you cannot use the infinite value or it will fail. You must configure a reasonable value. JOB_SIZE_LIST | | Syntax | JOB_SIZE_LIST=default_size [size ...] | Description | A list of job sizes (number of tasks) that are allowed on this application. 56 Platform LSF Configuration Reference lsb.applications | | | | | | | | When submitting a job or modifying a pending job that requests a job size by using the -n or -R options for bsub and bmod, the requested job size must be a single fixed value that matches one of the values that JOB_SIZE_LIST specifies, which are the job sizes that are allowed on this application profile. LSF rejects the job if the requested job size is not in this list. In addition, when using bswitch to switch a pending job with a requested job size to another queue, the requested job size in the pending job must also match one of the values in JOB_SIZE_LIST for the new queue. | | | The first value in this list is the default job size, which is the assigned job size request if the job was submitted without requesting one. The remaining values are the other job sizes allowed in the queue, and may be defined in any order. | | | | | | When defined in both a queue (lsb.queues) and an application profile, the job size request must satisfy both requirements. In addition, JOB_SIZE_LIST overrides any TASKLIMIT parameters defined at the same level. Job size requirements do not apply to queues and application profiles with no job size lists, nor do they apply to other levels of job submissions (that is, host level or cluster level job submissions). | | | | Note: An exclusive job may allocate more slots on the host then is required by the tasks. For example, if JOB_SIZE_LIST=8 and an exclusive job requesting -n8 runs on a 16 slot host, all 16 slots are assigned to the job. The job runs as expected, since the 8 tasks specified for the job matches the job size list. | Valid values | A space-separated list of positive integers between 1 and 2147483646. | Default | Undefined JOB_STARTER Syntax JOB_STARTER=starter [starter] ["%USRCMD"] [starter] Description Creates a specific environment for submitted jobs prior to execution. An application-level job starter overrides a queue-level job starter. starter is any executable that can be used to start the job (i.e., can accept the job as an input argument). Optionally, additional strings can be specified. By default, the user commands run after the job starter. A special string, %USRCMD, can be used to represent the position of the user’s job in the job starter command line. The %USRCMD string and any additional commands must be enclosed in quotation marks (" "). Example JOB_STARTER=csh -c "%USRCMD;sleep 10" In this case, if a user submits a job Chapter 1. Configuration Files 57 lsb.applications bsub myjob arguments the command that actually runs is: csh -c "myjob arguments;sleep 10" Default Not defined. No job starter is used, LOCAL_MAX_PREEXEC_RETRY Syntax LOCAL_MAX_PREEXEC_RETRY=integer Description The maximum number of times to attempt the pre-execution command of a job on the local cluster. When this limit is reached, the default behavior of the job is defined by the LOCAL_MAX_PREEXEC_RETRY_ACTION parameter in lsb.params, lsb.queues, or lsb.applications. | | | Valid values 0 < MAX_PREEXEC_RETRY < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of preexec retry times is unlimited | See also | | LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.params, lsb.queues, and lsb.applications. LOCAL_MAX_PREEXEC_RETRY_ACTION | | Syntax | LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT | Description | | | | | | The default behavior of a job when it reaches the maximum number of times to attempt its pre-execution command on the local cluster (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications). v If set to SUSPEND, the job is suspended and its status is set to PSUSP. | | This parameter is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The action specified v If set to EXIT, the job exits and its status is set to EXIT. The job exits with the same exit code as the last pre-execution fail exit code. 58 Platform LSF Configuration Reference lsb.applications | | in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration. | Default | | Not defined. If not defined in lsb.queues or lsb.params, the default action is SUSPEND. | See also | LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, and lsb.applications. MAX_JOB_PREEMPT Syntax MAX_JOB_PREEMPT=integer Description The maximum number of times a job can be preempted. Applies to queue-based preemption only. Valid values 0 < MAX_JOB_PREEMPT < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of preemption times is unlimited. MAX_JOB_REQUEUE Syntax MAX_JOB_REQUEUE=integer Description The maximum number of times to requeue a job automatically. Valid values 0 < MAX_JOB_REQUEUE < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of requeue times is unlimited Chapter 1. Configuration Files 59 lsb.applications MAX_PREEXEC_RETRY Syntax MAX_PREEXEC_RETRY=integer Description Use REMOTE_MAX_PREEXEC_RETRY instead. This parameter is only maintained for backwards compatibility. MultiCluster job forwarding model only. The maximum number of times to attempt the pre-execution command of a job from a remote cluster. If the job's pre-execution command fails all attempts, the job is returned to the submission cluster. Valid values 0 < MAX_PREEXEC_RETRY < INFINIT_INT INFINIT_INT is defined in lsf.h. Default 5 MAX_TOTAL_TIME_PREEMPT Syntax MAX_TOTAL_TIME_PREEMPT=integer Description The accumulated preemption time in minutes after which a job cannot be preempted again, where minutes is wall-clock time, not normalized time. Setting this parameter in lsb.applications overrides the parameter of the same name in lsb.queues and in lsb.params. Valid values Any positive integer greater than or equal to one (1) Default Unlimited MEMLIMIT Syntax MEMLIMIT=integer 60 Platform LSF Configuration Reference lsb.applications Description The per-process (soft) process resident set size limit for all of the processes belonging to a job running in the application profile. Sets the maximum amount of physical memory (resident set size, RSS) that may be allocated to a process. By default, the limit is specified in KB. Use LSF_UNIT_FOR_LIMITS in lsf.conf to specify a larger unit for the the limit (MB, GB, TB, PB, or EB). By default, jobs submitted to the application profile without a job-level memory limit are killed when the memory limit is reached. Application-level limits override any default limit specified in the queue, but must be less than the hard limit of the submission queue. LSF has two methods of enforcing memory usage: v OS Memory Limit Enforcement v LSF Memory Limit Enforcement OS memory limit enforcement OS memory limit enforcement is the default MEMLIMIT behavior and does not require further configuration. OS enforcement usually allows the process to eventually run to completion. LSF passes MEMLIMIT to the OS, which uses it as a guide for the system scheduler and memory allocator. The system may allocate more memory to a process if there is a surplus. When memory is low, the system takes memory from and lowers the scheduling priority (re-nice) of a process that has exceeded its declared MEMLIMIT. Only available on systems that support RLIMIT_RSS for setrlimit(). Not supported on: v Sun Solaris 2.x v Windows LSF memory limit enforcement To enable LSF memory limit enforcement, set LSB_MEMLIMIT_ENFORCE in lsf.conf to y. LSF memory limit enforcement explicitly sends a signal to kill a running process once it has allocated memory past MEMLIMIT. You can also enable LSF memory limit enforcement by setting LSB_JOB_MEMLIMIT in lsf.conf to y. The difference between LSB_JOB_MEMLIMIT set to y and LSB_MEMLIMIT_ENFORCE set to y is that with LSB_JOB_MEMLIMIT, only the per-job memory limit enforced by LSF is enabled. The per-process memory limit enforced by the OS is disabled. With LSB_MEMLIMIT_ENFORCE set to y, both the per-job memory limit enforced by LSF and the per-process memory limit enforced by the OS are enabled. Available for all systems on which LSF collects total memory usage. Default Unlimited Chapter 1. Configuration Files 61 lsb.applications MEMLIMIT_TYPE Syntax MEMLIMIT_TYPE=JOB [PROCESS] [TASK] MEMLIMIT_TYPE=PROCESS [JOB] [TASK] MEMLIMIT_TYPE=TASK [PROCESS] [JOB] Description A memory limit is the maximum amount of memory a job is allowed to consume. Jobs that exceed the level are killed. You can specify different types of memory limits to enforce. Use any combination of JOB, PROCESS, and TASK. By specifying a value in the application profile, you overwrite these three parameters: LSB_JOB_MEMLIMIT, LSB_MEMLIMIT_ENFORCE, LSF_HPC_EXTENSIONS (TASK_MEMLIMIT). Note: A task list is a list in LSF that keeps track of the default resource requirements for different applications and task eligibility for remote execution. v PROCESS: Applies a memory limit by OS process, which is enforced by the OS on the slave machine (where the job is running). When the memory allocated to one process of the job exceeds the memory limit, LSF kills the job. v TASK: Applies a memory limit based on the task list file. It is enforced by LSF. LSF terminates the entire parallel job if any single task exceeds the limit setting for memory and swap limits. v JOB: Applies a memory limit identified in a job and enforced by LSF. When the sum of the memory allocated to all processes of the job exceeds the memory limit, LSF kills the job. v PROCESS TASK: Enables both process-level memory limit enforced by OS and task-level memory limit enforced by LSF. v PROCESS JOB: Enables both process-level memory limit enforced by OS and job-level memory limit enforced by LSF. v TASK JOB: Enables both task-level memory limit enforced by LSF and job-level memory limit enforced by LSF. v PROCESS TASK JOB: Enables process-level memory limit enforced by OS, task-level memory limit enforced by LSF, and job-level memory limit enforced by LSF. Default Not defined. The memory limit-level is still controlled by LSF_HPC_EXTENSIONS=TASK_MEMLIMIT, LSB_JOB_MEMLIMIT, LSB_MEMLIMIT_ENFORCE MIG Syntax MIG=minutes 62 Platform LSF Configuration Reference lsb.applications Description Enables automatic job migration and specifies the migration threshold for checkpointable or rerunnable jobs, in minutes. LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes. A value of 0 specifies that a suspended job is migrated immediately. The migration threshold applies to all jobs running on the host. Job-level command line migration threshold overrides threshold configuration in application profile and queue. Application profile configuration overrides queue level configuration. When a host migration threshold is specified, and is lower than the value for the job, the queue, or the application, the host value is used. Members of a chunk job can be migrated. Chunk jobs in WAIT state are removed from the job chunk and put into PEND state. Does not affect MultiCluster jobs that are forwarded to a remote cluster. Default Not defined. LSF does not migrate checkpointable or rerunnable jobs automatically. NAME Syntax NAME=string Description Required. Unique name for the application profile. Specify any ASCII string up to 60 characters long. You can use letters, digits, underscores (_), dashes (-), periods (.) or spaces in the name. The application profile name must be unique within the cluster. Note: If you want to specify the ApplicationVersion in a JSDL file, include the version when you define the application profile name. Separate the name and version by a space, as shown in the following example: NAME=myapp 1.0 Default You must specify this parameter to define an application profile. LSF does not automatically assign a default application profile name. Chapter 1. Configuration Files 63 lsb.applications NETWORK_REQ Syntax NETWORK_REQ="network_res_req" network_res_req has the following syntax: [type=sn_all | sn_single] [:protocol=protocol_name[(protocol_number)][,protocol_name[(protocol_number)]] [:mode=US | IP] [:usage=dedicated | shared] [:instance=positive_integer] Description For LSF IBM Parallel Environment (PE) integration. Specifies the network resource requirements for a PE job. If any network resource requirement is specified in the job, queue, or application profile, the job is treated as a PE job. PE jobs can only run on hosts where IBM PE pnsd daemon is running. The network resource requirement string network_res_req has the same syntax as the bsub -network option. The -network bsub option overrides the value of NETWORK_REQ defined in lsb.queues or lsb.applications. The value of NETWORK_REQ defined in lsb.applications overrides queue-level NETWORK_REQ defined in lsb.queues. The following IBM LoadLeveller job command file options are not supported in LSF: v collective_groups v imm_send_buffers v rcxtblocks The following network resource requirement options are supported: type=sn_all | sn_single Specifies the adapter device type to use for message passing: either sn_all or sn_single. sn_single When used for switch adapters, specifies that all windows are on a single network sn_all Specifies that one or more windows are on each network, and that striped communication should be used over all available switch networks. The networks specified must be accessible by all hosts selected to run the PE job. See the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information about submitting jobs that use striping. If mode is IP and type is specified as sn_all or sn_single, the job will only run on InfiniBand (IB) adapters (IPoIB). If mode is IP and type is not specified, the job will only run on Ethernet adapters (IPoEth). For IPoEth jobs, LSF ensures the job is running on hosts where pnsd is installed and 64 Platform LSF Configuration Reference lsb.applications running. For IPoIB jobs, LSF ensures the job the job is running on hosts where pnsd is installed and running, and that IB networks are up. Because IP jobs do not consume network windows, LSF does not check if all network windows are used up or the network is already occupied by a dedicated PE job. Equivalent to the PE MP_EUIDEVICE environment variable and -euidevice PE flag See the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information. Only sn_all or sn_single are supported by LSF. The other types supported by PE are not supported for LSF jobs. protocol=protocol_name[(protocol_number)] Network communication protocol for the PE job, indicating which message passing API is being used by the application. The following protocols are supported by LSF: mpi The application makes only MPI calls. This value applies to any MPI job regardless of the library that it was compiled with (PE MPI, MPICH2). pami The application makes only PAMI calls. lapi The application makes only LAPI calls. shmem The application makes only OpenSHMEM calls. user_defined_parallel_api The application makes only calls from a parallel API that you define. For example: protocol=myAPI or protocol=charm. The default value is mpi. LSF also supports an optional protocol_number (for example, mpi(2), which specifies the number of contexts (endpoints) per parallel API instance. The number must be a power of 2, but no greater than 128 (1, 2, 4, 8, 16, 32, 64, 128). LSF will pass the communication protocols to PE without any change. LSF will reserve network windows for each protocol. When you specify multiple parallel API protocols, you cannot make calls to both LAPI and PAMI (lapi, pami) or LAPI and OpenSHMEM (lapi, shmem) in the same application. Protocols can be specified in any order. See the MP_MSG_API and MP_ENDPOINTS environment variables and the -msg_api and -endpoints PE flags in the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information about the communication protocols that are supported by IBM PE. mode=US | IP The network communication system mode used by the communication specified communication protocol: US (User Space) or IP (Internet Protocol). A US job can only run with adapters that support user space communications, such as the IB adapter. IP jobs can run with either Chapter 1. Configuration Files 65 lsb.applications Ethernet adapters or IB adapters. When IP mode is specified, the instance number cannot be specified, and network usage must be unspecified or shared. Each instance on the US mode requested by a task running on switch adapters requires and adapter window. For example, if a task requests both the MPI and LAPI protocols such that both protocol instances require US mode, two adapter windows will be used. The default value is US. usage=dedicated | shared Specifies whether the adapter can be shared with tasks of other job steps: dedicated or shared. Multiple tasks of the same job can share one network even if usage is dedicated. The default usage is shared. instances=positive_integer The number of parallel communication paths (windows) per task made available to the protocol on each network. The number actually used depends on the implementation of the protocol subsystem. The default value is 1. If the specified value is greater than MAX_PROTOCOL_INSTANCES in lsb.params or lsb.queues, LSF rejects the job. LSF_PE_NETWORK_NUM must be defined to a non-zero value in lsf.conf for NETWORK_REQ to take effect. If LSF_PE_NETWORK_NUM is not defined or is set to 0, NETWORK_REQ is ignored with a warning message. Example The following network resource requirement string specifies that the requirements for an sn_all job (one or more windows are on each network, and striped communication should be used over all available switch networks). The PE job uses MPI API calls (protocol), runs in user-space network communication system mode, and requires 1 parallel communication path (window) per task. NETWORK_REQ = "protocol=mpi:mode=us:instance=1:type=sn_all" Default No default value, but if you specify no value (NETWORK_REQ=""), the job uses the following: protocol=mpi:mode=US:usage=shared:instance=1 in the application profile. NICE Syntax NICE=integer Description Adjusts the UNIX scheduling priority at which jobs from the application execute. 66 Platform LSF Configuration Reference lsb.applications A value of 0 (zero) maintains the default scheduling priority for UNIX interactive jobs. This value adjusts the run-time priorities for batch jobs to control their effect on other batch or interactive jobs. See the nice(1) manual page for more details. On Windows, this value is mapped to Windows process priority classes as follows: v nice>=0 corresponds to a priority class of IDLE v nice<0 corresponds to a priority class of NORMAL LSF on Windows does not support HIGH or REAL-TIME priority classes. When set, this value overrides NICE set at the queue level in lsb.queues. Default Not defined. NO_PREEMPT_INTERVAL Syntax NO_PREEMPT_INTERVAL=minutes Description Prevents preemption of jobs for the specified number of minutes of uninterrupted run time, where minutes is wall-clock time, not normalized time. NO_PREEMPT_INTERVAL=0 allows immediate preemption of jobs as soon as they start or resume running. Setting this parameter in lsb.applications overrides the parameter of the same name in lsb.queues and in lsb.params. Default 0 NO_PREEMPT_FINISH_TIME Syntax NO_PREEMPT_FINISH_TIME=minutes | percentage Description Prevents preemption of jobs that will finish within the specified number of minutes or the specified percentage of the estimated run time or run limit. Specifies that jobs due to finish within the specified number of minutes or percentage of job duration should not be preempted, where minutes is wall-clock time, not normalized time. Percentage must be greater than 0 or less than 100% (between 1% and 99%). For example, if the job run limit is 60 minutes and NO_PREEMPT_FINISH_TIME=10%, the job cannot be preempted after it runs 54 minutes or longer. Chapter 1. Configuration Files 67 lsb.applications If you specify percentage for NO_PREEMPT_FINISH_TIME, requires a run time (bsub -We or RUNTIME in lsb.applications), or run limit to be specified for the job (bsub -W, or RUNLIMIT in lsb.queues, or RUNLIMIT in lsb.applications) NO_PREEMPT_RUN_TIME Syntax NO_PREEMPT_RUN_TIME=minutes | percentage Description Prevents preemption of jobs that have been running for the specified number of minutes or the specified percentage of the estimated run time or run limit. Specifies that jobs that have been running for the specified number of minutes or longer should not be preempted, where minutes is wall-clock time, not normalized time. Percentage must be greater than 0 or less than 100% (between 1% and 99%). For example, if the job run limit is 60 minutes and NO_PREEMPT_RUN_TIME=50%, the job cannot be preempted after it running 30 minutes or longer. If you specify percentage for NO_PREEMPT_RUN_TIME, requires a run time (bsub -We or RUNTIME in lsb.applications), or run limit to be specified for the job (bsub -W, or RUNLIMIT in lsb.queues, or RUNLIMIT in lsb.applications) PERSISTENT_HOST_ORDER Syntax PERSISTENT_HOST_ORDER=Y | yes | N | no Description Applies when migrating parallel jobs in a multicluster environment. Setting PERSISTENT_HOST_ORDER=Y ensures that jobs are restarted on hosts based on alphabetical names of the hosts, preventing them from being restarted on the same hosts that they ran on before migration. Default PERSISTENT_HOST_ORDER=N. Migrated jobs in a multicluster environment could run on the same hosts that they ran on before. POST_EXEC Syntax POST_EXEC=command Description Enables post-execution processing at the application level. The POST_EXEC command runs on the execution host after the job finishes. Post-execution commands can be configured at the job, application, and queue levels. 68 Platform LSF Configuration Reference lsb.applications If both application-level (POST_EXEC in lsb.applications) and job-level post-execution commands are specified, job level post-execution overrides application-level post-execution commands. Queue-level post-execution commands (POST_EXEC in lsb.queues) run after application-level post-execution and job-level post-execution commands. The POST_EXEC command uses the same environment variable values as the job, and runs under the user account of the user who submits the job. When a job exits with one of the application profile’s REQUEUE_EXIT_VALUES, LSF requeues the job and sets the environment variable LSB_JOBPEND. The post-execution command runs after the requeued job finishes. When the post-execution command is run, the environment variable LSB_JOBEXIT_STAT is set to the exit status of the job. If the execution environment for the job cannot be set up, LSB_JOBEXIT_STAT is set to 0 (zero). The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). For UNIX: v The pre- and post-execution commands run in the /tmp directory under /bin/sh -c, which allows the use of shell features in the commands. The following example shows valid configuration lines: PRE_EXEC= /usr/share/lsf/misc/testq_pre >> /tmp/pre.out POST_EXEC= /usr/share/lsf/misc/testq_post | grep -v "Hey!" v LSF sets the PATH environment variable to PATH="/bin /usr/bin /sbin /usr/sbin" v The stdin, stdout, and stderr are set to /dev/null v To allow UNIX users to define their own post-execution commands, an LSF administrator specifies the environment variable $USER_POSTEXEC as the POST_EXEC command. A user then defines the post-execution command: setenv USER_POSTEXEC /path_name Note: The path name for the post-execution command must be an absolute path. This parameter cannot be used to configure host-based post-execution processing. For Windows: v The pre- and post-execution commands run under cmd.exe /c v The standard input, standard output, and standard error are set to NULL v The PATH is determined by the setup of the LSF Service Note: For post-execution commands that execute on a Windows Server 2003, x64 Edition platform, users must have read and execute privileges for cmd.exe. Default Not defined. No post-execution commands are associated with the application profile. Chapter 1. Configuration Files 69 lsb.applications PREEMPT_DELAY Syntax PREEMPT_DELAY=seconds Description Preemptive jobs will wait the specified number of seconds from the submission time before preempting any low priority preemptable jobs. During the grace period, preemption will not be trigged, but the job can be scheduled and dispatched by other scheduling policies. This feature can provide flexibility to tune the system to reduce the number of preemptions. It is useful to get better performance and job throughput. When the low priority jobs are short, if high priority jobs can wait a while for the low priority jobs to finish, preemption can be avoided and cluster performance is improved. If the job is still pending after the grace period has expired, the preemption will be triggered. The waiting time is for preemptive jobs in the pending status only. It will not impact the preemptive jobs that are suspended. The time is counted from the submission time of the jobs. The submission time means the time mbatchd accepts a job, which includes newly submitted jobs, restarted jobs (by brestart) or forwarded jobs from a remote cluster. When the preemptive job is waiting, the pending reason is: The preemptive job is allowing a grace period before preemption. If you use an older version of bjobs, the pending reason is: Unknown pending reason code <6701>; The parameter is defined in lsb.params, lsb.queues (overrides lsb.params), and lsb.applications (overrides both lsb.params and lsb.queues). Run badmin reconfig to make your changes take effect. Default Not defined (if the parameter is not defined anywhere, preemption is immediate). PRE_EXEC Syntax PRE_EXEC=command Description Enables pre-execution processing at the application level. The PRE_EXEC command runs on the execution host before the job starts. If the PRE_EXEC command exits with a non-zero exit code, LSF requeues the job to the front of the queue. 70 Platform LSF Configuration Reference lsb.applications Pre-execution commands can be configured at the application, queue, and job levels and run in the following order: 1. The queue-level command 2. The application-level or job-level command. If you specify a command at both the application and job levels, the job-level command overrides the application-level command; the application-level command is ignored. The PRE_EXEC command uses the same environment variable values as the job, and runs under the user account of the user who submits the job. The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). For UNIX: v The pre- and post-execution commands run in the /tmp directory under /bin/sh -c, which allows the use of shell features in the commands. The following example shows valid configuration lines: PRE_EXEC= /usr/share/lsf/misc/testq_pre >> /tmp/pre.out POST_EXEC= /usr/share/lsf/misc/testq_post | grep -v "Hey!" v LSF sets the PATH environment variable to PATH="/bin /usr/bin /sbin /usr/sbin" v The stdin, stdout, and stderr are set to /dev/null For Windows: v The pre- and post-execution commands run under cmd.exe /c v The standard input, standard output, and standard error are set to NULL v The PATH is determined by the setup of the LSF Service Note: For pre-execution commands that execute on a Windows Server 2003, x64 Edition platform, users must have read and execute privileges for cmd.exe. This parameter cannot be used to configure host-based pre-execution processing. Default Not defined. No pre-execution commands are associated with the application profile. PROCESSLIMIT Syntax PROCESSLIMIT=integer Description Limits the number of concurrent processes that can be part of a job. By default. jobs submitted to the application profile without a job-level process limit are killed when the process limit is reached. Application-level limits override any default limit specified in the queue. Chapter 1. Configuration Files 71 lsb.applications SIGINT, SIGTERM, and SIGKILL are sent to the job in sequence when the limit is reached. Default Unlimited REMOTE_MAX_PREEXEC_RETRY Syntax REMOTE_MAX_PREEXEC_RETRY=integer Description MultiCluster job forwarding model only. The maximum number of times to attempt the pre-execution command of a job from a remote cluster. If the job's pre-execution command fails all attempts, the job is returned to the submission cluster. Valid values up to INFINIT_INT defined in lsf.h. Default 5 REQUEUE_EXIT_VALUES Syntax REQUEUE_EXIT_VALUES=[exit_code ...] [EXCLUDE(exit_code ...)] Description Enables automatic job requeue and sets the LSB_EXIT_REQUEUE environment variable. Use spaces to separate multiple exit codes. Application-level exit values override queue-level values. Job-level exit values (bsub -Q) override application-level and queue-level values. exit_code has the following form: "[all] [~number ...] | [number ...]" The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list. Jobs are requeued to the head of the queue. The output from the failed run is not saved, and the user is not notified by LSF. Define an exit code as EXCLUDE(exit_code) to enable exclusive job requeue, ensuring the job does not rerun on the samehost. Exclusive job requeue does not work for parallel jobs. 72 Platform LSF Configuration Reference lsb.applications For MultiCluster jobs forwarded to a remote execution cluster, the exit values specified in the submission cluster with the EXCLUDE keyword are treated as if they were non-exclusive. You can also requeue a job if the job is terminated by a signal. If a job is killed by a signal, the exit value is 128+signal_value. The sum of 128 and the signal value can be used as the exit code in the parameter REQUEUE_EXIT_VALUES. For example, if you want a job to rerun if it is killed with a signal 9 (SIGKILL), the exit value would be 128+9=137. You can configure the following requeue exit value to allow a job to be requeue if it was kill by signal 9: REQUEUE_EXIT_VALUES=137 In Windows, if a job is killed by a signal, the exit value is signal_value. The signal value can be used as the exit code in the parameter REQUEUE_EXIT_VALUES. For example, if you want to rerun a job after it was killed with a signal 7 (SIGKILL), the exit value would be 7. You can configure the following requeue exit value to allow a job to requeue after it was killed by signal 7: REQUEUE_EXIT_VALUES=7 You can configure the following requeue exit value to allow a job to requeue for both Linux and Windows after it was killed: REQUEUE_EXIT_VALUES=137 7 If mbatchd is restarted, it does not remember the previous hosts from which the job exited with an exclusive requeue exit code. In this situation, it is possible for a job to be dispatched to hosts on which the job has previously exited with an exclusive exit code. You should configure REQUEUE_EXIT_VALUES for interruptible backfill queues (INTERRUPTIBLE_BACKFILL=seconds). Example REQUEUE_EXIT_VALUES=30 EXCLUDE(20) means that jobs with exit code 30 are requeued, jobs with exit code 20 are requeued exclusively, and jobs with any other exit code are not requeued. Default Not defined. Jobs are not requeued. RERUNNABLE Syntax RERUNNABLE=yes | no Chapter 1. Configuration Files 73 lsb.applications Description If yes, enables automatic job rerun (restart) for any job associated with the application profile. Rerun is disabled when RERUNNABLE is set to no. The yes and no arguments are not case-sensitive. Members of a chunk job can be rerunnable. If the execution host becomes unavailable, rerunnable chunk job members are removed from the job chunk and dispatched to a different execution host. Job level rerun (bsub -r) overrides the RERUNNABLE value specified in the application profile, which overrides the queue specification. bmod -rn to make rerunnable jobs non-rerunnable overrides both the application profile and the queue. Default Not defined. RES_REQ Syntax RES_REQ=res_req Description Resource requirements used to determine eligible hosts. Specify a resource requirement string as usual. The resource requirement string lets you specify conditions in a more flexible manner than using the load thresholds. The following resource requirement sections are supported: v select v v v v v v rusage order span same cu affinity Resource requirement strings can be simple (applying to the entire job), compound (applying to the specified number of slots), or can contain alternative resources (alternatives between 2 or more simple and/or compound). When a compound resource requirement is set at the application-level, it will be ignored if any job-level resource requirements (simple or compound) are defined. Compound and alternative resource requirements follow the same set of rules for determining how resource requirements are going to be merged between job, application, and queue level. In the event no job-level resource requirements are set, the compound or alternative application-level requirements interact with queue-level resource requirement strings in the following ways: 74 Platform LSF Configuration Reference lsb.applications v When a compound resource requirement is set at the application level, it will be ignored if any job level resource requirements (simple or compound) are defined. v If no queue-level resource requirement is defined or a compound or alternative queue-level resource requirement is defined, the application-level requirement is used. v If a simple queue-level requirement is defined, the application-level and queue-level requirements combine as follows: section compound/alternative application and simple queue behavior select both levels satisfied; queue requirement applies to all terms same queue level ignored order span application-level section overwrites queue-level section (if a given level is present); queue requirement (if used) applies to all terms rusage v both levels merge v queue requirement if a job-based resource is applied to the first term, otherwise applies to all terms v if conflicts occur the application-level section overwrites the queue-level section. For example: if the application-level requirement is num1*{rusage[R1]} + num2*{rusage[R2]} and the queue-level requirement is rusage[RQ] where RQ is a job resource, the merged requirement is num1*{rusage[merge(R1,RQ)]} + num2*{rusage[R2]} Compound or alternative resource requirements do not support the cu section, or the || operator within the rusage section. Alternative resource strings use the || operator as a separator for each alternative resource. Multiple -R strings cannot be used with multi-phase rusage resource requirements. For internal load indices and duration, jobs are rejected if they specify resource reservation requirements at the job or application level that exceed the requirements specified in the queue. By default, memory (mem) and swap (swp) limits in select[] and rusage[] sections are specified in MB. Use LSF_UNIT_FOR_LIMITS in lsf.conf to specify a larger unit for these limits (GB, TB, PB, or EB). When LSF_STRICT_RESREQ=Y is configured in lsf.conf, resource requirement strings in select sections must conform to a more strict syntax. The strict resource requirement syntax only applies to the select section. It does not apply to the other resource requirement sections (order, rusage, same, span, cu, or affinity). When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource. Chapter 1. Configuration Files 75 lsb.applications select section For simple resource requirements, the select section defined at the application, queue, and job level must all be satisfied. rusage section The rusage section can specify additional requests. To do this, use the OR (||) operator to separate additional rusage strings. The job-level rusage section takes precedence. Note: Compound resource requirements do not support use of the || operator within the component rusage simple resource requirements. Multiple rusage strings cannot be used with multi-phase rusage resource requirements. When both job-level and application-level rusage sections are defined using simple resource requirement strings, the rusage section defined for the job overrides the rusage section defined in the application profile. The rusage definitions are merged, with the job-level rusage taking precedence. Any queue-level requirements are then merged with that result. For example: Application-level RES_REQ: RES_REQ=rusage[mem=200:lic=1] ... For the job submission: bsub -R "rusage[mem=100]" ... the resulting requirement for the job is rusage[mem=100:lic=1] where mem=100 specified by the job overrides mem=200 specified by the application profile. However, lic=1 from application profile is kept, since job does not specify it. Application-level RES_REQ threshold: RES_REQ = rusage[bwidth =2:threshold=5] ... For the job submission: bsub -R "rusage[bwidth =1:threshold=6]" ... the resulting requirement for the job is rusage[bwidth =1:threshold=6] Application-level RES_REQ with decay and duration defined: RES_REQ=rusage[mem=200:duration=20:decay=1] ... For a job submission with no decay or duration: bsub -R "rusage[mem=100]" ... the resulting requirement for the job is: rusage[mem=100:duration=20:decay=1] Application-level duration and decay are merged with the job-level specification, and mem=100 for the job overrides mem=200 specified by the application profile. However, duration=20 and decay=1 from application profile are kept, since job does not specify them. 76 Platform LSF Configuration Reference lsb.applications Application-level RES_REQ with multi-phase job-level rusage: RES_REQ=rusage[mem=(200 150):duration=(10 10):decay=(1),swap=100] ... For a multi-phase job submission: bsub -app app_name -R "rusage[mem=(600 350):duration=(20 10):decay=(0 1)]" ... the resulting requirement for the job is: rusage[mem=(600 350):duration=(20 10):decay=(0 1),swap=100] The job-level values for mem, duration and decay override the application-level values. However, swap=100 from the application profile is kept, since the job does not specify swap. Application-level RES_REQ with multi-phase application-level rusage: RES_REQ=rusage[mem=(200 150):duration=(10 10):decay=(1)] ... For a job submission: bsub -app app_name -R "rusage[mem=200:duration=15:decay=0]" ... the resulting requirement for the job is: rusage[mem=200:duration=15:decay=0] Job-level values override the application-level multi-phase rusage string. Note: The merged application-level and job-level rusage consumable resource requirements must satisfy any limits set by the parameter RESRSV_LIMIT in lsb.queues, or the job will be rejected. order section For simple resource requirements the order section defined at the job-level overrides any application-level order section. An application-level order section overrides queue-level specification. The order section defined at the application level is ignored if any resource requirements are specified at the job level. If the no resource requirements include an order section, the default order r15s:pg is used. The command syntax is: [!][-]resource_name [: [-]resource_name] For example: bsub -R "order[!ncpus:mem]" myjob "!" only works with consumable resources because resources can be specified in the rusage[] section and their value may be changed in schedule cycle (for example, slot or memory). In LSF scheduler, slots under RUN, SSUSP, USUP and RSV may be freed in different scheduling phases. Therefore, the slot value may change in different scheduling cycles. span section For simple resource requirements the span section defined at the job-level overrides an application-level span section, which overrides a queue-level span section. Note: Define span[hosts=-1] in the application profile or in bsub -R resource requirement string to disable the span section setting in the queue. Chapter 1. Configuration Files 77 lsb.applications same section For simple resource requirements all same sections defined at the job-level, application-level, and queue-level are combined before the job is dispatched. cu section For simple resource requirements the job-level cu section overrides the application-level, and the application-level cu section overrides the queue-level. affinity section For simple resource requirements the job-level affinity section overrides the application-level, and the application-level affinity section overrides the queue-level. Default select[type==local] order[r15s:pg] If this parameter is defined and a host model or Boolean resource is specified, the default type is any. RESIZABLE_JOBS Syntax RESIZABLE_JOBS = [Y|N|auto] Description N|n: The resizable job feature is disabled in the application profile. Under this setting, all jobs attached to this application profile are not resizable. All bresize and bsub -ar commands will be rejected with a proper error message. Y|y: Resize is enabled in the application profile and all jobs belonging to the application are resizable by default. Under this setting, users can run bresize commands to cancel pending resource allocation requests for the job or release resources from an existing job allocation, or use bsub to submit an autoresizable job. auto: All jobs belonging to the application will be autoresizable. Resizable jobs must be submitted with an application profile that defines RESIZABLE_JOBS as either auto or Y. If application defines RESIZABLE_JOBS=auto, but administrator changes it to N and reconfigures LSF, jobs without job-level auto resizable attribute become not autoresizable. For running jobs that are in the middle of notification stage, LSF lets current notification complete and stops scheduling. Changing RESIZABLE_JOBS configuration does not affect jobs with job-level autoresizable attribute. (This behavior is same as exclusive job, bsub -x and EXCLUSIVE parameter in queue level.) Auto-resizable jobs cannot be submitted with compute unit resource requirements. In the event a bswitch call or queue reconfiguration results in an auto-resizable job running in a queue with compute unit resource requirements, the job will no longer be auto-resizable. 78 Platform LSF Configuration Reference lsb.applications Resizable jobs cannot have compound resource requirements. Default If the parameter is undefined, the default value is N. RESIZE_NOTIFY_CMD Syntax RESIZE_NOTIFY_CMD = notification_command Description Defines an executable command to be invoked on the first execution host of a job when a resize event occurs. The maximum length of notification command is 4 KB. Default Not defined. No resize notification command is invoked. RESUME_CONTROL Syntax RESUME_CONTROL=signal | command Remember: Unlike the JOB_CONTROLS parameter in lsb.queues, the RESUME_CONTROL parameter does not require square brackets ([ ]) around the action. v signal is a UNIX signal name. The specified signal is sent to the job. The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command. v command specifies a /bin/sh command line to be invoked. Do not quote the command line inside an action definition. Do not specify a signal followed by an action that triggers the same signal. For example, do not specify RESUME_CONTROL=bresume. This causes a deadlock between the signal and the action. Description Changes the behavior of the RESUME action in LSF. v The contents of the configuration line for the action are run with /bin/sh -c so you can use shell features in the command. v The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is /dev/null. | v The command is run as the user of the job. v All environment variables set for the job are also set for the command action. The following additional environment variables are set: – LSB_JOBPGIDS — a list of current process group IDs of the job – LSB_JOBPIDS —a list of current process IDs of the job v If the command fails, LSF retains the original job status. Chapter 1. Configuration Files 79 lsb.applications The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). Default v On UNIX, by default, RESUME sends SIGCONT. v On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. RTASK_GONE_ACTION Syntax RTASK_GONE_ACTION="[KILLJOB_TASKDONE | KILLJOB_TASKEXIT] [IGNORE_TASKCRASH]" Description Defines the actions LSF should take if it detects that a remote task of a parallel or distributed job is gone. This parameter only applies to the blaunch distributed application framework. IGNORE_TASKCRASH A remote task crashes. LSF does nothing. The job continues to launch the next task. KILLJOB_TASKDONE A remote task exits with zero value. LSF terminates all tasks in the job. KILLJOB_TASKEXIT A remote task exits with non-zero value. LSF terminates all tasks in the job. Environment variable When defined in an application profile, the LSB_DJOB_RTASK_GONE_ACTION variable is set when running bsub -app for the specified application. You can also use the environment variable LSB_DJOB_RTASK_GONE_ACTION to override the value set in the application profile. Example RTASK_GONE_ACTION="IGNORE_TASKCRASH KILLJOB_TASKEXIT" Default Not defined. LSF does nothing. RUNLIMIT Syntax RUNLIMIT=[hour:]minute[/host_name | /host_model] 80 Platform LSF Configuration Reference lsb.applications Description The default run limit. The name of a host or host model specifies the runtime normalization host to use. By default, jobs that are in the RUN state for longer than the specified run limit are killed by LSF. You can optionally provide your own termination job action to override this default. Jobs submitted with a job-level run limit (bsub -W) that is less than the run limit are killed when their job-level run limit is reached. Jobs submitted with a run limit greater than the maximum run limit are rejected. Application-level limits override any default limit specified in the queue. Note: If you want to provide an estimated run time for scheduling purposes without killing jobs that exceed the estimate, define the RUNTIME parameter in the application profile, or submit the job with -We instead of a run limit. The run limit is in the form of [hour:]minute. The minutes can be specified as a number greater than 59. For example, three and a half hours can either be specified as 3:30, or 210. The run limit you specify is the normalized run time. This is done so that the job does approximately the same amount of processing, even if it is sent to host with a faster or slower CPU. Whenever a normalized run time is given, the actual time on the execution host is the specified time multiplied by the CPU factor of the normalization host then divided by the CPU factor of the execution host. If ABS_RUNLIMIT=Y is defined in lsb.params or in the application profile, the runtime limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted to an application profile with a run limit configured. Optionally, you can supply a host name or a host model name defined in LSF. You must insert ‘/’ between the run limit and the host name or model name. (See lsinfo(1) to get host model information.) If no host or host model is given, LSF uses the default runtime normalization host defined at the queue level (DEFAULT_HOST_SPEC in lsb.queues) if it has been configured; otherwise, LSF uses the default CPU time normalization host defined at the cluster level (DEFAULT_HOST_SPEC in lsb.params) if it has been configured; otherwise, the host with the largest CPU factor (the fastest host in the cluster). For MultiCluster jobs, if no other CPU time normalization host is defined and information about the submission host is not available, LSF uses the host with the largest CPU factor (the fastest host in the cluster). Jobs submitted to a chunk job queue are not chunked if RUNLIMIT is greater than 30 minutes. Default Unlimited Chapter 1. Configuration Files 81 lsb.applications RUNTIME Syntax RUNTIME=[hour:]minute[/host_name | /host_model] Description The RUNTIME parameter specifies an estimated run time for jobs associated with an application. LSF uses the RUNTIME value for scheduling purposes only, and does not kill jobs that exceed this value unless the jobs also exceed a defined RUNLIMIT. The format of runtime estimate is same as the RUNLIMIT parameter. The job-level runtime estimate specified by bsub -We overrides the RUNTIME setting in an application profile. The following LSF features use the RUNTIME value to schedule jobs: v Job chunking v Advance reservation v SLA v Slot reservation v Backfill Default Not defined STACKLIMIT Syntax STACKLIMIT=integer Description The per-process (soft) stack segment size limit for all of the processes belonging to a job from this queue (see getrlimit(2)). Application-level limits override any default limit specified in the queue, but must be less than the hard limit of the submission queue. By default, the limit is specified in KB. Use LSF_UNIT_FOR_LIMITS in lsf.conf to specify a larger unit for the the limit (MB, GB, TB, PB, or EB). Default Unlimited SUCCESS_EXIT_VALUES Syntax SUCCESS_EXIT_VALUES=[exit_code ...] 82 Platform LSF Configuration Reference lsb.applications Description Specifies exit values used by LSF to determine if job was done successfully. Use spaces to separate multiple exit codes. Job-level success exit values specified with the LSB_SUCCESS_EXIT_VALUES environment variable override the configration in application profile. Use SUCCESS_EXIT_VALUES for applications that successfully exit with non-zero values so that LSF does not interpret non-zero exit codes as job failure. exit_code should be the value between 0 and 255. Use spaces to separate exit code values. If both SUCCESS_EXIT_VALUES and REQUEUE_EXIT_VALUES are defined with the same exit code, REQUEUE_EXIT_VALUES will take precedence and the job will be set to PEND state and requeued. Default 0 SUSPEND_CONTROL Syntax SUSPEND_CONTROL=signal | command | CHKPNT Remember: Unlike the JOB_CONTROLS parameter in lsb.queues, the SUSPEND_CONTROL parameter does not require square brackets ([ ]) around the action. v signal is a UNIX signal name (for example, SIGTSTP). The specified signal is sent to the job. The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command. v command specifies a /bin/sh command line to be invoked. – Do not quote the command line inside an action definition. – Do not specify a signal followed by an action that triggers the same signal. For example, do not specify SUSPEND_CONTROL=bstop. This causes a deadlock between the signal and the action. v CHKPNT is a special action, which causes the system to checkpoint the job. The job is checkpointed and then stopped by sending the SIGSTOP signal to the job automatically. Description Changes the behavior of the SUSPEND action in LSF. v The contents of the configuration line for the action are run with /bin/sh -c so you can use shell features in the command. v The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is /dev/null. v The command is run as the user of the job. v All environment variables set for the job are also set for the command action. The following additional environment variables are set: Chapter 1. Configuration Files 83 lsb.applications – LSB_JOBPGIDS - a list of current process group IDs of the job – LSB_JOBPIDS - a list of current process IDs of the job – LSB_SUSP_REASONS - an integer representing a bitmap of suspending reasons as defined in lsbatch.h The suspending reason can allow the command to take different actions based on the reason for suspending the job. – LSB_SUSP_SUBREASONS - an integer representing the load index that caused the job to be suspended v If the command fails, LSF retains the original job status. | When the suspending reason SUSP_LOAD_REASON (suspended by load) is set in LSB_SUSP_REASONS, LSB_SUSP_SUBREASONS is set to one of the load index values defined in lsf.h. Use LSB_SUSP_REASONS and LSB_SUSP_SUBREASONS together in your custom job control to determine the exact load threshold that caused a job to be suspended. v If an additional action is necessary for the SUSPEND command, that action should also send the appropriate signal to the application. Otherwise, a job can continue to run even after being suspended by LSF. For example, SUSPEND_CONTROL=bkill $LSB_JOBPIDS; command The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). Default v On UNIX, by default, SUSPEND sends SIGTSTP for parallel or interactive jobs and SIGSTOP for other jobs. v On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. SWAPLIMIT Syntax SWAPLIMIT=integer Description Limits the amount of total virtual memory limit for the job. This limit applies to the whole job, no matter how many processes the job may contain. Application-level limits override any default limit specified in the queue. The action taken when a job exceeds its SWAPLIMIT or PROCESSLIMIT is to send SIGQUIT, SIGINT, SIGTERM, and SIGKILL in sequence. For CPULIMIT, SIGXCPU is sent before SIGINT, SIGTERM, and SIGKILL. By default, the limit is specified in KB. Use LSF_UNIT_FOR_LIMITS in lsf.conf to specify a larger unit for the the limit (MB, GB, TB, PB, or EB). Default Unlimited 84 Platform LSF Configuration Reference lsb.applications | TASKLIMIT | Syntax | TASKLIMIT=[minimum_limit [default_limit]] maximum_limit | Description | Note: TASKLIMIT replaces PROCLIMIT as of LSF 9.1.3. | | Maximum number of tasks that can be allocated to a job. For parallel jobs, the maximum number of tasks that can be allocated to the job. | | | | Queue level TASKLIMIT has the highest priority over application level TASKLIMIT and job level TASKLIMIT. Application level TASKLIMIT has higher priority than job level TASKLIMIT. Job-level limits must fall within the maximum and minimum limits of the application profile and the queue. | | Note: If you also defined JOB_SIZE_LIST in the same application profile where you defined TASKLIMIT, the TASKLIMIT parameter is ignored. | | | Optionally specifies the minimum and default number of job tasks. All limits must be positive numbers greater than or equal to 1 that satisfy the following relationship: | 1 <= minimum <= default <= maximum | | | | In the MultiCluster job forwarding model, the local cluster considers the receiving queue's TASKLIMIT on remote clusters before forwarding jobs. If the receiving queue's TASKLIMIT definition in the remote cluster cannot satisfy the job's task requirements, the job is not forwarded to that remote queue. | Default | Unlimited, the default number of tasks is 1 TERMINATE_CONTROL Syntax TERMINATE_CONTROL=signal | command | CHKPNT Remember: Unlike the JOB_CONTROLS parameter in lsb.queues, the TERMINATE_CONTROL parameter does not require square brackets ([ ]) around the action. v signal is a UNIX signal name (for example, SIGTERM). The specified signal is sent to the job. The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command. v command specifies a /bin/sh command line to be invoked. – Do not quote the command line inside an action definition. – Do not specify a signal followed by an action that triggers the same signal. For example, do not specify TERMINATE_CONTROL=bkill. This causes a deadlock between the signal and the action. v CHKPNT is a special action, which causes the system to checkpoint the job. The job is checkpointed and killed automatically. Chapter 1. Configuration Files 85 lsb.applications Description Changes the behavior of the TERMINATE action in LSF. v The contents of the configuration line for the action are run with /bin/sh -c so you can use shell features in the command. v The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is /dev/null. v The command is run as the user of the job. v All environment variables set for the job are also set for the command action. The following additional environment variables are set: – LSB_JOBPGIDS — a list of current process group IDs of the job – LSB_JOBPIDS —a list of current process IDs of the job The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). Default v On UNIX, by default, TERMINATE sends SIGINT, SIGTERM and SIGKILL in that order. v On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. Termination is implemented by the TerminateProcess() system call. THREADLIMIT Syntax THREADLIMIT=integer Description Limits the number of concurrent threads that can be part of a job. Exceeding the limit causes the job to terminate. The system sends the following signals in sequence to all processes belongs to the job: SIGINT, SIGTERM, and SIGKILL. By default, jobs submitted to the queue without a job-level thread limit are killed when the thread limit is reached. Application-level limits override any default limit specified in the queue. The limit must be a positive integer. Default Unlimited USE_PAM_CREDS Syntax USE_PAM_CREDS=y | n 86 Platform LSF Configuration Reference lsb.applications Description If USE_PAM_CREDS=y, applies PAM limits to an application when its job is dispatched to a Linux host using PAM. PAM limits are system resource limits defined in limits.conf. When USE_PAM_CREDS is enabled, PAM limits override others. If the execution host does not have PAM configured and this parameter is enabled, the job fails. For parallel jobs, only takes effect on the first execution host. Overrides MEMLIMIT_TYPE=Process. Overridden (for CPU limit only) by LSB_JOB_CPULIMIT=y. Overridden (for memory limits only) by LSB_JOB_MEMLIMIT=y. Default n Automatic Time-based configuration Use if-else constructs and time expressions to define time windows in the file. Configuration defined within in a time window applies only during the specified time period; configuration defined outside of any time window applies at all times. After editing the file, run badmin reconfig to reconfigure the cluster. Time expressions in the file are evaluated by LSF every 10 minutes, based on mbatchd start time. When an expression evaluates true, LSF changes the configuration in real time, without restarting mbatchd, providing continuous system availability. Time-based configuration also supports MultiCluster configuration in terms of shared configuration for groups of clusters (using the #include parameter). That means you can include a common configuration file by using the time-based feature in local configuration files. Example Begin application NAME=app1 #if time(16:00-18:00) CPULIMIT=180/hostA #else CPULIMIT=60/hostA #endif End application In this example, for two hours every day, the configuration is the following: Begin application NAME=app1 CPULIMIT=180/hostA End application The rest of the time, the configuration is the following: Chapter 1. Configuration Files 87 lsb.applications Begin application NAME=app1 CPULIMIT=60/hostA End application lsb.events The LSF batch event log file lsb.events is used to display LSF batch event history and for mbatchd failure recovery. Whenever a host, job, or queue changes status, a record is appended to the event log file. The file is located in LSB_SHAREDIR/cluster_name/logdir, where LSB_SHAREDIR must be defined in lsf.conf(5) and cluster_name is the name of the LSF cluster, as returned by lsid. See mbatchd(8) for the description of LSB_SHAREDIR. The bhist command searches the most current lsb.events file for its output. lsb.events structure The event log file is an ASCII file with one record per line. For the lsb.events file, the first line has the format # history_seek_position>, which indicates the file position of the first history event after log switch. For the lsb.events.# file, the first line has the format # timestamp_most_recent_event, which gives the timestamp of the most recent event in the file. Limiting the size of lsb.events Use MAX_JOB_NUM in lsb.params to set the maximum number of finished jobs whose events are to be stored in the lsb.events log file. Once the limit is reached, mbatchd starts a new event log file. The old event log file is saved as lsb.events.n, with subsequent sequence number suffixes incremented by 1 each time a new log file is started. Event logging continues in the new lsb.events file. Records and fields The fields of a record are separated by blanks. The first string of an event record indicates its type. The following types of events are recorded: v JOB_NEW v JOB_FORWARD v v v v v v v JOB_ACCEPT JOB_ACCEPTACK JOB_CHKPNT JOB_START JOB_START_ACCEPT JOB_STATUS JOB_SWITCH v JOB_SWITCH2 v JOB_MOVE v QUEUE_CTRL v HOST_CTRL v MBD_START 88 Platform LSF Configuration Reference lsb.events v v v v v MBD_DIE UNFULFILL LOAD_INDEX JOB_SIGACT MIG v v v v v v v JOB_MODIFY2 JOB_SIGNAL JOB_EXECUTE JOB_REQUEUE JOB_CLEAN JOB_EXCEPTION JOB_EXT_MSG v v v v v v JOB_ATTA_DATA JOB_CHUNK SBD_UNREPORTED_STATUS PRE_EXEC_START JOB_FORCE GRP_ADD v GRP_MOD v LOG_SWITCH v JOB_RESIZE_NOTIFY_START v v v v JOB_RESIZE_NOTIFY_ACCEPT JOB_RESIZE_NOTIFY_DONE JOB_RESIZE_RELEASE JOB_RESIZE_CANCEL v HOST_POWER_STATUS v JOB_PROV_HOST JOB_NEW A new job has been submitted. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID userId (%d) UNIX user ID of the submitter options (%d) Bit flags for job processing numProcessors (%d) Number of processors requested for execution submitTime (%d) Chapter 1. Configuration Files 89 lsb.events Job submission time beginTime (%d) Start time – the job should be started on or after this time termTime (%d) Termination deadline – the job should be terminated by this time (%d) sigValue (%d) Signal value chkpntPeriod (%d) Checkpointing period restartPid (%d) Restart process ID userName (%s) User name rLimits Soft CPU time limit (%d), see getrlimit(2) rLimits Soft file size limit (%d), see getrlimit(2) rLimits Soft data segment size limit (%d), see getrlimit(2) rLimits Soft stack segment size limit (%d), see getrlimit(2) rLimits Soft core file size limit (%d), see getrlimit(2) rLimits Soft memory size limit (%d), see getrlimit(2) rLimits Reserved (%d) rLimits Reserved (%d) rLimits Reserved (%d) rLimits Soft run time limit (%d), see getrlimit(2) rLimits Reserved (%d) hostSpec (%s) Model or host name for normalizing CPU time and run time 90 Platform LSF Configuration Reference lsb.events hostFactor (%f) CPU factor of the above host umask (%d) File creation mask for this job queue (%s) Name of job queue to which the job was submitted resReq (%s) Resource requirements fromHost (%s) Submission host name cwd (%s) Current working directory (up to 4094 characters for UNIX or 255 characters for Windows) chkpntDir (%s) Checkpoint directory inFile (%s) Input file name (up to 4094 characters for UNIX or 255 characters for Windows) outFile (%s) Output file name (up to 4094 characters for UNIX or 255 characters for Windows) errFile (%s) Error output file name (up to 4094 characters for UNIX or 255 characters for Windows) subHomeDir (%s) Submitter’s home directory jobFile (%s) Job file name numAskedHosts (%d) Number of candidate host names askedHosts (%s) List of names of candidate hosts for job dispatching dependCond (%s) Job dependency condition preExecCmd (%s) Job pre-execution command jobName (%s) Job name (up to 4094 characters) command (%s) Chapter 1. Configuration Files 91 lsb.events Job command (up to 4094 characters for UNIX or 255 characters for Windows) nxf (%d) Number of files to transfer (%d) xf (%s) List of file transfer specifications mailUser (%s) Mail user name projectName (%s) Project name niosPort (%d) Callback port if batch interactive job maxNumProcessors (%d) Maximum number of processors schedHostType (%s) Execution host type loginShell (%s) Login shell timeEvent (%d) Time Event, for job dependency condition; specifies when time event ended userGroup (%s) User group exceptList (%s) Exception handlers for the job options2 (%d) Bit flags for job processing idx (%d) Job array index inFileSpool (%s) Spool input file (up to 4094 characters for UNIX or 255 characters for Windows) commandSpool (%s) Spool command file (up to 4094 characters for UNIX or 255 characters for Windows) jobSpoolDir (%s) Job spool directory (up to 4094 characters for UNIX or 255 characters for Windows) userPriority (%d) User priority 92 Platform LSF Configuration Reference lsb.events rsvId %s Advance reservation ID; for example, "user2#0" jobGroup (%s) The job group under which the job runs sla (%s) SLA service class name under which the job runs rLimits Thread number limit extsched (%s) External scheduling options warningAction (%s) Job warning action warningTimePeriod (%d) Job warning time period in seconds SLArunLimit (%d) Absolute run time limit of the job for SLA service classes licenseProject (%s) IBM Platform License Scheduler project name options3 (%d) Bit flags for job processing app (%s) Application profile name postExecCmd (%s) Post-execution command to run on the execution host after the job finishes runtimeEstimation (%d) Estimated run time for the job requeueEValues (%s) Job exit values for automatic job requeue resizeNotifyCmd (%s) Resize notification command to run on the first execution host to inform job of a resize event. jobDescription (%s) Job description (up to 4094 characters). submitEXT Submission extension field, reserved for internal use. Num (%d) Number of elements (key-value pairs) in the structure. key (%s) Chapter 1. Configuration Files 93 lsb.events Reserved for internal use. value (%s) Reserved for internal use. srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster network (%s) Network requirements for IBM Parallel Environment (PE) jobs. cpu_frequency(%d) CPU frequency at which the job runs. options4 (%d) Bit flags for job processing | nStinFile (%d) (Platform Data Manager for LSF) The number of requested input files | | stinFiles | | (Platform Data Manager for LSF) List of input data requirement files requested. The list has the following elements: | options (%d) Bit field that identifies whether the data requriement is an input file or a tag. | | | host (%s) Source host of the input file. This field is empty if the data requirement is a tag. | | | name(%s) Full path to the input data requirement file on the host. This field is empty if the data requirement is a tag. | | | | | hash (%s) Hash key computed for the data requirement file at job submission time. This field is empty if the data requirement is a tag. | size (%lld) Size of the data requirement file at job submission time in bytes. | | modifyTime (%d) Last modified time of the data requirement file at job submission time. | JOB_FORWARD A job has been forwarded to a remote cluster (IBM Platform MultiCluster only). 94 Platform LSF Configuration Reference lsb.events If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID numReserHosts (%d) Number of reserved hosts in the remote cluster If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of .hosts listed in the reserHosts field. cluster (%s) Remote cluster name reserHosts (%s) List of names of the reserved hosts in the remote cluster If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. idx (%d) Job array index srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster effectiveResReq (%s) The runtime resource requirements used for the job. JOB_ACCEPT A job from a remote cluster has been accepted by this cluster. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event Chapter 1. Configuration Files 95 lsb.events jobId (%d) Job ID at the accepting cluster remoteJid (%d) Job ID at the submission cluster cluster (%s) Job submission cluster name idx (%d) Job array index srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_ACCEPTACK Contains remote and local job ID mapping information. The default number for the ID is -1 (which means that this is not applicable to the job), and the default value for the cluster name is "" (empty string). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) The ID number of the job at the execution cluster idx (%d) The job array index jobRmtAttr (%d) Remote job attributes from: v Remote batch job on the submission side v Lease job on the submission side v Remote batch job on the execution side v Lease job on the execution side v Lease job re-syncronization during restart v Remote batch job re-running on the execution cluster srcCluster (%s) The name of the submission cluster srcJobId (%d) 96 Platform LSF Configuration Reference lsb.events The submission cluster job ID dstCluster (%s) The name of the execution cluster dstJobId (%d) The execution cluster job ID JOB_CHKPNT Contains job checkpoint information. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) The ID number of the job at the execution cluster period (%d) The new checkpointing period jobPid (%d) The process ID of the checkpointing process, which is a child sbatchd ok (%d) v 0 means the checkpoint started v 1 means the checkpoint succeeded flags (%d) Checkpoint flags, see <lsf/lsbatch.h>: v LSB_CHKPNT_KILL: Kill the process if checkpoint is successful v LSB_CHKPNT_FORCE: Force checkpoint even if non-checkpointable conditions exist v LSB_CHKPNT_MIG: Checkpoint for the purpose of migration idx (%d) Job array index (must be 0 in JOB_NEW) srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_START A job has been dispatched. Chapter 1. Configuration Files 97 lsb.events If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID jStatus (%d) Job status, (4, indicating the RUN status of the job) jobPid (%d) Job process ID jobPGid (%d) Job process group ID hostFactor (%f) CPU factor of the first execution host numExHosts (%d) Number of processors used for execution If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of .hosts listed in the execHosts field. execHosts (%s) List of execution host names If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. queuePreCmd (%s) Pre-execution command queuePostCmd (%s) Post-execution command jFlags (%d) Job processing flags userGroup (%s) User group name idx (%d) Job array index additionalInfo (%s) Placement information of HPC jobs 98 Platform LSF Configuration Reference lsb.events preemptBackfill (%d) How long a backfilled job can run. Used for preemption backfill jobs. jFlags2 (%d) Job flags srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster effectiveResReq (%s) The runtime resource requirements used for the job. num_network (%d) The number of the allocated network for IBM Parallel Environment (PE) jobs. networkID (%s) Network ID of the allocated network for IBM Parallel Environment (PE) jobs. num_window (%d) Number of allocated windows for IBM Parallel Environment (PE) jobs. cpu_frequency(%d) CPU frequency at which the job runs. | | | | numAllocSlots(%d) Number of allocated slots. allocSlots(%s) List of execution host names where the slots are allocated. JOB_START_ACCEPT A job has started on the execution host(s). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID jobPid (%d) Job process ID jobPGid (%d) Job process group ID Chapter 1. Configuration Files 99 lsb.events idx (%d) Job array index srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_STATUS The status of a job changed after dispatch. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID jStatus (%d) New status, see <lsf/lsbatch.h> For JOB_STAT_EXIT (32) and JOB_STAT_DONE (64), host-based resource usage information is appended to the JOB_STATUS record in the fields numHostRusage and hostRusage. reason (%d) Pending or suspended reason code, see <lsf/lsbatch.h> subreasons (%d) Pending or suspended subreason code, see <lsf/lsbatch.h> cpuTime (%f) CPU time consumed so far endTime (%d) Job completion time ru (%d) Resource usage flag lsfRusage (%s) Resource usage statistics, see <lsf/lsf.h> exitStatus (%d) Exit status of the job, see <lsf/lsbatch.h> idx (%d) 100 Platform LSF Configuration Reference lsb.events Job array index exitInfo (%d) Job termination reason, see <lsf/lsbatch.h> duration4PreemptBackfill How long a backfilled job can run. Used for preemption backfill jobs numHostRusage(%d) For a jStatus of JOB_STAT_EXIT (32) or JOB_STAT_DONE (64), this field contains the number of host-based resource usage entries (hostRusage) that follow. 0 unless LSF_HPC_EXTENSIONS="HOST_RUSAGE" is set in lsf.conf. hostRusage For a jStatus of JOB_STAT_EXIT (32) or JOB_STAT_DONE (64), these fields contain host-based resource usage information for the job for parallel jobs when LSF_HPC_EXTENSIONS="HOST_RUSAGE" is set in lsf.conf. hostname (%s) Name of the host. mem(%d) Total resident memory usage of all processes in the job running on this host. swap(%d) Total virtual memory usage of all processes in the job running on this host. utime(%d) User time used on this host. stime(%d) System time used on this host. hHostExtendInfo(%d) Number of following key-value pairs containing extended host information (PGIDs and PIDs). Set to 0 in lsb.events, lsb.acct, and lsb.stream files. maxMem Peak memory usage (in Mbytes) avgMem Average memory usage (in Mbytes) srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster Chapter 1. Configuration Files 101 lsb.events JOB_SWITCH A job switched from one queue to another (bswitch). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event userId (%d) UNIX user ID of the user invoking the command jobId (%d) Job ID queue (%s) Target queue name idx (%d) Job array index userName (%s) Name of the job submitter srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_SWITCH2 A job array switched from one queue to another (bswitch). The fields are: Version number (%s) The version number Event time (%d) The time of the event userId (%d) UNIX user ID of the user invoking the command jobId (%d) Job ID queue (%s) Target queue name userName (%s) 102 Platform LSF Configuration Reference lsb.events Name of the job submitter indexRangeCnt (%s) The number of ranges indicating successfully switched elements indexRangeStart1 (%d) The start of the first index range indexRangeEnd1 (%d) The end of the first index range indexRangeStep1 (%d) The step of the first index range indexRangeStart2 (%d) The start of the second index range indexRangeEnd2 (%d) The end of the second index range indexRangeStep2 (%d) The step of the second index range indexRangeStartN (%d) The start of the last index range indexRangeEndN (%d) The end of the last index range indexRangeStepN (%d) The step of the last index range srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID rmtCluster (%d) The destination cluster to which the remote jobs belong rmtJobCtrlId (%d) Unique identifier for the remote job control session in the MultiCluster. numSuccJobId (%d) The number of jobs that were successful during this remote control operation. succJobIdArray (%d) Contains IDs for all the jobs that were successful during this remote control operation. numFailJobId (%d) The number of jobs which failed during this remote control session. Chapter 1. Configuration Files 103 lsb.events failJobIdArray (%d) Contains IDs for all the jobs that failed during this remote control operation. failReason (%d) Contains the failure code and reason for each failed job in the failJobIdArray. To prevent JOB_SWITCH2 from getting too long, the number of index ranges is limited to 500 per JOB_SWITCH2 event log. Therefore, if switching a large job array, several JOB_SWITCH2 events may be generated. JOB_MOVE A job moved toward the top or bottom of its queue (bbot or btop). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event userId (%d) UNIX user ID of the user invoking the command jobId (%d) Job ID position (%d) Position number base (%d) Operation code, (TO_TOP or TO_BOTTOM), see <lsf/lsbatch.h> idx (%d) Job array index userName (%s) Name of the job submitter QUEUE_CTRL A job queue has been altered. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event opCode (%d) Operation code), see <lsf/lsbatch.h> queue (%s) Queue name userId (%d) UNIX user ID of the user invoking the command 104 Platform LSF Configuration Reference lsb.events userName (%s) Name of the user ctrlComments (%s) Administrator comment text from the -C option of badmin queue control commands qclose, qopen, qact, and qinact HOST_CTRL A batch server host changed status. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event opCode (%d) Operation code, see <lsf/lsbatch.h> host (%s) Host name userId (%d) UNIX user ID of the user invoking the command userName (%s) Name of the user ctrlComments (%s) Administrator comment text from the -C option of badmin host control commands hclose and hopen MBD_START The mbatchd has started. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event master (%s) Master host name cluster (%s) cluster name numHosts (%d) Number of hosts in the cluster numQueues (%d) Number of queues in the cluster Chapter 1. Configuration Files 105 lsb.events MBD_DIE The mbatchd died. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event master (%s) Master host name numRemoveJobs (%d) Number of finished jobs that have been removed from the system and logged in the current event file exitCode (%d) Exit code from mbatchd ctrlComments (%s) Administrator comment text from the -C option of badmin mbdrestart UNFULFILL Actions that were not taken because the mbatchd was unable to contact the sbatchd on the job execution host. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID notSwitched (%d) Not switched: the mbatchd has switched the job to a new queue, but the sbatchd has not been informed of the switch sig (%d) Signal: this signal has not been sent to the job sig1 (%d) Checkpoint signal: the job has not been sent this signal to checkpoint itself sig1Flags (%d) Checkpoint flags, see <lsf/lsbatch.h> chkPeriod (%d) New checkpoint period for job notModified (%s) If set to true, then parameters for the job cannot be modified. idx (%d) Job array index 106 Platform LSF Configuration Reference lsb.events LOAD_INDEX mbatchd restarted with these load index names (see lsf.cluster(5)). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event nIdx (%d) Number of index names name (%s) List of index names JOB_SIGACT An action on a job has been taken. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID period (%d) Action period pid (%d) Process ID of the child sbatchd that initiated the action jstatus (%d) Job status reasons (%d) Job pending reasons flags (%d) Action flags, see <lsf/lsbatch.h> actStatus (%d) Action status: 1: Action started 2: One action preempted other actions 3: Action succeeded 4: Action Failed signalSymbol (%s) Action name, accompanied by actFlags idx (%d) Chapter 1. Configuration Files 107 lsb.events Job array index MIG A job has been migrated (bmig). The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID numAskedHosts (%d) Number of candidate hosts for migration askedHosts (%s) List of names of candidate hosts userId (%d) UNIX user ID of the user invoking the command idx (%d) Job array index userName (%s) Name of the job submitter srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_MODIFY2 This is created when the mbatchd modifies a previously submitted job with bmod. Version number (%s) The version number Event time (%d) The time of the event jobIdStr (%s) Job ID options (%d) Bit flags for job modification options processing options2 (%d) 108 Platform LSF Configuration Reference lsb.events Bit flags for job modification options processing delOptions (%d) Delete options for the options field userId (%d) UNIX user ID of the submitter userName (%s) User name submitTime (%d) Job submission time umask (%d) File creation mask for this job numProcessors (%d) Number of processors requested for execution. The value 2147483646 means the number of processors is undefined. beginTime (%d) Start time – the job should be started on or after this time termTime (%d) Termination deadline – the job should be terminated by this time sigValue (%d) Signal value restartPid (%d) Restart process ID for the original job jobName (%s) Job name (up to 4094 characters) queue (%s) Name of job queue to which the job was submitted numAskedHosts (%d) Number of candidate host names askedHosts (%s) List of names of candidate hosts for job dispatching; blank if the last field value is 0. If there is more than one host name, then each additional host name will be returned in its own field resReq (%s) Resource requirements rLimits Soft CPU time limit (%d), see getrlimit(2) rLimits Soft file size limit (%d), see getrlimit(2) Chapter 1. Configuration Files 109 lsb.events rLimits Soft data segment size limit (%d), see getrlimit2) rLimits Soft stack segment size limit (%d), see getrlimit(2) rLimits Soft core file size limit (%d), see getrlimit(2) rLimits Soft memory size limit (%d), see getrlimit(2) rLimits Reserved (%d) rLimits Reserved (%d) rLimits Reserved (%d) rLimits Soft run time limit (%d), see getrlimit(2) rLimits Reserved (%d) hostSpec (%s) Model or host name for normalizing CPU time and run time dependCond (%s) Job dependency condition timeEvent (%d) Time Event, for job dependency condition; specifies when time event ended subHomeDir (%s) Submitter’s home directory inFile (%s) Input file name (up to 4094 characters for UNIX or 255 characters for Windows) outFile (%s) Output file name (up to 4094 characters for UNIX or 255 characters for Windows) errFile (%s) Error output file name (up to 4094 characters for UNIX or 255 characters for Windows) command (%s) Job command (up to 4094 characters for UNIX or 255 characters for Windows) chkpntPeriod (%d) 110 Platform LSF Configuration Reference lsb.events Checkpointing period chkpntDir (%s) Checkpoint directory nxf (%d) Number of files to transfer xf (%s) List of file transfer specifications jobFile (%s) Job file name fromHost (%s) Submission host name cwd (%s) Current working directory (up to 4094 characters for UNIX or 255 characters for Windows) preExecCmd (%s) Job pre-execution command mailUser (%s) Mail user name projectName (%s) Project name niosPort (%d) Callback port if batch interactive job maxNumProcessors (%d) Maximum number of processors. The value 2147483646 means the maximum number of processors is undefined. loginShell (%s) Login shell schedHostType (%s) Execution host type userGroup (%s) User group exceptList (%s) Exception handlers for the job delOptions2 (%d) Delete options for the options2 field inFileSpool (%s) Spool input file (up to 4094 characters for UNIX or 255 characters for Windows) Chapter 1. Configuration Files 111 lsb.events commandSpool (%s) Spool command file (up to 4094 characters for UNIX or 255 characters for Windows) userPriority (%d) User priority rsvId %s Advance reservation ID; for example, "user2#0" extsched (%s) External scheduling options warningTimePeriod (%d) Job warning time period in seconds warningAction (%s) Job warning action jobGroup (%s) The job group to which the job is attached sla (%s) SLA service class name that the job is to be attached to licenseProject (%s) IBM Platform License Scheduler project name options3 (%d) Bit flags for job processing delOption3 (%d) Delete options for the options3 field app (%s) Application profile name apsString (%s) Absolute priority scheduling (APS) value set by administrator postExecCmd (%s) Post-execution command to run on the execution host after the job finishes runtimeEstimation (%d) Estimated run time for the job requeueEValues (%s) Job exit values for automatic job requeue resizeNotifyCmd (%s) Resize notification command to run on the first execution host to inform job of a resize event. jobdescription (%s) Job description (up to 4094 characters). 112 Platform LSF Configuration Reference lsb.events srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster network (%s) Network requirements for IBM Parallel Environment (PE) jobs. cpu_frequency(%d) CPU frequency at which the job runs. options4 (%d) Bit flags for job processing | | | nStinFile (%d) (Platform Data Manager for LSF) The number of requested input files stinFiles | | (Platform Data Manager for LSF) List of input data requirement files requested. The list has the following elements: | options (%d) | | | | | | | | Bit field that identifies whether the data requriement is an input file or a tag. host (%s) Source host of the input file. This field is empty if the data requirement is a tag. name(%s) Full path to the input data requirement file on the host. This field is empty if the data requirement is a tag. | | | hash (%s) Hash key computed for the data requirement file at job submission time. This field is empty if the data requirement is a tag. | size (%lld) | | | Size of the data requirement file at job submission time in bytes. modifyTime (%d) Last modified time of the data requirement file at job submission time. JOB_SIGNAL This is created when a job is signaled with bkill or deleted with bdel. The fields are in the order they appended: Version number (%s) The version number Chapter 1. Configuration Files 113 lsb.events Event time (%d) The time of the event jobId (%d) Job ID userId (%d) UNIX user ID of the user invoking the command runCount (%d) Number of runs signalSymbol (%s) Signal name idx (%d) Job array index userName (%s) Name of the job submitter srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_EXECUTE This is created when a job is actually running on an execution host. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID execUid (%d) Mapped UNIX user ID on execution host jobPGid (%d) Job process group ID execCwd (%s) Current working directory job used on execution host (up to 4094 characters for UNIX or 255 characters for Windows) 114 Platform LSF Configuration Reference lsb.events execHome (%s) Home directory job used on execution host execUsername (%s) Mapped user name on execution host jobPid (%d) Job process ID idx (%d) Job array index additionalInfo (%s) Placement information of HPC jobs SLAscaledRunLimit (%d) Run time limit for the job scaled by the execution host execRusage An internal field used by LSF. Position An internal field used by LSF. duration4PreemptBackfill How long a backfilled job can run; used for preemption backfill jobs srcJobId (%d) The submission cluster job ID srcCluster (%s) The name of the submission cluster dstJobId (%d) The execution cluster job ID dstCluster (%s) The name of the execution cluster JOB_REQUEUE This is created when a job ended and requeued by mbatchd. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID idx (%d) Job array index Chapter 1. Configuration Files 115 lsb.events JOB_CLEAN This is created when a job is removed from the mbatchd memory. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID idx (%d) Job array index JOB_EXCEPTION This is created when an exception condition is detected for a job. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID exceptMask (%d) Exception Id 0x01: missched 0x02: overrun 0x04: underrun 0x08: abend 0x10: cantrun 0x20: hostfail 0x40: startfail 0x100:runtime_est_exceeded actMask (%d) Action Id 0x01: kill 0x02: alarm 0x04: rerun 0x08: setexcept timeEvent (%d) Time Event, for missched exception specifies when time event ended. 116 Platform LSF Configuration Reference lsb.events exceptInfo (%d) Except Info, pending reason for missched or cantrun exception, the exit code of the job for the abend exception, otherwise 0. idx (%d) Job array index JOB_EXT_MSG An external message has been sent to a job. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID idx (%d) Job array index msgIdx (%d) Index in the list userId (%d) Unique user ID of the user invoking the command dataSize (%ld) Size of the data if it has any, otherwise 0 postTime (%ld) Message sending time dataStatus (%d) Status of the attached data desc (%s) Text description of the message userName (%s) Name of the author of the message Flags (%d) Used for internal flow control JOB_ATTA_DATA An update on the data status of a message for a job has been sent. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event Chapter 1. Configuration Files 117 lsb.events jobId (%d) Job ID idx (%d) Job array index msgIdx (%d) Index in the list dataSize (%ld) Size of the data if is has any, otherwise 0 dataStatus (%d) Status of the attached data fileName (%s) File name of the attached data JOB_CHUNK This is created when a job is inserted into a chunk. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, older daemons and commands (pre-LSF Version 6.0) cannot recognize the lsb.events file format. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event membSize (%ld) Size of array membJobId membJobId (%ld) Job IDs of jobs in the chunk numExHosts (%ld) Number of execution hosts If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of .hosts listed in the execHosts field. execHosts (%s) Execution host name array If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. SBD_UNREPORTED_STATUS This is created when an unreported status change occurs. The fields in order of occurrence are: Version number (%s) The version number 118 Platform LSF Configuration Reference lsb.events Event time (%d) The time of the event jobId (%d) Job ID actPid (%d) Acting processing ID jobPid (%d) Job process ID jobPGid (%d) Job process group ID newStatus (%d) New status of the job reason (%d) Pending or suspending reason code, see <lsf/lsbatch.h> suspreason (%d) Pending or suspending subreason code, see <lsf/lsbatch.h> lsfRusage The following fields contain resource usage information for the job (see getrusage(2)). If the value of some field is unavailable (due to job exit or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB. ru_utime (%f) User time used ru_stime (%f) System time used ru_maxrss (%f) Maximum shared text size ru_ixrss (%f) Integral of the shared text size over time (in KB seconds) ru_ismrss (%f) Integral of the shared memory size over time (valid only on Ultrix) ru_idrss (%f) Integral of the unshared data size over time ru_isrss (%f) Integral of the unshared stack size over time ru_minflt (%f) Number of page reclaims ru_majflt (%f) Chapter 1. Configuration Files 119 lsb.events Number of page faults ru_nswap (%f) Number of times the process was swapped out ru_inblock (%f) Number of block input operations ru_oublock (%f) Number of block output operations ru_ioch (%f) Number of characters read and written (valid only on HP-UX) ru_msgsnd (%f) Number of System V IPC messages sent ru_msgrcv (%f) Number of messages received ru_nsignals (%f) Number of signals received ru_nvcsw (%f) Number of voluntary context switches ru_nivcsw (%f) Number of involuntary context switches ru_exutime (%f) Exact user time used (valid only on ConvexOS) exitStatus (%d) Exit status of the job, see <lsf/lsbatch.h> execCwd (%s) Current working directory job used on execution host (up to 4094 characters for UNIX or 255 characters for Windows) execHome (%s) Home directory job used on execution host execUsername (%s) Mapped user name on execution host msgId (%d) ID of the message actStatus (%d) Action status 1: Action started 2: One action preempted other actions 3: Action succeeded 4: Action Failed 120 Platform LSF Configuration Reference lsb.events sigValue (%d) Signal value seq (%d) Sequence status of the job idx (%d) Job array index jRusage The following fields contain resource usage information for the job. If the value of some field is unavailable (due to job exit or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB. mem (%d) Total resident memory usage in KB of all currently running processes in a given process group swap (%d) Totaly virtual memory usage in KB of all currently running processes in given process groups utime (%d) Cumulative total user time in seconds stime (%d) Cumulative total system time in seconds npids (%d) Number of currently active process in given process groups. This entry has four sub-fields: pid (%d) Process ID of the child sbatchd that initiated the action ppid (%d) Parent process ID pgid (%d) Process group ID jobId (%d) Process Job ID npgids (%d) Number of currently active process groups exitInfo (%d) Job termination reason, see <lsf/lsbatch.h> PRE_EXEC_START A pre-execution command has been started. The fields in order of occurrence are: Chapter 1. Configuration Files 121 lsb.events Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID jStatus (%d) Job status, (4, indicating the RUN status of the job) jobPid (%d) Job process ID jobPGid (%d) Job process group ID hostFactor (%f) CPU factor of the first execution host numExHosts (%d) Number of processors used for execution execHosts (%s) List of execution host names queuePreCmd (%s) Pre-execution command queuePostCmd (%s) Post-execution command jFlags (%d) Job processing flags userGroup (%s) User group name idx (%d) Job array index additionalInfo (%s) Placement information of HPC jobs effectiveResReq (%s) The runtime resource requirements used for the job. JOB_FORCE A job has been forced to run with brun. Version number (%s) The version number Event time (%d) The time of the event 122 Platform LSF Configuration Reference lsb.events jobId (%d) Job ID userId (%d) UNIX user ID of the user invoking the command idx (%d) Job array index options (%d) Bit flags for job processing numExecHosts (%ld) Number of execution hosts If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of .hosts listed in the execHosts field. execHosts (%s) Execution host name array If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. userName (%s) Name of the user queue (%s) Name of queue if a remote brun job ran; otherwise, this field is empty. For MultiCluster this is the name of the receive queue at the execution cluster. GRP_ADD This is created when a job group is added. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event userId (%d) UNIX user ID of the job group owner submitTime (%d) Job submission time userName (%s) User name of the job group owner depCond (%s) Job dependency condition timeEvent (%d) Time Event, for job dependency condition; specifies when time event ended groupSpec (%s) Job group name Chapter 1. Configuration Files 123 lsb.events delOptions (%d) Delete options for the options field delOptions2 (%d) Delete options for the options2 field sla (%s) SLA service class name that the job group is to be attached to maxJLimit (%d) Job group limit set by bgadd -L groupType (%d) Job group creation method: v 0x01 - job group was created explicitly v 0x02 - job group was created implicitly GRP_MOD This is created when a job group is modified. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event userId (%d) UNIX user ID of the job group owner submitTime (%d) Job submission time userName (%s) User name of the job group owner depCond (%s) Job dependency condition timeEvent (%d) Time Event, for job dependency condition; specifies when time event ended groupSpec (%s) Job group name delOptions (%d) Delete options for the options field delOptions2 (%d) Delete options for the options2 field sla (%s) SLA service class name that the job group is to be attached to maxJLimit (%d) Job group limit set by bgmod -L 124 Platform LSF Configuration Reference lsb.events LOG_SWITCH This is created when switching the event file lsb.events. The fields in order of occurrence are: Version number (%s) The version number Event time (%d) The time of the event jobId (%d) Job ID JOB_RESIZE_NOTIFY_START LSF logs this event when a resize (shrink or grow) request has been sent to the first execution host. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. jobId (%d) The job ID. idx (%d) Job array index. notifyId (%d) Identifier or handle for notification. numResizeHosts (%d) Number of processors used for execution. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of hosts listed in short format. resizeHosts (%s) List of execution host names. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. | | numResizeSlots (%d) Number of allocated slots for executing resize. | | resizeSlots (%s) List of execution host names where slots are allocated for resizing. JOB_RESIZE_NOTIFY_ACCEPT LSF logs this event when a resize request has been accepted from the first execution host of a job. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. Chapter 1. Configuration Files 125 lsb.events jobId (%d) The job ID. idx (%d) Job array index. notifyId (%d) Identifier or handle for notification. resizeNotifyCmdPid (%d) Resize notification executable process ID. If no resize notification executable is defined, this field will be set to 0. resizeNotifyCmdPGid (%d) Resize notification executable process group ID. If no resize notification executable is defined, this field will be set to 0. status (%d) Status field used to indicate possible errors. 0 Success, 1 failure. JOB_RESIZE_NOTIFY_DONE LSF logs this event when the resize notification command completes. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. jobId (%d) The job ID. idx (%d) Job array index. notifyId (%d) Identifier or handle for notification. status (%d) Resize notification exit value. (0, success, 1, failure, 2 failure but cancel request.) JOB_RESIZE_RELEASE LSF logs this event when receiving resource release request from client. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. jobId (%d) The job ID. 126 Platform LSF Configuration Reference lsb.events idx (%d) Job array index. reqid (%d) Request Identifier or handle. options (%d) Release options. userId (%d) UNIX user ID of the user invoking the command. userName (%s) User name of the submitter. resizeNotifyCmd (%s) Resize notification command to run on the first execution host to inform job of a resize event. numResizeHosts (%d) Number of processors used for execution during resize. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is the number of hosts listed in short format. resizeHosts (%s) List of execution host names during resize. If LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is specified in lsf.conf, the value of this field is logged in a shortened format. | | numResizeSlots (%d) Number of allocated slots for executing resize. | | resizeSlots (%s) List of execution host names where slots are allocated for resizing. JOB_RESIZE_CANCEL LSF logs this event when receiving cancel request from client. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. jobId (%d) The job ID. idx (%d) Job array index. userId (%d) UNIX user ID of the user invoking the command. userName (%s) User name of the submitter. Chapter 1. Configuration Files 127 lsb.events HOST_POWER_STATUS LSF logs this event when a host power status is changed, whether by power policy, job, or by the command badmin hpower. The HOST_POWER_STATUS event is logged to reflect the power status changes. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. Request Id (%d) The power operation request ID to identify a power operation. Op Code (%d) Power operation type. Trigger (%d) The power operation trigger: power policy, job, or badmin hpower. Status (%d) The power operation status. Trigger Name (%s) If the operation is triggered by power policy, this is the power policy name. If the operation is triggered by an administrator, this is the administrator user name. Number (%d) Number of hosts on which the power operation occurred. Hosts (%s) The hosts on which the power operation occurred. JOB_PROV_HOST When a job has been dispatched to a power saved host (or hosts), it will trigger a power state change for the host and the job will be in the PROV state. This event logs those PROV cases. The fields in order of occurrence are: Version number (%s) The version number. Event time (%d) The time of the event. jobId (%d) The job ID. idx (%d) Job array index. status (%d) Indicates if the provision has started, is done, or is failed. num (%d) Number of hosts that need to be provisioned. 128 Platform LSF Configuration Reference lsb.events hostNameList(%d) Names of hosts that need to be provisioned. hostStatusList(%d) Host status for provisioning result. lsb.hosts The lsb.hosts file contains host-related configuration information for the server hosts in the cluster. It is also used to define host groups, host partitions, and compute units. This file is optional. All sections are optional. By default, this file is installed in LSB_CONFDIR/cluster_name/configdir. Changing lsb.hosts configuration After making any changes to lsb.hosts, run badmin reconfig to reconfigure mbatchd. Host section Description Optional. Defines the hosts, host types, and host models used as server hosts, and contains per-host configuration information. If this section is not configured, LSF uses all hosts in the cluster (the hosts listed in lsf.cluster.cluster_name) as server hosts. Each host, host model or host type can be configured to do the following: v Limit the maximum number of jobs run in total v Limit the maximum number of jobs run by each user v Run jobs only under specific load conditions v Run jobs only under specific time windows The entries in a line for a host override the entries in a line for its model or type. When you modify the cluster by adding or removing hosts, no changes are made to lsb.hosts. This does not affect the default configuration, but if hosts, host models, or host types are specified in this file, you should check this file whenever you make changes to the cluster and update it manually if necessary. Host section structure The first line consists of keywords identifying the load indices that you wish to configure on a per-host basis. The keyword HOST_NAME must be used; the others are optional. Load indices not listed on the keyword line do not affect scheduling decisions. Each subsequent line describes the configuration information for one host, host model or host type. Each line must contain one entry for each keyword. Use empty parentheses ( ) or a dash (-) to specify the default value for an entry. Chapter 1. Configuration Files 129 lsb.hosts HOST_NAME Required. Specify the name, model, or type of a host, or the keyword default. host name The name of a host defined in lsf.cluster.cluster_name. host model A host model defined in lsf.shared. host type A host type defined in lsf.shared. default The reserved host name default indicates all hosts in the cluster not otherwise referenced in the section (by name or by listing its model or type). CHKPNT Description If C, checkpoint copy is enabled. With checkpoint copy, all opened files are automatically copied to the checkpoint directory by the operating system when a process is checkpointed. Example HOST_NAME CHKPNT hostA C Compatibility Checkpoint copy is only supported on Cray systems. Default No checkpoint copy DISPATCH_WINDOW Description The time windows in which jobs from this host, host model, or host type are dispatched. Once dispatched, jobs are no longer affected by the dispatch window. Default Not defined (always open) EXIT_RATE Description Specifies a threshold for exited jobs. Specify a number of jobs. If the number of jobs that exit over a period of time specified by JOB_EXIT_RATE_DURATION in lsb.params (5 minutes by default) exceeds the number of jobs you specify as the threshold in this parameter, LSF invokes LSF_SERVERDIR/eadmin to trigger a host exception. 130 Platform LSF Configuration Reference lsb.hosts EXIT_RATE for a specific host overrides a default GLOBAL_EXIT_RATE specified in lsb.params. Example The following Host section defines a job exit rate of 20 jobs for all hosts, and an exit rate of 10 jobs on hostA. Begin Host HOST_NAME MXJ EXIT_RATE Default ! 20 hostA ! 10 # Keywords End Host Default Not defined JL/U Description Per-user job slot limit for the host. Maximum number of job slots that each user can use on this host. Example HOST_NAME JL/U hostA 2 Default Unlimited MIG Syntax MIG=minutes Description Enables automatic job migration and specifies the migration threshold for checkpointable or rerunnable jobs, in minutes. LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes. Specify a value of 0 to migrate jobs immediately upon suspension. The migration threshold applies to all jobs running on the host. Job-level command line migration threshold overrides threshold configuration in application profile and queue. Application profile configuration overrides queue level configuration. When a host migration threshold is specified, and is lower than the value for the job, the queue, or the application, the host value is used. Does not affect MultiCluster jobs that are forwarded to a remote cluster. Default Not defined. LSF does not migrate checkpointable or rerunnable jobs automatically. Chapter 1. Configuration Files 131 lsb.hosts MXJ Description The number of job slots on the host. With MultiCluster resource leasing model, this is the number of job slots on the host that are available to the local cluster. Use ! to make the number of job slots equal to the number of CPUs on a host. For the reserved host name default, ! makes the number of job slots equal to the number of CPUs on all hosts in the cluster not otherwise referenced in the section. By default, the number of running and suspended jobs on a host cannot exceed the number of job slots. If preemptive scheduling is used, the suspended jobs are not counted as using a job slot. On multiprocessor hosts, to fully use the CPU resource, make the number of job slots equal to or greater than the number of processors. Default Unlimited load_index Syntax load_index loadSched[/loadStop] Specify io, it, ls, mem, pg, r15s, r1m, r15m, swp, tmp, ut, or a non-shared (host based) dynamic custom external load index as a column. Specify multiple columns to configure thresholds for multiple load indices. Description Scheduling and suspending thresholds for dynamic load indices supported by LIM, including external load indices. Each load index column must contain either the default entry or two numbers separated by a slash (/), with no white space. The first number is the scheduling threshold for the load index; the second number is the suspending threshold. Queue-level scheduling and suspending thresholds are defined in lsb.queues. If both files specify thresholds for an index, those that apply are the most restrictive ones. Example HOST_NAME mem swp hostA 100/10 200/30 This example translates into a loadSched condition of mem>=100 && swp>=200 and a loadStop condition of mem < 10 || swp < 30 132 Platform LSF Configuration Reference lsb.hosts Default Not defined AFFINITY Syntax AFFINITY=Y | y | N | n | cpu_list Description Specifies whether the host can be used to run affinity jobs, and if so which CPUs are eligible to do so. The syntax accepts Y, N, a list of CPUs, or a CPU range. Examples The following configuration enables affinity scheduling and tells LSF to use all CPUs on hostA for affinity jobs: HOST_NAME MXJ r1m AFFINITY hostA ! () (Y) The following configuration specifies a CPU list for affinity scheduling: HOST_NAME MXJ r1m AFFINITY hostA ! () (CPU_LIST="1,3,5,7-10") This configuration enables affinity scheduling on hostA and tells LSF to just use CPUs 1,3,5, and CPUs 7-10 to run affinity jobs. The following configuration disables affinity scheduling: HOST_NAME MXJ r1m AFFINITY hostA ! () (N) Default Not defined. Affinity scheduling is not enabled. Example of a Host section Begin Host HOST_NAME MXJ hostA Linux default JL/U r1m 1 1 - 2 0.6/1.6 pg 10/20 0.5/2.5 1 0.6/1.6 DISPATCH_WINDOW (5:19:00-1:8:30 20:00-8:30) 23:00-8:00 20/40 () End Host Linux is a host type defined in lsf.shared. This example Host section configures one host and one host type explicitly and configures default values for all other load-sharing hosts. HostA runs one batch job at a time. A job will only be started on hostA if the r1m index is below 0.6 and the pg index is below 10; the running job is stopped if the r1m index goes above 1.6 or the pg index goes above 20. HostA only accepts batch jobs from 19:00 on Friday evening until 8:30 Monday morning and overnight from 20:00 to 8:30 on all other days. Chapter 1. Configuration Files 133 lsb.hosts For hosts of type Linux, the pg index does not have host-specific thresholds and such hosts are only available overnight from 23:00 to 8:00. The entry with host name default applies to each of the other hosts in the cluster. Each host can run up to two jobs at the same time, with at most one job from each user. These hosts are available to run jobs at all times. Jobs may be started if the r1m index is below 0.6 and the pg index is below 20. HostGroup section Description Optional. Defines host groups. The name of the host group can then be used in other host group, host partition, and queue definitions, as well as on the command line. Specifying the name of a host group has exactly the same effect as listing the names of all the hosts in the group. Structure Host groups are specified in the same format as user groups in lsb.users. The first line consists of two mandatory keywords, GROUP_NAME and GROUP_MEMBER, as well as optional keywords, CONDENSE and GROUP_ADMIN. Subsequent lines name a group and list its membership. The sum of all host groups, compute groups, and host partitions cannot be more than 1024. GROUP_NAME Description An alphanumeric string representing the name of the host group. You cannot use the reserved name all, and group names must not conflict with host names. CONDENSE Description Optional. Defines condensed host groups. Condensed host groups are displayed in a condensed output format for the bhosts and bjobs commands. If you configure a host to belong to more than one condensed host group, bjobs can display any of the host groups as execution host name. Valid values Y or N. Default N (the specified host group is not condensed) 134 Platform LSF Configuration Reference lsb.hosts GROUP_MEMBER Description A space-delimited list of host names or previously defined host group names, enclosed in one pair of parentheses. You cannot use more than one pair of parentheses to define the list. The names of hosts and host groups can appear on multiple lines because hosts can belong to multiple groups. The reserved name all specifies all hosts in the cluster. An exclamation mark (!) indicates an externally-defined host group, which the egroup executable retrieves. Pattern definition You can use string literals and special characters when defining host group members. Each entry cannot contain any spaces, as the list itself is space delimited. When a leased-in host joins the cluster, the host name is in the form of host@cluster. For these hosts, only the host part of the host name is subject to pattern definitions. You can use the following special characters to specify host group members: v Use a tilde (~) to exclude specified hosts or host groups from the list. v Use an asterisk (*) as a wildcard character to represent any number of characters. v Use square brackets with a hyphen ([integer1 - integer2]) to define a range of non-negative integers at the end of a host name. The first integer must be less than the second integer. v Use square brackets with commas ([integer1, integer2 ...]) to define individual non-negative integers at the end of a host name. v Use square brackets with commas and hyphens (for example, [integer1 - integer2, integer3, integer4 - integer5]) to define different ranges of non-negative integers at the end of a host name. Restrictions v You cannot use more than one set of square brackets in a single host group definition. – The following example is not correct: ... (hostA[1-10]B[1-20] hostC[101-120]) – The following example is correct: ... (hostA[1-20] hostC[101-120]) v You cannot define subgroups that contain wildcards and special characters. GROUP_ADMIN Description Host group administrators have the ability to open or close the member hosts for the group they are administering. the GROUP_ADMIN field is a space-delimited list of user names or previously defined user group names, enclosed in one pair of parentheses. Chapter 1. Configuration Files 135 lsb.hosts You cannot use more than one pair of parentheses to define the list. The names of users and user groups can appear on multiple lines because users can belong to and administer multiple groups. Host group administrator rights are inherited. For example, if the user admin2 is an administrator for host group hg1 and host group hg2 is a member of hg1, admin2 is also an administrator for host group hg2. When host group administrators (who are not also cluster administrators) open or close a host, they must specify a comment with the -C option. Valid values Any existing user or user group can be specified. A user group that specifies an external list is also allowed; however, in this location, you use the user group name that has been defined with (!) rather than (!) itself. Restrictions v You cannot specify any wildcards or special characters (for example: *, !, $, #, &, ~). v You cannot specify an external group (egroup). v You cannot use the keyword ALL and you cannot administer any group that has ALL as its members. v User names and user group names cannot have spaces. Example HostGroup sections Example 1 Begin HostGroup GROUP_NAME GROUP_MEMBER GROUP_ADMIN groupA (hostA hostD) (user1 user10) groupB (hostF groupA hostK) () groupC (!) () End HostGroup This example defines three host groups: v groupA includes hostA and hostD and can be administered by user1 and user10. v groupB includes hostF and hostK, along with all hosts in groupA. It has no administrators (only the cluster administrator can control the member hosts). v The group membership of groupC is defined externally and retrieved by the egroup executable. Example 2 Begin HostGroup GROUP_NAME GROUP_MEMBER GROUP_ADMIN groupA (all) () groupB (groupA ~hostA ~hostB) (user11 user14) groupC (hostX hostY hostZ) () groupD (groupC ~hostX) usergroupB groupE (all ~groupC ~hostB) () groupF (hostF groupC hostK) () End HostGroup 136 Platform LSF Configuration Reference lsb.hosts This example defines the following host groups: v groupA contains all hosts in the cluster and is administered by the cluster administrator. v groupB contains all the hosts in the cluster except for hostA and hostB and is administered by user11 and user14. v groupC contains only hostX, hostY, and hostZ and is administered by the cluster administrator. v groupD contains the hosts in groupC except for hostX. Note that hostX must be a member of host group groupC to be excluded from groupD. usergroupB is the administrator for groupD. v groupE contains all hosts in the cluster excluding the hosts in groupC and hostB and is administered by the cluster administrator. v groupF contains hostF, hostK, and the 3 hosts in groupC and is administered by the cluster administrator. Example 3 Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER GROUP_ADMIN groupA N (all) () groupB N (hostA, hostB) (usergroupC user1) groupC Y (all)() End HostGroup This example defines the following host groups: v groupA shows uncondensed output and contains all hosts in the cluster and is administered by the cluster administrator. v groupB shows uncondensed output, and contains hostA and hostB. It is administered by all members of usergroupC and user1. v groupC shows condensed output and contains all hosts in the cluster and is administered by the cluster administrator. Example 4 Begin HostGroup GROUP_NAME CONDENSE GROUP_MEMBER GROUP_ADMIN groupA Y (host*) (user7) groupB N (*A) () groupC N (hostB* ~hostB[1-50]) () groupD Y (hostC[1-50] hostC[101-150]) (usergroupJ) groupE N (hostC[51-100] hostC[151-200]) () groupF Y (hostD[1,3] hostD[5-10]) () groupG N (hostD[11-50] ~hostD[15,20,25] hostD2) () End HostGroup This example defines the following host groups: v groupA shows condensed output, and contains all hosts starting with the string host. It is administered by user7. v groupB shows uncondensed output, and contains all hosts ending with the string A, such as hostA and is administered by the cluster administrator. Chapter 1. Configuration Files 137 lsb.hosts v groupC shows uncondensed output, and contains all hosts starting with the string hostB except for the hosts from hostB1 to hostB50 and is administered by the cluster administrator. v groupD shows condensed output, and contains all hosts from hostC1 to hostC50 and all hosts from hostC101 to hostC150 and is administered by the the members of usergroupJ. v groupE shows uncondensed output, and contains all hosts from hostC51 to hostC100 and all hosts from hostC151 to hostC200 and is administered by the cluster administrator. v groupF shows condensed output, and contains hostD1, hostD3, and all hosts from hostD5 to hostD10 and is administered by the cluster administrator. v groupG shows uncondensed output, and contains all hosts from hostD11 to hostD50 except for hostD15, hostD20, and hostD25. groupG also includes hostD2. It is administered by the cluster administrator. HostPartition section Description Optional. Used with host partition user-based fairshare scheduling. Defines a host partition, which defines a user-based fairshare policy at the host level. Configure multiple sections to define multiple partitions. The members of a host partition form a host group with the same name as the host partition. Restriction: You cannot use host partitions and host preference simultaneously. Limitations on queue configuration v If you configure a host partition, you cannot configure fairshare at the queue level. v If a queue uses a host that belongs to a host partition, it should not use any hosts that don’t belong to that partition. All the hosts in the queue should belong to the same partition. Otherwise, you might notice unpredictable scheduling behavior: – Jobs in the queue sometimes may be dispatched to the host partition even though hosts not belonging to any host partition have a lighter load. – If some hosts belong to one host partition and some hosts belong to another, only the priorities of one host partition are used when dispatching a parallel job to hosts from more than one host partition. Shared resources and host partitions v If a resource is shared among hosts included in host partitions and hosts that are not included in any host partition, jobs in queues that use the host partitions will always get the shared resource first, regardless of queue priority. v If a resource is shared among host partitions, jobs in queues that use the host partitions listed first in the HostPartition section of lsb.hosts will always have priority to get the shared resource first. To allocate shared resources among host partitions, LSF considers host partitions in the order they are listed in lsb.hosts. 138 Platform LSF Configuration Reference lsb.hosts Structure Each host partition always consists of 3 lines, defining the name of the partition, the hosts included in the partition, and the user share assignments. HPART_NAME Syntax HPART_NAME=partition_name Description Specifies the name of the partition. The name must be 59 characters or less. HOSTS Syntax HOSTS=[[~]host_name | [~]host_group | all]... Description Specifies the hosts in the partition, in a space-separated list. A host cannot belong to multiple partitions. A host group cannot be empty. Hosts that are not included in any host partition are controlled by the FCFS scheduling policy instead of the fairshare scheduling policy. Optionally, use the reserved host name all to configure a single partition that applies to all hosts in a cluster. Optionally, use the not operator (~) to exclude hosts or host groups from the list of hosts in the host partition. Examples HOSTS=all ~hostK ~hostM The partition includes all the hosts in the cluster, except for hostK and hostM. HOSTS=groupA ~hostL The partition includes all the hosts in host group groupA except for hostL. USER_SHARES Syntax USER_SHARES=[user, number_shares]... Description Specifies user share assignments v Specify at least one user share assignment. v Enclose each user share assignment in square brackets, as shown. Chapter 1. Configuration Files 139 lsb.hosts v Separate a list of multiple share assignments with a space between the square brackets. v user—Specify users who are also configured to use the host partition. You can assign the shares: – To a single user (specify user_name). To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). – To users in a group, individually (specify group_name@) or collectively (specify group_name). To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\group_name). – To users not included in any other share assignment, individually (specify the keyword default) or collectively (specify the keyword others). By default, when resources are assigned collectively to a group, the group members compete for the resources according to FCFS scheduling. You can use hierarchical fairshare to further divide the shares among the group members. When resources are assigned to members of a group individually, the share assignment is recursive. Members of the group and of all subgroups always compete for the resources according to FCFS scheduling, regardless of hierarchical fairshare policies. v number_shares – Specify a positive integer representing the number of shares of the cluster resources assigned to the user. – The number of shares assigned to each user is only meaningful when you compare it to the shares assigned to other users or to the total number of shares. The total number of shares is just the sum of all the shares assigned in each share assignment. Example of a HostPartition section Begin HostPartition HPART_NAME = Partition1 HOSTS = hostA hostB USER_SHARES = [groupA@, 3] [groupB, 7] [default, 1] End HostPartition ComputeUnit section Description Optional. Defines compute units. Once defined, the compute unit can be used in other compute unit and queue definitions, as well as in the command line. Specifying the name of a compute unit has the same effect as listing the names of all the hosts in the compute unit. Compute units are similar to host groups, with the added feature of granularity allowing the construction of structures that mimic the network architecture. Job scheduling using compute unit resource requirements effectively spreads jobs over the cluster based on the configured compute units. To enforce consistency, compute unit configuration has the following requirements: v Hosts and host groups appear in the finest granularity compute unit type, and nowhere else. v Hosts appear in only one compute unit of the finest granularity. 140 Platform LSF Configuration Reference lsb.hosts v All compute units of the same type have the same type of compute units (or hosts) as members. Structure Compute units are specified in the same format as host groups in lsb.hosts. The first line consists of three mandatory keywords, NAME, MEMBER, and TYPE, as well as optional keywords CONDENSE and ADMIN. Subsequent lines name a compute unit and list its membership. The sum of all host groups, compute groups, and host partitions cannot be more than 1024. NAME Description An alphanumeric string representing the name of the compute unit. You cannot use the reserved names all, allremote, others, and default. Compute unit names must not conflict with host names, host partitions, or host group names. CONDENSE Description Optional. Defines condensed compute units. Condensed compute units are displayed in a condensed output format for the bhosts and bjobs commands. The condensed compute unit format includes the slot usage for each compute unit. Valid values Y or N. Default N (the specified host group is not condensed) MEMBER Description A space-delimited list of host names or previously defined compute unit names, enclosed in one pair of parentheses. You cannot use more than one pair of parentheses to define the list. The names of hosts and host groups can appear only once, and only in a compute unit type of the finest granularity. An exclamation mark (!) indicates an externally-defined host group, which the egroup executable retrieves. Chapter 1. Configuration Files 141 lsb.hosts Pattern definition You can use string literals and special characters when defining compute unit members. Each entry cannot contain any spaces, as the list itself is space delimited. You can use the following special characters to specify host and host group compute unit members: v Use a tilde (~) to exclude specified hosts or host groups from the list. v Use an asterisk (*) as a wildcard character to represent any number of characters. v Use square brackets with a hyphen ([integer1 - integer2]) to define a range of non-negative integers at the end of a host name. The first integer must be less than the second integer. v Use square brackets with commas ([integer1, integer2...]) to define individual non-negative integers at the end of a host name. v Use square brackets with commas and hyphens (for example, [integer1 - integer2, integer3, integer4 - integer5]) to define different ranges of non-negative integers at the end of a host name. Restrictions v You cannot use more than one set of square brackets in a single compute unit definition. – The following example is not correct: ... (enclA[1-10]B[1-20] enclC[101-120]) – The following example is correct: ... (enclA[1-20] enclC[101-120]) v Compute unit names cannot be used in compute units of the finest granularity. v You cannot include host or host group names except in compute units of the finest granularity. v You must not skip levels of granularity. For example: If lsb.params contains COMPUTE_UNIT_TYPES=enclosure rack cabinet then a compute unit of type cabinet can contain compute units of type rack, but not of type enclosure. v The keywords all, allremote, all@cluster, other and default cannot be used when defining compute units. TYPE Description The type of the compute unit, as defined in the COMPUTE_UNIT_TYPES parameter of lsb.params. ADMIN Description Compute unit administrators have the ability to open or close the member hosts for the compute unit they are administering. the ADMIN field is a space-delimited list of user names or previously defined user group names, enclosed in one pair of parentheses. You cannot use more than one pair of parentheses to define the list. 142 Platform LSF Configuration Reference lsb.hosts The names of users and user groups can appear on multiple lines because users can belong to and administer multiple compute units. Compute unit administrator rights are inherited. For example, if the user admin2 is an administrator for compute unit cu1 and compute unit cu2 is a member of cu1, admin2 is also an administrator for compute unit cu2. When compute unit administrators (who are not also cluster administrators) open or close a host, they must specify a comment with the -C option. Valid values Any existing user or user group can be specified. A user group that specifies an external list is also allowed; however, in this location, you use the user group name that has been defined with (!) rather than (!) itself. Restrictions v You cannot specify any wildcards or special characters (for example: *, !, $, #, &, ~). v You cannot specify an external group (egroup). v You cannot use the keyword ALL and you cannot administer any group that has ALL as its members. v User names and user group names cannot have spaces. Example ComputeUnit sections Example 1 (For the lsb.params entry COMPUTE_UNIT_TYPES=enclosure rack cabinet) Begin ComputeUnit NAME MEMBER encl1 (host1 host2) enclosure TYPE encl2 (host3 host4) enclosure encl3 (host5 host6) enclosure encl4 (host7 host8) enclosure rack1 (encl1 encl2) rack rack2 (encl3 encl4) rack cbnt1 (rack1 rack2) cabinet End ComputeUnit This example defines seven compute units: v encl1, encl2, encl3 and encl4 are the finest granularity, and each contain two hosts. v rack1 is of coarser granularity and contains two levels. At the enclosure level rack1 contains encl1 and encl2. At the lowest level rack1 contains host1, host2, host3, and host4. v rack2 has the same structure as rack1, and contains encl3 and encl4. v cbnt1 contains two racks (rack1 and rack2), four enclosures (encl1, encl2, encl3, and encl4) and all eight hosts. Compute unit cbnt1 is the coarsest granularity in this example. Chapter 1. Configuration Files 143 lsb.hosts Example 2 (For the lsb.params entry COMPUTE_UNIT_TYPES=enclosure rack cabinet) Begin ComputeUnit NAME CONDENSE MEMBER TYPE ADMIN encl1 Y (hg123 ~hostA ~hostB) enclosure (user11 user14) encl2 Y (hg456) enclosure () encl3 N (hostA hostB) enclosure usergroupB encl4 N (hgroupX ~hostB) enclosure () encl5 Y (hostC* ~hostC[101-150]) enclosure usergroupJ encl6 N (hostC[101-150]) enclosure () rack1 Y (encl1 encl2 encl3) rack () rack2 N (encl4 encl5) rack usergroupJ rack3 N (encl6) rack () cbnt1 Y (rack1 rack2) cabinet () cbnt2 N (rack3) cabinet user14 End ComputeUnit This example defines 11 compute units: v All six enclosures (finest granularity) contain only hosts and host groups. All three racks contain only enclosures. Both cabinets (coarsest granularity) contain only racks. v encl1 contains all the hosts in host group hg123 except for hostA and hostB and is administered by user11 and user14. Note that hostA and hostB must be members of host group hg123 to be excluded from encl1. encl1 shows condensed output. v encl2 contains host group hg456 and is administered by the cluster administrator. encl2 shows condensed output. v encl3 contains hostA and hostB. usergroupB is the administrator for encl3. encl3 shows uncondensed output. v encl4 contains host group hgroupX except for hostB. Since each host can appear in only one enclosure and hostB is already in encl3, it cannot be in encl4. encl4 is administered by the cluster administrator. encl4 shows uncondensed output. v encl5 contains all hosts starting with the string hostC except for hosts hostC101 to hostC150, and is administered by usergroupJ. encl5 shows condensed output. v rack1 contains encl1, encl2, and encl3. rack1 shows condensed output. v rack2 contains encl4, and encl5. rack2 shows uncondensed output. v rack3 contains encl6. rack3 shows uncondensed output. v cbnt1 contains rack1 and rack2. cbnt1 shows condensed output. v cbnt2 contains rack3. Even though rack3 only contains encl6, cbnt3 cannot contain encl6 directly because that would mean skipping the level associated with compute unit type rack. cbnt2 shows uncondensed output. Automatic time-based configuration Variable configuration is used to automatically change LSF configuration based on time windows. You define automatic configuration changes in lsb.hosts by using if-else constructs and time expressions. After you change the files, reconfigure the cluster with the badmin reconfig command. 144 Platform LSF Configuration Reference lsb.hosts The expressions are evaluated by LSF every 10 minutes based on mbatchd start time. When an expression evaluates true, LSF dynamically changes the configuration based on the associated configuration statements. Reconfiguration is done in real time without restarting mbatchd, providing continuous system availability. Example In the following example, the #if, #else, #endif are not interpreted as comments by LSF but as if-else constructs. Begin Host HOST_NAME r15s r1m pg host1 3/5 3/5 12/20 #if time(5:16:30-1:8:30 20:00-8:30) host2 3/5 3/5 12/20 #else 0host2 2/3 2/3 10/12 #endif host3 3/5 3/5 12/20 End Host lsb.modules The lsb.modules file contains configuration information for LSF scheduler and resource broker modules. The file contains only one section, named PluginModule. This file is optional. If no scheduler or resource broker modules are configured, LSF uses the default scheduler plugin modules named schmod_default and schmod_fcfs. The lsb.modules file is stored in the directory LSB_CONFDIR/cluster_name/configdir, where LSB_CONFDIR is defined in lsf.conf. Changing lsb.modules configuration After making any changes to lsb.modules, run badmin reconfig to reconfigure mbatchd. PluginModule section Description Defines the plugin modules for the LSF scheduler and LSF resource broker. If this section is not configured, LSF uses the default scheduler plugin modules named schmod_default and schmod_fcfs, which enable the LSF default scheduling features. Example PluginModule section The following PluginModule section enables all scheduling policies provided by LSF: Begin PluginModule SCH_PLUGIN RB_PLUGIN SCH_DISABLE_PHASES schmod_default () () schmod_fairshare () () schmod_fcfs () () Chapter 1. Configuration Files 145 lsb.modules schmod_limit () () schmod_parallel () () schmod_reserve () () schmod_preemption () () schmod_advrsv () () schmod_mc () () schmod_jobweight () () schmod_cpuset () () schmod_pset () () schmod_ps () () schmod_aps () () schmod_affinity () () End PluginModule PluginModule section structure The first line consists of the following keywords: v SCH_PLUGIN v RB_PLUGIN v SCH_DISABLE_PHASES They identify the scheduler plugins, resource broker plugins, and the scheduler phase to be disabled for the plugins that you wish to configure. Each subsequent line describes the configuration information for one scheduler plugin module, resource broker plugin module, and scheduler phase, if any, to be disabled for the plugin. Each line must contain one entry for each keyword. Use empty parentheses ( ) or a dash (-) to specify the default value for an entry. SCH_PLUGIN Description Required. The SCH_PLUGIN column specifies the shared module name for the LSF scheduler plugin. Scheduler plugins are called in the order they are listed in the PluginModule section. By default, all shared modules for scheduler plugins are located in LSF_LIBDIR. On UNIX, you can also specify a full path to the name of the scheduler plugin. The following modules are supplied with LSF: schmod_default Enables the default LSF scheduler features. schmod_fcfs Enables the first-come, first-served (FCFS) scheduler features. schmod_fcfs can appear anywhere in the SCH_PLUGIN list. By default, if schmod_fcfs is not configured in lsb.modules, it is loaded automatically along with schmod_default. Source code (sch.mod.fcfs.c) for the schmod_fcfs scheduler plugin module is installed in the directory LSF_TOP/9.1/misc/examples/external_plugin/ 146 Platform LSF Configuration Reference lsb.modules Use the LSF scheduler plugin SDK to modify the FCFS scheduler module code to suit the job scheduling requirements of your site. See IBM Platform LSF Programmer’s Guide for more detailed information about writing, building, and configuring your own custom scheduler plugins. schmod_fairshare Enables the LSF fairshare scheduling features. schmod_limit Enables the LSF resource allocation limit features. schmod_parallel Enables scheduling of parallel jobs submitted with bsub -n. schmod_reserve Enables the LSF resource reservation features. To enable processor reservation, backfill, and memory reservation for parallel jobs, you must configure both schmod_parallel and schmod_reserve in lsb.modules. If only schmod_reserve is configured, backfill and memory reservation are enabled only for sequential jobs, and processor reservation is not enabled. schmod_preemption Enables the LSF preemption scheduler features. schmod_advrsv Handles jobs that use advance reservations (brsvadd, brsvs, brsvdel, bsub -U) schmod_cpuset Handles jobs that use SGI cpusets (bsub -ext[sched] "CPUSET[cpuset_options]") The schmod_cpuset plugin name must be configured after the standard LSF plugin names in the PluginModule list. schmod_mc Enables MultiCluster job forwarding schmod_ps Enables resource ownership functionality of EGO-enabled SLA scheduling policies schmod_aps Enables absolute priority scheduling (APS) policies configured by APS_PRIORITY in lsb.queues. Chapter 1. Configuration Files 147 lsb.modules The schmod_aps plugin name must be configured after the schmod_fairshare plugin name in the PluginModule list, so that the APS value can override the fairshare job ordering decision. schmod_affinity Enables CPU and memory affinity scheduling configured by AFFINITY in lsf.hosts. Scheduler plugin SDK Use the LSF scheduler plugin SDK to write customized scheduler modules that give you more flexibility and control over job scheduling. Enable your custom scheduling policies by configuring your modules under SCH_PLUGIN in the PluginModules section of lsb.modules. The directory LSF_TOP/9.1/misc/examples/external_plugin/ contains sample plugin code. See IBM Platform LSF Programmer’s Guide for more detailed information about writing, building, and configuring your own custom scheduler plugins. SCH_DISABLE_PHASES Description SCH_DISABLE_PHASES specifies which scheduler phases, if any, to be disabled for the plugin. LSF scheduling has four phases: 1. Preprocessing - the scheduler checks the readiness of the job for scheduling and prepares a list of ready resource seekers. It also checks the start time of a job, and evaluates any job dependencies. 2. Match/limit - the scheduler evaluates the job resource requirements and prepares candidate hosts for jobs by matching jobs with resources. It also applies resource allocation limits. Jobs with all required resources matched go on to order/allocation phase. Not all jobs are mapped to all potential available resources. Jobs without any matching resources will not go through the Order/Allocation Phase but can go through the Post-processing phase, where preemption may be applied to get resources the job needs to run. 3. Order/allocation - the scheduler sorts jobs with matched resources and allocates resources for each job, assigning job slot, memory, and other resources to the job. It also checks if the allocation satisfies all constraints defined in configuration, such as queue slot limit, deadline for the job, etc. a. In the order phase, the scheduler applies policies such as FCFS, Fairshare and Host-partition and consider job priorities within user groups and share groups. By default, job priority within a pool of jobs from the same user is based on how long the job has been pending. b. For resource intensive jobs (jobs requiring a lot of CPUs or a large amount of memory), resource reservation is performed so that these jobs are not starved. c. When all the currently available resources are allocated, jobs go on to post-processing. 4. Post-processing - the scheduler prepares jobs from the order/allocation phase for dispatch and applies preemption or backfill policies to obtain resources for 148 Platform LSF Configuration Reference lsb.modules the jobs that have completed pre-processing or match/limit phases, but did not have resources available to enter the next scheduling phase. Each scheduler plugin module invokes one or more scheduler phase. The processing for a give phase can be disabled or skipped if: The plugin module does not need to do any processing for that phase or the processing has already been done by a previous plugin module in the list. Default Undefined lsb.params The lsb.params file defines general parameters used by the LSF system. This file contains only one section, named Parameters. mbatchd uses lsb.params for initialization. The file is optional. If not present, the LSF-defined defaults are assumed. Some of the parameters that can be defined in lsb.params control timing within the system. The default settings provide good throughput for long-running batch jobs while adding a minimum of processing overhead in the batch daemons. This file is installed by default in LSB_CONFDIR/cluster_name/configdir. Changing lsb.params configuration After making any changes to lsb.params, run badmin reconfig to reconfigure mbatchd. Automatic time-based configuration Variable configuration is used to automatically change LSF configuration based on time windows. You define automatic configuration changes in lsb.params by using if-else constructs and time expressions. After you change the files, reconfigure the cluster with the badmin reconfig command. The expressions are evaluated by LSF every 10 minutes based on mbatchd start time. When an expression evaluates true, LSF dynamically changes the configuration based on the associated configuration statements. Reconfiguration is done in real time without restarting mbatchd, providing continuous system availability. Example # if 18:30-19:30 is your short job express period, but # you want all jobs going to the short queue by default # and be subject to the thresholds of that queue # for all other hours, normal is the default queue #if time(18:30-19:30) DEFAULT_QUEUE=short #else DEFAULT_QUEUE=normal #endif Chapter 1. Configuration Files 149 lsb.params Parameters section This section and all the keywords in this section are optional. If keywords are not present, the default values are assumed. Parameters set at installation The following parameter values are set at installation for the purpose of testing a new cluster: Begin Parameters DEFAULT_QUEUE = normal MBD_SLEEP_TIME = 10 SBD_SLEEP_TIME = 7 JOB_ACCEPT_INTERVAL = 1 #default job queue name #Time used for calculating parameter values (60 secs is default) #sbatchd scheduling interval (30 secs is default) #interval for any host to accept a job #(default is 1 (one-fold of MBD_SLEEP_TIME)) End Parameters With this configuration, jobs submitted to the LSF system will be started on server hosts quickly. If this configuration is not suitable for your production use, you should either remove the parameters to take the default values, or adjust them as needed. For example, to avoid having jobs start when host load is high, increase JOB_ACCEPT_INTERVAL so that the job scheduling interval is longer to give hosts more time to adjust load indices after accepting jobs. In production use, you should define DEFAULT_QUEUE to the normal queue, MBD_SLEEP_TIME to 60 seconds (the default), and SBD_SLEEP_TIME to 30 seconds (the default). ABS_RUNLIMIT Syntax ABS_RUNLIMIT=y | Y Description If set, absolute (wall-clock) run time is used instead of normalized run time for all jobs submitted with the following values: v Run time limit or run time estimate specified by the -W or -We option of bsub v RUNLIMIT queue-level parameter in lsb.queues v RUNLIMIT application-level parameter in lsb.applications v RUNTIME parameter in lsb.applications The run time estimates and limits are not normalized by the host CPU factor. Default Set to Y at time of installation. If otherwise undefined, then N. ACCT_ARCHIVE_AGE Syntax ACCT_ARCHIVE_AGE=days 150 Platform LSF Configuration Reference lsb.params Description Enables automatic archiving of LSF accounting log files, and specifies the archive interval. LSF archives the current log file if the length of time from its creation date exceeds the specified number of days. See also v ACCT_ARCHIVE_SIZE also enables automatic archiving v ACCT_ARCHIVE_TIME also enables automatic archiving v MAX_ACCT_ARCHIVE_FILE enables automatic deletion of the archives Default Not defined; no limit to the age of lsb.acct. ACCT_ARCHIVE_SIZE Syntax ACCT_ARCHIVE_SIZE=kilobytes Description Enables automatic archiving of LSF accounting log files, and specifies the archive threshold. LSF archives the current log file if its size exceeds the specified number of kilobytes. See also v ACCT_ARCHIVE_SIZE also enables automatic archiving v ACCT_ARCHIVE_TIME also enables automatic archiving v MAX_ACCT_ARCHIVE_FILE enables automatic deletion of the archives Default Not defined; no limit to the size of lsb.acct. ACCT_ARCHIVE_TIME Syntax ACCT_ARCHIVE_TIME=hh:mm Description Enables automatic archiving of LSF accounting log file lsb.acct, and specifies the time of day to archive the current log file. See also v ACCT_ARCHIVE_SIZE also enables automatic archiving v ACCT_ARCHIVE_TIME also enables automatic archiving v MAX_ACCT_ARCHIVE_FILE enables automatic deletion of the archives Default Not defined (no time set for archiving lsb.acct) Chapter 1. Configuration Files 151 lsb.params ADVRSV_USER_LIMIT Syntax ADVRSV_USER_LIMIT=integer Description Sets the number of advance reservations each user or user group can have in the system. Valid values 1-10000 Default 100 BJOBS_RES_REQ_DISPLAY Syntax BJOBS_RES_REQ_DISPLAY=none | brief | full Description This parameter lets you control how many levels of resource requirements bjobs -l will display. You can set the parameter to one of the following values: v none: bjobs -l does not display any level of resource requirement. v brief: bjobs -l only displays the combined and effective resource requirements. This would include, for example, the following: RESOURCE REQUIREMENT DETAILS: Combined : res_req Effective : res_req v full: bjobs -l displays the job, app, queue, combined and effective resource requirement. This would include, for example, the following: RESOURCE REQUIREMENT DETAILS: Job-level : res_req App-level : res_req Queue-level: res_req Combined : res_req Effective : res_req Combined resource requirements are the result of mbatchd merging job, application, and queue level resource requirements for each job. Effective resource requirements are resource requirements used by Scheduler to dispatch jobs. Only the rusage can be changed for running jobs (for example, with bmod -R). When the job finishes, the effective rsrcreq that the job last used persists in JOB_FINISH of lsb.acct and JOB_FINISH2 of lsb.stream. If mbatchd was restarted, LSF 152 Platform LSF Configuration Reference lsb.params recovers job effective rsrcreq with the one persisted in the JOB_START event. When replaying the JOB_EXECUTE event, job effective rsrcreq recovers the effective rsrcreq persisted in JOB_EXECUTE. After modifying this parameter, use badmin reconfig or badmin mbdrestart make the new value take effect. Default brief BSWITCH_MODIFY_RUSAGE Syntax BSWITCH_MODIFY_RUSAGE=Y|y|N|n Description By default, LSF does not modify effective resource requirements and job resource usage when running the bswitch command. The effective resource requirement string for scheduled jobs represents the resource requirement used by the scheduler to make a dispatch decision. Enable this parameter to allow bswitch to update job resource usage according to the resource requirements in the new queue. Default Not defined. bswitch does not update effective resource usage of the job. CHUNK_JOB_DURATION Syntax CHUNK_JOB_DURATION=minutes Description Specifies a CPU limit, run limit, or estimated run time for jobs submitted to a chunk job queue to be chunked. When CHUNK_JOB_DURATION is set, the CPU limit or run limit set at the queue level (CPULIMIT or RUNLMIT), application level (CPULIMIT or RUNLIMIT), or job level (-c or -W bsub options), or the run time estimate set at the application level (RUNTIME) must be less than or equal to CHUNK_JOB_DURATION for jobs to be chunked. If CHUNK_JOB_DURATION is set, jobs are not chunked if: v No CPU limit, run time limit, or run time estimate is specified at any level, or v A CPU limit, run time limit, or run time estimate is greater than the value of CHUNK_JOB_DURATION. The value of CHUNK_JOB_DURATION is displayed by bparams -l. Examples v CHUNK_JOB_DURATION is not defined: – Jobs with no CPU limit, run limit, or run time estimate are chunked Chapter 1. Configuration Files 153 lsb.params – Jobs with a CPU limit, run limit, or run time estimate less than or equal to 30 are chunked – Jobs with a CPU limit, run limit, or run time estimate greater than 30 are not chunked v CHUNK_JOB_DURATION=90: – Jobs with no CPU limit, run limit, or run time estimate are not chunked – Jobs with a CPU limit, run limit, or run time estimate less than or equal to 90 are chunked – Jobs with a CPU limit, run limit, or run time estimate greater than 90 are not chunked Default Not defined. CLEAN_PERIOD Syntax CLEAN_PERIOD=seconds Description The amount of time that finished job records are kept in mbatchd core memory. | Users can still see all jobs after they have finished using the bjobs command. For jobs that finished more than CLEAN_PERIOD seconds ago, use the bhist command. Default 3600 (1 hour) CLEAN_PERIOD_DONE Syntax CLEAN_PERIOD_DONE=seconds Description Controls the amount of time during which successfully finished jobs are kept in mbatchd core memory. This applies to DONE and PDONE (post job execution processing) jobs. If CLEAN_PERIOD_DONE is not defined, the clean period for DONE jobs is defined by CLEAN_PERIOD in lsb.params. If CLEAN_PERIOD_DONE is defined, its value must be less than CLEAN_PERIOD, otherwise it will be ignored and a warning message will appear. CLEAN_PERIOD_DONE is limited to one week. Default Not defined. 154 Platform LSF Configuration Reference lsb.params COMMITTED_RUN_TIME_FACTOR Syntax COMMITTED_RUN_TIME_FACTOR=number Description Used only with fairshare scheduling. Committed run time weighting factor. In the calculation of a user’s dynamic priority, this factor determines the relative importance of the committed run time in the calculation. If the -W option of bsub is not specified at job submission and a RUNLIMIT has not been set for the queue, the committed run time is not considered. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Valid values Any positive number between 0.0 and 1.0 Default 0.0 COMPUTE_UNIT_TYPES Syntax COMPUTE_UNIT_TYPES=type1 type2... Description Used to define valid compute unit types for topological resource requirement allocation. The order in which compute unit types appear specifies the containment relationship between types. Finer grained compute unit types appear first, followed by the coarser grained type that contains them, and so on. At most one compute unit type in the list can be followed by an exclamation mark designating it as the default compute unit type. If no exclamation mark appears, the first compute unit type in the list is taken as the default type. Valid values Any space-separated list of alphanumeric strings. Default Not defined Example COMPUTE_UNIT_TYPES=cell enclosure! rack Chapter 1. Configuration Files 155 lsb.params Specifies three compute unit types, with the default type enclosure. Compute units of type rack contain type enclosure, and of type enclosure contain type cell. CONDENSE_PENDING_REASONS Syntax CONDENSE_PENDING_REASONS=ALL | PARTIAL |N Description Set to ALL, condenses all host-based pending reasons into one generic pending reason. This is equivalent to setting CONDENSE_PENDING_REASONS=Y. Set to PARTIAL, condenses all host-based pending reasons except shared resource pending reasons into one generic pending reason. If enabled, you can request a full pending reason list by running the following command: badmin diagnose jobId Tip: You must be LSF administrator or a queue administrator to run this command. Examples v CONDENSE_PENDING_REASONS=ALL If a job has no other pending reason, bjobs -p or bjobs -l displays the following: Individual host based reasons v CONDENSE_PENDING_REASONS=N The pending reasons are not suppressed. Host-based pending reasons are displayed. Default Set to Y at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then N. CPU_TIME_FACTOR Syntax CPU_TIME_FACTOR=number Description Used only with fairshare scheduling. CPU time weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the cumulative CPU time used by a user’s jobs. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Default 0.7 156 Platform LSF Configuration Reference lsb.params DEFAULT_APPLICATION Syntax DEFAULT_APPLICATION=application_profile_name Description The name of the default application profile. The application profile must already be defined in lsb.applications. When you submit a job to LSF without explicitly specifying an application profile, LSF associates the job with the specified application profile. Default Not defined. When a user submits a job without explicitly specifying an application profile, and no default application profile is defined by this parameter, LSF does not associate the job with any application profile. DEFAULT_HOST_SPEC Syntax DEFAULT_HOST_SPEC=host_name | host_model Description The default CPU time normalization host for the cluster. The CPU factor of the specified host or host model will be used to normalize the CPU time limit of all jobs in the cluster, unless the CPU time normalization host is specified at the queue or job level. Default Not defined DEFAULT_JOB_CWD Syntax DEFAULT_JOB_CWD=directory Description Cluster wide current working directory (CWD) for the job. The path can be absolute or relative to the submission directory. The path can include the following dynamic patterns (which are case sensitive): v %J - job ID v %JG - job group (if not specified, it will be ignored) v %I - job index (default value is 0) v %EJ - execution job ID v %EI - execution job index v %P - project name v %U - user name Chapter 1. Configuration Files 157 lsb.params v %G - New user group new for both CWD and job output directory Unsupported patterns are treated as text. LSF only creates the directory if CWD includes dynamic patterns. For example: DEFAULT_JOB_CWD=/scratch/jobcwd/%U/%J_%I The job CWD will be created by LSF before the job starts running based on CWD parameter values. For every job, the CWD is determined by the following sequence: 1. 2. 3. 4. bsub -cwd. If not defined, LSF goes to steps 2 and 3. Environment variable LSB_JOB_CWD. If not defined, LSF goes to steps 3 and 4. Application profile based JOB_CWD parameter. If not defined, LSF goes to step 4. Cluster wide CWD (DEFAULT_JOB_CWD). If not defined, it means there is no CWD and the submission directory will be used instead. DEFAULT_JOB_CWD supports all LSF path conventions such as UNIX, UNC and Windows formats. In a mixed UNIX/Windows cluster, CWD can be specified with one value for UNIX and another value for Windows separated by a pipe character (|). DEFAULT_JOB_CWD=unix_path|windows_path The first part of the path must be for UNIX and the second part must be for Windows. Both paths must be full paths. Default Not defined. DEFAULT_JOB_OUTDIR Syntax DEFAULT_JOB_OUTDIR=directory Description Set this parameter for LSF to create a cluster wide output directory for the job. Once set, the system starts using the new directory and always tries to create the directory if it does not exist. The directory path can be absolute or relative to the submission directory with dynamic patterns. The output directory can include the following dynamic patterns (which are case sensitive): v %J - job ID v %JG - job group (if not specified, it will be ignored) v %I - job index (default value is 0) v %EJ - execution job ID v %EI - execution job index v %P - project name v %U - user name v %G – User group new for the job output directory 158 Platform LSF Configuration Reference lsb.params Unsupported patterns are treated as text. For example: DEFALUT_JOB_OUTDIR=/scratch/%U/%J | \\samba\scratch\%U\%J LSF creates the output directory even if the path does not include dynamic patterns. LSF checks the directories from the beginning of the path. If a directory does not exist, the system tries to create that directory. If it fails to create that directory then the system deletes all created directories and uses the submission directory for output. LSF creates all directories under the 700 permissions with the ownership of a submission user. DEFAULT_JOB_OUTDIR supports all LSF path conventions such as UNIX, UNC and Windows formats. A mixed UNIX/Windows cluster can be specified with one value for UNIX and another value for Windows separated by a pipe character (|). For example: DEFAULT_JOB_OUTDIR=unix_path|windows_path The first part of the path must be for UNIX and the second part must be for Windows. Both paths must be full paths. An output directory can also be created for a checkpointed job. Default Not defined. The system uses the submission directory for job output. DEFAULT_JOBGROUP Syntax DEFAULT_JOBGROUP=job_group_name Description The name of the default job group. When you submit a job to LSF without explicitly specifying a job group, LSF associates the job with the specified job group. The LSB_DEFAULT_JOBGROUP environment variable overrrides the setting of DEFAULT_JOBGROUP. The bsub -g job_group_name option overrides both LSB_DEFAULT_JOBGROUP and DEFAULT_JOBGROUP. Default job group specification supports macro substitution for project name (%p) and user name (%u). When you specify bsub -P project_name, the value of %p is the specified project name. If you do not specify a project name at job submission, %p is the project name defined by setting the environment variable LSB_DEFAULTPROJECT, or the project name specified by DEFAULT_PROJECT in lsb.params. the default project name is default. For example, a default job group name specified by DEFAULT_JOBGROUP=/canada/ %p/%u is expanded to the value for the LSF project name and the user name of the job submission user (for example, /canada/projects/user1). Job group names must follow this format: Chapter 1. Configuration Files 159 lsb.params v Job group names must start with a slash character (/). For example, DEFAULT_JOBGROUP=/A/B/C is correct, but DEFAULT_JOBGROUP=A/B/C is not correct. v Job group names cannot end with a slash character (/). For example, DEFAULT_JOBGROUP=/A/ is not correct. v Job group names cannot contain more than one slash character (/) in a row. For example, job group names like DEFAULT_JOBGROUP=/A//B or DEFAULT_JOBGROUP=A////B are not correct. v Job group names cannot contain spaces. For example, DEFAULT_JOBGROUP=/A/B C/D is not correct. v Project names and user names used for macro substitution with %p and %u cannot start or end with slash character (/). v Project names and user names used for macro substitution with %p and %u cannot contain spaces or more than one slash character (/) in a row. v Project names or user names containing slash character (/) will create separate job groups. For example, if the project name is canada/projects, DEFAULT_JOBGROUP=/%p results in a job group hierarchy /canada/projects. Example DEFAULT_JOBGROUP=/canada/projects Default Not defined. When a user submits a job without explicitly specifying job group name, and the LSB_DEFAULT_JOBGROUP environment variable is not defined, LSF does not associate the job with any job group. DEFAULT_PROJECT Syntax DEFAULT_PROJECT=project_name Description The name of the default project. Specify any string. Project names can be up to 59 characters long. When you submit a job without specifying any project name, and the environment variable LSB_DEFAULTPROJECT is not set, LSF automatically assigns the job to this project. Default default DEFAULT_QUEUE Syntax DEFAULT_QUEUE=queue_name ... Description Space-separated list of candidate default queues (candidates must already be defined in lsb.queues). 160 Platform LSF Configuration Reference lsb.params When you submit a job to LSF without explicitly specifying a queue, and the environment variable LSB_DEFAULTQUEUE is not set, LSF puts the job in the first queue in this list that satisfies the job’s specifications subject to other restrictions, such as requested hosts, queue status, etc. Default This parameter is set at installation to DEFAULT_QUEUE=normal interactive. When a user submits a job to LSF without explicitly specifying a queue, and there are no candidate default queues defined (by this parameter or by the user’s environment variable LSB_DEFAULTQUEUE), LSF automatically creates a new queue named default, using the default configuration, and submits the job to that queue. DEFAULT_RESREQ_ORDER Syntax DEFAULT_RESREQ_ORDER=order_string Description The order_string is [!][-]resource_name [:[-]resource_name]... Specify the global LSF default sorting order for resource requirements so the scheduler can find the right candidate host. You can specify any built-in or external load index or static resource. When an index name is preceded by a minus sign (-), the sorting order is reversed so that hosts are ordered from worst to best on that index. A value that contains multiple strings separated by spaces must be enclosed in quotation marks. Default r15s:pg DEFAULT_SLA_VELOCITY Syntax DEFAULT_SLA_VELOCITY=num_slots Description For EGO-enabled SLA scheduling, the number of slots that the SLA should request for parallel jobs running in the SLA. By default, an EGO-enabled SLA requests slots from EGO based on the number of jobs the SLA needs to run. If the jobs themselves require more than one slot, they will remain pending. To avoid this for parallel jobs, set DEFAULT_SLA_VELOCITY to the total number of slots that are expected to be used by parallel jobs. Default 1 Chapter 1. Configuration Files 161 lsb.params DEFAULT_USER_GROUP Syntax DEFAULT_USER_GROUP=default_user_group Description When DEFAULT_USER_GROUP is defined, all submitted jobs must be associated with a user group. Jobs without a user group specified will be associated with default_user_group, where default_user_group is a group configured in lsb.users and contains all as a direct member. DEFAULT_USER_GROUP can only contain one user group. If the default user group does not have shares assigned in a fairshare queue, jobs can still run from the default user group and are charged to the highest priority account the user can access in the queue. A job submitted to a user group without shares in a specified fairshare queue is transferred to the default user group where the job can run. A job modified or moved using bmod or bswitch may similarly be transferred to the default user group. Note: The default user group should be configured in most queues and have shares in most fairshare queues to ensure jobs run smoothly. Jobs linked to a user group, either through the default_user_group or a user group specified at submission using bsub -G, allow the user group administrator to issue job control operations. User group administrator rights are configured in the UserGroup section lsb.users, under GROUP_ADMIN. When DEFAULT_USER_GROUP is not defined, jobs do not require a user group association. After adding or changing DEFAULT_USER_GROUP in lsb.params, use badmin reconfig to reconfigure your cluster Default Not defined. When a user submits a job without explicitly specifying user group name, LSF does not associate the job with any user group. See also STRICT_UG_CONTROL, ENFORCE_ONE_UG_LIMIT DETECT_IDLE_JOB_AFTER Syntax DETECT_IDLE_JOB_AFTER=time_minutes Description The minimum job run time before mbatchd reports that the job is idle. 162 Platform LSF Configuration Reference lsb.params Default 20 (mbatchd checks if the job is idle after 20 minutes of run time) DIAGNOSE_LOGDIR Syntax DIAGNOSE_LOGDIR=<full directory path> Description You must enable the ENABLE_DIAGNOSE parameter for DIAGNOSE_LOGDIR to take effect. Set DIAGNOSE_LOGDIR to specify the file location for the collected information. The log file shows who issued these requests, where the requests came from, and the data size of the query. If you do not modify this parameter, the default location for the log file is LSF_LOGDIR.. The name of the log file is query_info.querylog.<host_name>. You can dynamically set the path from the command line with badmin diagnose -c query -f log_name, where log_name can be a full path. This overrides any other setting for the path. However, if you restart/reconfigure mbatchd, this path setting is lost and it defaults back to the setting you specified in this parameter. The output of the log file is: 14:13:02 2011,bjobs,server02,user1,1020, 0x0001 Where: v The 1st field is the timestamp. v The 2nd field is the trigger query command. Only mbatchd query commands are supported. Some commands may trigger multiple queries/entries. v The 3rd field is the sending host. If the host name cannot be resolved, LSF uses the IP if available, or -. v The 4th field is the user. If the user name is unknown, LSF displays -. v The 5th field is the maximum data size in KB. The real data size may be a little smaller in some cases. v The 6th field is for query options. Check lsbatchd.h for details. If not applied, LSF displays -. The values are separated by commas to make it easier to write a script for analyzing the data. Default LSF_LOGDIR/ DISABLE_UACCT_MAP Syntax DISABLE_UACCT_MAP=y | Y Chapter 1. Configuration Files 163 lsb.params Description Specify y or Y to disable user-level account mapping. Default N EADMIN_TRIGGER_DURATION Syntax EADMIN_TRIGGER_DURATION=minutes Description Defines how often LSF_SERVERDIR/eadmin is invoked once a job exception is detected. Used in conjunction with job exception handling parameters JOB_IDLE, JOB_OVERRUN, and JOB_UNDERRUN in lsb.queues. Tip: Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough. Example EADMIN_TRIGGER_DURATION=5 Default 1 minute EGO_SLOTBASED_VELOCITY_SLA Syntax EGO_SLOTBASED_VELOCITY_SLA=Y|N Description Enables slot based requirements for EGO-enabled SLA. If the value is N, LSF calculates how many slots you need by the number of jobs. If the value is Y, LSF calculates how many slots you need by the number of job slots instead of the number of jobs. Default Y ENABLE_DEFAULT_EGO_SLA Syntax ENABLE_DEFAULT_EGO_SLA=service_class_name | consumer_name 164 Platform LSF Configuration Reference lsb.params Description The name of the default service class or EGO consumer name for EGO-enabled SLA scheduling. If the specified SLA does not exist in lsb.servieclasses, LSF creates one with the specified consumer name, velocity of 1, priority of 1, and a time window that is always open. If the name of the default SLA is not configured in lsb.servicesclasses, it must be the name of a valid EGO consumer. ENABLE_DEFAULT_EGO_SLA is required to turn on EGO-enabled SLA scheduling. All LSF resource management is delegated to EGO, and all LSF hosts are under EGO control. When all jobs running in the default SLA finish, all allocated hosts are released to EGO after the default idle timeout of 120 seconds (configurable by MAX_HOST_IDLE_TIME in lsb.serviceclasses). When you submit a job to LSF without explicitly using the -sla option to specify a service class name, LSF puts the job in the default service class specified by service_class_name. Default Not defined. When a user submits a job to LSF without explicitly specifying a service class, and there is no default service class defined by this parameter, LSF does not attach the job to any service class. ENABLE_DIAGNOSE Syntax ENABLE_DIAGNOSE=query Description Enable this parameter for mbatchd to write query source information to a log file (see DIAGNOSE_LOGDIR in lsb.params). The log file shows information about the source of mbatchd queries, allowing you to troubleshoot problems. The log file shows who issued these requests, where the requests came from, and the data size of the query. The log file collects key information like query name, user name, host name and the data size of the query. You can write a script to format the output. Default Disabled ENABLE_EVENT_STREAM Syntax ENABLE_EVENT_STREAM=Y | N Description Used only with event streaming for system performance analysis tools. Chapter 1. Configuration Files 165 lsb.params Default N (event streaming is not enabled) ENABLE_EXIT_RATE_PER_SLOT Syntax ENABLE_EXIT_RATE_PER_SLOT=Y | N Description Scales the actual exit rate thresholds on a host according to the number of slots on the host. For example, if EXIT_RATE=2 in lsb.hosts or GLOBAL_EXIT_RATE=2 in lsb.params, and the host has 2 job slots, the job exit rate threshold will be 4. Default N ENABLE_HIST_RUN_TIME Syntax ENABLE_HIST_RUN_TIME=y | Y | n | N Description Used with fairshare scheduling and global fairshare scheduling. If set, enables the use of historical run time in the calculation of fairshare scheduling priority. Whether or not ENABLE_HIST_RUN_TIME is set for a global fairshare queue, the historical run time for share accounts in the global fairshare queue is reported to GPD. When GPD receives historical run time from one cluster, it broadcasts the historical run time to other clusters. The remote historical run time received from GPD is not used in the calculation for fairshare scheduling priority for the queue. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Default N ENABLE_HOST_INTERSECTION Syntax ENABLE_HOST_INTERSECTION=Y | N Description When enabled, allows job submission to any host that belongs to the intersection created when considering the queue the job was submitted to, any advance reservation hosts, or any hosts specified by bsub -m at the time of submission. 166 Platform LSF Configuration Reference lsb.params When disabled job submission with hosts specified can be accepted only if specified hosts are a subset of hosts defined in the queue. The following commands are affected by ENABLE_HOST_INTERSECTION: v bsub v bmod v bmig v brestart v bswitch If no hosts exist in the intersection, the job is rejected. Default N ENABLE_JOB_INFO_BY_ADMIN_ROLE Syntax ENABLE_JOB_INFO_BY_ADMIN_ROLE = [usergroup] [queue] [cluster] Description By default, an administrator’s access to job details is determined by the setting of SECURE_JOB_INFO_LEVEL, the same as a regular user. The parameter ENABLE_JOB_INFO_BY_ADMIN_ROLE in lsb.params allows you to enable the user group, queue, and cluster administrators the right to access job detail information for jobs in the user group, queue, and clusters they manage, even when the administrator has no right based on the configuration of SECURE_JOB_INFO_LEVEL. You may define one or more of the values, usergroup, queue, or cluster. Default NULL (Not defined) ENABLE_USER_RESUME Syntax ENABLE_USER_RESUME=Y | N Description Defines job resume permissions. When this parameter is defined: v If the value is Y, users can resume their own jobs that have been suspended by the administrator. v If the value is N, jobs that are suspended by the administrator can only be resumed by the administrator or root; users do not have permission to resume a job suspended by another user or the administrator. Administrators can resume jobs suspended by users or administrators. Chapter 1. Configuration Files 167 lsb.params Default N (users cannot resume jobs suspended by administrator) ENFORCE_ONE_UG_LIMITS Syntax ENFORCE_ONE_UG_LIMITS=Y | N Upon job submission with the -G option and when user groups have overlapping members, defines whether only the specified user group’s limits (or those of any parent group) are enforced or whether the most restrictive user group limits of any overlapping user/user group are enforced. v If the value is Y, only the limits defined for the user group that you specify with -G during job submission apply to the job, even if there are overlapping members of groups. If you have nested user groups, the limits of a user's group parent also apply. View existing limits by running blimits. v If the value is N and the user group has members that overlap with other user groups, the strictest possible limits (that you can view by running blimits) defined for any of the member user groups are enforced for the job. If the user group specified at submission is no longer valid when the job runs and ENFORCE_ONE_UG_LIMIT=Y, only the user limit is applied to the job. This can occur if the user group is deleted or the user is removed from the user group. Default N ENFORCE_UG_TREE Syntax ENFORCE_UG_TREE=Y | N Description When ENFORCE_UG_TREE=Y is defined, user groups must form a tree-like structure, with each user group having at most one parent. User group definitions in the UserGroup section of lsb.users will be checked in configuration order, and any user group appearing in GROUP_MEMBER more than once will be ignored after the first occurence. After adding or changing ENFORCE_UG_TREE in lsb.params, use badmin reconfig to reconfigure your cluster Default N (Not defined.) See also DEFAULT_USER_GROUP, ENFORCE_ONE_UG_LIMIT, STRICT_UG_CONTROL 168 Platform LSF Configuration Reference lsb.params EVALUATE_JOB_DEPENDENCY Syntax EVALUATE_JOB_DEPENDENCY=integer Description Set the maximum number of job dependencies mbatchd evaluates in one scheduling cycle. This parameter limits the amount of time mbatchd spends on evaluating job dependencies in a scheduling cycle, which limits the amount of time the job dependency evaluation blocks services. Job dependency evaluation is a process that is used to check if each job's dependency condition is satisfied. When a job's dependency condition is satisfied, it sets a ready flag and allows itself to be scheduled by mbschd. When EVALUATE_JOB_DEPENDENCY is set, a configured number of jobs are evaluated. Not all the dependency satisfied jobs may be set to READY status in the same session. Therefore, jobs intended to be dispatched in one scheduling session may be dispatched in different scheduling sessions. Also, the job dependency evaluation process starts from the last evaluation end location, so it may prevent some dependency satisfied jobs that occur before the end location from being set to READY status in that particular session. This may cause one job to be dispatched before another when the other was ready first. LSF starts the job dependency evaluation from the endpoint in the next session. LSF evaluates all dependent jobs every 10 minutes regardless of the configuration for EVALUATE_JOB_DEPENDENCY. Default Unlimited. EVENT_STREAM_FILE Syntax EVENT_STREAM_FILE=file_path Description Determines the path to the event data stream file used by system performance analysis tools. Default LSF_TOP/work/cluster_name/logdir/stream/lsb.stream EVENT_UPDATE_INTERVAL Syntax EVENT_UPDATE_INTERVAL=seconds Chapter 1. Configuration Files 169 lsb.params Description Used with duplicate logging of event and accounting log files. LSB_LOCALDIR in lsf.conf must also be specified. Specifies how often to back up the data and synchronize the directories (LSB_SHAREDIR and LSB_LOCALDIR). If you do not define this parameter, the directories are synchronized when data is logged to the files, or when mbatchd is started on the first LSF master host. If you define this parameter, mbatchd synchronizes the directories only at the specified time intervals. Use this parameter if NFS traffic is too high and you want to reduce network traffic. Valid values 1 to 2147483647 Recommended values Between 10 and 30 seconds, or longer depending on the amount of network traffic. Note: Avoid setting the value to exactly 30 seconds, because this will trigger the default behavior and cause mbatchd to synchronize the data every time an event is logged. Default Not defined. See also LSB_LOCALDIR in lsf.conf EXIT_RATE_TYPE Syntax EXIT_RATE_TYPE=[JOBEXIT | JOBEXIT_NONLSF] [JOBINIT] [HPCINIT] Description When host exception handling is configured (EXIT_RATE in lsb.hosts or GLOBAL_EXIT_RATE in lsb.params), specifies the type of job exit to be handled. JOBEXIT Job exited after it was dispatched and started running. JOBEXIT_NONLSF Job exited with exit reasons related to LSF and not related to a host problem (for example, user action or LSF policy). These jobs are not counted in the exit rate calculation for the host. JOBINIT 170 Platform LSF Configuration Reference lsb.params Job exited during initialization because of an execution environment problem. The job did not actually start running. HPCINIT HPC job exited during initialization because of an execution environment problem. The job did not actually start running. Default JOBEXIT_NONLSF EXTEND_JOB_EXCEPTION_NOTIFY Syntax EXTEND_JOB_EXCEPTION_NOTIFY=Y | y | N | n Description Sends extended information about a job exception in a notification email sent when a job exception occurs. Extended information includes: v v v v v JOB_ID RUN_TIME IDLE_FACTOR (Only applicable if the job has been idle.) USER QUEUE v EXEC_HOST v JOB_NAME You can also set format options of the email in the eadmin script, located in the LSF_SERVERDIR directory. Valid values are fixed or full. Default N (Notfication for job exception is standard and includes only job ID and either run time or idle factor.) FAIRSHARE_ADJUSTMENT_FACTOR Syntax FAIRSHARE_ADJUSTMENT_FACTOR=number Description Used only with fairshare scheduling. Fairshare adjustment plugin weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the user-defined adjustment made in the fairshare plugin (libfairshareadjust.*). A positive float number both enables the fairshare plugin and acts as a weighting factor. Chapter 1. Configuration Files 171 lsb.params This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Default 0 (user-defined adjustment made in the fairshare plugin not used) GLOBAL_EXIT_RATE Syntax GLOBAL_EXIT_RATE=number Description Specifies a cluster-wide threshold for exited jobs. Specify a number of jobs. If EXIT_RATE is not specified for the host in lsb.hosts, GLOBAL_EXIT_RATE defines a default exit rate for all hosts in the cluster. Host-level EXIT_RATE overrides the GLOBAL_EXIT_RATE value. If the number of jobs that exit over the period of time specified by JOB_EXIT_RATE_DURATION (5 minutes by default) exceeds the number of jobs that you specify as the threshold in this parameter, LSF invokes LSF_SERVERDIR/eadmin to trigger a host exception. Example GLOBAL_EXIT_RATE=10 defines a job exit rate of 10 jobs for all hosts. Default 2147483647 (Unlimited threshold.) HIST_HOURS Syntax HIST_HOURS=hours Description Used only with fairshare scheduling. Determines a rate of decay for cumulative CPU time, run time, and historical run time. To calculate dynamic user priority, LSF scales the actual CPU time and the run time using a decay factor, so that 1 hour of recently-used time is equivalent to 0.1 hours after the specified number of hours has elapsed. To calculate dynamic user priority with decayed run time and historical run time, LSF scales the accumulated run time of finished jobs and run time of running jobs using the same decay factor, so that 1 hour of recently-used time is equivalent to 0.1 hours after the specified number of hours has elapsed. When HIST_HOURS=0, CPU time and run time accumulated by running jobs is not decayed. 172 Platform LSF Configuration Reference lsb.params This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Default 5 JOB_ACCEPT_INTERVAL Syntax JOB_ACCEPT_INTERVAL=integer Description The number you specify is multiplied by the value of lsb.params MBD_SLEEP_TIME (60 seconds by default). The result of the calculation is the number of seconds to wait after dispatching a job to a host, before dispatching a second job to the same host. If 0 (zero), a host may accept more than one job. By default, there is no limit to the total number of jobs that can run on a host, so if this parameter is set to 0, a very large number of jobs might be dispatched to a host all at once. This can overload your system to the point that it will be unable to create any more processes. It is not recommended to set this parameter to 0. JOB_ACCEPT_INTERVAL set at the queue level (lsb.queues) overrides JOB_ACCEPT_INTERVAL set at the cluster level (lsb.params). Note: The parameter JOB_ACCEPT_INTERVAL only applies when there are running jobs on a host. In other words, when there are no running jobs on a host, a new job can go right away to this host. When the first job runs and finishes earlier than the next job accept interval (before the interval expires), this job accept interval is ignored and a job is dispatched to the same host. For example, job1 is dispatched to host A. If job1 run time is 10 minutes, and the job accept interval is 1, mbd_sleep_time is 60 seconds. Therefore, no second job will be dispatched within 60 seconds to host A. However, if job1 run time is 5 seconds on host A, then after job1 completes, the host is available. Therefore, JOB_ACCEPT_INTERVAL policy allows 1 job to be dispatched to Host A as soon as possible. Default Set to 0 at time of installation. If otherwise undefined, then set to 1. JOB_ATTA_DIR Syntax JOB_ATTA_DIR=directory Chapter 1. Configuration Files 173 lsb.params Description The shared directory in which mbatchd saves the attached data of messages posted with the bpost command. Use JOB_ATTA_DIR if you use bpost and bread to transfer large data files between jobs and want to avoid using space in LSB_SHAREDDIR. By default, the bread command reads attachment data from the JOB_ATTA_DIR directory. JOB_ATTA_DIR should be shared by all hosts in the cluster, so that any potential LSF master host can reach it. Like LSB_SHAREDIR, the directory should be owned and writable by the primary LSF administrator. The directory must have at least 1 MB of free space. The attached data will be stored under the directory in the format: JOB_ATTA_DIR/timestamp.jobid.msgs/msg$msgindex On UNIX, specify an absolute path. For example: JOB_ATTA_DIR=/opt/share/lsf_work On Windows, specify a UNC path or a path with a drive letter. For example: JOB_ATTA_DIR=\\HostA\temp\lsf_work or JOB_ATTA_DIR=D:\temp\lsf_work After adding JOB_ATTA_DIR to lsb.params, use badmin reconfig to reconfigure your cluster. Valid values JOB_ATTA_DIR can be any valid UNIX or Windows path up to a maximum length of 256 characters. Default Not defined If JOB_ATTA_DIR is not specified, job message attachments are saved in LSB_SHAREDIR/info/. JOB_CWD_TTL Syntax JOB_CWD_TTL=hours Description Specifies the time-to-live for the current working directory (CWD) of a job. LSF cleans created CWD directories after a job finishes based on the TTL value. LSF deletes the CWD for the job if LSF created that directory for the job. The following options are available: v 0 - sbatchd deletes CWD when all process related to the job finishs. v 2147483647 - Never delete the CWD for a job. 174 Platform LSF Configuration Reference lsb.params v 1 to 2147483646 - Delete the CWD for a job after the timeout expires. The system checks the directory list every 5 minutes with regards to cleaning and deletes only the last directory of the path to avoid conflicts when multiple jobs share some parent directories. TTL will be calculated after the post-exec script finishes. When LSF (sbatchd) starts, it checks the directory list file and deletes expired CWDs. If the value for this parameter is not set in the application profile, LSF checks to see if it is set at the cluster-wide level in lsb.params. If neither is set, the default value is used. Default Not defined. The value of 2147483647 is used, meaning the CWD is not deleted. JOB_DEP_LAST_SUB Description Used only with job dependency scheduling. If set to 1, whenever dependency conditions use a job name that belongs to multiple jobs, LSF evaluates only the most recently submitted job. Otherwise, all the jobs with the specified name must satisfy the dependency condition. Running jobs are not affected when JOB_DEP_LAST_SUB is changed. To reevaluate job dependencies after changing JOB_DEP_LAST_SUB, run badmin reconfig. Default Set to 1 at time of installation for the DEFAULT and PARALLEL configuration templates. If otherwise undefined, then 0 (turned off). JOB_DISTRIBUTE_ON_HOST Syntax JOB_DISTRIBUTE_ON_HOST=pack | balance | any Description For NUMA CPU and memory affinity scheduling. Specifies how LSF distributes tasks for different jobs. The parameter has the following values: pack LSF attempts to pack tasks as tightly as possible across jobs. Topology nodes with fewer available resources will be favored for task allocations. JOB_DISTRIBUTE_ON_HOST is not the same as the distribute clause on the command-line affinity resource requirement. JOB_DISTRIBUTE_ON_HOST decides how to distribute tasks between jobs, rather than within a job. Use pack to allow your application to use memory locality. Chapter 1. Configuration Files 175 lsb.params balance LSF attempts to distribute tasks equally across hosts' topology, while considering the allocations of all jobs. Topology nodes with more available resources will be favored for task allocations. any LSF attempts no job task placement optimization. LSF chooses the first available processor units for task placement. When JOB_DISTRIBUTE_ON_HOST is not defined, any is the default value. Default Not defined. JOB_DISTRIBUTE_ON_HOST=any is used. JOB_EXIT_RATE_DURATION Description Defines how long LSF waits before checking the job exit rate for a host. Used in conjunction with EXIT_RATE in lsb.hosts for LSF host exception handling. If the job exit rate is exceeded for the period specified by JOB_EXIT_RATE_DURATION, LSF invokes LSF_SERVERDIR/eadmin to trigger a host exception. Tuning Tip: Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough. Example JOB_EXIT_RATE_DURATION=10 Default 5 minutes JOB_GROUP_CLEAN Syntax JOB_GROUP_CLEAN=Y | N Description If JOB_GROUP_CLEAN = Y, implicitly created job groups that are empty and have no limits assigned to them are automatically deleted. Job groups can only be deleted automatically if they have no limits specified (directly or in descendent job groups), have no explicitly created children job groups, and haven’t been attached to an SLA. 176 Platform LSF Configuration Reference lsb.params Default N (Implicitly created job groups are not automatically deleted unless they are deleted manually with bgdel.) JOB_INCLUDE_POSTPROC Syntax JOB_INCLUDE_POSTPROC=Y | N Description Specifies whether LSF includes the post-execution processing of the job as part of the job. When set to Y: v Prevents a new job from starting on a host until post-execution processing is finished on that host v Includes the CPU and run times of post-execution processing with the job CPU and run times v sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time In MultiCluster job forwarding model, the JOB_INCLUDE_POSTPROC value in the receiving cluster applies to the job. MultiCluster job lease model, the JOB_INCLUDE_POSTPROC value applies to jobs running on remote leased hosts as if they were running on local hosts. The variable LSB_JOB_INCLUDE_POSTPROC in the user environment overrides the value of JOB_INCLUDE_POSTPROC in an application profile in lsb.applications. JOB_INCLUDE_POSTPROC in an application profile in lsb.applications overrides the value of JOB_INCLUDE_POSTPROC in lsb.params. For CPU and memory affinity jobs, if JOB_INCLUDE_POSTPROC=Y, LSF does not release affinity resources until post-execution processing has finished, since slots are still occupied by the job during post-execution processing. For SGI cpusets, if JOB_INCLUDE_POSTPROC=Y, LSF does not release the cpuset until post-execution processing has finished, even though post-execution processes are not attached to the cpuset. Default N (Post-execution processing is not included as part of the job, and a new job can start on the execution host before post-execution processing finishes.) JOB_POSITION_CONTROL_BY_ADMIN Syntax JOB_POSITION_CONTROL_BY_ADMIN=Y | N Description Allows LSF administrators to control whether users can use btop and bbot to move jobs to the top and bottom of queues. When JOB_POSITION_CONTROL_BY_ADMIN=Y, Chapter 1. Configuration Files 177 lsb.params only the LSF administrator (including any queue administrators) can use bbot and btop to move jobs within a queue. Default N See also bbot, btop JOB_POSTPROC_TIMEOUT Syntax JOB_POSTPROC_TIMEOUT=minutes Description Specifies a timeout in minutes for job post-execution processing. The specified timeout must be greater than zero. If post-execution processing takes longer than the timeout, sbatchd reports that post-execution has failed (POST_ERR status), and kills the entire process group of the job’s post-execution processes on UNIX and Linux. On Windows, only the parent process of the post-execution command is killed when the timeout expires. The child processes of the post-execution command are not killed. If JOB_INCLUDE_POSTPROC=Y, and sbatchd kills the post-execution processes because the timeout has been reached, the CPU time of the post-execution processing is set to 0, and the job’s CPU time does not include the CPU time of post-execution processing. JOB_POSTPROC_TIMEOUT defined in an application profile in lsb.applications overrides the value in lsb.params. JOB_POSTPROC_TIMEOUT cannot be defined in the user environment. In the MultiCluster job forwarding model, the JOB_POSTPROC_TIMEOUT value in the receiving cluster applies to the job. In the MultiCluster job lease model, the JOB_POSTPROC_TIMEOUT value applies to jobs running on remote leased hosts as if they were running on local hosts. When running host-based post execution processing, set JOB_POSTPROC_TIMEOUT to a value that gives the process enough time to run. Default 2147483647 (Unlimited; post-execution processing does not time out.) JOB_PREPROC_TIMEOUT Syntax JOB_PREPROC_TIMEOUT=minutes 178 Platform LSF Configuration Reference lsb.params Description Specify a timeout in minutes for job pre-execution processing. The specified timeout must be an integer greater than zero. If the job's pre-execution processing takes longer than the timeout, LSF kills the job's pre-execution processes, kills the job with a pre-defined exit value of 98, and then requeues the job to the head of the queue. However, if the number of pre-execution retries has reached the limit, LSF suspends the job with PSUSP status instead of requeuing it. JOB_PREPROC_TIMEOUT defined in an application profile in lsb.applications overrides the value in lsb.params. JOB_PREPROC_TIMEOUT cannot be defined in the user environment. On UNIX and Linux, sbatchd kills the entire process group of the job's pre-execution processes. On Windows, only the parent process of the pre-execution command is killed when the timeout expires, the child processes of the pre-execution command are not killed. In the MultiCluster job forwarding model, JOB_PREPROC_TIMEOUT and the number of pre-execution retries defined in the receiving cluster apply to the job. When the number of attempts reaches the limit, the job returns to submission cluster and is rescheduled. In the MultiCluster job lease model, JOB_PREPROC_TIMEOUT and the number of pre-execution retries defined in the submission cluster apply to jobs running on remote leased hosts, as if they were running on local hosts. Default Not defined. Pre-execution processing does not time out. However, when running host-based pre-execution processing, you cannot use the infinite value or it may fail. You must configure a reasonable value. JOB_PRIORITY_OVER_TIME Syntax JOB_PRIORITY_OVER_TIME=increment/interval Description JOB_PRIORITY_OVER_TIME enables automatic job priority escalation when MAX_USER_PRIORITY is also defined. Valid values increment Specifies the value used to increase job priority every interval minutes. Valid values are positive integers. interval Chapter 1. Configuration Files 179 lsb.params Specifies the frequency, in minutes, to increment job priority. Valid values are positive integers. Default Not defined. Example JOB_PRIORITY_OVER_TIME=3/20 Specifies that every 20 minute interval increment to job priority of pending jobs by 3. See also MAX_USER_PRIORITY JOB_RUNLIMIT_RATIO Syntax JOB_RUNLIMIT_RATIO=integer | 0 Description Specifies a ratio between a job run limit and the runtime estimate specified by bsub -We or bmod -We, -We+, -Wep. The ratio does not apply to the RUNTIME parameter in lsb.applications. This ratio can be set to 0 and no restrictions are applied to the runtime estimate. JOB_RUNLIMIT_RATIO prevents abuse of the runtime estimate. The value of this parameter is the ratio of run limit divided by the runtime estimate. By default, the ratio value is 0. Only administrators can set or change this ratio. If the ratio changes, it only applies to newly submitted jobs. The changed value does not retroactively reapply to already submitted jobs. If the ratio value is greater than 0: v If the users specifiy a runtime estimate only (bsub -We), the job-level run limit will automatically be set to runtime_ratio * runtime_estimate. Jobs running longer than this run limit are killed by LSF. If the job-level run limit is greater than the hard run limit in the queue, the job is rejected. v If the users specify a runtime estimate (-We) and job run limit (-W) at job submission, and the run limit is greater than runtime_ratio * runtime_estimate, the job is rejected. v If the users modify the run limit to be greater than runtime_ratio, they must increase the runtime estimate first (bmod -We). Then they can increase the default run limit. v LSF remembers the run limit is set with bsub -W or convert from runtime_ratio * runtime_estimate. When users modify the run limit with bmod -Wn, the run limit is automatically be set to runtime_ratio * runtime_estimate If the run limit is set from runtime_ratio, LSF rejects the run limit modification. 180 Platform LSF Configuration Reference lsb.params v If users modify the runtime estimate with bmod -We and the run limit is set by the user, the run limit is MIN(new_estimate * new_ratio, run_limit). If the run limit is set by runtime_ratio, the run limit is set to new_estimate * new_ratio. v If users modify the runtime estimate by using bmod -Wen and the run limit is set by the user, it is not changed. If the run limit is set by runtime_ratio, it is set to unlimited. In MultiCluster job forwarding model, JOB_RUNLIMIT_RATIO valuese in both the sending and receiving clusters apply to the job. The run limit in the receiving cluster cannot be greater than the value of runtime * JOB_RUNLIMIT_RATIO in the receiving cluster. Some examples: v Run limit (for example with bsub -We) is 10, JOB_RUNLIMIT_RATIO=5 in the sending cluster, JOB_RUNLIMIT_RATIO=0 in the receiving cluster—run limit=50, and the job will run v Run limit (for example with bsub -We) is 10, JOB_RUNLIMIT_RATIO=5 in the sending cluster, JOB_RUNLIMIT_RATIO=3 in the receiving cluster—run limit=50, and the job will pend v Run limit (for example with bsub -We) is 10, JOB_RUNLIMIT_RATIO=5 in the sending cluster, JOB_RUNLIMIT_RATIO=6 in the receiving cluster—run limit=50, and the job will run v Run limit (for example with bsub -We) is 10, JOB_RUNLIMIT_RATIO=0 in the sending cluster, JOB_RUNLIMIT_RATIO=5 in the receiving cluster—run limit=50, and the job will run MultiCluster job lease model, the JOB_RUNLIMIT_RATIO value applies to jobs running on remote leased hosts as if they were running on local hosts. Default 0 JOB_SCHEDULING_INTERVAL Syntax JOB_SCHEDULING_INTERVAL=seconds | milliseconds ms Description Time interval at which mbatchd sends jobs for scheduling to the scheduling daemon mbschd along with any collected load information. Specify in seconds, or include the keyword ms to specify in milliseconds. If set to 0, there is no interval between job scheduling sessions. The smaller the value of this parameter, the quicker jobs are scheduled. However, when the master batch daemon spends more time doing job scheduling, it has less time to respond to user commands. To have a balance between speed of job scheduling and response to the LSF commands, start with a setting of 0 or 1, and increase if users see the message “Batch system not responding...". Valid Value Number of seconds or milliseconds greater than or equal to zero (0). Chapter 1. Configuration Files 181 lsb.params Default Set at time of installation to 1 second for the DEFAULT and PARALLEL configuration templates, and to 50ms for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then set to 5 seconds. JOB_SPOOL_DIR Syntax JOB_SPOOL_DIR=dir Description Specifies the directory for buffering batch standard output and standard error for a job. When JOB_SPOOL_DIR is defined, the standard output and standard error for the job is buffered in the specified directory. Files are copied from the submission host to a temporary file in the directory specified by the JOB_SPOOL_DIR on the execution host. LSF removes these files when the job completes. If JOB_SPOOL_DIR is not accessible or does not exist, files are spooled to the default directory $HOME/.lsbatch. For bsub -is and bsub -Zs, JOB_SPOOL_DIR must be readable and writable by the job submission user, and it must be shared by the master host and the submission host. If the specified directory is not accessible or does not exist, and JOB_SPOOL_DIR is specified, bsub -is cannot write to the default directory LSB_SHAREDIR/cluster_name/lsf_indir, and bsub -Zs cannot write to the default directory LSB_SHAREDIR/cluster_name/lsf_cmddir, and the job will fail. As LSF runs jobs, it creates temporary directories and files under JOB_SPOOL_DIR. By default, LSF removes these directories and files after the job is finished. See bsub for information about job submission options that specify the disposition of these files. On UNIX, specify an absolute path. For example: JOB_SPOOL_DIR=/home/share/lsf_spool On Windows, specify a UNC path or a path with a drive letter. For example: JOB_SPOOL_DIR=\\HostA\share\spooldir or JOB_SPOOL_DIR=D:\share\spooldir In a mixed UNIX/Windows cluster, specify one path for the UNIX platform and one for the Windows platform. Separate the two paths by a pipe character (|): JOB_SPOOL_DIR=/usr/share/lsf_spool | \\HostA\share\spooldir Valid value JOB_SPOOL_DIR can be any valid path. 182 Platform LSF Configuration Reference lsb.params The entire path including JOB_SPOOL_DIR can up to 4094 characters on UNIX and Linux or up to 255 characters for Windows. This maximum path length includes: v All directory and file paths attached to the JOB_SPOOL_DIR path v Temporary directories and files that the LSF system creates as jobs run. The path you specify for JOB_SPOOL_DIR should be as short as possible to avoid exceeding this limit. Note: The first path must be UNIX and second path must be Windows. Default Not defined Batch job output (standard output and standard error) is sent to the .lsbatch directory on the execution host: v On UNIX: $HOME/.lsbatch v On Windows: %windir%\lsbtmpuser_id\.lsbatch If %HOME% is specified in the user environment, uses that directory instead of %windir% for spooled output. JOB_SWITCH2_EVENT Syntax JOB_SWITCH2_EVENT=Y|N Description Specify Y to allow mbatchd to generate the JOB_SWITCH2 event log when switching a job array to another queue. If this parameter is not enabled, mbatchd will generate the old JOB_SWITCH event instead. The JOB_SWITCH event is generated for the switch of each array element. If the job array is very large, many JOB_SWITCH events are generated, causing mbatchd to use large amounts of memory to replay all the JOB_SWITCH events. This causes performance problems when mbatchd starts up. JOB_SWITCH2 logs the switching of the array to another queue as one event instead of logging each array element separately. JOB_SWITCH2 has these advantages: v Reduces memory usage of mbatchd when replaying bswitch destination_queue job_ID, where job_ID is the job ID of the job array on which to operate. v Reduces the time for reading records from lsb.events when mbatchd starts up. v Reduces the size of lsb.events. Default N JOB_TERMINATE_INTERVAL Syntax JOB_TERMINATE_INTERVAL=seconds Chapter 1. Configuration Files 183 lsb.params Description UNIX only. Specifies the time interval in seconds between sending SIGINT, SIGTERM, and SIGKILL when terminating a job. When a job is terminated, the job is sent SIGINT, SIGTERM, and SIGKILL in sequence with a sleep time of JOB_TERMINATE_INTERVAL between sending the signals. This allows the job to clean up if necessary. Default 10 (seconds) LOCAL_MAX_PREEXEC_RETRY Syntax LOCAL_MAX_PREEXEC_RETRY=integer Description The maximum number of times to attempt the pre-execution command of a job on the local cluster. When this limit is reached, the default behavior of the job is defined by the LOCAL_MAX_PREEXEC_RETRY_ACTION parameter in lsb.params, lsb.queues, or lsb.applications. | | | Valid values 0 < LOCAL_MAX_PREEXEC_RETRY < 2147483647 Default 2147483647 (Unlimited number of pre-execution retry times.) | See also | | LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.params, lsb.queues, and lsb.applications. LOCAL_MAX_PREEXEC_RETRY_ACTION | | Syntax | LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT | Description | | The default behavior of a job when it reaches the maximum number of times to attempt its pre-execution command on the local cluster (LOCAL_MAX_PREEXEC_RETRY). v If set to SUSPEND, the job is suspended and its status is set to PSUSP. v If set to EXIT, the job exits and its status is set to EXIT. The job exits with the same exit code as the last pre-execution fail exit code. | | | 184 Platform LSF Configuration Reference lsb.params | | | | This parameter is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The action specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration. | Default | SUSPEND | See also | LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, and lsb.applications. EGROUP_UPDATE_INTERVAL Syntax EGROUP_UPDATE_INTERVAL=hours Description Specify a time interval, in hours, for which dynamic user group information in lsb.users will be updated automatically. There is no need to run badmin reconfig. In the LSF_SERVERDIR, there should be an executable named egroup to manage the user group members. When EGROUP_UPDATE_INTERVAL is set, and when the time interval is matched, you get the updated members from egroup. If this parameter is not set, then you must update the user groups manually by running badmin reconfig. Default Not defined. LSB_SYNC_HOST_STAT_LIM Syntax LSB_SYNC_HOST_STAT_LIM=Y|y|N|n Description Improves the speed with which mbatchd obtains host status, and therefore the speed with which LSF reschedules rerunnable jobs: the sooner LSF knows that a host has become unavailable, the sooner LSF reschedules any rerunnable jobs executing on that host. Useful for a large cluster. This parameter is enabled by default. It allows mbatchd to periodically obtain the host status from the master LIM and verifies the status by polling each sbatchd at an interval. It is recommended not to disable this parameter because it may then take longer to get status updates. Default Y Chapter 1. Configuration Files 185 lsb.params See also MBD_SLEEP_TIME in lsb.params LSB_MAX_PROBE_SBD in lsf.conf MAX_ACCT_ARCHIVE_FILE Syntax MAX_ACCT_ARCHIVE_FILE=integer Description Enables automatic deletion of archived LSF accounting log files and specifies the archive limit. Compatibility ACCT_ARCHIVE_SIZE or ACCT_ARCHIVE_AGE should also be defined. Example MAX_ACCT_ARCHIVE_FILE=10 LSF maintains the current lsb.acct and up to 10 archives. Every time the old lsb.acct.9 becomes lsb.acct.10, the old lsb.acct.10 gets deleted. See also v ACCT_ARCHIVE_AGE also enables automatic archiving v ACCT_ARCHIVE_SIZE also enables automatic archiving v ACCT_ARCHIVE_TIME also enables automatic archiving Default Not defined. No deletion of lsb.acct.n files. MAX_CONCURRENT_QUERY Syntax MAX_CONCURRENT_QUERY=integer Description This parameter applies to all query commands and defines the maximum batch queries (including job queries) that mbatchd can handle. MAX_CONCURRENT_QUERY controls the maximum number of concurrent query commands under the following conditions: v LSB_QUERY_PORT is not defined v LSB_QUERY_PORT is defined and LSB_QUERY_ENH is Y If the specified threshold is reached, the query commands will retry. If LSB_QUERY_PORT is defined and LSB_QUERY_ENH is N, MAX_CONCURRENT_QUERY controls two thresholds separately: 186 Platform LSF Configuration Reference lsb.params v Maximum number of concurrent job related query commands v Maximum number of concurrent other query commands If either of the specified thresholds are reached, the query commands will retry. Valid values 1-100 Default Set to 100 at time of installation. If otherwise undefined, then unlimited. MAX_EVENT_STREAM_FILE_NUMBER Syntax MAX_EVENT_STREAM_FILE_NUMBER=integer Description Determines the maximum number of different lsb.stream.utc files that mbatchd uses. When MAX_EVENT_STREAM_FILE_NUMBER is reached, every time the size of the lsb.stream file reaches MAX_EVENT_STREAM_SIZE, the oldest lsb.stream file is overwritten. Default 10 MAX_EVENT_STREAM_SIZE Syntax MAX_EVENT_STREAM_SIZE=integer Description Determines the maximum size in MB of the lsb.stream file used by system performance analysis tools. When the MAX_EVENT_STREAM_SIZE size is reached, LSF logs a special event EVENT_END_OF_STREAM, closes the stream and moves it to lsb.stream.0 and a new stream is opened. All applications that read the file once the event EVENT_END_OF_STREAM is logged should close the file and reopen it. Recommended value 2000 MB Default 1024 MB Chapter 1. Configuration Files 187 lsb.params MAX_INFO_DIRS Syntax MAX_INFO_DIRS=num_subdirs Description The number of subdirectories under the LSB_SHAREDIR/cluster_name/logdir/info directory. When MAX_INFO_DIRS is enabled, mbatchd creates the specified number of subdirectories in the info directory. These subdirectories are given an integer as its name, starting with 0 for the first subdirectory. Important: If you are using local duplicate event logging, you must run badmin mbdrestart after changing MAX_INFO_DIRS for the changes to take effect. Valid values 1-1024 Default Set to 500 at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 0 (no subdirectories under the info directory; mbatchd writes all jobfiles to the info directory). Example MAX_INFO_DIRS=10 mbatchd creates ten subdirectories from LSB_SHAREDIR/cluster_name/logdir/info/0 to LSB_SHAREDIR/cluster_name/logdir/info/9. MAX_JOB_ARRAY_SIZE Syntax MAX_JOB_ARRAY_SIZE=integer Description Specifies the maximum number of jobs in a job array that can be created by a user for a single job submission. The maximum number of jobs in a job array cannot exceed this value. A large job array allows a user to submit a large number of jobs to the system with a single job submission. Valid values Specify a positive integer between 1 and 2147483646 188 Platform LSF Configuration Reference lsb.params Default Set to 10000 at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 1000. MAX_JOB_ATTA_SIZE Syntax MAX_JOB_ATTA_SIZE=integer | 0 Specify any number less than 20000. Description Maximum attached data size, in KB, that can be transferred to a job. Maximum size for data attached to a job with the bpost command. Useful if you use bpost and bread to transfer large data files between jobs and you want to limit the usage in the current working directory. 0 indicates that jobs cannot accept attached data files. Default 2147483647 (Unlimited; LSF does not set a maximum size of job attachments.) MAX_JOB_NUM Syntax MAX_JOB_NUM=integer Description The maximum number of finished jobs whose events are to be stored in the lsb.events log file. Once the limit is reached, mbatchd starts a new event log file. The old event log file is saved as lsb.events.n, with subsequent sequence number suffixes incremented by 1 each time a new log file is started. Event logging continues in the new lsb.events file. Default Set at time of installation to 10000 for the DEFAULT configuration template and 100000 for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 1000. MAX_JOB_PREEMPT Syntax MAX_JOB_PREEMPT=integer Chapter 1. Configuration Files 189 lsb.params Description The maximum number of times a job can be preempted. Applies to queue-based preemption only. Valid values 0 < MAX_JOB_PREEMPT < 2147483647 Default 2147483647 (Unlimited number of preemption times.) MAX_JOB_PREEMPT_RESET Syntax MAX_JOB_PREEMPT_RESET=Y|N Description If MAX_JOB_PREEMPT_RESET=N, the job preempted count for MAX_JOB_PREEMPT is not reset when a started job is requeued, migrated, or rerun. Default Y. Job preempted counter resets to 0 once a started job is requeued, migrated, or rerun. MAX_JOB_REQUEUE Syntax MAX_JOB_REQUEUE=integer Description The maximum number of times to requeue a job automatically. Valid values 0 < MAX_JOB_REQUEUE < 2147483647 Default 2147483647 (Unlimited number of requeue times.) MAX_JOBID Syntax MAX_JOBID=integer Description The job ID limit. The job ID limit is the highest job ID that LSF will ever assign, and also the maximum number of jobs in the system. 190 Platform LSF Configuration Reference lsb.params By default, LSF assigns job IDs up to 6 digits. This means that no more than 999999 jobs can be in the system at once. Specify any integer from 999999 to 2147483646 (for practical purposes, you can use any 10-digit integer less than this value). You cannot lower the job ID limit, but you can raise it to 10 digits. This allows longer term job accounting and analysis, and means you can have more jobs in the system, and the job ID numbers will roll over less often. LSF assigns job IDs in sequence. When the job ID limit is reached, the count rolls over, so the next job submitted gets job ID "1". If the original job 1 remains in the system, LSF skips that number and assigns job ID "2", or the next available job ID. If you have so many jobs in the system that the low job IDs are still in use when the maximum job ID is assigned, jobs with sequential numbers could have totally different submission times. Example MAX_JOBID=125000000 Default 999999 MAX_JOBINFO_QUERY_PERIOD Syntax MAX_JOBINFO_QUERY_PERIOD=integer Description Maximum time for job information query commands (for example, with bjobs) to wait. When the time arrives, the query command processes exit, and all associated threads are terminated. If the parameter is not defined, query command processes will wait for all threads to finish. Specify a multiple of MBD_REFRESH_TIME. Valid values Any positive integer greater than or equal to one (1) Default 2147483647 (Unlimited wait time.) See also LSB_BLOCK_JOBINFO_TIMEOUT in lsf.conf Chapter 1. Configuration Files 191 lsb.params MAX_PEND_JOBS Syntax MAX_PEND_JOBS=integer Description The maximum number of pending jobs in the system. This is the hard system-wide pending job threshold. No user or user group can exceed this limit unless the job is forwarded from a remote cluster. If the user or user group submitting the job has reached the pending job threshold as specified by MAX_PEND_JOBS, LSF will reject any further job submission requests sent by that user or user group. The system will continue to send the job submission requests with the interval specified by SUB_TRY_INTERVAL in lsb.params until it has made a number of attempts equal to the LSB_NTRIES environment variable. If LSB_NTRIES is not defined and LSF rejects the job submission request, the system will continue to send the job submission requests indefinitely as the default behavior. Default 2147483647 (Unlimited number of pending jobs.) See also SUB_TRY_INTERVAL MAX_PREEXEC_RETRY Syntax MAX_PREEXEC_RETRY=integer Description MultiCluster job forwarding model only. The maximum number of times to attempt the pre-execution command of a job from a remote cluster. If the job's pre-execution command fails all attempts, the job is returned to the submission cluster. Valid values 0 < MAX_PREEXEC_RETRY < 2147483647 Default 5 MAX_PROTOCOL_INSTANCES Syntax MAX_PROTOCOL_INSTANCES=integer 192 Platform LSF Configuration Reference lsb.params Description For LSF IBM Parallel Environment (PE) integration. Specify the number of parallel communication paths (windows) available to the protocol on each network. If number of windows specified for the job (with the instances option of bsub -network or the NETWORK_REQ parameter in lsb.queues or lsb.applications) is greater than the specified maximum value, LSF rejects the job. Specify MAX_PROTOCOL_INSTANCES in a queue (lsb.queues) or cluster-wide in lsb.params. The value specified in a queue overrides the value specified in lsb.params. LSF_PE_NETWORK_NUM must be defined to a non-zero value in lsf.conf for MAX_PROTOCOL_INSTANCES to take effect and for LSF to run PE jobs. If LSF_PE_NETWORK_NUM is not defined or is set to 0, the value of MAX_PROTOCOL_INSTANCES is ignored with a warning message. Default 2 MAX_SBD_CONNS Sets the maximum number of open file connections between mbatchd and sbatchd. The system sets MAX_SBD_CONNS automatically during mbatchd startup. Syntax MAX_SBD_CONNS=integer Description MAX_SBD_CONNS and LSB_MAX_JOB_DISPATCH_PER_SESSION affect the number of file descriptors. To decrease the load on the master LIM you should not configure the master host as the first host for the LSF_SERVER_HOSTS parameter. The default values for MAX_SBD_CONNS and LSB_MAX_JOB_DISPATCH_PER_SESSION are set during mbatchd startup. They are not changed dynamically. If hosts are added dynamically, mbatchd does not increase their values. Once all the hosts have been added, you must run badmin mbdrestart to set the correct values. If you know in advance that your cluster will dynamically grow or shrink, you should configure these parameters beforehand. Default MAX_SBD_CONNS = numOfHosts + (2 * LSB_MAX_JOB_DISPATCH_PER_SESSION)+200. This formula does not provide the exact number of SBD connections because it also calculates the lost and found hosts. Therefore, the calculated number of connections might be a few more than this theoretical number. MAX_SBD_FAIL Syntax MAX_SBD_FAIL=integer Chapter 1. Configuration Files 193 lsb.params Description The maximum number of retries for reaching a non-responding slave batch daemon, sbatchd. The minimum interval between retries is defined by MBD_SLEEP_TIME/10. If mbatchd fails to reach a host and has retried MAX_SBD_FAIL times, the host is considered unavailable or unreachable. After mbatchd tries to reach a host MAX_SBD_FAIL number of times, mbatchd reports the host status as unavailable or unreachable. When a host becomes unavailable, mbatchd assumes that all jobs running on that host have exited and that all rerunnable jobs (jobs submitted with the bsub -r option) are scheduled to be rerun on another host. Default 3 MAX_TOTAL_TIME_PREEMPT Syntax MAX_TOTAL_TIME_PREEMPT=integer Description The accumulated preemption time in minutes after which a job cannot be preempted again, where minutes is wall-clock time, not normalized time. The parameter of the same name in lsb.queues overrides this parameter. The parameter of the same name in lsb.applications overrides both this parameter and the parameter of the same name in lsb.queues. Valid values Any positive integer greater than or equal to one (1) Default Unlimited MAX_USER_PRIORITY Syntax MAX_USER_PRIORITY=integer Description Enables user-assigned job priority and specifies the maximum job priority a user can assign to a job. LSF and queue administrators can assign a job priority higher than the specified value for jobs they own. 194 Platform LSF Configuration Reference lsb.params Compatibility User-assigned job priority changes the behavior of btop and bbot. Example MAX_USER_PRIORITY=100 Specifies that 100 is the maximum job priority that can be specified by a user. Default Not defined. See also v bsub, bmod, btop, bbot v JOB_PRIORITY_OVER_TIME MBD_EGO_CONNECT_TIMEOUT Syntax MBD_EGO_CONNECT_TIMEOUT=seconds Description For EGO-enabled SLA scheduling, timeout parameter for network I/O connection with EGO vemkd. Default 0 seconds MBD_EGO_READ_TIMEOUT Syntax MBD_EGO_READ_TIMEOUT=seconds Description For EGO-enabled SLA scheduling, timeout parameter for network I/O read from EGO vemkd after connection with EGO. Default 0 seconds MBD_EGO_TIME2LIVE Syntax MBD_EGO_TIME2LIVE=minutes Description For EGO-enabled SLA scheduling, specifies how long EGO should keep information about host allocations in case mbatchd restarts, Chapter 1. Configuration Files 195 lsb.params Default 0 minutes MBD_QUERY_CPUS Syntax MBD_QUERY_CPUS=cpu_list cpu_list defines the list of master host CPUS on which the mbatchd child query processes can run. Format the list as a white-space delimited list of CPU numbers. For example, if you specify MBD_QUERY_CPUS=1 2 3 the mbatchd child query processes will run only on CPU numbers 1, 2, and 3 on the master host. Description This parameter allows you to specify the master host CPUs on which mbatchd child query processes can run (hard CPU affinity). This improves mbatchd scheduling and dispatch performance by binding query processes to specific CPUs so that higher priority mbatchd processes can run more efficiently. When you define this parameter, LSF runs mbatchd child query processes only on the specified CPUs. The operating system can assign other processes to run on the same CPU; however, if utilization of the bound CPU is lower than utilization of the unbound CPUs. Important 1. You can specify CPU affinity only for master hosts that use one of the following operating systems: v Linux 2.6 or higher v Solaris 8 or higher 2. If failover to a master host candidate occurs, LSF maintains the hard CPU affinity, provided that the master host candidate has the same CPU configuration as the original master host. If the configuration differs, LSF ignores the CPU list and reverts to default behavior. Related parameters To improve scheduling and dispatch performance of all LSF daemons, you should use MBD_QUERY_CPUS together with EGO_DAEMONS_CPUS (in ego.conf), which controls LIM CPU allocation, and LSF_DAEMONS_CPUS, which binds mbatchd and mbschd daemon processes to specific CPUs so that higher priority daemon processes can run more efficiently. To get best performance, CPU allocation for all four daemons should be assigned their own CPUs. For example, on a 4 CPU SMP host, the following configuration will give the best performance: EGO_DAEMONS_CPUS=0 LSF_DAEMONS_CPUS=1:2 MBD_QUERY_CPUS=3 Default Not defined 196 Platform LSF Configuration Reference lsb.params See also LSF_DAEMONS_CPUS in lsf.conf MBD_REFRESH_TIME Syntax MBD_REFRESH_TIME=seconds [min_refresh_time] where min_refresh_time defines the minimum time (in seconds) that the child mbatchd will stay to handle queries. Description Time interval, in seconds, when mbatchd will fork a new child mbatchd to service query requests to keep information sent back to clients updated. A child mbatchd processes query requests creating threads. MBD_REFRESH_TIME applies only to UNIX platforms that support thread programming. To enable MBD_REFRESH_TIME you must specify LSB_QUERY_PORT in lsf.conf. The child mbatchd listens to the port number specified by LSB_QUERY_PORT and creates threads to service requests until the job changes status, a new job is submitted, or MBD_REFRESH_TIME has expired. v If MBD_REFRESH_TIME is < min_refresh_time, the child mbatchd exits at MBD_REFRESH_TIME even if the job changes status or a new job is submitted before MBD_REFRESH_TIME expires. v If MBD_REFRESH_TIME > min_refresh_time: – the child mbatchd exits at min_refresh_time if a job changes status or a new job is submitted before the min_refresh_time – the child mbatchd exits after the min_refresh_time when a job changes status or a new job is submitted v If MBD_REFRESH_TIME > min_refresh_time and no job changes status or a new job is submitted, the child mbatchd exits at MBD_REFRESH_TIME The value of this parameter must be between 0 and 300. Any values specified out of this range are ignored, and the system default value is applied. The bjobs command may not display up-to-date information if two consecutive query commands are issued before a child mbatchd expires because child mbatchd job information is not updated. If you use the bjobs command and do not get up-to-date information, you may need to decrease the value of this parameter. Note, however, that the lower the value of this parameter, the more you negatively affect performance. The number of concurrent requests is limited by the number of concurrent threads that a process can have. This number varies by platform: v Sun Solaris, 2500 threads per process v AIX, 512 threads per process v Digital, 256 threads per process v HP-UX, 64 threads per process Chapter 1. Configuration Files 197 lsb.params Valid Values 5-300 seconds Default The default value for the minimum refresh time is adjusted automatically based on the number of jobs in the system: v If there are less than 500,000 jobs in the system, the default value is 10 seconds. v If there are more than 500,000 jobs in the system, the default value is 10 seconds + (#jobs – 500,000)/100,000. See also LSB_QUERY_PORT in lsf.conf MBD_SLEEP_TIME Syntax MBD_SLEEP_TIME=seconds Description Used in conjunction with the parameters SLOT_RESERVE, MAX_SBD_FAIL, and JOB_ACCEPT_INTERVAL Amount of time in seconds used for calculating parameter values. Default Set at installation to 10 seconds. If not defined, 60 seconds. MBD_USE_EGO_MXJ Syntax MBD_USE_EGO_MXJ=Y | N Description By default, when EGO-enabled SLA scheduling is configured, EGO allocates an entire host to LSF, which uses its own MXJ definition to determine how many slots are available on the host. LSF gets its host allocation from EGO, and runs as many jobs as the LSF configured MXJ for that host dictates. MBD_USE_EGO_MXJ forces LSF to use the job slot maximum configured in the EGO consumer. This allows partial sharing of hosts (for example, a large SMP computer) among different consumers or workload managers. When MBD_USE_EGO_MXJ is set, LSF schedules jobs based on the number of slots allocated from EGO. For example, if hostA has 4 processors, but EGO allocates 2 slots to an EGO-enabled SLA consumer. LSF can schedule a maximum of 2 jobs from that SLA on hostA. Default N (mbatcthd uses the LSF MXJ) 198 Platform LSF Configuration Reference lsb.params MC_PENDING_REASON_PKG_SIZE Syntax MC_PENDING_REASON_PKG_SIZE=kilobytes | 0 Description MultiCluster job forwarding model only. Pending reason update package size, in KB. Defines the maximum amount of pending reason data this cluster will send to submission clusters in one cycle. Specify the keyword 0 (zero) to disable the limit and allow any amount of data in one package. Default 512 MC_PENDING_REASON_UPDATE_INTERVAL Syntax MC_PENDING_REASON_UPDATE_INTERVAL=seconds | 0 Description MultiCluster job forwarding model only. Pending reason update interval, in seconds. Defines how often this cluster will update submission clusters about the status of pending MultiCluster jobs. Specify the keyword 0 (zero) to disable pending reason updating between clusters. Default 300 MC_PLUGIN_SCHEDULE_ENHANCE Syntax MC_PLUGIN_SCHEDULE_ENHANCE=RESOURCE_ONLY MC_PLUGIN_SCHEDULE_ENHANCE=COUNT_PREEMPTABLE [HIGH_QUEUE_PRIORITY] [PREEMPTABLE_QUEUE_PRIORITY] [PENDING_WHEN_NOSLOTS] MC_PLUGIN_SCHEDULE_ENHANCE=DYN_CLUSTER_WEIGHTING Note: When any one of HIGH_QUEUE_PRIORITY, PREEMPTABLE_QUEUE_PRIORITY or PENDING_WHEN_NOSLOTS is defined, COUNT_PREEMPTABLE is enabled automatically. Chapter 1. Configuration Files 199 lsb.params Description MultiCluster job forwarding model only. The parameter MC_PLUGIN_SCHEDULE_ENHANCE enhances the scheduler for the MultiCluster job forwarding model based on the settings selected. Use in conjunction with MC_PLUGIN_UPDATE_INTERVAL to set the data update interval between remote clusters. MC_PLUGIN_UPDATE_INTERVAL must be a non-zero value to enable the MultiCluster enhanced scheduler. With the parameter MC_PLUGIN_SCHEDULE_ENHANCE set to a valid value, remote resources are considered as if MC_PLUGIN_REMOTE_RESOURCE=Y regardless of the actual setting. In addition the submission cluster scheduler considers specific execution queue resources when scheduling jobs. See Using IBM Platform MultiCluster for details about the specific values for this parameter. Note: The parameter MC_PLUGIN_SCHEDULE_ENHANCE was introduced in LSF Version 7 Update 6. All clusters within a MultiCluster configuration must be running a version of LSF containing this parameter to enable the enhanced scheduler. After a MultiCluster connection is established, counters take the time set in MC_PLUGIN_UPDATE_INTERVAL to update. Scheduling decisions made before this first interval has passed do not accurately account for remote queue workload. Default Not defined. The enhanced MultiCluster scheduler is not used. If MC_PLUGIN_REMOTE_RESOURCE=Y in lsf.conf remote resource availability is considered before jobs are forwarded to the queue with the most available slots. See also MC_PLUGIN_UPDATE_INTERVAL in lsb.params. MC_PLUGIN_REMOTE_RESOURCE in lsf.conf. MC_PLUGIN_UPDATE_INTERVAL Syntax MC_PLUGIN_UPDATE_INTERVAL=seconds | 0 Description MultiCluster job forwarding model only; set for the execution cluster. The number of seconds between data updates between clusters. A non-zero value enables collection of remote cluster queue data for use by the submission cluster enhanced scheduler. Suggested value when enabled is MBD_SLEEP_TIME (default is 20 seconds). A value of 0 disables collection of remote cluster queue data. 200 Platform LSF Configuration Reference lsb.params Default 0 See Also MC_PLUGIN_SCHEDULE_ENHANCE in lsf.params. MC_RECLAIM_DELAY Syntax MC_RECLAIM_DELAY=minutes Description MultiCluster resource leasing model only. The reclaim interval (how often to reconfigure shared leases) in minutes. Shared leases are defined by Type=shared in the lsb.resources HostExport section. Default 10 (minutes) MC_RESOURCE_MATCHING_CRITERIA Syntax MC_RESOURCE_MATCHING_CRITERIA=<rc1> <rc2>... Description This parameter is configured on the MultiCluster execution side and defines numeric and string resources for the execution cluster to pass back to the submission cluster. The execution cluster makes the submission cluster aware of what resources and their values are listed so that the submission cluster can make better forwarding decisions. You can define resources that meet the following criteria: v User defined numeric and string resources v Host based resources, for example: – Resources defined in the RESOURCE column in the Host section of the lsf.cluster file. – Resource location as [default] or value@[default] in the Resource Map section of the lsf.cluster file. v Non-consumable resources. Although you can configure dynamic resources as criterion, they should be as close to static as possible to make forwarding decisions accurately. The number of criterion and values for each resource should be limited within a reasonable range to prevent deterioration of forward scheduling performance. The behavior for MC_PLUGIN_REMOTE_RESOURCE is the default behavior and is kept for compatibility. Chapter 1. Configuration Files 201 lsb.params Default None MC_RUSAGE_UPDATE_INTERVAL Syntax MC_RUSAGE_UPDATE_INTERVAL=seconds Description MultiCluster only. Enables resource use updating for MultiCluster jobs running on hosts in the cluster and specifies how often to send updated information to the submission or consumer cluster. Default 300 MIN_SWITCH_PERIOD Syntax MIN_SWITCH_PERIOD=seconds Description The minimum period in seconds between event log switches. Works together with MAX_JOB_NUM to control how frequently mbatchd switches the file. mbatchd checks if MAX_JOB_NUM has been reached every MIN_SWITCH_PERIOD seconds. If mbatchd finds that MAX_JOB_NUM has been reached, it switches the events file. To significantly improve the performance of mbatchd for large clusters, set this parameter to a value equal to or greater than 600. This causes mbatchd to fork a child process that handles event switching, thereby reducing the load on mbatchd. mbatchd terminates the child process and appends delta events to new events after the MIN_SWITCH_PERIOD has elapsed. Default Set to 1800 at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 0 (no minimum period, log switch frequency is not restricted). See also MAX_JOB_NUM NEWJOB_REFRESH Syntax NEWJOB_REFRESH=Y | N 202 Platform LSF Configuration Reference lsb.params Description Enables a child mbatchd to get up to date information about new jobs from the parent mbatchd. When set to Y, job queries with bjobs display new jobs submitted after the child mbatchd was created. If you have enabled multithreaded mbatchd support, the bjobs command may not display up-to-date information if two consecutive query commands are issued before a child mbatchd expires because child mbatchd job information is not updated. Use NEWJOB_REFRESH=Y to enable the parent mbatchd to push new job information to a child mbatchd When NEWJOB_REFRESH=Y, as users submit new jobs, the parent mbatchd pushes the new job event to the child mbatchd. The parent mbatchd transfers the following kinds of new jobs to the child mbatchd: v Newly submitted jobs v Restarted jobs v Remote lease model jobs from the submission cluster v Remote forwarded jobs from the submission cluster When NEWJOB_REFRESH=Y, you should set MBD_REFRESH_TIME to a value greater than 10 seconds. Required parameters LSB_QUERY_PORT must be enabled in lsf.conf. Restrictions The parent mbatchd only pushes the new job event to a child mbatchd. The child mbatchd is not aware of status changes of existing jobs. The child mbatchd will not reflect the results of job control commands (bmod, bmig, bswitch, btop, bbot, brequeue, bstop, bresume, and so on) invoked after the child mbatchd is created. Default Set to Y at time of installation for the DEFAULT and PARALLEL configuration templates. If otherwise undefined, then N (new jobs are not pushed to the child mbatchd). See also MBD_REFRESH_TIME NO_PREEMPT_FINISH_TIME Syntax NO_PREEMPT_FINISH_TIME=minutes | percentage Description Prevents preemption of jobs that will finish within the specified number of minutes or the specified percentage of the estimated run time or run limit. Chapter 1. Configuration Files 203 lsb.params Specifies that jobs due to finish within the specified number of minutes or percentage of job duration should not be preempted, where minutes is wall-clock time, not normalized time. Percentage must be greater than 0 or less than 100% (between 1% and 99%). For example, if the job run limit is 60 minutes and NO_PREEMPT_FINISH_TIME=10%, the job cannot be preempted after it running 54 minutes or longer. If you specify percentage for NO_PREEMPT_RUN_TIME, requires a run time (bsub -We or RUNTIME in lsb.applications), or run limit to be specified for the job (bsub -W, or RUNLIMIT in lsb.queues, or RUNLIMIT in lsb.applications) Default Not defined. NO_PREEMPT_INTERVAL Syntax NO_PREEMPT_INTERVAL=minutes Description Prevents preemption of jobs for the specified number of minutes of uninterrupted run time, where minutes is wall-clock time, not normalized time. NO_PREEMPT_INTERVAL=0 allows immediate preemption of jobs as soon as they start or resume running. The parameter of the same name in lsb.queues overrides this parameter. The parameter of the same name in lsb.applications overrides both this parameter and the parameter of the same name in lsb.queues. Default 0 NO_PREEMPT_RUN_TIME Syntax NO_PREEMPT_RUN_TIME=minutes | percentage Description Prevents preemption of jobs that have been running for the specified number of minutes or the specified percentage of the estimated run time or run limit. Specifies that jobs that have been running for the specified number of minutes or longer should not be preempted, where minutes is wall-clock time, not normalized time. Percentage must be greater than 0 or less than 100% (between 1% and 99%). For example, if the job run limit is 60 minutes and NO_PREEMPT_RUN_TIME=50%, the job cannot be preempted after it running 30 minutes or longer. If you specify percentage for NO_PREEMPT_RUN_TIME, requires a run time (bsub -We or RUNTIME in lsb.applications), or run limit to be specified for the job (bsub -W, 204 Platform LSF Configuration Reference lsb.params or RUNLIMIT in lsb.queues, or RUNLIMIT in lsb.applications) Default Not defined. MAX_JOB_MSG_NUM Syntax MAX_JOB_MSG_NUM=integer | 0 Description Maximum number of message slots for each job. Maximum number of messages that can be posted to a job with the bpost command. 0 indicates that jobs cannot accept external messages. Default 128 | ORPHAN_JOB_TERM_GRACE_PERIOD | Syntax | ORPHAN_JOB_TERM_GRACE_PERIOD=seconds | Description | | | | | | | | | | | | | | If defined, enables automatic orphan job termination at the cluster level which applies to all dependent jobs; otherwise it is disabled. This parameter is also used to define a cluster-wide termination grace period to tell LSF how long to wait before killing orphan jobs. Once configured, automatic orphan job termination applies to all dependent jobs in the cluster. v ORPHAN_JOB_TERM_GRACE_PERIOD = 0: Automatic orphan job termination is enabled in the cluster but no termination grace period is defined. A dependent job can be terminated as soon as it is found to be an orphan. v ORPHAN_JOB_TERM_GRACE_PERIOD > 0: Automatic orphan job termination is enabled and the termination grace period is set to the specified number of seconds. This is the minimum time LSF will wait before terminating an orphan job. In a multi-level job dependency tree, the grace period is not repeated at each level, and all direct and indirect orphans of the parent job can be terminated by LSF automatically after the grace period has expired. | | The valid range of values is any integer greater than or equal to 0 and less than 2147483647. | Default | Not defined. Automatic orphan termination is disabled. Chapter 1. Configuration Files 205 lsb.params PARALLEL_SCHED_BY_SLOT Syntax PARALLEL_SCHED_BY_SLOT=y | Y Description If defined, LSF schedules jobs based on the number of slots assigned to the hosts instead of the number of CPUs. For example, if MXJ is set to "-", then LSF considers the default value for number of CPUs for that host. These slots can be defined by host in lsb.hosts or by slot limit in lsb.resources. All slot-related messages still show the word “processors”, but actually refer to “slots” instead. Similarly, all scheduling activities also use slots instead of processors. Default Set to Y at time of installation. If otherwise undefined, then N (disabled). See also v JL/U and MXJ in lsb.hosts v SLOTS and SLOTS_PER_PROCESSOR in lsb.resources PEND_REASON_MAX_JOBS Syntax PEND_REASON_MAX_JOBS=integer Description Number of jobs for each user per queue for which pending reasons are calculated by the scheduling daemon mbschd. Pending reasons are calculated at a time period set by PEND_REASON_UPDATE_INTERVAL. Default 20 PEND_REASON_UPDATE_INTERVAL Syntax PEND_REASON_UPDATE_INTERVAL=seconds Description Time interval that defines how often pending reasons are calculated by the scheduling daemon mbschd. Default Set to 60 seconds at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 30 seconds. 206 Platform LSF Configuration Reference lsb.params PERFORMANCE_THRESHOLD_FILE Syntax PERFORMANCE_THRESHOLD_FILEfull_file_path Description Specifies the location of the performance threshold file for the cluster. This file contains the cluster-level threshold values for the minimize energy and minimize time policies, used with the energy aware scheduling automatic select CPU frequency feature. Default $LSF_ENVDIR/lsbatch/cluster_name/configdir/lsb.threshold PG_SUSP_IT Syntax PG_SUSP_IT=seconds Description The time interval that a host should be interactively idle (it > 0) before jobs suspended because of a threshold on the pg load index can be resumed. This parameter is used to prevent the case in which a batch job is suspended and resumed too often as it raises the paging rate while running and lowers it while suspended. If you are not concerned with the interference with interactive jobs caused by paging, the value of this parameter may be set to 0. Default 180 seconds | PMR_UPDATE_SUMMARY_INTERVAL | Syntax | PMR_UPDATE_SUMMARY_INTERVAL=seconds | Description | | Platform MapReduce Accelerator only. Specifies the interval after which LSF uses bpost to update the MapReduce job summary. | Used by the pmr command to work with MapReduce jobs. | If set to 0, LSF does not update the MapReduce job summary. | Valid values | 0 - 2147483646 Chapter 1. Configuration Files 207 lsb.params | Default | 60 seconds. POWER_ON_WAIT Syntax POWER_ON_WAIT=time_seconds Description Configures a wait time (in seconds) after a host is resumed and enters ok status, before dispatching a job. This is to allow other services on the host to restart and enter a ready state. The default value is 0 and is applied globally. Default 0 POWER_RESET_CMD Syntax POWER_RESET_CMD=command Description Defines the reset operation script that will be called when handling a power reset request. To allow the command to parse all its arguments as a host list, LSF uses the command in the format: command host [host ...] To show each host with its execution result (success (0) or fail (1)), the return value of the command follows the format: host 0 host 1 ... Default Not defined. POWER_RESUME_CMD Syntax POWER_RESUME_CMD=command Description Defines the resume operation script that will be called when handling a resume request. An opposite operation to POWER_SUSPEND_CMD. 208 Platform LSF Configuration Reference lsb.params To allow the command to parse all its arguments as a host list, LSF uses the command in the format: command host [host ...] To show each host with its execution result (success (0) or fail (1)), the return value of the command follows the format: host 0 host 1 ... Default Not defined. POWER_STATUS_LOG_MAX Syntax POWER_STATUS_LOG_MAX=number Description Configures a trigger value for events switching. The default value is 10000. This value takes effect only if PowerPolicy (in lsb.resources) is enabled. If a finished job number is not larger than the value of MAX_JOB_NUM, the event switch can also be triggered by POWER_STATUS_LOG_MAX, which works with MIN_SWITCH_PERIOD. POWER_STATUS_LOG_MAX is not available with LSF Express edition. Default 10000 POWER_SUSPEND_CMD Syntax POWER_SUSPEND_CMD=command Description Defines the suspend operation script that will be called when handling a suspend request. To allow the command to parse all its arguments as a host list, LSF uses the command in the format: command host [host ...] To show each host with its execution result (success (0) or fail (1)), the return value of the command follows the format: host 0 host 1 ... Chapter 1. Configuration Files 209 lsb.params Default Not defined. POWER_SUSPEND_TIMEOUT Syntax POWER_SUSPEND_TIMEOUT=integer Description Defines the timeout value (in seconds) for power suspend, resume, and reset actions. When a power operation is not successful (for example, sbatchd does not reconnect when resuming a host) within the specified number of seconds, the action will be considered failed. Default 600 PREEMPT_DELAY Syntax PREEMPT_DELAY=seconds Description Preemptive jobs will wait the specified number of seconds from the submission time before preempting any low priority preemptable jobs. During the grace period, preemption will not be trigged, but the job can be scheduled and dispatched by other scheduling policies. This feature can provide flexibility to tune the system to reduce the number of preemptions. It is useful to get better performance and job throughput. When the low priority jobs are short, if high priority jobs can wait a while for the low priority jobs to finish, preemption can be avoided and cluster performance is improved. If the job is still pending after the grace period has expired, the preemption will be triggered. The waiting time is for preemptive jobs in the pending status only. It will not impact the preemptive jobs that are suspended. The time is counted from the submission time of the jobs. The submission time means the time mbatchd accepts a job, which includes newly submitted jobs, restarted jobs (by brestart) or forwarded jobs from a remote cluster. When the preemptive job is waiting, the pending reason is: The preemptive job is allowing a grace period before preemption. If you use an older version of bjobs, the pending reason is: Unknown pending reason code <6701>; 210 Platform LSF Configuration Reference lsb.params The parameter is defined in lsb.params, lsb.queues (overrides lsb.params), and lsb.applications (overrides both lsb.params and lsb.queues). Run badmin reconfig to make your changes take effect. Default Not defined (if the parameter is not defined anywhere, preemption is immediate). PREEMPT_FOR Syntax PREEMPT_FOR=[GROUP_JLP] [GROUP_MAX] [HOST_JLU] [LEAST_RUN_TIME] [MINI_JOB] [USER_JLP] [OPTIMAL_MINI_JOB] Description If preemptive scheduling is enabled, this parameter is used to disregard suspended jobs when determining if a job slot limit is exceeded, to preempt jobs with the shortest running time, and to optimize preemption of parallel jobs. If preemptive scheduling is enabled, more lower-priority parallel jobs may be preempted than necessary to start a high-priority parallel job. Both running and suspended jobs are counted when calculating the number of job slots in use, except for the following limits: v The total job slot limit for hosts, specified at the host level v Total job slot limit for individual users, specified at the user level—by default, suspended jobs still count against the limit for user groups Specify one or more of the following keywords. Use spaces to separate multiple keywords. GROUP_JLP Counts only running jobs when evaluating if a user group is approaching its per-processor job slot limit (SLOTS_PER_PROCESSOR, USERS, and PER_HOST=all in the lsb.resources file). Suspended jobs are ignored when this keyword is used. GROUP_MAX Counts only running jobs when evaluating if a user group is approaching its total job slot limit (SLOTS, PER_USER=all, and HOSTS in the lsb.resources file). Suspended jobs are ignored when this keyword is used. When preemptive scheduling is enabled, suspended jobs never count against the total job slot limit for individual users. HOST_JLU Counts only running jobs when evaluating if a user or user group is approaching its per-host job slot limit (SLOTS and USERS in the lsb.resources file). Suspended jobs are ignored when this keyword is used. LEAST_RUN_TIME Preempts the job that has been running for the shortest time. Run time is wall-clock time, not normalized run time. MINI_JOB Chapter 1. Configuration Files 211 lsb.params Optimizes the preemption of parallel jobs by preempting only enough parallel jobs to start the high-priority parallel job. OPTIMAL_MINI_JOB Optimizes preemption of parallel jobs by preempting only low-priority parallel jobs based on the least number of jobs that will be suspended to allow the high-priority parallel job to start. User limits and user group limits can interfere with preemption optimization of OPTIMAL_MINI_JOB. You should not configure OPTIMAL_MINI_JOB if you have user or user group limits configured. You should configure PARALLEL_SCHED_BY_SLOT=Y when using OPTIMAL_MINI_JOB. USER_JLP Counts only running jobs when evaluating if a user is approaching their per-processor job slot limit (SLOTS_PER_PROCESSOR, USERS, and PER_HOST=all in the lsb.resources file). Suspended jobs are ignored when this keyword is used. Ignores suspended jobs when calculating the user-processor job slot limit for individual users. When preemptive scheduling is enabled, suspended jobs never count against the total job slot limit for individual users. Default 0 (The parameter is not defined.) Both running and suspended jobs are included in job slot limit calculations, except for job slots limits for hosts and individual users where only running jobs are ever included. PREEMPT_JOBTYPE Syntax PREEMPT_JOBTYPE=[EXCLUSIVE] [BACKFILL] Description If preemptive scheduling is enabled, this parameter enables preemption of exclusive and backfill jobs. Specify one or both of the following keywords. Separate keywords with a space. EXCLUSIVE Enables preemption of and preemption by exclusive jobs. LSB_DISABLE_LIMLOCK_EXCL=Y in lsf.conf must also be defined. BACKFILL Enables preemption of backfill jobs. Jobs from higher priority queues can preempt jobs from backfill queues that are either backfilling reserved job slots or running as normal jobs. AFFINITY 212 Platform LSF Configuration Reference lsb.params Enables affinity resource preemption. Affinity resources (thread, core, socket, and NUMA) held by a suspended job can be used by a pending job through queue-based preemption, or through License Scheduler preemption. Default Not defined. Exclusive and backfill jobs are only preempted if the exclusive low priority job is running on a different host than the one used by the preemptive high priority job. PREEMPTABLE_RESOURCES Syntax PREEMPTABLE_RESOURCES= res1 [res2] [res3] .... Description Enables preemption for resources (in addition to slots) when preemptive scheduling is enabled (has no effect if queue preemption is not enabled) and specifies the resources that will be preemptable. Specify shared resources (static or dynamic) that are numeric, decreasing, and releasable. One of the resources can be built-in resource mem, meaning that res1 is also option if memory comes later in the list. The default preemption action is the suspend the job. To force a job to release resources instead of suspending them, set TERMINATE_WHEN=PREEMPT in lsb.queues, or set JOB_CONTROLS in lsb.queues and specify brequeue as the SUSPEND action. Some applications will release resources when sent the SIGTSTP signal. Use JOB_CONTROLS to send this signal to suspend the job. To enable memory preemption, include mem in the PREEMPTABLE_RESOURCES list in lsb.params. When preempting a job for memory, LSF does not free the memory occupied by the job. Rather, it suspends the job and dispatches another job to the host. It relies on the operating system to swap out the pages of the stopped job as memory of the running job grows. Default Not defined (if preemptive scheduling is configured, LSF preempts on job slots only) PREEMPTION_WAIT_TIME Syntax PREEMPTION_WAIT_TIME=seconds Description You must also specify PREEMPTABLE_RESOURCES in lsb.params. The amount of time LSF waits, after preempting jobs, for preemption resources to become available. Specify at least 300 seconds. Chapter 1. Configuration Files 213 lsb.params If LSF does not get the resources after this time, LSF might preempt more jobs. Default 300 (seconds) PREEXEC_EXCLUDE_HOST_EXIT_VALUES Syntax PREEXEC_EXCLUDE_HOST_EXIT_VALUES=all [~exit_value] | exit_value [exit_value] [...] Description Specify one or more values (between 1 and 255, but not 99) that corresponds to the exit code your pre-execution scripts exits with in the case of failure. LSF excludes any hosts that attempt to run the pre-exec script and exit with the value specified in PREEXEC_EXCLUDE_HOST_EXIT_VALUES. The exclusion list exists for this job until the mbatchd restarts. Specify more than one value by separating them with a space. 99 is a reserved value. For example, PREEXEC_EXCLUDE_HOST_EXIT_VALUES=1 14 19 20 21. Exclude values using a "~": PREEXEC_EXCLUDE_HOST_EXIT_VALUES=all ~40 In the case of failures that could be avoided by retrying on the same host, add the retry process to the pre-exec script. Use in combination with MAX_PREEXEC_RETRY in lsb.params to limit the total number of hosts that are tried. In a multicluster environment, use in combination with LOCAL_MAX_PREEXEC_RETRY and REMOTE_MAX_PREEXEC_RETRY. Default None. PRIVILEGED_USER_FORCE_BKILL Syntax PRIVILEGED_USER_FORCE_BKILL=y | Y Description If Y, only root or the LSF administrator can successfully run bkill -r. For any other users, -r is ignored. If not defined, any user can run bkill -r. Default Not defined. 214 Platform LSF Configuration Reference lsb.params REMOVE_HUNG_JOBS_FOR Syntax REMOVE_HUNG_JOBS_FOR = runlimit[,wait_time=min] | host_unavail[,wait_time=min] | runlimit [,wait_time=min]:host_unavail[,wait_time=min] | all[,wait_time=min] Description Hung jobs are removed under the following conditions: v host_unavail: Hung jobs are automatically removed if the first execution host is unavailable and a timeout is reached as specified by wait_time in the parameter configuration. The default value of wait_time is 10 minutes. Hung jobs of any status (RUN, SSUSP, etc.) will be a candidate for removal by LSF when the timeout is reached. v runlimit: Remove the hung job after the job’s run limit was reached. You can use the wait_time option to specify a timeout for removal after reaching the runlimit. The default value of wait_time is 10 minutes. For example, if REMOVE_HUNG_JOBS_FOR is defined with runlimit, wait_time=5 and JOB_TERMINATE_INTERVAL is not set, the job is removed by mbatchd 5 minutes after the job runlimit is reached. Hung jobs in RUN status are considered for removal if the runlimit + wait_time have expired. For backwards compatibility with earlier versions of LSF, REMOVE_HUNG_JOBS_FOR = runlimit is handled as previously: The grace period is 10 mins + MAX(6 seconds, JOB_TERMINATE_INTERVAL) where JOB_TERMINATE_INTERVAL is specified in lsb.params. The grace period only begins once a job’s run limit has been reached. v all: Specifies hung job removal for all conditions (both runlimit and host_unavail). The hung job is removed when the first condition is satisfied. For example, if a job has a run limit, but it becomes hung because a host is unavailable before the run limit is reached, jobs (running, suspended, etc.) will be removed after 10 minutes after the host is unavailable. Job is placed in EXIT status by mbatchd. For a host_unavail condition, wait_time count starts from the moment mbatchd detects that the host is unavail. Running badmin mbdrestart or badmin reconfig while the timeout is in progress will restart the timeout countdown from 0. For a runlimit condition, wait_time is the time that the job in the UNKNOWN state takes to reach the runlimit. Default Not defined. REMOTE_MAX_PREEXEC_RETRY Syntax REMOTE_MAX_PREEXEC_RETRY=integer Chapter 1. Configuration Files 215 lsb.params Description The maximum number of times to attempt the pre-execution command of a job from the remote cluster. Valid values 0 < REMOTE_MAX_PREEXEC_RETRY < 2147483647 Default 5 RESOURCE_RESERVE_PER_TASK Syntax RESOURCE_RESERVE_PER_TASK=Y| y|N|n If set to Y, mbatchd reserves resources based on job tasks instead of per-host. By default, mbatchd only reserves resources for parallel jobs on a per-host basis. For example, by default, the command: bsub -n 4 -R "rusage[mem=500]" -q reservation my_job requires the job to reserve 500 MB on each host where the job runs. Some parallel jobs need to reserve resources based on job tasks, rather than by host. In this example, if per-task reservation is enabled by RESOURCE_RESERVE_PER_TASK, the job my_job must reserve 500 MB of memory for each job task (4*500=2 GB) on the host in order to run. If RESOURCE_RESERVE_PER_TASK is set, the following command reserves the resource my_resource for all 4 job tasks instead of only 1 on the host where the job runs: bsub -n 4 -R "my_resource > 0 rusage[my_resource=1]" myjob Default N (Not defined; reserve resources per-host.) RUN_JOB_FACTOR Syntax RUN_JOB_FACTOR=number Description Used only with fairshare scheduling. Job slots weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the number of job slots reserved and in use by a user. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. 216 Platform LSF Configuration Reference lsb.params Default 3.0 RUN_TIME_DECAY Syntax RUN_TIME_DECAY=Y | y | N | n Description Used only with fairshare scheduling. Enables decay for run time at the same rate as the decay set by HIST_HOURS for cumulative CPU time and historical run time. In the calculation of a user’s dynamic share priority, this factor determines whether run time is decayed. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Restrictions Running badmin reconfig or restarting mbatchd during a job's run time results in the decayed run time being recalculated. When a suspended job using run time decay is resumed, the decay time is based on the elapsed time. Default N RUN_TIME_FACTOR Syntax RUN_TIME_FACTOR=number Description Used only with fairshare scheduling. Run time weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the total run time of a user’s running jobs. This parameter can also be set for an individual queue in lsb.queues. If defined, the queue value takes precedence. Default 0.7 Chapter 1. Configuration Files 217 lsb.params SBD_SLEEP_TIME Syntax SBD_SLEEP_TIME=seconds Description The interval at which LSF checks the load conditions of each host, to decide whether jobs on the host must be suspended or resumed. The job-level resource usage information is updated at a maximum frequency of every SBD_SLEEP_TIME seconds. The update is done only if the value for the CPU time, resident memory usage, or virtual memory usage has changed by more than 10 percent from the previous update or if a new process or process group has been created. The LIM marks the host SBDDOWN if it does not receive the heartbeat in 1 minute. Therefore, setting SBD_SLEEP_TIME greater than 60 seconds causes the host to be frequently marked SBDDOWN and triggers mbatchd probe, thus slowing performance. After modifying this parameter, use badmin hrestart -f all to restart sbatchds and let the modified value take effect. Default 30 seconds. SCHED_METRIC_ENABLE Syntax SCHED_METRIC_ENABLE=Y | N Description Enable scheduler performance metric collection. Use badmin perfmon stop and badmin perfmon start to dynamically control performance metric collection. Default N SCHED_METRIC_SAMPLE_PERIOD Syntax SCHED_METRIC_SAMPLE_PERIOD=seconds 218 Platform LSF Configuration Reference lsb.params Description Set a default performance metric sampling period in seconds. Cannot be less than 60 seconds. Use badmin perfmon setperiod to dynamically change performance metric sampling period. Default 60 seconds SCHED_PER_JOB_SORT Syntax SCHED_PER_JOB_SORT=Y/N Description Enable this parameter to use the per-job sorting feature in scheduler. This feature allows LSF to schedule jobs accurately in one scheduling cycle by adding extra load to scheduler. For example, jobs 1 to 3 are submitted to LSF as follows: v bsub -R "order[slots]" job1 v bsub -R "order[slots]" job2 v bsub -R "order[slots]" job3 Each job was requested to be dispatched to the host with the most unused slots. The request is not completely satisfied in one scheduling cycle if per-job sorting is disabled. For example: v host1 3 slots v host2 3 slots v host3 3 slots Jobs 1 to 3 are dispatched to the same host (for example, host1) in one scheduling cycle because the hosts are only sorted once in one scheduling cycle. If the per-job sorting feature is enabled, the candidate hosts are sorted again before the job is actually scheduled. Therefore, these three jobs are dispatched to each host separately. This result is exactly what you want but the cost is that scheduler performance maybe significantly lower. For example, if there are 5000 hosts in the cluster and the master host is on the machine with 24GB memory and 2 physical CPUs with 2 cores with 6 threads, each job with ! in the ORDER[] section consumes about an extra 10ms of time in one scheduling cycle. If there are many jobs with! in the ORDER[] section that are controlled by LSB_MAX_JOB_DISPATCH_PER_SESSION, then a lot of extra time is consumed in one scheduling cycle. To get the accurate schedule result without impacting scheduler performance, you can set JOB_ACCEPT_INTERVAL as a non-zero value. This is because the hosts do not need to be sorted again by the per-job sorting feature. Chapter 1. Configuration Files 219 lsb.params Default N See also JOB_ACCEPT_INTERVAL in lsb.params and lsb.queues. LSB_MAX_JOB_DISPATCH_PER_SESSION in lsf.conf. SCHEDULER_THREADS Syntax SCHEDULER_THREADS=integer Description Set the number of threads the scheduler uses to evaluate resource requirements. Multithreaded resource evaluation is useful for large scale clusters with large numbers of hosts. The idea is to do resource evaluation for hosts concurrently. For example, there are 6,000 hosts in a cluster, so the scheduler may create six threads to do the evaluation concurrently. Each thread is in charge of 1,000 hosts. To set an effective value for this parameter, consider the number of available CPUs on the master host, the number of hosts in the cluster, and the scheduling performance metrics. Set the number of threads between 1 and the number of cores in the master host. A value of 0 means that the scheduler does not create any threads to evaluate resource requirements. Note: SCHEDULER_THREADS is available only in LSF Advanced Edition. Default 0 SECURE_INFODIR_USER_ACCESS Syntax SECURE_INFODIR_USER_ACCESS=Y | N Description By default, (SECURE_INFODIR_USER_ACCESS=N or is not defined), any user can view other users job information in lsb.event and lsb.acct files using the bhist or bacct commands. Specify Y to prevent users (includes all users except the primary admin) from accessing other users' job information using bhist or bacct . With SECURE_INFODIR_USER_ACCESS enabled, a regular user does not have rights to call the API to get data under LSB_SHAREDIR/cluster/logdir, which will be readable only by the primary administrator. Regular and administrator users will not have rights to run bhist -t. Only the primary administrator will have rights to run bhist -t. Regular and administrator users will only see their own job information. The LSF primary administrator can always view all users’ job information in lsb.event and lsb.acct, no matter what the setting. 220 Platform LSF Configuration Reference lsb.params After enabling this feature, you must setuid of the LSF primary administrator for bhist and bacct binary under LSF_BINDIR. bhist and bacct will call mbatchd to check whether the parameter is set or not when you have setuid for bhist and bacct. To disable this feature, specify N for SECURE_INFODIR_USER_ACCESS and to avoid bhist and bacct calling mbatchd, remove the setuid for bhist and bacct binary under LSF_BINDIR. When disabled, the permission to LSB_SHAREDIR/cluster/logdir will return to normal after mbatchd is reconfigured (run badmin reconfig). Note: This feature is only supported when LSF is installed on a file system that supports setuid bit for file. Therefore, this feature does not work on Windows platforms. Note: If LSB_LOCALDIR has been enabled to duplicate LSB_SHAREDIR, LSB_LOCALDIR will also be readable only by the primary administrator after setting SECURE_INFODIR_USER_ACCESS = Y. Default N SECURE_JOB_INFO_LEVEL Syntax SECURE_JOB_INFO_LEVEL=0|1|2|3|4 Description Defines an access control level for all users. Specify a level (0 to 4) to control which jobs users and administrators (except the primary administrator) can see. For LSF users, there are three types of jobs: v The user’s own jobs. v Jobs that belong to other users in the same user group. v Jobs that do not belong to users in the same user group. There are two kinds of job information which will be viewed by users: v Summary Information: Obtained from bjobs with options other than -l, such as -aps, -fwd, -p, -ss, -sum, -W, -WF, -WP, -WL, etc. v Detail Information: Obtained from bjobs -l, bjobs -UF, bjobs -N, bjdepinfo, bread, and bstatus. There are two kinds of user rights which will determine what kind of information a user can view for a job: v Basic rights: User can see all summary information. v Detail rights: User can see all detail information. When a user or admin enters one of the commands to see job information (bjobs, bjdepinfo, bread, or bstatus), the SECURE_JOB_INFO_LEVEL controls whether they see: v Just their own jobs’ information. (level 4) Chapter 1. Configuration Files 221 lsb.params v Their own jobs and summary information from jobs in the same user group. (level 3) v Their own jobs, summary and detail information from jobs in the same user group. (level 2) v Their own jobs, summary and detail information from jobs in the same user group, and summary information from jobs outside their user group. (level 1) v Summary and detail job information for all jobs. (level 0) Note: If SECURE_JOB_INFO_LEVEL is set to level 1, 2, 3, or 4, check if SECURE_INFODIR_USER_ACCESS is enabled (set to Y). If it is not enabled, access to bjobs functions will be restricted, but access to bhist / bacct will be available. Note: In a MultiCluster environment, the SECURE_JOB_INFO_LEVEL definition still applies when a user attempts to view job information from a remote cluster through the bjobs -m remotecluster command. The security level configuration of a specified cluster will take effect. Interaction with bsub -G and bjobs -o v If a user submits a job using bsub -G, the job will be treated as a member of the -G specified user group (or default user group). For example: UserA belongs to user groups UG1 and UG2. UserA submits Job1 using bsub -G: bsub -G UG1 UserA submits Job2 without using bsub -G. Job1 will be treated as belonging to UG1 only. Job2 will be treated as belonging to UG1 and UG2. The result is that members of UG1 can view both Job1 and Job2 details if they are given access rights to view jobs in the same user group. Members of UG2 can view only Job2 if they are given access rights to view jobs in the same user group. v If a user has only basic rights, bjobs -o returns only values in the basic fields (others display as "-"): Jobid, stat, user, queue, job_name, proj_name, pids, from_host, exec_host, nexec_host, first_host, submit_time, start_time, time_left, finish_time, %complete, cpu_used, slots, mem, swap, forward_cluster, forward_time, run_time. Limitations v An administrator may not have permission to see a job, but they can still control a job (for example, kill a job) using the appropriate commands. v When job information security is enabled, pre-LSF 9.1 bjobs and bjdepinfo commands will be rejected no matter who issues them because mbatchd cannot get the command user name. A "No job found" message will be returned. v When job information security is enabled, users may have rights to only view job summary information and no rights to view job detail information. Therefore, a user would see job info when viewing summary info (using bjobs <jobid>), but an error (job <jobid> is not found) will be returned when the user tries to view job detail information (using bjobs -l <jobid>). Default 0 222 Platform LSF Configuration Reference lsb.params SLA_TIMER Syntax SLA_TIMER=seconds Description For EGO-enabled SLA scheduling. Controls how often each service class is evaluated and a network message is sent to EGO communicating host demand. Valid values Positive integer between 2 and 21474847 Default 0 (Not defined.) SSCHED_ACCT_DIR Syntax SSCHED_ACCT_DIR=directory Description Used by IBM Platform Session Scheduler (ssched). A universally accessible and writable directory that will store Session Scheduler task accounting files. Each Session Scheduler session (each ssched instance) creates one accounting file. Each file contains one accounting entry for each task. The accounting file is named job_ID.ssched.acct. If no directory is specified, accounting records are not written. Valid values Specify any string up to 4096 characters long Default Not defined. No task accounting file is created. SSCHED_MAX_RUNLIMIT Syntax SSCHED_MAX_RUNLIMIT=seconds Description Used by IBM Platform Session Scheduler (ssched). Maximum run time for a task. Users can override this value with a lower value. Specify a value greater than or equal to zero (0). Chapter 1. Configuration Files 223 lsb.params Recommended value For very short-running tasks, a reasonable value is twice the typical runtime. Because LSF does not release slots allocated to the session until all tasks are completed and ssched exits, you should avoid setting a large value for SSCHED_MAX_RUNLIMIT. Valid values Specify a positive integer between 0 and 2147483645 Default 600 seconds (10 minutes) SSCHED_MAX_TASKS Syntax SSCHED_MAX_TASKS=integer Description Used by Session Scheduler (ssched). Maximum number of tasks that can be submitted to Session Scheduler. Session Scheduler exits if this limit is reached. Specify a value greater than or equal to zero (0). Valid values Specify a positive integer between 0 and 2147483645 Default 50000 tasks SSCHED_REQUEUE_LIMIT Syntax SSCHED_REQUEUE_LIMIT=integer Description Used by Session Scheduler (ssched). Number of times Session Scheduler tries to requeue a task as a result of the REQUEUE_EXIT_VALUES (ssched -Q) setting. SSCHED_REQUEUE_LIMIT=0 means never requeue. Specify a value greater than or equal to zero (0). Valid values Specify a positive integer between 0 and 2147483645 224 Platform LSF Configuration Reference lsb.params Default 3 requeue attempts SSCHED_RETRY_LIMIT Syntax SSCHED_RETRY_LIMIT=integer Description Used by Session Scheduler (ssched). Number of times Session Scheduler tries to retry a task that fails during dispatch or setup. SSCHED_RETRY_LIMIT=0 means never retry. Specify a value greater than or equal to zero (0). Valid values Specify a positive integer between 0 and 2147483645 Default 3 retry attempts SSCHED_UPDATE_SUMMARY_BY_TASK Syntax SSCHED_UPDATE_SUMMARY_INTERVAL=integer Description Used by Platform Session Scheduler (ssched). Update the Session Scheduler task summary via bpost after the specified number of tasks finish. Specify a value greater than or equal to zero (0). If both SSCHED_UPDATE_SUMMARY_INTERVAL and SSCHED_UPDATE_SUMMARY_BY_TASK are set to zero (0), bpost is not run. Valid values Specify a positive integer between 0 and 2147483645 Default 0 See also SSCHED_UPDATE_SUMMARY_INTERVAL Chapter 1. Configuration Files 225 lsb.params SSCHED_UPDATE_SUMMARY_INTERVAL Syntax SSCHED_UPDATE_SUMMARY_INTERVAL=seconds Description Used by Platform Session Scheduler (ssched). Update the Session Scheduler task summary via bpost after the specified number of seconds. Specify a value greater than or equal to zero (0). If both SSCHED_UPDATE_SUMMARY_INTERVAL and SSCHED_UPDATE_SUMMARY_BY_TASK are set to zero (0), bpost is not run. Valid values Specify a positive integer between 0 and 2147483645 Default 60 seconds See also SSCHED_UPDATE_SUMMARY_BY_TASK STRICT_UG_CONTROL Syntax STRICT_UG_CONTROL=Y | N Description When STRICT_UG_CONTROL=Y is defined: v Jobs submitted with -G usergroup specified can only be controlled by the usergroup administrator of the specified user group. v user group administrators can be defined for user groups with all as a member After adding or changing STRICT_UG_CONTROL in lsb.params, use badmin reconfig to reconfigure your cluster. Default N (Not defined.) See also DEFAULT_USER_GROUP, ENFORCE_ONE_UG_LIMIT, ENFORCE_UG_TREE STRIPING_WITH_MINIMUM_NETWORK Syntax STRIPING_WITH_MINIMUM_NETWORK=y | n 226 Platform LSF Configuration Reference lsb.params Description For LSF IBM Parallel Environment (PE) integration. Specifies whether or not nodes which have more than half of their networks in READY state are considered for PE jobs with type=sn_all specified in the network resource requirements (in the bsub -network option or the NETWORK_REQ parameter in lsb.queues or lsb.applications). This makes certain that at least one network is UP and in READY state between any two nodes assigned for the job. When set to y, the nodes which have more than half the minimum number of networks in the READY state are considered for sn_all jobs. If set to n, only nodes which have all networks in the READY state are considered for sn_all jobs. Note: LSF_PE_NETWORK_NUM must be defined with a value greater than 0 for STRIPING_WITH_MINUMUM_NETWORK to take effect. Example In a cluster with 8 networks, due to hardware failure, only 3 networks are ok on hostA, and 5 networks are ok on hostB. If STRIPING_WITH_MINUMUM_NETWORK=n, an sn_all job cannot run on either hostA or hostB. If STRIPING_WITH_MINUMUM_NETWORK=y, an sn_all job can run on hostB, but it cannot run on hostA. Default n SUB_TRY_INTERVAL Syntax SUB_TRY_INTERVAL=integer Description The number of seconds for the requesting client to wait before resubmitting a job. This is sent by mbatchd to the client. Default 60 seconds See also MAX_PEND_JOBS SYSTEM_MAPPING_ACCOUNT Syntax SYSTEM_MAPPING_ACCOUNT=user_account Description Enables Windows workgroup account mapping, which allows LSF administrators to map all Windows workgroup users to a single Windows system account, Chapter 1. Configuration Files 227 lsb.params eliminating the need to create multiple users and passwords in LSF. Users can submit and run jobs using their local user names and passwords, and LSF runs the jobs using the mapped system account name and password. With Windows workgroup account mapping, all users have the same permissions because all users map to the same system account. To specify the user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). Define this parameter for LSF Windows Workgroup installations only. Default Not defined USE_SUSP_SLOTS Syntax USE_SUSP_SLOTS=Y | N Description If USE_SUSP_SLOTS=Y, allows jobs from a low priority queue to use slots held by suspended jobs in a high priority queue, which has a preemption relation with the low priority queue. Set USE_SUSP_SLOTS=N to prevent low priority jobs from using slots held by suspended jobs in a high priority queue, which has a preemption relation with the low priority queue. Default Y lsb.queues The lsb.queues file defines batch queues. Numerous controls are available at the queue level to allow cluster administrators to customize site policies. This file is optional. If no queues are configured, LSF creates a queue that is named default, with all parameters set to default values. This file is installed by default in LSB_CONFDIR/cluster_name/configdir. Changing lsb.queues configuration After you change lsb.queues, run badmin reconfig to reconfigure mbatchd. Some parameters, such as run window and runtime limit, do not take effect immediately for running jobs unless you run mbatchd restart or sbatchd restart on the job execution host. 228 Platform LSF Configuration Reference lsb.queues lsb.queues structure Each queue definition begins with the line Begin Queue and ends with the line End Queue. The queue name must be specified; all other parameters are optional. ADMINISTRATORS Syntax ADMINISTRATORS=user_name | user_group ... Description List of queue administrators. To specify a Windows user account or user group, include the domain name in uppercase letters (DOMAIN_NAME\user_name or DOMAIN_NAME\user_group). Queue administrators can operate on any user job in the queue, and on the queue itself. Default Not defined. You must be a cluster administrator to operate on this queue. APS_PRIORITY Syntax APS_PRIORITY=WEIGHT[[factor, value] [subfactor, value]...]...] LIMIT[[factor, value] [subfactor, value]...]...] GRACE_PERIOD[[factor, value] [subfactor, value]...]...] Description Specifies calculation factors for absolute priority scheduling (APS). Pending jobs in the queue are ordered according to the calculated APS value. If weight of a subfactor is defined, but the weight of parent factor is not defined, the parent factor weight is set as 1. The WEIGHT and LIMIT factors are floating-point values. Specify a value for GRACE_PERIOD in seconds (values), minutes (valuem), or hours (valueh). The default unit for grace period is hours. For example, the following sets a grace period of 10 hours for the MEM factor, 10 minutes for the JPRIORITY factor, 10 seconds for the QPRIORITY factor, and 10 hours (default) for the RSRC factor: GRACE_PERIOD[[MEM,10h] [JPRIORITY, 10m] [QPRIORITY,10s] [RSRC, 10]] You cannot specify 0 (zero) for the WEIGHT, LIMIT, and GRACE_PERIOD of any factor or subfactor. APS queues cannot configure cross-queue fairshare (FAIRSHARE_QUEUES). The QUEUE_GROUP parameter replaces FAIRSHARE_QUEUES, which is obsolete in LSF 7.0. Chapter 1. Configuration Files 229 lsb.queues Suspended (bstop) jobs and migrated jobs (bmig) are always scheduled before pending jobs. For migrated jobs, LSF keeps the existing job priority information. If LSB_REQUEUE_TO_BOTTOM and LSB_MIG2PEND are configured in lsf.conf, the migrated jobs keep their APS information. When LSB_REQUEUE_TO_BOTTOM and LSB_MIG2PEND are configured, the migrated jobs need to compete with other pending jobs based on the APS value. To reset the APS value, use brequeue, not bmig. Default Not defined BACKFILL Syntax BACKFILL=Y | N Description If Y, enables backfill scheduling for the queue. A possible conflict exists if BACKFILL and PREEMPTION are specified together. If PREEMPT_JOBTYPE = BACKFILL is set in the lsb.params file, a backfill queue can be preemptable. Otherwise, a backfill queue cannot be preemptable. If BACKFILL is enabled, do not also specify PREEMPTION = PREEMPTABLE. BACKFILL is required for interruptible backfill queues (INTERRUPTIBLE_BACKFILL=seconds). When MAX_SLOTS_IN_POOL, SLOT_RESERVE, and BACKFILL are defined for the same queue, jobs in the queue cannot backfill with slots that are reserved by other jobs in the same queue. Default Not defined. No backfilling. CHKPNT Syntax CHKPNT=chkpnt_dir [chkpnt_period] Description Enables automatic checkpointing for the queue. All jobs that are submitted to the queue are checkpointable. The checkpoint directory is the directory where the checkpoint files are created. Specify an absolute path or a path relative to CWD, do not use environment variables. Specify the optional checkpoint period in minutes. You can checkpoint only running members of a chunk job. 230 Platform LSF Configuration Reference lsb.queues If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration. Checkpoint-related configuration that is specified in the queue, application profile, and at job level has the following effect: v Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value. v The merged result of job-level and application profile settings override queue-level configuration. To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both the send-jobs and receive-jobs queues (CHKPNT in lsb.queues), or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD in lsb.applications) of both submission cluster and execution cluster. LSF uses the directory that is specified in the execution cluster. To make a MultiCluster job checkpointable, both submission and execution queues must enable checkpointing, and the application profile or queue setting on the execution cluster determines the checkpoint directory. Checkpointing is not supported if a job runs on a leased host. The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name. Default Not defined CHUNK_JOB_SIZE Syntax CHUNK_JOB_SIZE=integer Description Chunk jobs only. Enables job chunking and specifies the maximum number of jobs that are allowed to be dispatched together in a chunk. Specify a positive integer greater than 1. The ideal candidates for job chunking are jobs that have the same host and resource requirements and typically take 1 - 2 minutes to run. Job chunking can have the following advantages: v Reduces communication between sbatchd and mbatchd and reduces scheduling performance in mbschd. v Increases job throughput in mbatchd and CPU usage on the execution hosts. However, throughput can deteriorate if the chunk job size is too large. Performance might decrease on queues with CHUNK_JOB_SIZE greater than 30. Evaluate the chunk job size on your own systems for best performance. Chapter 1. Configuration Files 231 lsb.queues With MultiCluster job forwarding model, this parameter does not affect MultiCluster jobs that are forwarded to a remote cluster. Compatibility This parameter is ignored in the following kinds of queues and applications: v Interactive (INTERACTIVE=ONLY parameter) v CPU limit greater than 30 minutes (CPULIMIT parameter) v Run limit greater than 30 minutes (RUNLIMIT parameter) v Runtime estimate greater than 30 minutes (RUNTIME parameter in lsb.applications only) If CHUNK_JOB_DURATION is set in lsb.params, chunk jobs are accepted regardless of the value of CPULIMIT, RUNLIMIT, or RUNTIME. Example The following example configures a queue that is named chunk, which dispatches up to four jobs in a chunk: Begin Queue QUEUE_NAME = chunk PRIORITY = 50 CHUNK_JOB_SIZE = 4 End Queue Default Not defined COMMITTED_RUN_TIME_FACTOR Syntax COMMITTED_RUN_TIME_FACTOR=number Description Used only with fairshare scheduling. Committed runtime weighting factor. In the calculation of a user dynamic priority, this factor determines the relative importance of the committed run time in the calculation. If the -W option of bsub is not specified at job submission and a RUNLIMIT is not set for the queue, the committed run time is not considered. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Valid values Any positive number between 0.0 and 1.0 Default Not defined. 232 Platform LSF Configuration Reference lsb.queues CORELIMIT Syntax CORELIMIT=integer Description The per-process (hard) core file size limit (in KB) for all of the processes that belong to a job from this queue (see getrlimit(2)). Default Unlimited CPU_FREQUENCY Syntax CPU_FREQUENCY=[float_number][unit] Description Specifies the CPU frequency for a queue. All jobs submit to the queue require the specified CPU frequency. Value is a positive float number with units (GHz, MHz, or KHz). If no units are set, the default is GHz. You can also use bsub -freq to set this value. The submission value overwrites the application profile value, and the application profile value overwrites the queue value. Default Not defined (Nominal CPU frequency is used) CPULIMIT Syntax CPULIMIT=[default_limit] maximum_limit where default_limit and maximum_limit are defined by the following formula: [hour:]minute[/host_name | /host_model] Description Maximum normalized CPU time and optionally, the default normalized CPU time that is allowed for all processes of jobs that run in this queue. The name of a host or host model specifies the CPU time normalization host to use. Limits the total CPU time the job can use. This parameter is useful for preventing runaway jobs or jobs that use up too many resources. When the total CPU time for the whole job reaches the limit, a SIGXCPU signal is sent to all processes that belong to the job. If the job has no signal handler for Chapter 1. Configuration Files 233 lsb.queues SIGXCPU, the job is killed immediately. If the SIGXCPU signal is handled, blocked, or ignored by the application, then after the grace period expires, LSF sends SIGINT, SIGTERM, and SIGKILL to the job to kill it. If a job dynamically creates processes, the CPU time that is used by these processes is accumulated over the life of the job. Processes that exist for fewer than 30 seconds might be ignored. By default, if a default CPU limit is specified, jobs submitted to the queue without a job-level CPU limit are killed when the default CPU limit is reached. If you specify only one limit, it is the maximum, or hard, CPU limit. If you specify two limits, the first one is the default, or soft, CPU limit, and the second one is the maximum CPU limit. The number of minutes might be greater than 59. Therefore, three and a half hours can be specified either as 3:30 or 210. If no host or host model is given with the CPU time, LSF uses the default CPU time normalization host that is defined at the queue level (DEFAULT_HOST_SPEC in lsb.queues) if it is configured. Otherwise, the default CPU time normalization host that is defined at the cluster level (DEFAULT_HOST_SPEC in lsb.params) is used if it is configured. Otherwise, the host with the largest CPU factor (the fastest host in the cluster) is used. Because sbatchd periodically checks whether a CPU time limit was exceeded, a Windows job that runs under a CPU time limit can exceed that limit by up to SBD_SLEEP_TIME. On UNIX systems, the CPU limit can be enforced by the operating system at the process level. You can define whether the CPU limit is a per-process limit that is enforced by the OS or a per-job limit enforced by LSF with LSB_JOB_CPULIMIT in lsf.conf. Jobs that are submitted to a chunk job queue are not chunked if CPULIMIT is greater than 30 minutes. Default Unlimited CPU_TIME_FACTOR Syntax CPU_TIME_FACTOR=number Description Used only with fairshare scheduling. CPU time weighting factor. In the calculation of a user dynamic share priority, this factor determines the relative importance of the cumulative CPU time that is used by a user’s jobs. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. 234 Platform LSF Configuration Reference lsb.queues Default 0.7 DATALIMIT Syntax DATALIMIT=[default_limit] maximum_limit Description The per-process data segment size limit (in KB) for all of the processes that belong to a job from this queue (see getrlimit(2)). By default, if a default data limit is specified, jobs submitted to the queue without a job-level data limit are killed when the default data limit is reached. If you specify only one limit, it is the maximum, or hard, data limit. If you specify two limits, the first one is the default, or soft, data limit, and the second one is the maximum data limit. Default Unlimited | DATA_TRANSFER | Syntax | DATA_TRANSFER=Y | N | Description | | Configures the queue as a data transfer queue for LSF data management. DATA_TRANSFER=Y enables the queue for data transfer. | | | | Only one queue in a cluster can be a data transfer queue. Any transfer jobs that are submitted by the data manager go to this queue. If the lsf.datamanager file exists, then at least one queue must define the DATA_TRANSFER parameter. If this parameter is set, a corresponding lsf.datamanager file must exist. | Regular jobs that are submitted to this queue through bsub are rejected. | | | | | Use bstop, bresume, and bkill to stop, resume, and kill your own jobs in a data transfer queue. LSF administrators and queue administrators can additionally use btop and bbot to move jobs in the queue. All other commands on jobs in a data transfer queue are rejected. Jobs from other queues cannot be switched to a data transfer queue (bswitch). | | | If you change this parameter, LSF data manager transfer jobs that were in the previous queue remain in that queue and are scheduled and run as normal. The LSF data manager is notified of their success or failure. | | | The following queue parameters cannot be used together with a queue that defines this parameter: v INTERACTIVE=ONLY Chapter 1. Configuration Files 235 lsb.queues SNDJOBS_TO RCVJOBS_FROM MAX_RSCHED_TIME REQUEUE_EXIT_VALUES SUCCESS_EXIT_VALUES | | | | | v v v v v | v RERUNNABLE | | | | A data transfer queue cannot appear in the list of default queues that are defined by DEFAULT_QUEUE defined in lsb.params. Jobs that are submitted to the data transfer queue are not attached to the application specified by DEFAULT_APPLICATION defined in lsb.params. | Default | N DEFAULT_EXTSCHED Syntax DEFAULT_EXTSCHED=external_scheduler_options Description Specifies default external scheduling options for the queue. -extsched options on the bsub command are merged with DEFAULT_EXTSCHED options, and -extsched options override any conflicting queue-level options set by DEFAULT_EXTSCHED. Default Not defined DEFAULT_HOST_SPEC Syntax DEFAULT_HOST_SPEC=host_name | host_model Description The default CPU time normalization host for the queue. The CPU factor of the specified host or host model is used to normalize the CPU time limit of all jobs in the queue, unless the CPU time normalization host is specified at the job level. Default Not defined. The queue uses the DEFAULT_HOST_SPEC defined in lsb.params. If DEFAULT_HOST_SPEC is not defined in either file, LSF uses the fastest host in the cluster. 236 Platform LSF Configuration Reference lsb.queues DESCRIPTION Syntax DESCRIPTION=text Description Description of the job queue that is displayed by bqueues -l. Use a description that clearly describes the service features of this queue to help users select the proper queue for each job. The text can include any characters, including white space. The text can be extended to multiple lines by ending the preceding line with a backslash (\). The maximum length for the text is 512 characters. DISPATCH_BY_QUEUE Syntax DISPATCH_BY_QUEUE=Y|y|N|n Description Set this parameter to increase queue responsiveness. The scheduling decision for the specified queue is published without waiting for the whole scheduling session to finish. The scheduling decision for the jobs in the specified queue is final and these jobs cannot be preempted within the same scheduling cycle. Tip: Set this parameter only for your highest priority queue (such as for an interactive queue) to ensure that this queue has the highest responsiveness. Default N DISPATCH_ORDER Syntax DISPATCH_ORDER=QUEUE Description Defines an ordered cross-queue fairshare set. DISPATCH_ORDER indicates that jobs are dispatched according to the order of queue priorities first, then user fairshare priority. By default, a user has the same priority across the master and subordinate queues. If the same user submits several jobs to these queues, user priority is calculated by taking into account all the jobs the user submits across the master-subordinate set. If DISPATCH_ORDER=QUEUE is set in the master queue, jobs are dispatched according to queue priorities first, then user priority. Jobs from users with lower Chapter 1. Configuration Files 237 lsb.queues fairshare priorities who have pending jobs in higher priority queues are dispatched before jobs in lower priority queues. This behavior avoids having users with higher fairshare priority from getting jobs that are dispatched from low-priority queues. Jobs in queues with the same priority are dispatched according to user priority. Queues that are not part of the cross-queue fairshare can have any priority; they are not limited to fall outside of the priority range of cross-queue fairshare queues. Default Not defined DISPATCH_WINDOW Syntax DISPATCH_WINDOW=time_window ... Description The time windows in which jobs from this queue are dispatched. After jobs are dispatched, they are no longer affected by the dispatch window. Default Not defined. Dispatch window is always open. ENABLE_HIST_RUN_TIME Syntax ENABLE_HIST_RUN_TIME=y | Y | n | N Description Used only with fairshare scheduling. If set, enables the use of historical run time in the calculation of fairshare scheduling priority. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Default Not defined. EXCLUSIVE Syntax EXCLUSIVE=Y | N | CU[cu_type] Description If Y, specifies an exclusive queue. 238 Platform LSF Configuration Reference lsb.queues If CU, CU[], or CU[cu_type], specifies an exclusive queue as well as a queue exclusive to compute units of type cu_type (as defined in lsb.params). If no type is specified, the default compute unit type is used. Jobs that are submitted to an exclusive queue with bsub -x are only dispatched to a host that has no other running jobs. Jobs that are submitted to a compute unit exclusive queue with bsub -R "cu[excl]" only run on a compute unit that has no other running jobs. For hosts shared under the MultiCluster resource leasing model, jobs are not dispatched to a host that has running jobs, even if the jobs are from another cluster. Note: EXCLUSIVE=Y or EXCLUSIVE=CU[cu_type] must be configured to enable affinity jobs to use CPUs exclusively, when the alljobs scope is specified in the exclusive option of an affinity[] resource requirement string. Default N FAIRSHARE Syntax FAIRSHARE=USER_SHARES[[user, number_shares] ...] v Specify at least one user share assignment. v Enclose the list in square brackets, as shown. v Enclose each user share assignment in square brackets, as shown. v user: specify users who are also configured to use queue. You can assign the shares to the following types of users: – A single user (specify user_name). To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). – Users in a group, individually (specify group_name@) or collectively (specify group_name). To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\group_name. – Users not included in any other share assignment, individually (specify the keyword default) or collectively (specify the keyword others) - By default, when resources are assigned collectively to a group, the group members compete for the resources on a first-come, first-served (FCFS) basis. You can use hierarchical fairshare to further divide the shares among the group members. - When resources are assigned to members of a group individually, the share assignment is recursive. Members of the group and of all subgroups always compete for the resources according to FCFS scheduling, regardless of hierarchical fairshare policies. v number_shares – Specify a positive integer that represents the number of shares of the cluster resources that are assigned to the user. – The number of shares that are assigned to each user is only meaningful when you compare it to the shares assigned to other users or to the total number of shares. The total number of shares is just the sum of all the shares that are assigned in each share assignment. Chapter 1. Configuration Files 239 lsb.queues Description Enables queue-level user-based fairshare and specifies share assignments. Only users with share assignments can submit jobs to the queue. Compatibility Do not configure hosts in a cluster to use fairshare at both queue and host levels. However, you can configure user-based fairshare and queue-based fairshare together. Default Not defined. No fairshare. FAIRSHARE_ADJUSTMENT_FACTOR Syntax FAIRSHARE_ADJUSTMENT_FACTOR=number Description Used only with fairshare scheduling. Fairshare adjustment plug-in weighting factor. In the calculation of a user dynamic share priority, this factor determines the relative importance of the user-defined adjustment that is made in the fairshare plug-in (libfairshareadjust.*). A positive float number both enables the fairshare plug-in and acts as a weighting factor. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Default Not defined. FAIRSHARE_QUEUES Syntax FAIRSHARE_QUEUES=queue_name[queue_name ...] Description Defines cross-queue fairshare. When this parameter is defined: v The queue in which this parameter is defined becomes the “master queue”. v Queues that are listed with this parameter are subordinate queues and inherit the fairshare policy of the master queue. v A user has the same priority across the master and subordinate queues. If the same user submits several jobs to these queues, user priority is calculated by taking into account all the jobs that the user submitted across the master-subordinate set. 240 Platform LSF Configuration Reference lsb.queues Notes v By default, the PRIORITY range that is defined for queues in cross-queue fairshare cannot be used with any other queues. For example, you have 4 queues: queue1, queue2, queue3, queue4. You assign the following cross-queue fairshare: priorities – queue1 priority: 30 – queue2 priority: 40 – queue3 priority: 50 v By default, the priority of queue4 (which is not part of the cross-queue fairshare) cannot fall between the priority range of the cross-queue fairshare queues (30-50). It can be any number up to 29 or higher than 50. It does not matter if queue4 is a fairshare queue or FCFS queue. If DISPATCH_ORDER=QUEUE is set in the master queue, the priority of queue4 (which is not part of the cross-queue fairshare) can be any number, including a priority that falls between the priority range of the cross-queue fairshare queues (30-50). v FAIRSHARE must be defined in the master queue. If it is also defined in the queues that are listed in FAIRSHARE_QUEUES, it is ignored. v Cross-queue fairshare can be defined more than once within lsb.queues. You can define several sets of master-slave queues. However, a queue cannot belong to more than one master-slave set. For example, you can define: – In queue normal: FAIRSHARE_QUEUES=short – In queue priority: FAIRSHARE_QUEUES=night owners Restriction: You cannot, however, define night, owners, or priority as subordinates in the queue normal; or normaland short as subordinates in the priority queue; or short, night, owners as master queues of their own. v Cross-queue fairshare cannot be used with host partition fairshare. It is part of queue-level fairshare. v Cross-queue fairshare cannot be used with absolute priority scheduling. Default Not defined FILELIMIT Syntax FILELIMIT=integer Description The per-process (hard) file size limit (in KB) for all of the processes that belong to a job from this queue (see getrlimit(2)). Default Unlimited HIST_HOURS Syntax HIST_HOURS=hours Chapter 1. Configuration Files 241 lsb.queues Description Used only with fairshare scheduling. Determines a rate of decay for cumulative CPU time, run time, and historical run time. To calculate dynamic user priority, LSF uses a decay factor to scale the actual CPU time and run time. One hour of recently used time is equivalent to 0.1 hours after the specified number of hours elapses. To calculate dynamic user priority with decayed run time and historical run time, LSF uses the same decay factor to scale the accumulated run time of finished jobs and run time of running jobs, so that one hour of recently used time is equivalent to 0.1 hours after the specified number of hours elapses. When HIST_HOURS=0, CPU time and run time that is accumulated by running jobs is not decayed. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Default Not defined. HJOB_LIMIT Syntax HJOB_LIMIT=integer Description Per-host job slot limit. Maximum number of job slots that this queue can use on any host. This limit is configured per host, regardless of the number of processors it might have. Example The following queue runs a maximum of one job on each of hostA, hostB, and hostC: Begin Queue ... HJOB_LIMIT = 1 HOSTS=hostA hostB hostC ... End Queue Default Unlimited 242 Platform LSF Configuration Reference lsb.queues HOST_POST_EXEC Syntax HOST_POST_EXEC=command Description Enables host-based post-execution processing at the queue level. The HOST_POST_EXEC command runs on all execution hosts after the job finishes. If job-based post-execution POST_EXEC was defined at the queue-level, application-level, or job-level, the HOST_POST_EXEC command runs after POST_EXEC of any level. Host-based post-execution commands can be configured at the queue and application level, and run in the following order: 1. The application-level command 2. The queue-level command. The supported command rule is the same as the existing POST_EXEC for the queue section. See the POST_EXEC topic for details. Note: The host-based pre-execution command cannot run on Windows systems. This parameter cannot be used to configure job-based post-execution processing. Default Not defined. HOST_PRE_EXEC Syntax HOST_PRE_EXEC=command Description Enables host-based pre-execution processing at the queue level. The HOST_PRE_EXEC command runs on all execution hosts before the job starts. If job based pre-execution PRE_EXEC was defined at the queue-level/application-level/job-level, the HOST_PRE_EXEC command runs before PRE_EXEC of any level. Host-based pre-execution commands can be configured at the queue and application level, and run in the following order: 1. The queue-level command 2. The application-level command. The supported command rule is the same as the existing PRE_EXEC for the queue section. See the PRE_EXEC topic for details. Note: Chapter 1. Configuration Files 243 lsb.queues The host-based pre-execution command cannot be executed on Windows platforms. This parameter cannot be used to configure job-based pre-execution processing. Default Not defined. HOSTLIMIT_PER_JOB Syntax HOSTLIMIT_PER_JOB=integer Description Per-job host limit. The maximum number of hosts that a job in this queue can use. LSF verifies the host limit during the allocation phase of scheduling. If the number of hosts requested for a parallel job exceeds this limit and LSF cannot satisfy the minimum number of request slots, the parallel job will pend. However, for resumed parallel jobs, this parameter does not stop the job from resuming even if the job's host allocation exceeds the per-job host limit specified in this parameter. Default Unlimited HOSTS Syntax HOSTS=host_list | none v host_list is a space-separated list of the following items: – host_name[@cluster_name][[!] | +pref_level] – host_partition[+pref_level] – host_group[[!] | +pref_level] – compute_unit[[!] | +pref_level] – [~]host_name – [~]host_group – [~]compute_unit v The list can include the following items only once: – all@cluster_name – others[+pref_level] – all – allremote v The none keyword is only used with the MultiCluster job forwarding model, to specify a remote-only queue. Description A space-separated list of hosts on which jobs from this queue can be run. 244 Platform LSF Configuration Reference lsb.queues If compute units, host groups, or host partitions are included in the list, the job can run on any host in the unit, group, or partition. All the members of the host list should either belong to a single host partition or not belong to any host partition. Otherwise, job scheduling may be affected. Some items can be followed by a plus sign (+) and a positive number to indicate the preference for dispatching a job to that host. A higher number indicates a higher preference. If a host preference is not given, it is assumed to be 0. If there are multiple candidate hosts, LSF dispatches the job to the host with the highest preference; hosts at the same level of preference are ordered by load. If compute units, host groups, or host partitions are assigned a preference, each host in the unit, group, or partition has the same preference. Use the keyword others to include all hosts not explicitly listed. Use the keyword all to include all hosts not explicitly excluded. Use the keyword all@cluster_name hostgroup_name or allremote hostgroup_name to include lease in hosts. Use the not operator (~) to exclude hosts from the all specification in the queue. This is useful if you have a large cluster but only want to exclude a few hosts from the queue definition. The not operator can only be used with the all keyword. It is not valid with the keywords others and none. The not operator (~) can be used to exclude host groups. For parallel jobs, specify first execution host candidates when you want to ensure that a host has the required resources or runtime environment to handle processes that run on the first execution host. To specify one or more hosts, host groups, or compute units as first execution host candidates, add the exclamation point (!) symbol after the name. Follow these guidelines when you specify first execution host candidates: v If you specify a compute unit or host group, you must first define the unit or group in the file lsb.hosts. v Do not specify a dynamic host group as a first execution host. v Do not specify all, allremote, or others, or a host partition as a first execution host. v Do not specify a preference (+) for a host identified by (!) as a first execution host candidate. v For each parallel job, specify enough regular hosts to satisfy the CPU requirement for the job. Once LSF selects a first execution host for the current job, the other first execution host candidates – Become unavailable to the current job – Remain available to other jobs as either regular or first execution hosts v You cannot specify first execution host candidates when you use the brun command. Chapter 1. Configuration Files 245 lsb.queues Restriction: If you have enabled EGO, host groups and compute units are not honored. With MultiCluster resource leasing model, use the format host_name@cluster_name to specify a borrowed host. LSF does not validate the names of remote hosts. The keyword others indicates all local hosts not explicitly listed. The keyword all indicates all local hosts not explicitly excluded. Use the keyword allremote to specify all hosts borrowed from all remote clusters. Use all@cluster_name to specify the group of all hosts borrowed from one remote cluster. You cannot specify a host group or partition that includes remote resources, unless it uses the keyword allremote to include all remote hosts. You cannot specify a compute unit that includes remote resources. With MultiCluster resource leasing model, the not operator (~) can be used to exclude local hosts or host groups. You cannot use the not operator (~) with remote hosts. Restriction: Hosts that participate in queue-based fairshare cannot be in a host partition. Behavior with host intersection Host preferences specified by bsub -m combine intelligently with the queue specification and advance reservation hosts. The jobs run on the hosts that are both specified at job submission and belong to the queue or have advance reservation. Example 1 HOSTS=hostA+1 hostB hostC+1 hostD+3 This example defines three levels of preferences: run jobs on hostD as much as possible, otherwise run on either hostA or hostC if possible, otherwise run on hostB. Jobs should not run on hostB unless all other hosts are too busy to accept more jobs. Example 2 HOSTS=hostD+1 others Run jobs on hostD as much as possible, otherwise run jobs on the least-loaded host available. With MultiCluster resource leasing model, this queue does not use borrowed hosts. Example 3 HOSTS=all ~hostA Run jobs on all hosts in the cluster, except for hostA. With MultiCluster resource leasing model, this queue does not use borrowed hosts. Example 4 HOSTS=Group1 ~hostA hostB hostC Run jobs on hostB, hostC, and all hosts in Group1 except for hostA. 246 Platform LSF Configuration Reference lsb.queues With MultiCluster resource leasing model, this queue uses borrowed hosts if Group1 uses the keyword allremote. Example 5 HOSTS=hostA! hostB+ hostC hostgroup1! Runs parallel jobs using either hostA or a host defined in hostgroup1 as the first execution host. If the first execution host cannot run the entire job due to resource requirements, runs the rest of the job on hostB. If hostB is too busy to accept the job, or if hostB does not have enough resources to run the entire job, runs the rest of the job on hostC. Example 6 HOSTS=computeunit1! hostB hostC Runs parallel jobs using a host in computeunit1 as the first execution host. If the first execution host cannot run the entire job due to resource requirements, runs the rest of the job on other hosts in computeunit1 followed by hostB and finally hostC. Example 7 HOSTS=hostgroup1! computeunitA computeunitB computeunitC Runs parallel jobs using a host in hostgroup1 as the first execution host. If additional hosts are required, runs the rest of the job on other hosts in the same compute unit as the first execution host, followed by hosts in the remaining compute units in the order they are defined in the lsb.hosts ComputeUnit section. Default all (the queue can use all hosts in the cluster, and every host has equal preference) With MultiCluster resource leasing model, this queue can use all local hosts, but no borrowed hosts. IGNORE_DEADLINE Syntax IGNORE_DEADLINE=Y Description If Y, disables deadline constraint scheduling (starts all jobs regardless of deadline constraints). IMPT_JOBBKLG Syntax IMPT_JOBBKLG=integer |infinit Description MultiCluster job forwarding model only. Chapter 1. Configuration Files 247 lsb.queues Specifies the MultiCluster pending job limit for a receive-jobs queue. This represents the maximum number of MultiCluster jobs that can be pending in the queue; once the limit has been reached, the queue stops accepting jobs from remote clusters. Use the keyword infinit to make the queue accept an unlimited number of pending MultiCluster jobs. Default 50 IMPT_TASKBKLG | | Syntax | IMPT_TASKBKLG=integer |infinit | Description | MultiCluster job forwarding model only. | | | | | Specifies the MultiCluster pending job task limit for a receive-jobs queue. In the submission cluster, if the total of requested job tasks and the number of imported pending tasks in the receiving queue is greater than IMPT_TASKBKLG, the queue stops accepting jobs from remote clusters, and the job is not forwarded to the receiving queue. | Specify an integer between 0 and 2147483646 for the number of tasks. | | Use the keyword infinit to make the queue accept an unlimited number of pending MultiCluster job tasks. | Set IMPT_TASKBKLG to 0 to forbid any job being forwarded to the receiving queue. | | Note: IMPT_SLOTBKLG has been changed to IMPT_TASKBKLG and the concept has changed from slot to task as of LSF 9.1.3, | Default | infinit (The queue accepts an unlimited number of pending MultiCluster job tasks.) INTERACTIVE Syntax INTERACTIVE=YES | NO | ONLY Description YES causes the queue to accept both interactive and non-interactive batch jobs, NO causes the queue to reject interactive batch jobs, and ONLY causes the queue to accept interactive batch jobs and reject non-interactive batch jobs. Interactive batch jobs are submitted via bsub -I. 248 Platform LSF Configuration Reference lsb.queues Default YES. The queue accepts both interactive and non-interactive jobs. INTERRUPTIBLE_BACKFILL Syntax INTERRUPTIBLE_BACKFILL=seconds Description Configures interruptible backfill scheduling policy, which allows reserved job slots to be used by low priority small jobs that are terminated when the higher priority large jobs are about to start. There can only be one interruptible backfill queue.It should be the lowest priority queue in the cluster. Specify the minimum number of seconds for the job to be considered for backfilling.This minimal time slice depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created if a site has jobs of distinctively different classes. An interruptible backfill job: v Starts as a regular job and is killed when it exceeds the queue runtime limit, or v Is started for backfill whenever there is a backfill time slice longer than the specified minimal time, and killed before the slot-reservation job is about to start The queue RUNLIMIT corresponds to a maximum time slice for backfill, and should be configured so that the wait period for the new jobs submitted to the queue is acceptable to users. 10 minutes of runtime is a common value. You should configure REQUEUE_EXIT_VALUES for interruptible backfill queues. BACKFILL and RUNLIMIT must be configured in the queue. The queue is disabled if BACKFILL and RUNLIMIT are not configured. Assumptions and limitations: v The interruptible backfill job holds the slot-reserving job start until its calculated start time, in the same way as a regular backfill job. The interruptible backfill job are not preempted in any way other than being killed when its time come. v While the queue is checked for the consistency of interruptible backfill, backfill and runtime specifications, the requeue exit value clause is not verified, nor executed automatically. Configure requeue exit values according to your site policies. v The interruptible backfill job must be able to do at least one unit of useful calculations and save its data within the minimal time slice, and be able to continue its calculations after it has been restarted v Interruptible backfill paradigm does not explicitly prohibit running parallel jobs, distributed across multiple nodes; however, the chance of success of such job is close to zero. Chapter 1. Configuration Files 249 lsb.queues Default Not defined. No interruptible backfilling. JOB_ACCEPT_INTERVAL Syntax JOB_ACCEPT_INTERVAL=integer Description The number you specify is multiplied by the value of lsb.params MBD_SLEEP_TIME (60 seconds by default). The result of the calculation is the number of seconds to wait after dispatching a job to a host, before dispatching a second job to the same host. If 0 (zero), a host may accept more than one job in each dispatch turn. By default, there is no limit to the total number of jobs that can run on a host, so if this parameter is set to 0, a very large number of jobs might be dispatched to a host all at once. This can overload your system to the point that it is unable to create any more processes. It is not recommended to set this parameter to 0. JOB_ACCEPT_INTERVAL set at the queue level (lsb.queues) overrides JOB_ACCEPT_INTERVAL set at the cluster level (lsb.params). Note: The parameter JOB_ACCEPT_INTERVAL only applies when there are running jobs on a host. A host running a short job which finishes before JOB_ACCEPT_INTERVAL has elapsed is free to accept a new job without waiting. Default Not defined. The queue uses JOB_ACCEPT_INTERVAL defined in lsb.params, which has a default value of 1. JOB_ACTION_WARNING_TIME Syntax JOB_ACTION_WARNING_TIME=[hour:]minute Description Specifies the amount of time before a job control action occurs that a job warning action is to be taken. For example, 2 minutes before the job reaches runtime limit or termination deadline, or the queue's run window is closed, an URG signal is sent to the job. Job action warning time is not normalized. A job action warning time must be specified with a job warning action in order for job warning to take effect. The warning time specified by the bsub -wt option overrides JOB_ACTION_WARNING_TIME in the queue. JOB_ACTION_WARNING_TIME is 250 Platform LSF Configuration Reference lsb.queues used as the default when no command line option is specified. Example JOB_ACTION_WARNING_TIME=2 Default Not defined JOB_CONTROLS Syntax JOB_CONTROLS=SUSPEND[signal | command | CHKPNT] RESUME[signal | command] TERMINATE[signal | command | CHKPNT] v signal is a UNIX signal name (for example, SIGTSTP or SIGTERM). The specified signal is sent to the job. The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command. v command specifies a /bin/sh command line to be invoked. Restriction: Do not quote the command line inside an action definition. Do not specify a signal followed by an action that triggers the same signal. For example, do not specify JOB_CONTROLS=TERMINATE[bkill] or JOB_CONTROLS=TERMINATE[brequeue]. This causes a deadlock between the signal and the action. v CHKPNT is a special action, which causes the system to checkpoint the job. Only valid for SUSPEND and TERMINATE actions: – If the SUSPEND action is CHKPNT, the job is checkpointed and then stopped by sending the SIGSTOP signal to the job automatically. – If the TERMINATE action is CHKPNT, then the job is checkpointed and killed automatically. Description Changes the behavior of the SUSPEND, RESUME, and TERMINATE actions in LSF. v The contents of the configuration line for the action are run with /bin/sh -c so you can use shell features in the command. v The standard input, output, and error of the command are redirected to the NULL device, so you cannot tell directly whether the command runs correctly. The default null device on UNIX is /dev/null. v The command is run as the user of the job. v All environment variables set for the job are also set for the command action. The following additional environment variables are set: – LSB_JOBPGIDS: a list of current process group IDs of the job – LSB_JOBPIDS: a list of current process IDs of the job v For the SUSPEND action command, the following environment variables are also set: Chapter 1. Configuration Files 251 lsb.queues – LSB_SUSP_REASONS - an integer representing a bitmap of suspending reasons as defined in lsbatch.h. The suspending reason can allow the command to take different actions based on the reason for suspending the job. – LSB_SUSP_SUBREASONS - an integer representing the load index that caused the job to be suspended. When the suspending reason SUSP_LOAD_REASON (suspended by load) is set in LSB_SUSP_REASONS, LSB_SUSP_SUBREASONS set to one of the load index values defined in lsf.h. Use LSB_SUSP_REASONS and LSB_SUSP_SUBREASONS together in your custom job control to determine the exact load threshold that caused a job to be suspended. v If an additional action is necessary for the SUSPEND command, that action should also send the appropriate signal to the application. Otherwise, a job can continue to run even after being suspended by LSF. For example, JOB_CONTROLS=SUSPEND[kill $LSB_JOBPIDS; command] v If the job control command fails, LSF retains the original job status. v If you set preemption with the signal SIGTSTP you use IBM Platform License Scheduler, define LIC_SCHED_PREEMPT_STOP=Y in lsf.conf for License Scheduler preemption to work. | Default On UNIX, by default, SUSPEND sends SIGTSTP for parallel or interactive jobs and SIGSTOP for other jobs. RESUME sends SIGCONT. TERMINATE sends SIGINT, SIGTERM and SIGKILL in that order. On Windows, actions equivalent to the UNIX signals have been implemented to do the default job control actions. Job control messages replace the SIGINT and SIGTERM signals, but only customized applications are able to process them. Termination is implemented by the TerminateProcess( ) system call. JOB_IDLE Syntax JOB_IDLE=number Description Specifies a threshold for idle job exception handling. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes LSF_SERVERDIR/eadmin to trigger the action for a job idle exception. The minimum job run time before mbatchd reports that the job is idle is defined as DETECT_IDLE_JOB_AFTER in lsb.params. Valid values Any positive number between 0.0 and 1.0 Example JOB_IDLE=0.10 A job idle exception is triggered for jobs with an idle value (CPU time/runtime) less than 0.10. 252 Platform LSF Configuration Reference lsb.queues Default Not defined. No job idle exceptions are detected. JOB_OVERRUN Syntax JOB_OVERRUN=run_time Description Specifies a threshold for job overrun exception handling. If a job runs longer than the specified run time, LSF invokes LSF_SERVERDIR/eadmin to trigger the action for a job overrun exception. Example JOB_OVERRUN=5 A job overrun exception is triggered for jobs running longer than 5 minutes. Default Not defined. No job overrun exceptions are detected. | JOB_SIZE_LIST | Syntax | JOB_SIZE_LIST=default_size [size ...] | Description | A list of job sizes (number of tasks) that are allowed on this queue. | | | | | | | | When submitting a job or modifying a pending job that requests a job size by using the -n or -R options for bsub and bmod, the requested job size must be a single fixed value that matches one of the values that JOB_SIZE_LIST specifies, which are the job sizes that are allowed on this queue. LSF rejects the job if the requested job size is not in this list. In addition, when using bswitch to switch a pending job with a requested job size to another queue, the requested job size in the pending job must also match one of the values in JOB_SIZE_LIST for the new queue. | | | The first value in this list is the default job size, which is the assigned job size request if the job was submitted without requesting one. The remaining values are the other job sizes allowed in the queue, and may be defined in any order. | | | | | | When defined in both a queue and an application profile (lsb.applications), the job size request must satisfy both requirements. In addition, JOB_SIZE_LIST overrides any TASKLIMIT parameters defined at the same level. Job size requirements do not apply to queues and application profiles with no job size lists, nor do they apply to other levels of job submissions (that is, host level or cluster level job submissions). | | Note: An exclusive job may allocate more slots on the host then is required by the tasks. For example, if JOB_SIZE_LIST=8 and an exclusive job requesting -n8 runs on Chapter 1. Configuration Files 253 lsb.queues | | a 16 slot host, all 16 slots are assigned to the job. The job runs as expected, since the 8 tasks specified for the job matches the job size list. | Valid values | A space-separated list of positive integers between 1 and 2147483646. | Default | Undefined JOB_STARTER Syntax JOB_STARTER=starter [starter] ["%USRCMD"] [starter] Description Creates a specific environment for submitted jobs prior to execution. starter is any executable that can be used to start the job (i.e., can accept the job as an input argument). Optionally, additional strings can be specified. By default, the user commands run after the job starter. A special string, %USRCMD, can be used to represent the position of the user’s job in the job starter command line. The %USRCMD string and any additional commands must be enclosed in quotation marks (" "). If your job starter script runs on a Windows execution host and includes symbols (like & or |), you can use the JOB_STARTER_EXTEND=preservestarter parameter in lsf.conf and set JOB_STARTER=preservestarter in lsb.queues. A customized userstarter can also be used. Example JOB_STARTER=csh -c "%USRCMD;sleep 10" In this case, if a user submits a job % bsub myjob arguments the command that actually runs is: % csh -c "myjob arguments;sleep 10" Default Not defined. No job starter is used. JOB_UNDERRUN Syntax JOB_UNDERRUN=run_time 254 Platform LSF Configuration Reference lsb.queues Description Specifies a threshold for job underrun exception handling. If a job exits before the specified number of minutes, LSF invokes LSF_SERVERDIR/eadmin to trigger the action for a job underrun exception. Example JOB_UNDERRUN=2 A job underrun exception is triggered for jobs running less than 2 minutes. Default Not defined. No job underrun exceptions are detected. JOB_WARNING_ACTION Syntax JOB_WARNING_ACTION=signal Description Specifies the job action to be taken before a job control action occurs. For example, 2 minutes before the job reaches runtime limit or termination deadline, or the queue's run window is closed, an URG signal is sent to the job. A job warning action must be specified with a job action warning time in order for job warning to take effect. If JOB_WARNING_ACTION is specified, LSF sends the warning action to the job before the actual control action is taken. This allows the job time to save its result before being terminated by the job control action. The warning action specified by the bsub -wa option overrides JOB_WARNING_ACTION in the queue. JOB_WARNING_ACTION is used as the default when no command line option is specified. Example JOB_WARNING_ACTION=URG Default Not defined LOAD_INDEX Syntax load_index=loadSched[/loadStop] Specify io, it, ls, mem, pg, r15s, r1m, r15m, swp, tmp, ut, or a non-shared custom external load index. Specify multiple lines to configure thresholds for multiple load indices. Chapter 1. Configuration Files 255 lsb.queues Specify io, it, ls, mem, pg, r15s, r1m, r15m, swp, tmp, ut, or a non-shared custom external load index as a column. Specify multiple columns to configure thresholds for multiple load indices. Description Scheduling and suspending thresholds for the specified dynamic load index. The loadSched condition must be satisfied before a job is dispatched to the host. If a RESUME_COND is not specified, the loadSched condition must also be satisfied before a suspended job can be resumed. If the loadStop condition is satisfied, a job on the host is suspended. The loadSched and loadStop thresholds permit the specification of conditions using simple AND/OR logic. Any load index that does not have a configured threshold has no effect on job scheduling. LSF does not suspend a job if the job is the only batch job running on the host and the machine is interactively idle (it>0). The r15s, r1m, and r15m CPU run queue length conditions are compared to the effective queue length as reported by lsload -E, which is normalized for multiprocessor hosts. Thresholds for these parameters should be set at appropriate levels for single processor hosts. Example MEM=100/10 SWAP=200/30 These two lines translate into a loadSched condition of mem>=100 && swap>=200 and a loadStop condition of mem < 10 || swap < 30 Default Not defined LOCAL_MAX_PREEXEC_RETRY Syntax LOCAL_MAX_PREEXEC_RETRY=integer Description The maximum number of times to attempt the pre-execution command of a job on the local cluster. When this limit is reached, the default behavior of the job is defined by the LOCAL_MAX_PREEXEC_RETRY_ACTION parameter in lsb.params, lsb.queues, or lsb.applications. | | | 256 Platform LSF Configuration Reference lsb.queues Valid values 0 < MAX_PREEXEC_RETRY < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of preexec retry times is unlimited | See also | | LOCAL_MAX_PREEXEC_RETRY_ACTION in lsb.params, lsb.queues, and lsb.applications. | LOCAL_MAX_PREEXEC_RETRY_ACTION | Syntax | LOCAL_MAX_PREEXEC_RETRY_ACTION=SUSPEND | EXIT | Description | | | | | | | The default behavior of a job when it reaches the maximum number of times to attempt its pre-execution command on the local cluster (LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, or lsb.applications). v If set to SUSPEND, the local or leased job is suspended and its status is set to PSUSP v If set to EXIT, the local or leased job exits and its status is set to EXIT. The job exits with the same exit code as the last pre-execution fail exit code. | | | | This parameter is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). The action specified in lsb.applications overrides lsb.queues, and lsb.queues overrides the lsb.params configuration. | Default | Not defined. If not defined in lsb.params, the default action is SUSPEND. | See also | LOCAL_MAX_PREEXEC_RETRY in lsb.params, lsb.queues, and lsb.applications. MANDATORY_EXTSCHED Syntax MANDATORY_EXTSCHED=external_scheduler_options Description Specifies mandatory external scheduling options for the queue. -extsched options on the bsub command are merged with MANDATORY_EXTSCHED options, and MANDATORY_EXTSCHED options Chapter 1. Configuration Files 257 lsb.queues override any conflicting job-level options set by -extsched. Default Not defined MAX_JOB_PREEMPT Syntax MAX_JOB_PREEMPT=integer Description The maximum number of times a job can be preempted. Applies to queue-based preemption only. Valid values 0 < MAX_JOB_PREEMPT < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of preemption times is unlimited. MAX_JOB_REQUEUE Syntax MAX_JOB_REQUEUE=integer Description The maximum number of times to requeue a job automatically. Valid values 0 < MAX_JOB_REQUEUE < INFINIT_INT INFINIT_INT is defined in lsf.h. Default Not defined. The number of requeue times is unlimited MAX_PREEXEC_RETRY Syntax MAX_PREEXEC_RETRY=integer Description Use REMOTE_MAX_PREEXEC_RETRY instead. This parameter is maintained for backwards compatibility. 258 Platform LSF Configuration Reference lsb.queues MultiCluster job forwarding model only. The maximum number of times to attempt the pre-execution command of a job from a remote cluster. If the job's pre-execution command fails all attempts, the job is returned to the submission cluster. Valid values 0 < MAX_PREEXEC_RETRY < INFINIT_INT INFINIT_INT is defined in lsf.h. Default 5 MAX_PROTOCOL_INSTANCES Syntax MAX_PROTOCOL_INSTANCES=integer Description For LSF IBM Parallel Environment (PE) integration. Specify the number of parallel communication paths (windows) available to the protocol on each network. If number of windows specified for the job (with the instances option of bsub -network or the NETWORK_REQ parameter in lsb.queues or lsb.applications), or it is greater than the specified maximum value, LSF rejects the job. Specify MAX_PROTOCOL_INSTANCES in a queue (lsb.queues) or cluster-wide in lsb.params. The value specified in a queue overrides the value specified in lsb.params. LSF_PE_NETWORK_NUM must be defined to a non-zero value in lsf.conf for MAX_PROTOCOL_INSTANCES to take effect and for LSF to run PE jobs. If LSF_PE_NETWORK_NUM is not defined or is set to 0, the value of MAX_PROTOCOL_INSTANCES is ignored with a warning message. For best performance, set MAX_PROTOCOL_INSTANCES so that the communication subsystem uses every available adapter before it reuses any of the adapters. Default No default value MAX_RSCHED_TIME Syntax MAX_RSCHED_TIME=integer | infinit Chapter 1. Configuration Files 259 lsb.queues Description MultiCluster job forwarding model only. Determines how long a MultiCluster job stays pending in the execution cluster before returning to the submission cluster. The remote timeout limit in seconds is: MAX_RSCHED_TIME * MBD_SLEEP_TIME=timeout Specify infinit to disable remote timeout (jobs always get dispatched in the correct FCFS order because MultiCluster jobs never get rescheduled, but MultiCluster jobs can be pending in the receive-jobs queue forever instead of being rescheduled to a better queue). Note: apply to the queue in the submission cluster (only). This parameter is ignored by the receiving queue. Remote timeout limit never affects advance reservation jobs Jobs that use an advance reservation always behave as if remote timeout is disabled. Default 20 (20 minutes by default) MAX_SLOTS_IN_POOL Syntax MAX_SLOTS_IN_POOL=integer Description Queue-based fairshare only. Maximum number of job slots available in the slot pool the queue belongs to for queue based fairshare. Defined in the first queue of the slot pool. Definitions in subsequent queues have no effect. When defined together with other slot limits (QJOB_LIMIT, HJOB_LIMIT or UJOB_LIMIT in lsb.queues or queue limits in lsb.resources) the lowest limit defined applies. When MAX_SLOTS_IN_POOL, SLOT_RESERVE, and BACKFILL are defined for the same queue, jobs in the queue cannot backfill using slots reserved by other jobs in the same queue. Valid values MAX_SLOTS_IN_POOL can be any number from 0 to INFINIT_INT, where INFINIT_INT is defined in lsf.h. Default Not defined 260 Platform LSF Configuration Reference lsb.queues MAX_TOTAL_TIME_PREEMPT Syntax MAX_TOTAL_TIME_PREEMPT=integer Description The accumulated preemption time in minutes after which a job cannot be preempted again, where minutes is wall-clock time, not normalized time. Setting the parameter of the same name in lsb.applications overrides this parameter; setting this parameter overrides the parameter of the same name in lsb.params. Valid values Any positive integer greater than or equal to one (1) Default Unlimited MEMLIMIT Syntax MEMLIMIT=[default_limit] maximum_limit Description The per-process (hard) process resident set size limit (in KB) for all of the processes belonging to a job from this queue (see getrlimit(2)). Sets the maximum amount of physical memory (resident set size, RSS) that may be allocated to a process. By default, if a default memory limit is specified, jobs submitted to the queue without a job-level memory limit are killed when the default memory limit is reached. If you specify only one limit, it is the maximum, or hard, memory limit. If you specify two limits, the first one is the default, or soft, memory limit, and the second one is the maximum memory limit. LSF has two methods of enforcing memory usage: v OS Memory Limit Enforcement v LSF Memory Limit Enforcement OS memory limit enforcement OS memory limit enforcement is the default MEMLIMIT behavior and does not require further configuration. OS enforcement usually allows the process to eventually run to completion. LSF passes MEMLIMIT to the OS that uses it as a guide for the system scheduler and memory allocator. The system may allocate more memory to a process if there is a surplus. When memory is low, the system Chapter 1. Configuration Files 261 lsb.queues takes memory from and lowers the scheduling priority (re-nice) of a process that has exceeded its declared MEMLIMIT. Only available on systems that support RLIMIT_RSS for setrlimit(). Not supported on: v Sun Solaris 2.x v Windows LSF memory limit enforcement To enable LSF memory limit enforcement, set LSB_MEMLIMIT_ENFORCE in lsf.conf to y. LSF memory limit enforcement explicitly sends a signal to kill a running process once it has allocated memory past MEMLIMIT. You can also enable LSF memory limit enforcement by setting LSB_JOB_MEMLIMIT in lsf.conf to y. The difference between LSB_JOB_MEMLIMIT set to y and LSB_MEMLIMIT_ENFORCE set to y is that with LSB_JOB_MEMLIMIT, only the per-job memory limit enforced by LSF is enabled. The per-process memory limit enforced by the OS is disabled. With LSB_MEMLIMIT_ENFORCE set to y, both the per-job memory limit enforced by LSF and the per-process memory limit enforced by the OS are enabled. Available for all systems on which LSF collects total memory usage. Example The following configuration defines a queue with a memory limit of 5000 KB: Begin Queue QUEUE_NAME = default DESCRIPTION = Queue with memory limit of 5000 kbytes MEMLIMIT = 5000 End Queue Default Unlimited MIG Syntax MIG=minutes Description Enables automatic job migration and specifies the migration threshold for checkpointable or rerunnable jobs, in minutes. LSF automatically migrates jobs that have been in the SSUSP state for more than the specified number of minutes. Specify a value of 0 to migrate jobs immediately upon suspension. The migration threshold applies to all jobs running on the host. Job-level command line migration threshold overrides threshold configuration in application profile and queue. Application profile configuration overrides queue level configuration. 262 Platform LSF Configuration Reference lsb.queues When a host migration threshold is specified, and is lower than the value for the job, the queue, or the application, the host value is used.. Members of a chunk job can be migrated. Chunk jobs in WAIT state are removed from the job chunk and put into PEND state. Does not affect MultiCluster jobs that are forwarded to a remote cluster. Default Not defined. LSF does not migrate checkpointable or rerunnable jobs automatically. NETWORK_REQ Syntax NETWORK_REQ="network_res_req" network_res_req has the following syntax: [type=sn_all | sn_single] [:protocol=protocol_name[(protocol_number)][,protocol_name[(protocol_number)]] [:mode=US | IP] [:usage=dedicated | shared] [:instance=positive_integer] Description For LSF IBM Parallel Environment (PE) integration. Specifies the network resource requirements for a PE job. If any network resource requirement is specified in the job, queue, or application profile, the job is treated as a PE job. PE jobs can only run on hosts where IBM PE pnsd daemon is running. The network resource requirement string network_res_req has the same syntax as the bsub -network option. The -network bsub option overrides the value of NETWORK_REQ defined in lsb.queues or lsb.applications. The value of NETWORK_REQ defined in lsb.applications overrides queue-level NETWORK_REQ defined in lsb.queues. The following IBM LoadLeveller job command file options are not supported in LSF: v collective_groups v imm_send_buffers v rcxtblocks The following network resource requirement options are supported: type=sn_all | sn_single Specifies the adapter device type to use for message passing: either sn_all or sn_single. sn_single When used for switch adapters, specifies that all windows are on a single network Chapter 1. Configuration Files 263 lsb.queues sn_all Specifies that one or more windows are on each network, and that striped communication should be used over all available switch networks. The networks specified must be accessible by all hosts selected to run the PE job. See the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information about submitting jobs that use striping. If mode is IP and type is specified as sn_all or sn_single, the job will only run on InfiniBand (IB) adapters (IPoIB). If mode is IP and type is not specified, the job will only run on Ethernet adapters (IPoEth). For IPoEth jobs, LSF ensures the job is running on hosts where pnsd is installed and running. For IPoIB jobs, LSF ensures the job the job is running on hosts where pnsd is installed and running, and that IB networks are up. Because IP jobs do not consume network windows, LSF does not check if all network windows are used up or the network is already occupied by a dedicated PE job. Equivalent to the PE MP_EUIDEVICE environment variable and -euidevice PE flag See the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information. Only sn_all or sn_single are supported by LSF. The other types supported by PE are not supported for LSF jobs. protocol=protocol_name[(protocol_number)] Network communication protocol for the PE job, indicating which message passing API is being used by the application. The following protocols are supported by LSF: mpi The application makes only MPI calls. This value applies to any MPI job regardless of the library that it was compiled with (PE MPI, MPICH2). pami The application makes only PAMI calls. lapi The application makes only LAPI calls. shmem The application makes only OpenSHMEM calls. user_defined_parallel_api The application makes only calls from a parallel API that you define. For example: protocol=myAPI or protocol=charm. The default value is mpi. LSF also supports an optional protocol_number (for example, mpi(2), which specifies the number of contexts (endpoints) per parallel API instance. The number must be a power of 2, but no greater than 128 (1, 2, 4, 8, 16, 32, 64, 128). LSF will pass the communication protocols to PE without any change. LSF will reserve network windows for each protocol. When you specify multiple parallel API protocols, you cannot make calls to both LAPI and PAMI (lapi, pami) or LAPI and OpenSHMEM (lapi, shmem) in the same application. Protocols can be specified in any order. 264 Platform LSF Configuration Reference lsb.queues See the MP_MSG_API and MP_ENDPOINTS environment variables and the -msg_api and -endpoints PE flags in the Parallel Environment Runtime Edition for AIX: Operation and Use guide (SC23-6781-05) for more information about the communication protocols that are supported by IBM Parallel Edition. mode=US | IP The network communication system mode used by the communication specified communication protocol: US (User Space) or IP (Internet Protocol). A US job can only run with adapters that support user space communications, such as the IB adapter. IP jobs can run with either Ethernet adapters or IB adapters. When IP mode is specified, the instance number cannot be specified, and network usage must be unspecified or shared. Each instance on the US mode requested by a task running on switch adapters requires and adapter window. For example, if a task requests both the MPI and LAPI protocols such that both protocol instances require US mode, two adapter windows will be used. The default value is US. usage=dedicated | shared Specifies whether the adapter can be shared with tasks of other job steps: dedicated or shared. Multiple tasks of the same job can share one network even if usage is dedicated. The default usage is shared. instances=positive_integer The number of parallel communication paths (windows) per task made available to the protocol on each network. The number actually used depends on the implementation of the protocol subsystem. The default value is 1. If the specified value is greater than MAX_PROTOCOL_INSTANCES in lsb.params or lsb.queues, LSF rejects the job. LSF_PE_NETWORK_NUM must be defined to a non-zero value in lsf.conf for NETWORK_REQ to take effect. If LSF_PE_NETWORK_NUM is not defined or is set to 0, NETWORK_REQ is ignored with a warning message. Example The following network resource requirement string specifies that the requirements for an sn_all job (one or more windows are on each network, and striped communication should be used over all available switch networks). The PE job uses MPI API calls (protocol), runs in user-space network communication system mode, and requires 1 parallel communication path (window) per task. NETWORK_REQ = "protocol=mpi:mode=us:instance=1:type=sn_all" Default No default value, but if you specify no value (NETWORK_REQ=""), the job uses the following: protocol=mpi:mode=US:usage=shared:instance=1 in the queue. Chapter 1. Configuration Files 265 lsb.queues NEW_JOB_SCHED_DELAY Syntax NEW_JOB_SCHED_DELAY=seconds Description The number of seconds that a new job waits, before being scheduled. A value of zero (0) means the job is scheduled without any delay. The scheduler still periodically fetches jobs from mbatchd. Once it gets jobs, scheduler schedules them without any delay. This may speed up job scheduling a bit, but it also generates some communication overhead. Therefore, you should only set it to 0 for high priority, urgent or interactive queues for a small workloads. If NEW_JOB_SCHED_DELAY is set to a non-zero value, scheduler will periodically fetch new jobs from mbatchd, after which it sets job scheduling time to job submission time + NEW_JOB_SCHED_DELAY. Default 0 seconds NICE Syntax NICE=integer Description Adjusts the UNIX scheduling priority at which jobs from this queue execute. The default value of 0 (zero) maintains the default scheduling priority for UNIX interactive jobs. This value adjusts the run-time priorities for batch jobs on a queue-by-queue basis, to control their effect on other batch or interactive jobs. See the nice(1) manual page for more details. On Windows, this value is mapped to Windows process priority classes as follows: v nice>=0 corresponds to an priority class of IDLE v nice<0 corresponds to an priority class of NORMAL LSF on Windows does not support HIGH or REAL-TIME priority classes. This value is overwritten by the NICE setting in lsb.applications, if defined. Default 0 (zero) NO_PREEMPT_INTERVAL Syntax NO_PREEMPT_INTERVAL=minutes 266 Platform LSF Configuration Reference lsb.queues Description Prevents preemption of jobs for the specified number of minutes of uninterrupted run time, where minutes is wall-clock time, not normalized time. NO_PREEMPT_INTERVAL=0 allows immediate preemption of jobs as soon as they start or resume running. Setting the parameter of the same name in lsb.applications overrides this parameter; setting this parameter overrides the parameter of the same name in lsb.params. Default 0 PJOB_LIMIT Syntax PJOB_LIMIT=float Description Per-processor job slot limit for the queue. Maximum number of job slots that this queue can use on any processor. This limit is configured per processor, so that multiprocessor hosts automatically run more jobs. Default Unlimited POST_EXEC Syntax POST_EXEC=command Description Enables post-execution processing at the queue level. The POST_EXEC command runs on the execution host after the job finishes. Post-execution commands can be configured at the application and queue levels. Application-level post-execution commands run before queue-level post-execution commands. The POST_EXEC command uses the same environment variable values as the job, and, by default, runs under the user account of the user who submits the job. To run post-execution commands under a different user account (such as root for privileged operations), configure the parameter LSB_PRE_POST_EXEC_USER in lsf.sudoers. When a job exits with one of the queue’s REQUEUE_EXIT_VALUES, LSF requeues the job and sets the environment variable LSB_JOBPEND. The post-execution command runs after the requeued job finishes. Chapter 1. Configuration Files 267 lsb.queues When the post-execution command is run, the environment variable LSB_JOBEXIT_STAT is set to the exit status of the job. If the execution environment for the job cannot be set up, LSB_JOBEXIT_STAT is set to 0 (zero). The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). For UNIX: v The pre- and post-execution commands run in the /tmp directory under /bin/sh -c, which allows the use of shell features in the commands. The following example shows valid configuration lines: PRE_EXEC= /usr/share/lsf/misc/testq_pre >> /tmp/pre.out POST_EXEC= /usr/share/lsf/misc/testq_post | grep -v "Hey!" v LSF sets the PATH environment variable to PATH=’/bin /usr/bin /sbin /usr/sbin’ v The stdin, stdout, and stderr are set to /dev/null v To allow UNIX users to define their own post-execution commands, an LSF administrator specifies the environment variable $USER_POSTEXEC as the POST_EXEC command. A user then defines the post-execution command: setenv USER_POSTEXEC /path_name Note: The path name for the post-execution command must be an absolute path. Do not define POST_EXEC=$USER_POSTEXEC when LSB_PRE_POST_EXEC_USER=root. This parameter cannot be used to configure host-based post-execution processing. For Windows: v The pre- and post-execution commands run under cmd.exe /c v The standard input, standard output, and standard error are set to NULL v The PATH is determined by the setup of the LSF Service Note: For post-execution commands that execute on a Windows Server 2003, x64 Edition platform, users must have read and execute privileges for cmd.exe. Default Not defined. No post-execution commands are associated with the queue. PRE_EXEC Syntax PRE_EXEC=command Description Enables pre-execution processing at the queue level. The PRE_EXEC command runs on the execution host before the job starts. If the PRE_EXEC command exits with a non-zero exit code, LSF requeues the job to the front of the queue. 268 Platform LSF Configuration Reference lsb.queues Pre-execution commands can be configured at the queue, application, and job levels and run in the following order: 1. The queue-level command 2. The application-level or job-level command. If you specify a command at both the application and job levels, the job-level command overrides the application-level command; the application-level command is ignored. The PRE_EXEC command uses the same environment variable values as the job, and runs under the user account of the user who submits the job. To run pre-execution commands under a different user account (such as root for privileged operations), configure the parameter LSB_PRE_POST_EXEC_USER in lsf.sudoers. The command path can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory, file name, and expanded values for %J (job_ID) and %I (index_ID). For UNIX: v The pre- and post-execution commands run in the /tmp directory under /bin/sh -c, which allows the use of shell features in the commands. The following example shows valid configuration lines: PRE_EXEC= /usr/share/lsf/misc/testq_pre >> /tmp/pre.out POST_EXEC= /usr/share/lsf/misc/testq_post | grep -v "Hey!" v LSF sets the PATH environment variable to PATH=’/bin /usr/bin /sbin /usr/sbin’ v The stdin, stdout, and stderr are set to /dev/null For Windows: v The pre- and post-execution commands run under cmd.exe /c v The standard input, standard output, and standard error are set to NULL v The PATH is determined by the setup of the LSF Service Note: For pre-execution commands that execute on a Windows Server 2003, x64 Edition platform, users must have read and execute privileges for cmd.exe. This parameter cannot be used to configure host-based pre-execution processing. Default Not defined. No pre-execution commands are associated with the queue. PREEMPTION Syntax PREEMPTION=PREEMPTIVE[[low_queue_name[+pref_level]...]] PREEMPTION=PREEMPTABLE[[hi_queue_name...]] PREEMPTION=PREEMPTIVE[[low_queue_name[+pref_level]...]] PREEMPTABLE[[hi_queue_name...]] Description PREEMPTIVE Chapter 1. Configuration Files 269 lsb.queues Enables preemptive scheduling and defines this queue as preemptive. Jobs in this queue preempt jobs from the specified lower-priority queues or from all lower-priority queues if the parameter is specified with no queue names. PREEMPTIVE can be combined with PREEMPTABLE to specify that jobs in this queue can preempt jobs in lower-priority queues, and can be preempted by jobs in higher-priority queues. PREEMPTABLE Enables preemptive scheduling and defines this queue as preemptable. Jobs in this queue can be preempted by jobs from specified higher-priority queues, or from all higher-priority queues, even if the higher-priority queues are not preemptive. PREEMPTIVE can be combined with PREEMPTIVE to specify that jobs in this queue can be preempted by jobs in higher-priority queues, and can preempt jobs in lower-priority queues. low_queue_name Specifies the names of lower-priority queues that can be preempted. To specify multiple queues, separate the queue names with a space, and enclose the list in a single set of square brackets. +pref_level Specifies to preempt this queue before preempting other queues. When multiple queues are indicated with a preference level, an order of preference is indicated: queues with higher relative preference levels are preempted before queues with lower relative preference levels set. hi_queue_name Specifies the names of higher-priority queues that can preempt jobs in this queue. To specify multiple queues, separate the queue names with a space and enclose the list in a single set of square brackets. Example: configure selective, ordered preemption across queues The following example defines four queues, as follows: v high – Has the highest relative priority of 99 – Jobs from this queue can preempt jobs from all other queues v medium – Has the second-highest relative priority at 10 – Jobs from this queue can preempt jobs from normal and low queues, beginning with jobs from low, as indicated by the preference (+1) v normal – Has the second-lowest relative priority, at 5 – Jobs from this queue can preempt jobs from low, and can be preempted by jobs from both high and medium queues v low – Has the lowest relative priority, which is also the default priority, at 1 – Jobs from this queue can be preempted by jobs from all preemptive queues, even though it does not have the PREEMPTABLE keyword set Begin Queue QUEUE_NAME=high 270 Platform LSF Configuration Reference lsb.queues PREEMPTION=PREEMPTIVE PRIORITY=99 End Queue Begin Queue QUEUE_NAME=medium PREEMPTION=PREEMPTIVE[normal low+1] PRIORITY=10 End Queue Begin Queue QUEUE_NAME=normal PREEMPTION=PREEMPTIVE[low] PREEMPTABLE[high medium] PRIORITY=5 End Queue Begin Queue QUEUE_NAME=low PRIORITY=1 End Queue PREEMPT_DELAY Syntax PREEMPT_DELAY=seconds Description Preemptive jobs will wait the specified number of seconds from the submission time before preempting any low priority preemptable jobs. During the grace period, preemption will not be trigged, but the job can be scheduled and dispatched by other scheduling policies. This feature can provide flexibility to tune the system to reduce the number of preemptions. It is useful to get better performance and job throughput. When the low priority jobs are short, if high priority jobs can wait a while for the low priority jobs to finish, preemption can be avoided and cluster performance is improved. If the job is still pending after the grace period has expired, the preemption will be triggered. The waiting time is for preemptive jobs in the pending status only. It will not impact the preemptive jobs that are suspended. The time is counted from the submission time of the jobs. The submission time means the time mbatchd accepts a job, which includes newly submitted jobs, restarted jobs (by brestart) or forwarded jobs from a remote cluster. When the preemptive job is waiting, the pending reason is: The preemptive job is allowing a grace period before preemption. If you use an older version of bjobs, the pending reason is: Unknown pending reason code <6701>; Chapter 1. Configuration Files 271 lsb.queues The parameter is defined in lsb.params, lsb.queues (overrides lsb.params), and lsb.applications (overrides both lsb.params and lsb.queues). Run badmin reconfig to make your changes take effect. Default Not defined (if the parameter is not defined anywhere, preemption is immediate). PRIORITY Syntax PRIORITY=integer Description Specifies the relative queue priority for dispatching jobs. A higher value indicates a higher job-dispatching priority, relative to other queues. LSF schedules jobs from one queue at a time, starting with the highest-priority queue. If multiple queues have the same priority, LSF schedules all the jobs from these queues in first-come, first-served order. LSF queue priority is independent of the UNIX scheduler priority system for time-sharing processes. In LSF, the NICE parameter is used to set the UNIX time-sharing priority for batch jobs. integer Specify a number greater than or equal to 1, where 1 is the lowest priority. Default 1 PROCESSLIMIT Syntax PROCESSLIMIT=[default_limit] maximum_limit Description Limits the number of concurrent processes that can be part of a job. By default, if a default process limit is specified, jobs submitted to the queue without a job-level process limit are killed when the default process limit is reached. If you specify only one limit, it is the maximum, or hard, process limit. If you specify two limits, the first one is the default, or soft, process limit, and the second one is the maximum process limit. Default Unlimited 272 Platform LSF Configuration Reference lsb.queues QJOB_LIMIT Syntax QJOB_LIMIT=integer Description Job slot limit for the queue. Total number of job slots that this queue can use. Default Unlimited QUEUE_GROUP Syntax QUEUE_GROUP=queue1, queue2 ... Description Configures absolute priority scheduling (APS) across multiple queues. When APS is enabled in the queue with APS_PRIORITY, the FAIRSHARE_QUEUES parameter is ignored. The QUEUE_GROUP parameter replaces FAIRSHARE_QUEUES, which is obsolete in LSF 7.0. Default Not defined QUEUE_NAME Syntax QUEUE_NAME=string Description Required. Name of the queue. Specify any ASCII string up to 59 characters long. You can use letters, digits, underscores (_) or dashes (-). You cannot use blank spaces. You cannot specify the reserved name default. Default You must specify this parameter to define a queue. The default queue automatically created by LSF is named default. RCVJOBS_FROM Syntax RCVJOBS_FROM=cluster_name ... | allclusters Chapter 1. Configuration Files 273 lsb.queues Description MultiCluster only. Defines a MultiCluster receive-jobs queue. Specify cluster names, separated by a space. The administrator of each remote cluster determines which queues in that cluster forward jobs to the local cluster. Use the keyword allclusters to specify any remote cluster. Example RCVJOBS_FROM=cluster2 cluster4 cluster6 This queue accepts remote jobs from clusters 2, 4, and 6. REMOTE_MAX_PREEXEC_RETRY Syntax REMOTE_MAX_PREEXEC_RETRY=integer Description MultiCluster job forwarding model only. Applies to the execution cluster. Define the maximum number of times to attempt the pre-execution command of a job from the remote cluster. Valid values 0 - INFINIT_INT INFINIT_INT is defined in lsf.h. Default 5 REQUEUE_EXIT_VALUES Syntax REQUEUE_EXIT_VALUES=[exit_code ...] [EXCLUDE(exit_code ...)] Description Enables automatic job requeue and sets the LSB_EXIT_REQUEUE environment variable. Use spaces to separate multiple exit codes. Application-level exit values override queue-level values. Job-level exit values (bsub -Q) override application-level and queue-level values. exit_code has the following form: "[all] [~number ...] | [number ...]" The reserved keyword all specifies all exit codes. Exit codes are typically between 0 and 255. Use a tilde (~) to exclude specified exit codes from the list. Jobs are requeued to the head of the queue. The output from the failed run is not saved, and the user is not notified by LSF. 274 Platform LSF Configuration Reference lsb.queues Define an exit code as EXCLUDE(exit_code) to enable exclusive job requeue, ensuring the job does not rerun on the samehost. Exclusive job requeue does not work for parallel jobs. For MultiCluster jobs forwarded to a remote execution cluster, the exit values specified in the submission cluster with the EXCLUDE keyword are treated as if they were non-exclusive. You can also requeue a job if the job is terminated by a signal. If a job is killed by a signal, the exit value is 128+signal_value. The sum of 128 and the signal value can be used as the exit code in the parameter REQUEUE_EXIT_VALUES. For example, if you want a job to rerun if it is killed with a signal 9 (SIGKILL), the exit value would be 128+9=137. You can configure the following requeue exit value to allow a job to be requeue if it was kill by signal 9: REQUEUE_EXIT_VALUES=137 In Windows, if a job is killed by a signal, the exit value is signal_value. The signal value can be used as the exit code in the parameter REQUEUE_EXIT_VALUES. For example, if you want to rerun a job after it was killed with a signal 7 (SIGKILL), the exit value would be 7. You can configure the following requeue exit value to allow a job to requeue after it was killed by signal 7: REQUEUE_EXIT_VALUES=7 You can configure the following requeue exit value to allow a job to requeue for both Linux and Windows after it was killed: REQUEUE_EXIT_VALUES=137 7 If mbatchd is restarted, it does not remember the previous hosts from which the job exited with an exclusive requeue exit code. In this situation, it is possible for a job to be dispatched to hosts on which the job has previously exited with an exclusive exit code. You should configure REQUEUE_EXIT_VALUES for interruptible backfill queues (INTERRUPTIBLE_BACKFILL=seconds). Example REQUEUE_EXIT_VALUES=30 EXCLUDE(20) means that jobs with exit code 30 are requeued, jobs with exit code 20 are requeued exclusively, and jobs with any other exit code are not requeued. Default Not defined. Jobs are not requeued. Chapter 1. Configuration Files 275 lsb.queues RERUNNABLE Syntax RERUNNABLE=yes | no Description If yes, enables automatic job rerun (restart). Rerun is disabled when RERUNNABLE is set to no. The yes and no arguments are not case sensitive. For MultiCluster jobs, the setting in the submission queue is used, and the setting in the execution queue is ignored. Members of a chunk job can be rerunnable. If the execution host becomes unavailable, rerunnable chunk job members are removed from the job chunk and dispatched to a different execution host. Default no RESOURCE_RESERVE Syntax RESOURCE_RESERVE=MAX_RESERVE_TIME[integer] Description Enables processor reservation and memory reservation for pending jobs for the queue. Specifies the number of dispatch turns (MAX_RESERVE_TIME) over which a job can reserve job slots and memory. Overrides the SLOT_RESERVE parameter. If both RESOURCE_RESERVE and SLOT_RESERVE are defined in the same queue, an error is displayed when the cluster is reconfigured, and SLOT_RESERVE is ignored. Job slot reservation for parallel jobs is enabled by RESOURCE_RESERVE if the LSF scheduler plug-in module names for both resource reservation and parallel batch jobs (schmod_parallel and schmod_reserve) are configured in the lsb.modules file: The schmod_parallel name must come before schmod_reserve in lsb.modules. If a job has not accumulated enough memory or job slots to start by the time MAX_RESERVE_TIME expires, it releases all its reserved job slots or memory so that other pending jobs can run. After the reservation time expires, the job cannot reserve memory or slots for one scheduling session, so other jobs have a chance to be dispatched. After one scheduling session, the job can reserve available memory and job slots again for another period specified by MAX_RESERVE_TIME. If BACKFILL is configured in a queue, and a run limit is specified with -W on bsub or with RUNLIMIT in the queue, backfill jobs can use the accumulated memory reserved by the other jobs in the queue, as long as the backfill job can finish before the predicted start time of the jobs with the reservation. 276 Platform LSF Configuration Reference lsb.queues Unlike slot reservation, which only applies to parallel jobs, memory reservation and backfill on memory apply to sequential and parallel jobs. Example RESOURCE_RESERVE=MAX_RESERVE_TIME[5] This example specifies that jobs have up to 5 dispatch turns to reserve sufficient job slots or memory (equal to 5 minutes, by default). Default Not defined. No job slots or memory is reserved. RES_REQ Syntax RES_REQ=res_req Description Resource requirements used to determine eligible hosts. Specify a resource requirement string as usual. The resource requirement string lets you specify conditions in a more flexible manner than using the load thresholds. Resource requirement strings can be simple (applying to the entire job), compound (applying to the specified number of slots) or can contain alternative resources (alternatives between 2 or more simple and/or compound). For alternative resources, if the first resource cannot be found that satisfies the first resource requirement, then the next resource requirement is tried, and so on until the requirement is satisfied. Compound and alternative resource requirements follow the same set of rules for determining how resource requirements are going to be merged between job, application, and queue level. For more detail on merge rules, see the Administering IBM Platform LSF. When a compound or alternative resource requirement is set for a queue, it will be ignored unless it is the only resource requirement specified (no resource requirements are set at the job-level or application-level). When a simple resource requirement is set for a queue and a compound resource requirement is set at the job-level or application-level, the queue-level requirements merge as they do for simple resource requirements. However, any job-based resources defined in the queue only apply to the first term of the merged compound resource requirements. When LSF_STRICT_RESREQ=Y is configured in lsf.conf, resource requirement strings in select sections must conform to a more strict syntax. The strict resource requirement syntax only applies to the select section. It does not apply to the other resource requirement sections (order, rusage, same, span, cu or affinity). When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource. For simple resource requirements, the select sections from all levels must be satisfied and the same sections from all levels are combined. cu, order, and span sections at the job-level overwrite those at the application-level which overwrite Chapter 1. Configuration Files 277 lsb.queues those at the queue-level. Multiple rusage definitions are merged, with the job-level rusage taking precedence over the application-level, and application-level taking precedence over the queue-level. The simple resource requirement rusage section can specify additional requests. To do this, use the OR (||) operator to separate additional rusage strings. Multiple -R options cannot be used with multi-phase rusage resource requirements. For simple resource requirements the job-level affinity section overrides the application-level, and the application-level affinity section overrides the queue-level. Note: Compound and alternative resource requirements do not support use of the || operator within rusage sections or the cu section. The RES_REQ consumable resource requirements must satisfy any limits set by the parameter RESRSV_LIMIT in lsb.queues, or the RES_REQ will be ignored. When both the RES_REQ and RESRSV_LIMIT are set in lsb.queues for a consumable resource, the queue-level RES_REQ no longer acts as a hard limit for the merged RES_REQ rusage values from the job and application levels. In this case only the limits set by RESRSV_LIMIT must be satisfied, and the queue-level RES_REQ acts as a default value. For example: Queue-level RES_REQ: RES_REQ=rusage[mem=200:lic=1] ... For the job submission: bsub -R’rusage[mem=100]’ ... the resulting requirement for the job is rusage[mem=100:lic=1] where mem=100 specified by the job overrides mem=200 specified by the queue. However, lic=1 from queue is kept, since job does not specify it. Queue-level RES_REQ threshold: RES_REQ = rusage[bwidth =2:threshold=5] ... For the job submission: bsub -R "rusage[bwidth =1:threshold=6]" ... the resulting requirement for the job is rusage[bwidth =1:threshold=6] Queue-level RES_REQ with decay and duration defined: RES_REQ=rusage[mem=200:duration=20:decay=1] ... For a job submission with no decay or duration: bsub -R’rusage[mem=100]’ ... the resulting requirement for the job is: rusage[mem=100:duration=20:decay=1] 278 Platform LSF Configuration Reference lsb.queues Queue-level duration and decay are merged with the job-level specification, and mem=100 for the job overrides mem=200 specified by the queue. However, duration=20 and decay=1 from queue are kept, since job does not specify them. Queue-level RES_REQ with multi-phase job-level rusage: RES_REQ=rusage[mem=200:duration=20:decay=1] ... For a job submission with no decay or duration: bsub -R’rusage[mem=(300 200 100):duration=(10 10 10)]’ ... the resulting requirement for the job is: rusage[mem=(300 200 100):duration=(10 10 10)] Multi-phase rusage values in the job submission override the single phase specified by the queue. v If RESRSV_LIMIT is defined in lsb.queues and has a maximum memory limit of 300 MB or greater, this job will be accepted. v If RESRSV_LIMIT is defined in lsb.queues and has a maximum memory limit of less than 300 MB, this job will be rejected. v If RESRSV_LIMIT is not defined in lsb.queues and the queue-level RES_REQ value of 200 MB acts as a ceiling, this job will be rejected. Queue-level multi-phase rusage RES_REQ: RES_REQ=rusage[mem=(350 200):duration=(20):decay=(1)] ... For a single phase job submission with no decay or duration: bsub -q q_name -R’rusage[mem=100:swap=150]’ ... the resulting requirement for the job is: rusage[mem=100:swap=150] The job-level rusage string overrides the queue-level multi-phase rusage string. The order section defined at the job level overwrites any resource requirements specified at the application level or queue level. The order section defined at the application level overwrites any resource requirements specified at the queue level. The default order string is r15s:pg. If RES_REQ is defined at the queue level and there are no load thresholds defined, the pending reasons for each individual load index are not displayed by bjobs. The span section defined at the queue level is ignored if the span section is also defined at the job level or in an application profile. Note: Define span[hosts=-1] in the application profile or bsub -R resource requirement string to override the span section setting in the queue. Default select[type==local] order[r15s:pg]. If this parameter is defined and a host model or Boolean resource is specified, the default type is any. RESRSV_LIMIT Syntax RESRSV_LIMIT=[res1={min1,} max1] [res2={min2,} max2]... Chapter 1. Configuration Files 279 lsb.queues Where res is a consumable resource name, min is an optional minimum value and max is the maximum allowed value. Both max and min must be float numbers between 0 and 2147483647, and min cannot be greater than max. Description Sets a range of allowed values for RES_REQ resources. Queue-level RES_REQ rusage values (set in lsb.queues) must be in the range set by RESRSV_LIMIT, or the queue-level RES_REQ is ignored. Merged RES_REQ rusage values from the job and application levels must be in the range of RESRSV_LIMIT, or the job is rejected. Changes made to the rusage values of running jobs using bmod -R cannot exceed the maximum values of RESRSV_LIMIT, but can be lower than the minimum values. When both the RES_REQ and RESRSV_LIMIT are set in lsb.queues for a consumable resource, the queue-level RES_REQ no longer acts as a hard limit for the merged RES_REQ rusage values from the job and application levels. In this case only the limits set by RESRSV_LIMIT must be satisfied, and the queue-level RES_REQ acts as a default value. For MultiCluster, jobs must satisfy the RESRSV_LIMIT range set for the send-jobs queue in the submission cluster. After the job is forwarded the resource requirements are also checked against the RESRSV_LIMIT range set for the receive-jobs queue in the execution cluster. Note: Only consumable resource limits can be set in RESRSV_LIMIT. Other resources will be ignored. Default Not defined. If max is defined and optional min is not, the default for min is 0. RESUME_COND Syntax RESUME_COND=res_req Use the select section of the resource requirement string to specify load thresholds. All other sections are ignored. Description LSF automatically resumes a suspended (SSUSP) job in this queue if the load on the host satisfies the specified conditions. If RESUME_COND is not defined, then the loadSched thresholds are used to control resuming of jobs. The loadSched thresholds are ignored, when resuming jobs, if RESUME_COND is defined. 280 Platform LSF Configuration Reference lsb.queues Default Not defined. The loadSched thresholds are used to control resuming of jobs. RUN_JOB_FACTOR Syntax RUN_JOB_FACTOR=number Description Used only with fairshare scheduling. Job slots weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the number of job slots reserved and in use by a user. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Default Not defined. RUN_TIME_DECAY Syntax RUN_TIME_DECAY=Y | y | N | n Description Used only with fairshare scheduling. Enables decay for run time at the same rate as the decay set by HIST_HOURS for cumulative CPU time and historical run time. In the calculation of a user’s dynamic share priority, this factor determines whether run time is decayed. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Restrictions Running badmin reconfig or restarting mbatchd during a job's run time results in the decayed run time being recalculated. When a suspended job using run time decay is resumed, the decay time is based on the elapsed time. Default Not defined Chapter 1. Configuration Files 281 lsb.queues RUN_TIME_FACTOR Syntax RUN_TIME_FACTOR=number Description Used only with fairshare scheduling. Run time weighting factor. In the calculation of a user’s dynamic share priority, this factor determines the relative importance of the total run time of a user’s running jobs. If undefined, the cluster-wide value from the lsb.params parameter of the same name is used. Default Not defined. RUN_WINDOW Syntax RUN_WINDOW=time_window ... Description Time periods during which jobs in the queue are allowed to run. When the window closes, LSF suspends jobs running in the queue and stops dispatching jobs from the queue. When the window reopens, LSF resumes the suspended jobs and begins dispatching additional jobs. Default Not defined. Queue is always active. RUNLIMIT Syntax RUNLIMIT=[default_limit] maximum_limit where default_limit and maximum_limit are: [hour:]minute[/host_name | /host_model] Description The maximum run limit and optionally the default run limit. The name of a host or host model specifies the runtime normalization host to use. By default, jobs that are in the RUN state for longer than the specified maximum run limit are killed by LSF. You can optionally provide your own termination job action to override this default. 282 Platform LSF Configuration Reference lsb.queues Jobs submitted with a job-level run limit (bsub -W) that is less than the maximum run limit are killed when their job-level run limit is reached. Jobs submitted with a run limit greater than the maximum run limit are rejected by the queue. If a default run limit is specified, jobs submitted to the queue without a job-level run limit are killed when the default run limit is reached. The default run limit is used with backfill scheduling of parallel jobs. Note: If you want to provide an estimated run time for scheduling purposes without killing jobs that exceed the estimate, define the RUNTIME parameter in an application profile instead of a run limit (see lsb.applications for details). If you specify only one limit, it is the maximum, or hard, run limit. If you specify two limits, the first one is the default, or soft, run limit, and the second one is the maximum run limit. The number of minutes may be greater than 59. Therefore, three and a half hours can be specified either as 3:30, or 210. The run limit is in the form of [hour:]minute. The minutes can be specified as a number greater than 59. For example, three and a half hours can either be specified as 3:30, or 210. The run limit you specify is the normalized run time. This is done so that the job does approximately the same amount of processing, even if it is sent to host with a faster or slower CPU. Whenever a normalized run time is given, the actual time on the execution host is the specified time multiplied by the CPU factor of the normalization host then divided by the CPU factor of the execution host. If ABS_RUNLIMIT=Y is defined in lsb.params, the runtime limit is not normalized by the host CPU factor. Absolute wall-clock run time is used for all jobs submitted to a queue with a run limit configured. Optionally, you can supply a host name or a host model name defined in LSF. You must insert ‘/’ between the run limit and the host name or model name. (See lsinfo(1) to get host model information.) If no host or host model is given, LSF uses the default runtime normalization host defined at the queue level (DEFAULT_HOST_SPEC in lsb.queues) if it has been configured; otherwise, LSF uses the default CPU time normalization host defined at the cluster level (DEFAULT_HOST_SPEC in lsb.params) if it has been configured; otherwise, the host with the largest CPU factor (the fastest host in the cluster). For MultiCluster jobs, if no other CPU time normalization host is defined and information about the submission host is not available, LSF uses the host with the largest CPU factor (the fastest host in the cluster). Jobs submitted to a chunk job queue are not chunked if RUNLIMIT is greater than 30 minutes. RUNLIMIT is required for queues configured with INTERRUPTIBLE_BACKFILL. Default Unlimited Chapter 1. Configuration Files 283 lsb.queues SLA_GUARANTEES_IGNORE Syntax SLA_GUARANTEES_IGNORE=Y| y | N | n Description Applies to SLA guarantees only. SLA_GUARANTEES_IGNORE=Y allows jobs in the queue access to all guaranteed resources. As a result, some guarantees might not be honored. If a queue does not have this parameter set, jobs in this queue cannot trigger preemption of an SLA job. If an SLA job is suspended (e.g. by a bstop), jobs in queues without the parameter set can still make use of the slots released by the suspended job. Note: Using SLA_GUARANTEES_IGNORE=Y defeats the purpose of guaranteeing resources. This should be used sparingly for low traffic queues only. Default Not defined (N). The queue must honor resource guarantees when dispatching jobs. SLOT_POOL Syntax SLOT_POOL=pool_name Description Name of the pool of job slots the queue belongs to for queue-based fairshare. A queue can only belong to one pool. All queues in the pool must share the same set of hosts. Valid values Specify any ASCII string up to 60 characters long. You can use letters, digits, underscores (_) or dashes (-). You cannot use blank spaces. Default Not defined. No job slots are reserved. SLOT_RESERVE Syntax SLOT_RESERVE=MAX_RESERVE_TIME[integer] Description Enables processor reservation for the queue and specifies the reservation time. Specify the keyword MAX_RESERVE_TIME and, in square brackets, the number of 284 Platform LSF Configuration Reference lsb.queues MBD_SLEEP_TIME cycles over which a job can reserve job slots. MBD_SLEEP_TIME is defined in lsb.params; the default value is 60 seconds. If a job has not accumulated enough job slots to start before the reservation expires, it releases all its reserved job slots so that other jobs can run. Then, the job cannot reserve slots for one scheduling session, so other jobs have a chance to be dispatched. After one scheduling session, the job can reserve job slots again for another period specified by SLOT_RESERVE. SLOT_RESERVE is overridden by the RESOURCE_RESERVE parameter. If both RESOURCE_RESERVE and SLOT_RESERVE are defined in the same queue, job slot reservation and memory reservation are enabled and an error is displayed when the cluster is reconfigured. SLOT_RESERVE is ignored. Job slot reservation for parallel jobs is enabled by RESOURCE_RESERVE if the LSF scheduler plug-in module names for both resource reservation and parallel batch jobs (schmod_parallel and schmod_reserve) are configured in the lsb.modules file: The schmod_parallel name must come before schmod_reserve in lsb.modules. If BACKFILL is configured in a queue, and a run limit is specified at the job level (bsub -W), application level (RUNLIMIT in lsb.applications), or queue level (RUNLIMIT in lsb.queues), or if an estimated run time is specified at the application level (RUNTIME in lsb.applications), backfill parallel jobs can use job slots reserved by the other jobs, as long as the backfill job can finish before the predicted start time of the jobs with the reservation. Unlike memory reservation, which applies both to sequential and parallel jobs, slot reservation applies only to parallel jobs. Example SLOT_RESERVE=MAX_RESERVE_TIME[5] This example specifies that parallel jobs have up to 5 cycles of MBD_SLEEP_TIME (5 minutes, by default) to reserve sufficient job slots to start. Default Not defined. No job slots are reserved. SLOT_SHARE Syntax SLOT_SHARE=integer Description Share of job slots for queue-based fairshare. Represents the percentage of running jobs (job slots) in use from the queue. SLOT_SHARE must be greater than zero (0) and less than or equal to 100. The sum of SLOT_SHARE for all queues in the pool does not need to be 100%. It can be more or less, depending on your needs. Chapter 1. Configuration Files 285 lsb.queues Default Not defined SNDJOBS_TO Syntax SNDJOBS_TO=[queue@]cluster_name[+preference] ... Description Defines a MultiCluster send-jobs queue. Specify remote queue names, in the form queue_name@cluster_name[+preference], separated by a space. This parameter is ignored if lsb.queues HOSTS specifies remote (borrowed) resources. Queue preference is defined at the queue level in SNDJOBS_TO (lsb.queues) of the submission cluster for each corresponding execution cluster queue receiving forwarded jobs. Example SNDJOBS_TO=queue2@cluster2+1 queue3@cluster2+2 STACKLIMIT Syntax STACKLIMIT=integer Description The per-process (hard) stack segment size limit (in KB) for all of the processes belonging to a job from this queue (see getrlimit(2)). Default Unlimited STOP_COND Syntax STOP_COND=res_req Use the select section of the resource requirement string to specify load thresholds. All other sections are ignored. Description LSF automatically suspends a running job in this queue if the load on the host satisfies the specified conditions. v LSF does not suspend the only job running on the host if the machine is interactively idle (it > 0). 286 Platform LSF Configuration Reference lsb.queues v LSF does not suspend a forced job (brun -f). v LSF does not suspend a job because of paging rate if the machine is interactively idle. If STOP_COND is specified in the queue and there are no load thresholds, the suspending reasons for each individual load index is not displayed by bjobs. Example STOP_COND= select[((!cs && it < 5) || (cs && mem < 15 && swp < 50))] In this example, assume “cs” is a Boolean resource indicating that the host is a computer server. The stop condition for jobs running on computer servers is based on the availability of swap memory. The stop condition for jobs running on other kinds of hosts is based on the idle time. SUCCESS_EXIT_VALUES Syntax SUCCESS_EXIT_VALUES=[exit_code ...] Description Use this parameter to specify exit values used byLSF to determine if the job was done successfully. Application level success exit values defined with SUCCESS_EXIT_VALUES in lsb.applications override the configuration defined in lsb.queues. Job-level success exit values specified with the LSB_SUCCESS_EXIT_VALUES environment variable override the configration in lsb.queues and lsb.applications. Use SUCCESS_EXIT_VALUES for submitting jobs to specific queues that successfully exit with non-zero values so that LSF does not interpret non-zero exit codes as job failure. If the same exit code is defined in SUCCESS_EXIT_VALUES and REQUEUE_EXIT_VALUES, any job with this exit code is requeued instead of being marked as DONE because sbatchd processes requeue exit values before success exit values. In MultiCluster job forwarding mode, LSF uses the SUCCESS_EXIT_VALUES from the remote cluster. In a MultiCluster resource leasing environment, LSF uses the SUCCESS_EXIT_VALUES from the consumer cluster. exit_code should be a value between 0 and 255. Use spaces to separate multiple exit codes. Any changes you make to SUCCESS_EXIT_VALUES will not affect running jobs. Only pending jobs will use the new SUCCESS_EXIT_VALUES definitions, even if you run badmin reconfig and mbatchd restart to apply your changes. Default Not defined. Chapter 1. Configuration Files 287 lsb.queues SWAPLIMIT Syntax SWAPLIMIT=integer Description The amount of total virtual memory limit (in KB) for a job from this queue. This limit applies to the whole job, no matter how many processes the job may contain. The action taken when a job exceeds its SWAPLIMIT or PROCESSLIMIT is to send SIGQUIT, SIGINT, SIGTERM, and SIGKILL in sequence. For CPULIMIT, SIGXCPU is sent before SIGINT, SIGTERM, and SIGKILL. Default Unlimited TASKLIMIT | | Syntax | TASKLIMIT=[minimum_limit [default_limit]] maximum_limit | Description | Note: TASKLIMIT replaces PROCLIMIT as of LSF 9.1.3. | | Maximum number of tasks that can be allocated to a job. For parallel jobs, the maximum number of tasks that can be allocated to the job. | | | | Queue level TASKLIMIT has the highest priority over application level TASKLIMIT and job level TASKLIMIT. Application level TASKLIMIT has higher priority than job level TASKLIMIT. Job-level limits must fall within the maximum and minimum limits of the application profile and the queue. | | Note: If you also defined JOB_SIZE_LIST in the same queue where you defined TASKLIMIT, the TASKLIMIT parameter is ignored. | Optionally specifies the minimum and default number of job tasks. | | All limits must be positive numbers greater than or equal to 1 that satisfy the following relationship: | 1 <= minimum <= default <= maximum | | | If RES_REQ in a queue was defined as a compound resource requirement with a block size (span[block=value]), the default value for TASKLIMIT should be a multiple of a block. | For example, this configuration would be accepted: | Queue-level RES_REQ="1*{type==any } + {type==local span[block=4]}" 288 Platform LSF Configuration Reference lsb.queues | TASKLIMIT = 5 9 13 | | This configuration, for example, would not be accepted. An error message will appear when doing badmin reconfig: | Queue-level RES_REQ="1*{type==any } + {type==local span[block=4]}" | TASKCLIMIT = 4 10 12 | | | | | In the MultiCluster job forwarding model, the local cluster considers the receiving queue's TASKLIMIT on remote clusters before forwarding jobs. If the receiving queue's TASKLIMIT definition in the remote cluster cannot satisfy the job's task requirements for a remote queue, the job is not forwarded to that remote queue in the cluster. | Default | Unlimited, the default number of tasks is 1 TERMINATE_WHEN Syntax TERMINATE_WHEN=[LOAD] [PREEMPT] [WINDOW] Description Configures the queue to invoke the TERMINATE action instead of the SUSPEND action in the specified circumstance. v LOAD: kills jobs when the load exceeds the suspending thresholds. v PREEMPT: kills jobs that are being preempted. v WINDOW: kills jobs if the run window closes. If the TERMINATE_WHEN job control action is applied to a chunk job, sbatchd kills the chunk job element that is running and puts the rest of the waiting elements into pending state to be rescheduled later. Example Set TERMINATE_WHEN to WINDOW to define a night queue that kills jobs if the run window closes: Begin Queue NAME = night RUN_WINDOW = 20:00-08:00 TERMINATE_WHEN = WINDOW JOB_CONTROLS = TERMINATE[kill -KILL $LS_JOBPGIDS; mail - s "job $LSB_JOBID killed by queue run window" $USER < /dev/null] End Queue THREADLIMIT Syntax THREADLIMIT=[default_limit] maximum_limit Chapter 1. Configuration Files 289 lsb.queues Description Limits the number of concurrent threads that can be part of a job. Exceeding the limit causes the job to terminate. The system sends the following signals in sequence to all processes belongs to the job: SIGINT, SIGTERM, and SIGKILL. By default, if a default thread limit is specified, jobs submitted to the queue without a job-level thread limit are killed when the default thread limit is reached. If you specify only one limit, it is the maximum, or hard, thread limit. If you specify two limits, the first one is the default, or soft, thread limit, and the second one is the maximum thread limit. Both the default and the maximum limits must be positive integers. The default limit must be less than the maximum limit. The default limit is ignored if it is greater than the maximum limit. Examples THREADLIMIT=6 No default thread limit is specified. The value 6 is the default and maximum thread limit. THREADLIMIT=6 8 The first value (6) is the default thread limit. The second value (8) is the maximum thread limit. Default Unlimited UJOB_LIMIT Syntax UJOB_LIMIT=integer Description Per-user job slot limit for the queue. Maximum number of job slots that each user can use in this queue. UJOB_LIMIT must be within or greater than the range set by TASKLIMIT or bsub -n (if either is used), or jobs are rejected. Default Unlimited USE_PAM_CREDS Syntax USE_PAM_CREDS=y | n 290 Platform LSF Configuration Reference lsb.queues Description If USE_PAM_CREDS=y, applies PAM limits to a queue when its job is dispatched to a Linux host using PAM. PAM limits are system resource limits defined in limits.conf. When USE_PAM_CREDS is enabled, PAM limits override others. For example, the PAM limit is used even if queue-level soft limit is less than PAM limit. However, it still cannot exceed queue's hard limit. If the execution host does not have PAM configured and this parameter is enabled, the job fails. For parallel jobs, only takes effect on the first execution host. USE_PAM_CREDS only applies on the following platforms: v linux2.6-glibc2.3-ia64 v linux2.6-glibc2.3-ppc64 v linux2.6-glibc2.3-sn-ipf v linux2.6-glibc2.3-x86 v linux2.6-glibc2.3-x86_64 Overrides MEMLIMIT_TYPE=Process. Overridden (for CPU limit only) by LSB_JOB_CPULIMIT=y. Overridden (for memory limits only) by LSB_JOB_MEMLIMIT=y. Default n USE_PRIORITY_IN_POOL Syntax USE_PRIORITY_IN_POOL= y | Y | n | N Description Queue-based fairshare only. After job scheduling occurs for each queue, this parameter enables LSF to dispatch jobs to any remaining slots in the pool in first-come first-served order across queues. Default N USERS Syntax USERS=all [~user_name ...] [~user_group ...] | [user_name ...] [user_group [~user_group ...] ...] Chapter 1. Configuration Files 291 lsb.queues Description A space-separated list of user names or user groups that can submit jobs to the queue. LSF cluster administrators are automatically included in the list of users. LSF cluster administrators can submit jobs to this queue, or switch (bswitch) any user’s jobs into this queue. If user groups are specified, each user in the group can submit jobs to this queue. If FAIRSHARE is also defined in this queue, only users defined by both parameters can submit jobs, so LSF administrators cannot use the queue if they are not included in the share assignments. User names must be valid login names. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). User group names can be LSF user groups or UNIX and Windows user groups. To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\user_group). Use the keyword all to specify all users or user groups in a cluster. Use the not operator (~) to exclude users from the all specification or from user groups. This is useful if you have a large number of users but only want to exclude a few users or groups from the queue definition. The not operator (~) can only be used with the all keyword or to exclude users from user groups. CAUTION: The not operator (~) does not exclude LSF administrators from the queue definition. Default all (all users can submit jobs to the queue) Examples v USERS=user1 user2 v USERS=all ~user1 ~user2 v USERS=all ~ugroup1 v USERS=groupA ~user3 ~user4 Automatic time-based configuration Variable configuration is used to automatically change LSF configuration based on time windows. You define automatic configuration changes in lsb.queues by using if-else constructs and time expressions. After you change the files, reconfigure the cluster with the badmin reconfig command. The expressions are evaluated by LSF every 10 minutes based on mbatchd start time. When an expression evaluates true, LSF dynamically changes the configuration based on the associated configuration statements. Reconfiguration is done in real time without restarting mbatchd, providing continuous system availability. 292 Platform LSF Configuration Reference lsb.queues Example Begin Queue ... #if time(8:30-18:30) INTERACTIVE #endif = ONLY # interactive only during day shift #endif ... End Queue lsb.resources The lsb.resources file contains configuration information for resource allocation limits, exports, resource usage limits, and guarantee policies. This file is optional. The lsb.resources file is stored in the directory LSB_CONFDIR/cluster_name/ configdir, where LSB_CONFDIR is defined in lsf.conf. Changing lsb.resources configuration After making any changes to lsb.resources, run badmin reconfig to reconfigure mbatchd. Limit section The Limit section sets limits for the maximum amount of the specified resources that must be available for different classes of jobs to start, and which resource consumers the limits apply to. Limits are enforced during job resource allocation. Tip: For limits to be enforced, jobs must specify rusage resource requirements (bsub -R or RES_REQ in lsb.queues). The blimits command displays view current usage of resource allocation limits configured in Limit sections in lsb.resources: Limit section structure Each set of limits is defined in a Limit section enclosed by Begin Limit and End Limit. A Limit section has two formats: v Vertical tabular v Horizontal The file can contain sections in both formats. In either format, you must configure a limit for at least one consumer and one resource. The Limit section cannot be empty. Vertical tabular format Use the vertical format for simple configuration conditions involving only a few consumers and resource limits. Chapter 1. Configuration Files 293 lsb.resources The first row consists of an optional NAME and the following keywords for: v Resource types: – SLOTS or SLOTS_PER_PROCESSOR – MEM (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) – SWP (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) – TMP (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) – JOBS – RESOURCE v Consumer types: – USERS or PER_USER – QUEUES or PER_QUEUE – HOSTS or PER_HOST – PROJECTS or PER_PROJECT – LIC_PROJECTS or PER_LIC_PROJECT Each subsequent row describes the configuration information for resource consumers and the limits that apply to them. Each line must contain an entry for each keyword. Use empty parentheses () or a dash (-) to to indicate an empty field. Fields cannot be left blank. Tip: Multiple entries must be enclosed in parentheses. For RESOURCE limits, RESOURCE names must be enclosed in parentheses. Horizontal format Use the horizontal format to give a name for your limits and to configure more complicated combinations of consumers and resource limits. The first line of the Limit section gives the name of the limit configuration. Each subsequent line in the Limit section consists of keywords identifying the resource limits: v Job slots and per-processor job slots v Memory (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) v Swap space (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) v Tmp space (MB or unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) v Running and suspended (RUN, SSUSP, USUSP) jobs v Other shared resources and the resource consumers to which the limits apply: v Users and user groups v Hosts and host groups v Queues v Projects 294 Platform LSF Configuration Reference lsb.resources Example: Vertical tabular format In the following limit configuration: v Jobs from user1 and user3 are limited to 2 job slots on hostA v Jobs from user2 on queue normal are limited to 20 MB of memory or the unit set in LSF_UNIT_FOR_LIMITS in lsf.conf. v The short queue can have at most 200 running and suspended jobs Begin Limit NAME USERS limit1 (user1 user3) QUEUES - HOSTS SLOTS hostA 2 MEM SWP TMP JOBS - - - - - user2 normal - - 20 - - - - - short - - - - - 200 End Limit Jobs that do not match these limits; that is, all users except user1 and user3 running jobs on hostA and all users except user2 submitting jobs to queue normal, have no limits. Example: Horizontal format All users in user group ugroup1 except user1 using queue1 and queue2 and running jobs on hosts in host group hgroup1 are limited to 2 job slots per processor on each host: Begin Limit # ugroup1 except user1 uses queue1 and queue2 with 2 job slots # on each host in hgroup1 NAME = limit1 # Resources SLOTS_PER_PROCESSOR = 2 #Consumers QUEUES = queue1 queue2 USERS = ugroup1 ~user1 PER_HOST = hgroup1 End Limit Compatibility with lsb.queues, lsb.users, and lsb.hosts The Limit section of lsb.resources does not support the keywords or format used in lsb.users, lsb.hosts, and lsb.queues. However, your existing job slot limit configuration in these files will continue to apply. Job slot limits are the only type of limit you can configure in lsb.users, lsb.hosts, and lsb.queues. You cannot configure limits for user groups, host groups and projects in lsb.users, lsb.hosts, and lsb.queues. You should not configure any new resource allocation limits in lsb.users, lsb.hosts, and lsb.queues. Use lsb.resources to configure all new resource allocation limits, including job slot limits. Limits on running and suspended jobs can only be set in lsb.resources. Existing limits in lsb.users, lsb.hosts, and lsb.queues with the same scope as a new limit in lsb.resources, but with a different value are ignored. The value of the new limit in lsb.resources is used. Similar limits with different scope enforce the most restrictive limit. Chapter 1. Configuration Files 295 lsb.resources Parameters v v v v HOSTS JOBS MEM NAME v v v v v v v PER_HOST PER_PROJECT PER_QUEUE PER_USER PROJECTS QUEUES RESOURCE v SLOTS v SLOTS_PER_PROCESSOR v SWP v TMP v USERS HOSTS Syntax HOSTS=all [~]host_name ... | all [~]host_group ... HOSTS ( [-] | all [~]host_name ... | all [~]host_group ... ) Description A space-separated list of hosts, host groups defined in lsb.hosts on which limits are enforced. Limits are enforced on all hosts or host groups listed. If a group contains a subgroup, the limit also applies to each member in the subgroup recursively. To specify a per-host limit, use the PER_HOST keyword. Do not configure HOSTS and PER_HOST limits in the same Limit section. If you specify MEM, TMP, or SWP as a percentage, you must specify PER_HOST and list the hosts that the limit is to be enforced on. You cannot specify HOSTS. In horizontal format, use only one HOSTS line per Limit section. Use the keyword all to configure limits that apply to all hosts in a cluster. Use the not operator (~) to exclude hosts from the all specification in the limit. This is useful if you have a large cluster but only want to exclude a few hosts from the limit definition. In vertical tabular format, multiple host names must be enclosed in parentheses. 296 Platform LSF Configuration Reference lsb.resources In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default all (limits are enforced on all hosts in the cluster). Example 1 HOSTS=Group1 ~hostA hostB hostC Enforces limits on hostB, hostC, and all hosts in Group1 except for hostA. Example 2 HOSTS=all ~group2 ~hostA Enforces limits on all hosts in the cluster, except for hostA and the hosts in group2. Example 3 HOSTS SWP (all ~hostK ~hostM) 10 Enforces a 10 MB (or the unit set in LSF_UNIT_FOR_LIMITS in lsf.conf) swap limit on all hosts in the cluster, except for hostK and hostM JOBS Syntax JOBS=integer JOBS - | integer Description Maximum number of running or suspended (RUN, SSUSP, USUSP) jobs available to resource consumers. Specify a positive integer greater than or equal 0. Job limits can be defined in both vertical and horizontal limit formats. With MultiCluster resource lease model, this limit applies only to local hosts being used by the local cluster. The job limit for hosts exported to a remote cluster is determined by the host export policy, not by this parameter. The job limit for borrowed hosts is determined by the host export policy of the remote cluster. If SLOTS are configured in the Limit section, the most restrictive limit is applied. If HOSTS are configured in the Limit section, JOBS is the number of running and suspended jobs on a host. If preemptive scheduling is used, the suspended jobs are not counted against the job limit. Use this parameter to prevent a host from being overloaded with too many jobs, and to maximize the throughput of a machine. If only QUEUES are configured in the Limit section, JOBS is the maximum number of jobs that can run in the listed queues. Chapter 1. Configuration Files 297 lsb.resources If only USERS are configured in the Limit section, JOBS is the maximum number of jobs that the users or user groups can run. If only HOSTS are configured in the Limit section, JOBS is the maximum number of jobs that can run on the listed hosts. If only PROJECTS are configured in the Limit section, JOBS is the maximum number of jobs that can run under the listed projects. Use QUEUES or PER_QUEUE, USERS or PER_USER, HOSTS or PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further limit jobs available to resource consumers. In horizontal format, use only one JOBS line per Limit section. In vertical format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. Default No limit Example JOBS=20 MEM Syntax MEM=integer[%] MEM - | integer[%] Description Maximum amount of memory available to resource consumers. Specify a value in MB or the unit set in LSF_UNIT_FOR_LIMITS in lsf.conf as a positive integer greater than or equal 0. The Limit section is ignored if MEM is specified as a percentage: v Without PER_HOST, or v With HOSTS In horizontal format, use only one MEM line per Limit section. In vertical tabular format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. If only QUEUES are configured in the Limit section, MEM must be an integer value. MEM is the maximum amount of memory available to the listed queues. If only USERS are configured in the Limit section, MEM must be an integer value. MEM is the maximum amount of memory that the users or user groups can use. 298 Platform LSF Configuration Reference lsb.resources If only HOSTS are configured in the Limit section, MEM must be an integer value. It cannot be a percentage. MEM is the maximum amount of memory available to the listed hosts. If only PROJECTS are configured in the Limit section, MEM must be an integer value. MEM is the maximum amount of memory available to the listed projects. Use QUEUES or PER_QUEUE, USERS or PER_USER, HOSTS or PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further limit memory available to resource consumers. Default No limit Example MEM=20 NAME Syntax NAME=limit_name NAME - | limit_name Description Name of the Limit section Specify any ASCII string 40 characters or less. You can use letters, digits, underscores (_) or dashes (-). You cannot use blank spaces. If duplicate limit names are defined, the Limit section is ignored. If value of NAME is not defined in vertical format, or defined as (-), blimtis displays NONAMEnnn. Default None. In horizontal format, you must provide a name for the Limit section. NAME is optional in the vertical format. Example NAME=short_limits PER_HOST Syntax PER_HOST=all [~]host_name ... | all [~]host_group ... PER_HOST ( [-] | all [~]host_name ... | all [~]host_group ... ) Chapter 1. Configuration Files 299 lsb.resources Description A space-separated list of host or host groups defined in lsb.hosts on which limits are enforced. Limits are enforced on each host or individually to each host of the host group listed. If a group contains a subgroup, the limit also applies to each member in the subgroup recursively. Do not configure PER_HOST and HOSTS limits in the same Limit section. In horizontal format, use only one PER_HOST line per Limit section. If you specify MEM, TMP, or SWP as a percentage, you must specify PER_HOST and list the hosts that the limit is to be enforced on. You cannot specify HOSTS. Use the keyword all to configure limits that apply to each host in a cluster. If host groups are configured, the limit applies to each member of the host group, not the group as a whole. Use the not operator (~) to exclude hosts or host groups from the all specification in the limit. This is useful if you have a large cluster but only want to exclude a few hosts from the limit definition. In vertical tabular format, multiple host names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default None. If no limit is specified for PER_HOST or HOST, no limit is enforced on any host or host group. Example PER_HOST=hostA hgroup1 ~hostC PER_PROJECT Syntax PER_PROJECT=all [~]project_name ... PER_PROJECT ( [-] | all [~]project_name ... ) Description A space-separated list of project names on which limits are enforced. Limits are enforced on each project listed. Do not configure PER_PROJECT and PROJECTS limits in the same Limit section. In horizontal format, use only one PER_PROJECT line per Limit section. Use the keyword all to configure limits that apply to each project in a cluster. Use the not operator (~) to exclude projects from the all specification in the limit. 300 Platform LSF Configuration Reference lsb.resources In vertical tabular format, multiple project names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default None. If no limit is specified for PER_PROJECT or PROJECTS, no limit is enforced on any project. Example PER_PROJECT=proj1 proj2 PER_QUEUE Syntax PER_QUEUE=all [~]queue_name .. PER_QUEUE ( [-] | all [~]queue_name ... ) Description A space-separated list of queue names on which limits are enforced. Limits are enforced on jobs submitted to each queue listed. Do not configure PER_QUEUE and QUEUES limits in the same Limit section. In horizontal format, use only one PER_QUEUE line per Limit section. Use the keyword all to configure limits that apply to each queue in a cluster. Use the not operator (~) to exclude queues from the all specification in the limit. This is useful if you have a large number of queues but only want to exclude a few queues from the limit definition. In vertical tabular format, multiple queue names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default None. If no limit is specified for PER_QUEUE or QUEUES, no limit is enforced on any queue. Example PER_QUEUE=priority night PER_USER Syntax PER_USER=all [~]user_name ... | all [~]user_group ... PER_USER Chapter 1. Configuration Files 301 lsb.resources ( [-] | all [~]user_name ... | all [~]user_group ... ) Description A space-separated list of user names or user groups on which limits are enforced. Limits are enforced on each user or individually to each user in the user group listed. If a user group contains a subgroup, the limit also applies to each member in the subgroup recursively. User names must be valid login names. User group names can be LSF user groups or UNIX and Windows user groups. Note that for LSF and UNIX user groups, the groups must be specified in a UserGroup section in lsb.users first. Do not configure PER_USER and USERS limits in the same Limit section. In horizontal format, use only one PER_USER line per Limit section. Use the keyword all to configure limits that apply to each user in a cluster. If user groups are configured, the limit applies to each member of the user group, not the group as a whole. Use the not operator (~) to exclude users or user groups from the all specification in the limit. This is useful if you have a large number of users but only want to exclude a few users from the limit definition. In vertical tabular format, multiple user names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default None. If no limit is specified for PER_USER or USERS, no limit is enforced on any user or user group. Example PER_USER=user1 user2 ugroup1 ~user3 PROJECTS Syntax PROJECTS=all [~]project_name ... PROJECTS ( [-] | all [~]project_name ... ) Description A space-separated list of project names on which limits are enforced. Limits are enforced on all projects listed. To specify a per-project limit, use the PER_PROJECT keyword. Do not configure PROJECTS and PER_PROJECT limits in the same Limit section. In horizontal format, use only one PROJECTS line per Limit section. 302 Platform LSF Configuration Reference lsb.resources Use the keyword all to configure limits that apply to all projects in a cluster. Use the not operator (~) to exclude projects from the all specification in the limit. This is useful if you have a large number of projects but only want to exclude a few projects from the limit definition. In vertical tabular format, multiple project names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default all (limits are enforced on all projects in the cluster) Example PROJECTS=projA projB QUEUES Syntax QUEUES=all [~]queue_name ... QUEUES ( [-] | all [~]queue_name ... ) Description A space-separated list of queue names on which limits are enforced. Limits are enforced on all queues listed. The list must contain valid queue names defined in lsb.queues. To specify a per-queue limit, use the PER_QUEUE keyword. Do not configure QUEUES and PER_QUEUE limits in the same Limit section. In horizontal format, use only one QUEUES line per Limit section. Use the keyword all to configure limits that apply to all queues in a cluster. Use the not operator (~) to exclude queues from the all specification in the limit. This is useful if you have a large number of queues but only want to exclude a few queues from the limit definition. In vertical tabular format, multiple queue names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default all (limits are enforced on all queues in the cluster) Example QUEUES=normal night Chapter 1. Configuration Files 303 lsb.resources RESOURCE Syntax RESOURCE=[shared_resource,integer] [[shared_resource,integer] ...] RESOURCE ( [[shared_resource,integer] [[shared_resource,integer] ...] ) Description Maximum amount of any user-defined shared resource available to consumers. In horizontal format, use only one RESOURCE line per Limit section. In vertical tabular format, resource names must be enclosed in parentheses. In vertical tabular format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default None Examples RESOURCE=[stat_shared,4] Begin Limit RESOURCE ([stat_shared,4]) ([dyn_rsrc,1] [stat_rsrc,2]) PER_HOST (all ~hostA) (hostA) End Limit SLOTS Syntax SLOTS=integer SLOTS - | integer Description Maximum number of job slots available to resource consumers. Specify a positive integer greater than or equal 0. With MultiCluster resource lease model, this limit applies only to local hosts being used by the local cluster. The job slot limit for hosts exported to a remote cluster is determined by the host export policy, not by this parameter. The job slot limit for borrowed hosts is determined by the host export policy of the remote cluster. If JOBS are configured in the Limit section, the most restrictive limit is applied. 304 Platform LSF Configuration Reference lsb.resources If HOSTS are configured in the Limit section, SLOTS is the number of running and suspended jobs on a host. If preemptive scheduling is used, the suspended jobs are not counted as using a job slot. To fully use the CPU resource on multiprocessor hosts, make the number of job slots equal to or greater than the number of processors. Use this parameter to prevent a host from being overloaded with too many jobs, and to maximize the throughput of a machine. Use “!” to make the number of job slots equal to the number of CPUs on a host. If the number of CPUs in a host changes dynamically, mbatchd adjusts the maximum number of job slots per host accordingly. Allow the mbatchd up to 10 minutes to get the number of CPUs for a host. During this period the value of SLOTS is 1. If only QUEUES are configured in the Limit section, SLOTS is the maximum number of job slots available to the listed queues. If only USERS are configured in the Limit section, SLOTS is the maximum number of job slots that the users or user groups can use. If only HOSTS are configured in the Limit section, SLOTS is the maximum number of job slots that are available to the listed hosts. If only PROJECTS are configured in the Limit section, SLOTS is the maximum number of job slots that are available to the listed projects. Use QUEUES or PER_QUEUE, USERS or PER_USER, HOSTS or PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further limit job slots per processor available to resource consumers. In horizontal format, use only one SLOTS line per Limit section. In vertical format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. Default No limit Example SLOTS=20 SLOTS_PER_PROCESSOR Syntax SLOTS_PER_PROCESSOR=number SLOTS_PER_PROCESSOR - | number Chapter 1. Configuration Files 305 lsb.resources Description Per processor job slot limit, based on the number of processors on each host affected by the limit. Maximum number of job slots that each resource consumer can use per processor. This job slot limit is configured per processor so that multiprocessor hosts will automatically run more jobs. You must also specify PER_HOST and list the hosts that the limit is to be enforced on. The Limit section is ignored if SLOTS_PER_PROCESSOR is specified: v Without PER_HOST, or v With HOSTS In vertical format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. To fully use the CPU resource on multiprocessor hosts, make the number of job slots equal to or greater than the number of processors. Use this parameter to prevent a host from being overloaded with too many jobs, and to maximize the throughput of a machine. This number can be a fraction such as 0.5, so that it can also serve as a per-CPU limit on multiprocessor machines. This number is rounded up to the nearest integer equal to or greater than the total job slot limits for a host. For example, if SLOTS_PER_PREOCESSOR is 0.5, on a 4-CPU multiprocessor host, users can only use up to 2 job slots at any time. On a single-processor machine, users can use 1 job slot. Use “!” to make the number of job slots equal to the number of CPUs on a host. If the number of CPUs in a host changes dynamically, mbatchd adjusts the maximum number of job slots per host accordingly. Allow the mbatchd up to 10 minutes to get the number of CPUs for a host. During this period the number of CPUs is 1. If only QUEUES and PER_HOST are configured in the Limit section, SLOTS_PER_PROCESSOR is the maximum amount of job slots per processor available to the listed queues for any hosts, users or projects. If only USERS and PER_HOST are configured in the Limit section, SLOTS_PER_PROCESSOR is the maximum amount of job slots per processor that the users or user groups can use on any hosts, queues, license projects, or projects. If only PER_HOST is configured in the Limit section, SLOTS_PER_PROCESSOR is the maximum amount of job slots per processor available to the listed hosts for any users, queues or projects. If only PROJECTS and PER_HOST are configured in the Limit section, SLOTS_PER_PROCESSOR is the maximum amount of job slots per processor available to the listed projects for any users, queues or hosts. Use QUEUES or PER_QUEUE, USERS or PER_USER, PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further 306 Platform LSF Configuration Reference lsb.resources limit job slots per processor available to resource consumers. Default No limit Example SLOTS_PER_PROCESSOR=2 SWP Syntax SWP=integer[%] SWP - | integer[%] Description Maximum amount of swap space available to resource consumers. Specify a value in MB or the unit set in LSF_UNIT_FOR_LIMITS in lsf.conf as a positive integer greater than or equal 0. The Limit section is ignored if SWP is specified as a percentage: v Without PER_HOST, or v With HOSTS In horizontal format, use only one SWP line per Limit section. In vertical format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. If only USERS are configured in the Limit section, SWP must be an integer value. SWP is the maximum amount of swap space that the users or user groups can use on any hosts, queues or projects. If only HOSTS are configured in the Limit section, SWP must be an integer value. SWP is the maximum amount of swap space available to the listed hosts for any users, queues or projects. If only PROJECTS are configured in the Limit section, SWP must be an integer value. SWP is the maximum amount of swap space available to the listed projects for any users, queues or hosts. If only LIC_PROJECTS are configured in the Limit section, SWP must be an integer value. SWP is the maximum amount of swap space available to the listed projects for any users, queues, projects, or hosts. Use QUEUES or PER_QUEUE, USERS or PER_USER, HOSTS or PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further limit swap space available to resource consumers. Chapter 1. Configuration Files 307 lsb.resources Default No limit Example SWP=60 TMP Syntax TMP=integer[%] TMP - | integer[%] Description Maximum amount of tmp space available to resource consumers. Specify a value in MB or the unit set in LSF_UNIT_FOR_LIMITS in lsf.conf as a positive integer greater than or equal 0. The Limit section is ignored if TMP is specified as a percentage: v Without PER_HOST, or v With HOSTS In horizontal format, use only one TMP line per Limit section. In vertical format, use empty parentheses () or a dash (-) to indicate the default value (no limit). Fields cannot be left blank. If only QUEUES are configured in the Limit section, TMP must be an integer value. TMP is the maximum amount of tmp space available to the listed queues for any hosts, users projects. If only USERS are configured in the Limit section, TMP must be an integer value. TMP is the maximum amount of tmp space that the users or user groups can use on any hosts, queues or projects. If only HOSTS are configured in the Limit section, TMP must be an integer value. TMP is the maximum amount of tmp space available to the listed hosts for any users, queues or projects. If only PROJECTS are configured in the Limit section, TMP must be an integer value. TMP is the maximum amount of tmp space available to the listed projects for any users, queues or hosts. If only LIC_PROJECTS are configured in the Limit section, TMP must be an integer value. TMP is the maximum amount of tmp space available to the listed projects for any users, queues, projects, or hosts. Use QUEUES or PER_QUEUE, USERS or PER_USER, HOSTS or PER_HOST, LIC_PROJECTS or PER_LIC_PROJECT, and PROJECTS or PER_PROJECT in combination to further limit tmp space available to resource consumers. 308 Platform LSF Configuration Reference lsb.resources Default No limit Example TMP=20% USERS Syntax USERS=all [~]user_name ... | all [~]user_group ... USERS ( [-] | all [~]user_name ... | all [~]user_group ... ) Description A space-separated list of user names or user groups on which limits are enforced. Limits are enforced on all users or groups listed. Limits apply to a group as a whole. If a group contains a subgroup, the limit also applies to each member in the subgroup recursively. User names must be valid login names. User group names can be LSF user groups or UNIX and Windows user groups. UNIX user groups must be configured in lsb.user. To specify a per-user limit, use the PER_USER keyword. Do not configure USERS and PER_USER limits in the same Limit section. In horizontal format, use only one USERS line per Limit section. Use the keyword all to configure limits that apply to all users or user groups in a cluster. Use the not operator (~) to exclude users or user groups from the all specification in the limit. This is useful if you have a large number of users but only want to exclude a few users or groups from the limit definition. In vertical format, multiple user names must be enclosed in parentheses. In vertical format, use empty parentheses () or a dash (-) to indicate an empty field. Fields cannot be left blank. Default all (limits are enforced on all users in the cluster) Example USERS=user1 user2 Chapter 1. Configuration Files 309 lsb.resources GuaranteedResourcePool section Defines a guarantee policy. A guarantee is a commitment to ensure availability of a number of resources to a service class, where a service class is a container for jobs. Each guarantee pool can provide guarantees to multiple service classes, and each service class can have guarantees in multiple pools. To use guaranteed resources, configure service classes with GOALS=[GUARANTEE] in the lsb.serviceclasses file. GuaranteedResourcePool section structure Each resource pool is defined in a GuaranteedResourcePool section and enclosed by Begin GuaranteedResourcePool and End GuaranteedResourcePool. You must configure a NAME, TYPE and DISTRIBUTION for each GuaranteedResourcePool section. The order of GuaranteedResourcePool sections is important, as the sections are evaluated in the order configured. Each host can only be in one pool of host-based resources (slots, hosts, or package, each of which can have its own GuaranteedResourcePool section); ensure all GuaranteedResourcePool sections (except the last one) define the HOSTS parameter, so they do not contain the default of all hosts. When LSF starts up, it goes through the hosts and assigns each host to a pool that will accept the host based on the pool's RES_SELECT and HOSTS parameters. If multiple pools will accept the host, the host will be assigned to the first pool according to the configuration order of the pools. Example GuaranteedResourcePool sections Begin GuaranteedResourcePool NAME = linuxGuarantee TYPE = slots HOSTS = linux_group DISTRIBUTION = [sc1, 25] [sc2, 30] LOAN_POLICIES=QUEUES[all] DURATION[15] DESCRIPTION = This is the resource pool for the hostgroup linux_group, with\ and 30 slots guaranteed to sc2. Resources are loaned to 25 slots guaranteed\ to sc1 jobs from any queue with runtimes of up to 15 minutes End GuaranteedResourcePool Begin GuaranteedResourcePool NAME = x86Guarantee TYPE = slots HOSTS = linux_x86 DISTRIBUTION = [sc1, 25] LOAN_POLICIES=QUEUES[short_jobs] DURATION[15] DESCRIPTION = This is the resource pool for the hostgroup\ linux_x86 using the queue solaris, with 25 slots guaranteed\ to sc1. Resources are loaned to jobs for up to 15 minutes. End GuaranteedResoucePool Begin GuaranteedResourcePool NAME = resource2pool TYPE = resource[f2] 310 Platform LSF Configuration Reference lsb.resources DISTRIBUTION = [sc1, 25%] [sc2, 25%] LOAN_POLICIES=QUEUES[all] DURATION[10] DESCRIPTION = This is the resource pool for all f2 resources managed by IBM\ Platform License Scheduler, with 25% guaranteed to each of sc1 and sc2. \ Resources are loaned to jobs from any queue with runtimes of up to 10 minutes. End GuaranteedResourcePool Parameters v v v v NAME TYPE HOSTS RES_SELECT v DISTRIBUTION v LOAN_POLICIES v DESCRIPTION NAME Syntax NAME=name Description The name of the guarantee policy. Default None. You must provide a name for the guarantee. TYPE Syntax TYPE = slots | hosts | resource[shared_resource] | package[slots=[slots_per_package][:mem=mem_per_package]] Description Defines the type of resources to be guaranteed in this guarantee policy. These can either be slots, whole hosts, packages composed of an amount of slots and memory bundled on a single host, or licenses managed by License Scheduler. Specify resource[license] to guarantee licenses (which must be managed by License Scheduler) to service class guarantee jobs. Specifies the combination of memory and slots that defines the packages that will be treated as resources reserved by service class guarantee jobs. For example, TYPE=package[slots=1:mem=1000] Each unit guaranteed is for one slot and 1000 MB of memory. LSF_UNIT_FOR_LIMITS in lsf.conf determines the units of memory in the package definition. The default value of LSF_UNIT_FOR_LIMITS is MB, therefore the guarantee is for 1000 MB of memory. Chapter 1. Configuration Files 311 lsb.resources A package need not have both slots and memory. Setting TYPE=package[slots=1] is the equivalent of slots. In order to provide guarantees for parallel jobs that require multiple CPUs on a single host where memory is not an important resource, you can use packages with multiple slots and not specify mem. Each host can belong to at most one slot/host/package guarantee pool. Default None. You must specify the type of guarantee. HOSTS Syntax HOSTS=all | allremote | all@cluster_name ... | [~]host_name | [~]host_group Description A space-separated list of hosts or host groups defined in lsb.hosts, on which the guarantee is enforced. Use the keyword all to include all hosts in a cluster. Use the not operator (~) to exclude hosts from the all specification in the guarantee. Use host groups for greater flexibility, since host groups have additional configuration options. Ensure all GuaranteedResourcePool sections (except the last one) define the HOSTS or RES_SELECT parameter, so they do not contain the default of all hosts. Default all RES_SELECT Syntax RES_SELECT=res_req Description Resource requirement string with which all hosts defined by the HOSTS parameter are further filtered. For example, RES_SELECT=type==LINUX86 Only static host attributes can be used in RES_SELECT. Do not use consumable resources or dynamic resources. Default None. RES_SELECT is optional. DISTRIBUTION Syntax DISTRIBUTION=([service_class_name, amount[%]]...) 312 Platform LSF Configuration Reference lsb.resources Description Assigns the amount of resources in the pool to the specified service classes, where amount can be an absolute number or a percentage of the resources in the pool. The outer brackets are optional. When configured as a percentage, the total can exceed 100% but each assigned percentage cannot exceed 100%. For example: DISTRIBUTION=[sc1,50%] [sc2,50%] [sc3,50%] is an acceptable configuration even though the total percentages assigned add up to 150%. DISTRIBUTION=[sc1,120%] is not an acceptable configuration, since the percentage for sc1 is greater than 100%. Each service class must be configured in lsb.serviceclasses, with GOALS=[GUARANTEE]. When configured as a percentage and there are remaining resources to distribute (because the calculated number of slots is rounded down), LSF distributes the remaining resources using round-robin distribution, starting with the first configured service class. Therefore, the service classes that you define first will receive additional resources regardless of the configured percentage. For example, there are 93 slots in a pool and you configure the following guarantee distribution: DISTRIBUTION=[sc1,30%] [sc2,10%] [sc3,30%] The number of slots assigned to guarantee policy are: floor((30% + 10% + 30%)*(93 slots)) = 65 slots The slots are distributed to the service classes as follows: v sc1_slots = floor(30%*92) = 27 v sc2_slots = floor(10%*92) = 9 v sc3_slots = floor(30%*92) = 27 As a result of rounding down, the total number of distributed slots is 27+9+27=63 slots, which means there are two remaining slots to distribute. Using round-robin distribution, LSF distributes one slot each to sc1 and sc2 because these service classes are defined first. Therefore, the final slot distribution to the service classes are as follows: v sc1_slots = floor(30%*92) + 1 = 28 v sc2_slots = floor(10%*92) + 1 = 10 v sc3_slots = floor(30%*92) = 27 If you configure sc3 before sc2 (DISTRIBUTION=[sc1,30%] [sc3,30%] [sc2,10%]), LSF distributes the two remaining slots to sc1 and sc3. Therefore, the slots are distributed as follows: v sc1_slots = floor(30%*92) + 1 = 28 v sc3_slots = floor(30%*92) + 1 = 28 v sc2_slots = floor(10%*92) = 9 Default None. You must provide a distribution for the resource pool. Chapter 1. Configuration Files 313 lsb.resources LOAN_POLICIES Syntax LOAN_POLICIES=QUEUES[queue_name ...|all] [CLOSE_ON_DEMAND] [DURATION[minutes][RETAIN[amount[%]]] Description By default, LSF will reserve sufficient resources in each guarantee pool to honor the configured guarantees. To increase utilization, use LOAN_POLICIES to allow any job (with or without guarantees) to use these reserved resources when not needed by jobs with guarantees. When resources are loaned out, jobs with guarantees may have to wait for jobs to finish before they are able to dispatch in the pool. QUEUES[all | queue_name ...] loans only to jobs from the specified queue or queues. You must specify which queues are permitted to borrow resources reserved for guarantees. When CLOSE_ON_DEMAND is specified, LSF stops loaning out from a pool whenever there is pending demand from jobs with guarantees in the pool. DURATION[minutes] only allows jobs to borrow the resources if the job run limit (or estimated run time) is no larger than minutes. Loans limited by job duration make the guaranteed resources available within the time specified by minutes. Jobs running longer than the estimated run time will run to completion regardless of the actual run time. RETAIN[amount[%]] enables LSF to try to keep idle the amount of resources specified in RETAIN as long as there are unused guarantees. These idle resources can only be used to honor guarantees. Whenever the number of free resources in the pool drops below the RETAIN amount, LSF stops loaning resources from the pool. Default None. LOAN_POLICIES is optional. DESCRIPTION Syntax DESCRIPTION=description Description A description of the guarantee policy. Default None. DESCRIPTION is optional. HostExport section Defines an export policy for a host or a group of related hosts. Defines how much of each host’s resources are exported, and how the resources are distributed among the consumers. 314 Platform LSF Configuration Reference lsb.resources Each export policy is defined in a separate HostExport section, so it is normal to have multiple HostExport sections in lsb.resources. HostExport section structure Use empty parentheses ( ) or a dash (-) to specify the default value for an entry. Fields cannot be left blank. Example HostExport section Begin HostExport PER_HOST= hostA hostB SLOTS= 4 DISTRIBUTION= [cluster1, 1] [cluster2, 3] MEM= 100 SWP= 100 End HostExport Parameters v v v v PER_HOST RES_SELECT NHOSTS DISTRIBUTION v MEM v SLOTS v SWAP v TYPE PER_HOST Syntax PER_HOST=host_name... Description Required when exporting special hosts. Determines which hosts to export. Specify one or more LSF hosts by name. Separate names by space. RES_SELECT Syntax RES_SELECT=res_req Description Required when exporting workstations. Determines which hosts to export. Specify the selection part of the resource requirement string (without quotes or parentheses), and LSF will automatically select hosts that meet the specified criteria. For this parameter, if you do not specify the required host type, the default is type==any. When LSF_STRICT_RESREQ=Y is configured in lsf.conf, resource requirement strings in select sections must conform to a more strict syntax. The strict resource requirement syntax only applies to the select section. It does not apply to the other resource requirement sections (order, rusage, same, span, or cu). When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource. Chapter 1. Configuration Files 315 lsb.resources The criteria is only evaluated once, when a host is exported. NHOSTS Syntax NHOSTS=integer Description Required when exporting workstations. Maximum number of hosts to export. If there are not this many hosts meeting the selection criteria, LSF exports as many as it can. DISTRIBUTION Syntax DISTRIBUTION=([cluster_name, number_shares]...) Description Required. Specifies how the exported resources are distributed among consumer clusters. The syntax for the distribution list is a series of share assignments. The syntax of each share assignment is the cluster name, a comma, and the number of shares, all enclosed in square brackets, as shown. Use a space to separate multiple share assignments. Enclose the full distribution list in a set of round brackets. cluster_name Specify the name of a remote cluster that will be allowed to use the exported resources. If you specify a local cluster, the assignment is ignored. number_shares Specify a positive integer representing the number of shares of exported resources assigned to the cluster. The number of shares assigned to a cluster is only meaningful when you compare it to the number assigned to other clusters, or to the total number. The total number of shares is just the sum of all the shares assigned in each share assignment. MEM Syntax MEM=megabytes Description Used when exporting special hosts. Specify the amount of memory to export on each host, in MB or in units set in LSF_UNIT_FOR_LIMITS in lsf.conf. 316 Platform LSF Configuration Reference lsb.resources Default - (provider and consumer clusters compete for available memory) SLOTS Syntax SLOTS=integer Description Required when exporting special hosts. Specify the number of job slots to export on each host. To avoid overloading a partially exported host, you can reduce the number of job slots in the configuration of the local cluster. SWAP Syntax SWAP=megabytes Description Used when exporting special hosts. Specify the amount of swap space to export on each host, in MB or in units set in LSF_UNIT_FOR_LIMITS in lsf.conf. Default - (provider and consumer clusters compete for available swap space) TYPE Syntax TYPE=shared Description Changes the lease type from exclusive to shared. If you export special hosts with a shared lease (using PER_HOST), you cannot specify multiple consumer clusters in the distribution policy. Default Undefined (the lease type is exclusive; exported resources are never available to the provider cluster) SharedResourceExport section Optional. Requires HostExport section. Defines an export policy for a shared resource. Defines how much of the shared resource is exported, and the distribution among the consumers. The shared resource must be available on hosts defined in the HostExport sections. Chapter 1. Configuration Files 317 lsb.resources SharedResourceExport section structure All parameters are required. Example SharedResourceExport section Begin SharedResourceExport NAME= AppRes NINSTANCES= 10 DISTRIBUTION= ([C1, 30] [C2, 70]) End SharedResourceExport Parameters v NAME v NINSTANCES v DISTRIBUTION NAME Syntax NAME=shared_resource_name Description Shared resource to export. This resource must be available on the hosts that are exported to the specified clusters; you cannot export resources without hosts. NINSTANCES Syntax NINSTANCES=integer Description Maximum quantity of shared resource to export. If the total number available is less than the requested amount, LSF exports all that are available. DISTRIBUTION Syntax DISTRIBUTION=([cluster_name, number_shares]...) Description Specifies how the exported resources are distributed among consumer clusters. The syntax for the distribution list is a series of share assignments. The syntax of each share assignment is the cluster name, a comma, and the number of shares, all enclosed in square brackets, as shown. Use a space to separate multiple share assignments. Enclose the full distribution list in a set of round brackets. cluster_name Specify the name of a cluster allowed to use the exported resources. number_shares 318 Platform LSF Configuration Reference lsb.resources Specify a positive integer representing the number of shares of exported resources assigned to the cluster. The number of shares assigned to a cluster is only meaningful when you compare it to the number assigned to other clusters, or to the total number. The total number of shares is the sum of all the shares assigned in each share assignment. ResourceReservation section By default, only LSF administrators or root can add or delete advance reservations. The ResourceReservation section defines an advance reservation policy. It specifies: v Users or user groups that can create reservations v Hosts that can be used for the reservation v Time window when reservations can be created Each advance reservation policy is defined in a separate ResourceReservation section, so it is normal to have multiple ResourceReservation sections in lsb.resources. Example ResourceReservation section Only user1 and user2 can make advance reservations on hostA and hostB. The reservation time window is between 8:00 a.m. and 6:00 p.m. every day: Begin ResourceReservation NAME = dayPolicy USERS = user1 user2 # optional HOSTS = hostA hostB # optional TIME_WINDOW = 8:00-18:00 # weekly recurring reservation End ResourceReservation user1 can add the following reservation for user user2 to use on hostA every Friday between 9:00 a.m. and 11:00 a.m.: % user1@hostB> brsvadd -m "hostA" -n 1 -u "user2" -t "5:9:0-5:11:0" Reservation "user2#2" is created Users can only delete reservations they created themselves. In the example, only user user1 can delete the reservation; user2 cannot. Administrators can delete any reservations created by users. Parameters v HOSTS v NAME v TIME_WINDOW v USERS HOSTS Syntax HOSTS=[~]host_name | [~]host_group | all | allremote | all@cluster_name ... Chapter 1. Configuration Files 319 lsb.resources Description A space-separated list of hosts, host groups defined in lsb.hosts on which administrators or users specified in the USERS parameter can create advance reservations. The hosts can be local to the cluster or hosts leased from remote clusters. If a group contains a subgroup, the reservation configuration applies to each member in the subgroup recursively. Use the keyword all to configure reservation policies that apply to all local hosts in a cluster not explicitly excluded. This is useful if you have a large cluster but you want to use the not operator (~) to exclude a few hosts from the list of hosts where reservations can be created. Use the keyword allremote to specify all hosts borrowed from all remote clusters. Tip: You cannot specify host groups or host partitions that contain the allremote keyword. Use all@cluster_name to specify the group of all hosts borrowed from one remote cluster. You cannot specify a host group or partition that includes remote resources. With MultiCluster resource leasing model, the not operator (~) can be used to exclude local hosts or host groups. You cannot use the not operator (~) with remote hosts. Examples HOSTS=hgroup1 ~hostA hostB hostC Advance reservations can be created on hostB, hostC, and all hosts in hgroup1 except for hostA. HOSTS=all ~group2 ~hostA Advance reservations can be created on all hosts in the cluster, except for hostA and the hosts in group2. Default all allremote (users can create reservations on all server hosts in the local cluster, and all leased hosts in a remote cluster). NAME Syntax NAME=text Description Required. Name of the ResourceReservation section Specify any ASCII string 40 characters or less. You can use letters, digits, underscores (_) or dashes (-). You cannot use blank spaces. 320 Platform LSF Configuration Reference lsb.resources Example NAME=reservation1 Default None. You must provide a name for the ResourceReservation section. TIME_WINDOW Syntax TIME_WINDOW=time_window ... Description Optional. Time window for users to create advance reservations. The time for reservations that users create must fall within this time window. Use the same format for time_window as the recurring reservation option (-t) of brsvadd. To specify a time window, specify two time values separated by a hyphen (-), with no space in between: time_window = begin_time-end_time Time format Times are specified in the format: [day:]hour[:minute] where all fields are numbers with the following ranges: v day of the week: 0-6 (0 is Sunday) v hour: 0-23 v minute: 0-59 Specify a time window one of the following ways: v hour-hour v hour:minute-hour:minute v day:hour:minute-day:hour:minute The default value for minute is 0 (on the hour); the default value for day is every day of the week. You must specify at least the hour. Day of the week and minute are optional. Both the start time and end time values must use the same syntax. If you do not specify a minute, LSF assumes the first minute of the hour (:00). If you do not specify a day, LSF assumes every day of the week. If you do specify the day, you must also specify the minute. You can specify multiple time windows, but they cannot overlap. For example: timeWindow(8:00-14:00 18:00-22:00) is correct, but timeWindow(8:00-14:00 11:00-15:00) is not valid. Chapter 1. Configuration Files 321 lsb.resources Example TIME_WINDOW=8:00-14:00 Users can create advance reservations with begin time (brsvadd -b), end time (brsvadd -e), or time window (brsvadd -t) on any day between 8:00 a.m. and 2:00 p.m. Default Undefined (any time) USERS Syntax USERS=[~]user_name | [~]user_group ... | all Description A space-separated list of user names or user groups who are allowed to create advance reservations. Administrators, root, and all users or groups listed can create reservations. If a group contains a subgroup, the reservation policy applies to each member in the subgroup recursively. User names must be valid login names. User group names can be LSF user groups or UNIX and Windows user groups. Use the keyword all to configure reservation policies that apply to all users or user groups in a cluster. This is useful if you have a large number of users but you want to exclude a few users or groups from the reservation policy. Use the not operator (~) to exclude users or user groups from the list of users who can create reservations. CAUTION: The not operator does not exclude LSF administrators from the policy. Example USERS=user1 user2 Default all (all users in the cluster can create reservations) ReservationUsage section | | | To enable greater flexibility for reserving numeric resources that are reserved by jobs, configure the ReservationUsage section in lsb.resources to reserve resources as PER_JOB, PER_TASK, or PER_HOST. For example: | | | Example ReservationUsage section Begin ReservationUsage RESOURCE 322 Platform LSF Configuration Reference METHOD RESERVE lsb.resources | | | | resourceX PER_JOB Y resourceY PER_HOST N resourceZ PER_TASK N | Parameters End ReservationUsage v RESOURCE v METHOD v RESERVE RESOURCE The name of the resource to be reserved. User-defined numeric resources can be reserved, but only if they are shared (they are not specific to one host). The following built-in resources can be configured in the ReservationUsage section and reserved: v mem v tmp v swp Any custom resource can also be reserved if it is shared (defined in the Resource section of lsf.shared) or host based (listed in the Host section of the lsf.cluster file in the resource column). METHOD The resource reservation method. One of: v PER_JOB | v PER_HOST v PER_TASK The cluster-wide RESOURCE_RESERVE_PER_SLOT parameter in lsb.params is obsolete. RESOURCE_RESERVE_PER_TASK parameter still controls resources not configured in lsb.resources. Resources not reserved in lsb.resources are reserved per job. PER_HOST reservation means that for the parallel job, LSF reserves one instance of a for each host. For example, some application licenses are charged only once no matter how many applications are running provided those applications are running on the same host under the same user. Use no method ("-") when setting mem, swp, or tmp as RESERVE=Y. RESERVE Reserves the resource for pending jobs that are waiting for another resource to become available. For example, job A requires resources X, Y, and Z to run, but resource Z is a high demand or scarce resource. This job pends until Z is available. In the meantime, other jobs requiring only X and Y resources run. If X and Y are set as reservable resources (the RESERVE parameter is set to "Y"), as soon as Z resource is available, job A runs. If they are not, job A may never be able to run because all resources are never available at the same time. Chapter 1. Configuration Files 323 lsb.resources Restriction: Only the following built-in resources can be defined as reservable: v mem v swp v tmp Use no method ("-") when setting mem, swp, or tmp as RESERVE=Y. When submitting a job, the queue must have RESOURCE_RESERVE defined. Backfill of the reservable resources is also supported when you submit a job with reservable resources to a queue with BACKFILL defined. Valid values are Y and N. If not specified, resources are not reserved. Assumptions and limitations v Per-resource configuration defines resource usage for individual resources, but it does not change any existing resource limit behavior (PER_JOB, PER_TASK). | | v In a MultiCluster environment, you should configure resource usage in the scheduling cluster (submission cluster in lease model or receiving cluster in job forward model). Automatic time-based configuration Variable configuration is used to automatically change LSF configuration based on time windows. You define automatic configuration changes in lsb.resources by using if-else constructs and time expressions. After you change the files, reconfigure the cluster with the badmin reconfig command. The expressions are evaluated by LSF every 10 minutes based on mbatchd start time. When an expression evaluates true, LSF dynamically changes the configuration based on the associated configuration statements. Reconfiguration is done in real time without restarting mbatchd, providing continuous system availability. Example # limit usage of hosts for group and time # based configuration # - 10 jobs can run from normal queue # - any number can run from short queue between 18:30 # and 19:30 # all other hours you are limited to 100 slots in the # short queue # - each other queue can run 30 jobs Begin Limit PER_QUEUE HOSTS SLOTS normal Resource1 10 Resource1 - Resource1 100 # if time(18:30-19:30) short #else short #endif 324 Platform LSF Configuration Reference # Example lsb.resources (all ~normal ~short) Resource1 30 End Limit PowerPolicy section This section enables and defines a power management policy. Example PowerPolicy section Begin PowerPolicy NAME = policy_night HOSTS = hostGroup1 host3 TIME_WINDOW = 23:00-8:00 MIN_IDLE_TIME = 1800 CYCLE_TIME = 60 End PowerPolicy Parameters v NAME v HOSTS v TIME_WINDOW v MIN_IDLE_TIME v CYCLE_TIME NAME Syntax NAME=string Description Required. Unique name for the power management policy. Specify any ASCII string 60 characters or less. You can use letters, digits, underscores (_), dashes (-), or periods (.). You cannot use blank spaces. Example NAME=policy_night1 Default None. You must provide a name to define a PowerPolicy. HOSTS Syntax HOSTS=host_list Description host_list is a space-separated list of host names, host groups, host partitions, or compute units. Required. Specified hosts should not overlap between power policies. Chapter 1. Configuration Files 325 lsb.resources Example HOSTS=hostGroup1 host3 Default If a host is not defined, the default value is all other hosts that are not included in power policy. (Does not contains the master and master candidate) TIME_WINDOW Syntax TIME_WINDOW=time_window ... Description Required. Time window is the time period to which the power policy applies. To specify a time window, specify two time values separated by a hyphen (-), with no space in between time_window = begin_time-end_time Time format Times are specified in the format: [day:]hour[:minute] where all fields are numbers with the following ranges: v day of the week: 0-6 (0 is Sunday) v hour: 0-23 v minute: 0-59 Specify a time window one of the following ways: v hour-hour v hour:minute-hour:minute v day:hour:minute-day:hour:minute The default value for minute is 0 (on the hour); the default value for day is every day of the week. You must specify at least the hour. Day of the week and minute are optional. Both the start time and end time values must use the same syntax. If you do not specify a minute, LSF assumes the first minute of the hour (:00). If you do not specify a day, LSF assumes every day of the week. If you do specify the day, you must also specify the minute. You can specify multiple time windows, but they cannot overlap. For example: timeWindow(8:00-14:00 18:00-22:00) is correct, but timeWindow(8:00-14:00 11:00-15:00) is not valid. 326 Platform LSF Configuration Reference lsb.resources Example TIME_WINDOW=8:00-14:00 Default Not defined (any time) MIN_IDLE_TIME Syntax MIN_IDLE_TIME=minutes Description This parameter takes effect if TIME_WINDOW is configured and is valid. It defines the host idle time before power operations are issued for defined hosts. Example MIN_IDLE_TIME=60 Default 0 CYCLE_TIME Syntax CYCLE_TIME=minutes Description This parameter takes effect if TIME_WINDOW is configured and is valid. It defines the minimum time (in minutes) between changes in power states for defined hosts. Example CYCLE_TIME=15 Default 0 lsb.serviceclasses The lsb.serviceclasses file defines the service-level agreements (SLAs) in an LSF cluster as service classes, which define the properties of the SLA. This file is optional. You can configure as many service class sections as you need. Use bsla to display the properties of service classes configured in lsb.serviceclasses and dynamic information about the state of each configured service class. By default, lsb.serviceclasses is installed in LSB_CONFDIR/cluster_name/configdir. Chapter 1. Configuration Files 327 lsb.serviceclasses Changing lsb.serviceclasses configuration After making any changes to lsb.serviceclasses, run badmin reconfig to reconfigure mbatchd. lsb.serviceclasses structure Each service class definition begins with the line Begin ServiceClass and ends with the line End ServiceClass. Syntax Begin ServiceClass NAME = stringPRIORITY = integerGOALS CONTROL_ACTION = VIOLATION_PERIOD[minutes] CMD [action] USER_GROUP = all | [user_name] [user_group] ... DESCRIPTION = text = [throughput | velocity | deadline] [\...] End ServiceClass Begin ServiceClass NAME = string GOALS = guarantee ACCESS_CONTROL = [QUEUES[ queue ...]] [USERS[ [user_name] [user_group] ...]] [FAIRSHARE_GROUPS[user_group ...]] [APPS[app_name ...]] [PROJECTS[proj_name...]] AUTO_ATTACH = Y | y | N | nDESCRIPTION = text End ServiceClass You must specify: v Service class name v Goals Service classes with guarantee goals cannot have PRIORITY, CONTROL_ACTION or USER_GROUP defined. To configure EGO-enabled SLA scheduling, you must specify an existing EGO consumer name to allow the SLA to get host allocations from EGO. All other parameters are optional. Example Begin ServiceClass NAME=Sooke PRIORITY=20 GOALS=[DEADLINE timeWindow (8:30-16:00)] DESCRIPTION="working hours"End ServiceClass Begin ServiceClass NAME=Newmarket GOALS=[GUARANTEE] ACCESS_CONTROL = QUEUES[batch] FAIRSHARE_GROUPS[team2] AUTO_ATTACH = Y DESCRIPTION="guarantee for team2 batch jobs" End ServiceClass 328 Platform LSF Configuration Reference lsb.serviceclasses Parameters | v v v v ACCESS_CONTROL AUTO_ATTACH CONSUMER CONTROL_ACTION v v v v v v v DESCRIPTION EGO_RES_REQ EGO_RESOURCE_GROUP GOALS MAX_HOST_IDLE_TIME NAME PRIORITY v USER_GROUP ACCESS_CONTROL Syntax ACCESS_CONTROL=[QUEUES[queue ...]] [USERS[ [user_name] [user_group] ...]] [FAIRSHARE_GROUPS[user_group ...]] [APPS[app_name ...]] [PROJECTS[proj_name...]] [LIC_PROJECTS[lic_proj...]] Description Guarantee SLAs (with GOALS=[GUARANTEE]) only. Restricts access to a guarantee SLA. If more than one restriction is configured, all must be satisfied. v QUEUES restricts access to the queues listed; the queue is specified for jobs at submission using bsub -q. v USERS restricts access to jobs submitted by the users or user groups specified. User names must be valid login names. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\ user_name).User group names can be LSF user groups or UNIX and Windows user groups. To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\user_group). v FAIRSHARE_GROUPS restricts access to the fairshare groups listed; the fairshare group is specified for jobs at submission using bsub -G. v APPS restricts access to the application profiles listed; the application profile is specified for jobs at submission using bsub -app. v PROJECTS restricts access to the projects listed; the project is specified for jobs at submission using bsub -P. Example ACCESS_CONTROL = QUEUES[normal short] USERS[ug1] Jobs submitted to the queues normal or short by users in usergroup ug1 are the only jobs accepted by the guarantee SLA. Default None. Access to the guarantee SLA is not restricted. Chapter 1. Configuration Files 329 lsb.serviceclasses AUTO_ATTACH Syntax AUTO_ATTACH=Y | y | N | n Description Guarantee SLAs (with GOALS=[GUARANTEE]) only. Used with ACCESS_CONTROL. Enabling AUTO_ATTACH when a guarantee SLA has ACCESS_CONTROL configured results in submitted jobs automatically attaching to the guarantee SLA if they have access. If a job can access multiple guarantee SLAs with AUTO_ATTACH enabled, the job is automatically attached to the first accessible SLA based on configuration order in the lsb.serviceclasses file. During restart or reconfiguration, automatic attachments to guarantee SLAs are checked and jobs may be attached to a different SLA. During live reconfiguration (using the bconf command) automatic attachments are not checked, and jobs remain attached to the same guarantee SLAs regardless of configuration changes. Example Begin ServiceClass ... NAME = Maple GOALS = [GUARANTEE] ACCESS_CONTROL = QUEUES[priority] USERS[ug1]AUTO_ATTACH = Y ... End ServiceClass All jobs submitted to the priority queue by users in user group ug1 and submitted without an SLA specified are automatically attached to the service class Maple. Default N CONSUMER Syntax CONSUMER=ego_consumer_name Description For EGO-enabled SLA service classes, the name of the EGO consumer from which hosts are allocated to the SLA. This parameter is not mandatory, but must be configured for the SLA to receive hosts from EGO. Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have CONSUMER set. If defined, it will be ignored. Important: CONSUMER must specify the name of a valid consumer in EGO. If a default SLA is configured with ENABLE_DEFAULT_EGO_SLA in lsb.params, all services classes configured in lsb.serviceclasses must specify a consumer name. 330 Platform LSF Configuration Reference lsb.serviceclasses Default None CONTROL_ACTION Syntax CONTROL_ACTION=VIOLATION_PERIOD[minutes] CMD [action] Description Optional. Configures a control action to be run if the SLA goal is delayed for a specified number of minutes. If the SLA goal is delayed for longer than VIOLATION_PERIOD, the action specified by CMD is invoked. The violation period is reset and if the SLA is still active when the violation period expires again, the action runs again. If the SLA has multiple active goals that are in violation, the action is run for each of them. Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have CONTROL_ACTION set. If defined, it will be ignored. Example CONTROL_ACTION=VIOLATION_PERIOD[10] CMD [echo `date`: SLA is in violation >> ! /tmp/sla_violation.log] Default None DESCRIPTION Syntax DESCRIPTION=text Description Optional. Description of the service class. Use bsla to display the description text. This description should clearly describe the features of the service class to help users select the proper service class for their jobs. The text can include any characters, including white space. The text can be extended to multiple lines by ending the preceding line with a backslash (\). Default None EGO_RES_REQ Syntax EGO_RES_REQ=res_req Chapter 1. Configuration Files 331 lsb.serviceclasses Description For EGO-enabled SLA service classes, the EGO resource requirement that specifies the characteristics of the hosts that EGO will assign to the SLA. Must be a valid EGO resource requirement. The EGO resource requirement string supports the select section, but the format is different from LSF resource requirements. Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have EGO_RES_REQ set. If defined, it will be ignored. Note: After changing this parameter, running jobs using the allocation may be re-queued. Example EGO_RES_REQ=select(linux && maxmem > 100) Default None EGO_RESOURCE_GROUP | | Syntax | EGO_RESOURCE_GROUP=ego resource group name or a blank separated group list | Description | | For EGO-enabled SLA service classes. A resource group or space-separated list of resource groups from which hosts are allocated to the SLA. | | List must be a subset of or equal to the resource groups allocated to the consumer defined by the CONSUMER entry. | | Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have EGO_RESOURCE_GROUP set. If defined, it will be ignored. | | Note: After changing this parameter, running jobs using the allocation may be re-queued. | | Example | Default | Undefined (vemkd determines which resource groups to allocate slots to LSF). EGO_RESOURCE_GROUP=resource_group1 resource_group4 resource_group5 GOALS Syntax GOALS=[throughput | velocity | deadline] [\ [throughput | velocity | deadline] ...] 332 Platform LSF Configuration Reference lsb.serviceclasses GOALS=[guarantee] Description Required. Defines the service-level goals for the service class. A service class can have more than one goal, each active at different times of the day and days of the week. Outside of the time window, the SLA is inactive and jobs are scheduled as if no service class is defined. LSF does not enforce any service-level goal for an inactive SLA. The time windows of multiple service-level goals can overlap. In this case, the largest number of jobs is run. An active SLA can have a status of On time if it is meeting the goal, and a status Delayed, if it is missing its goals. A service-level goal defines: throughput - expressed as finished jobs per hour and an optional time window when the goal is active. throughput has the form: GOALS=[THROUGHPUT num_jobs timeWindow [(time_window)]] If no time window is configured, THROUGHPUT can be the only goal in the service class. The service class is always active, and bsla displays ACTIVE WINDOW: Always Open. velocity - expressed as concurrently running jobs and an optional time window when the goal is active. velocity has the form: GOALS=[VELOCITY num_jobs timeWindow [(time_window)]] If no time window is configured, VELOCITY can be the only goal in the service class. The service class is always active, and bsla displays ACTIVE WINDOW: Always Open. deadline - indicates that all jobs in the service class should complete by the end of the specified time window. The time window is required for a deadline goal. deadline has the form: GOALS=[DEADLINE timeWindow (time_window)] guarantee - indicates the SLA has guaranteed resources defined in lsb.resources and is able to guarantee resources, depending on the scavenging policies configured. Guarantee goals cannot be combined with any other goals, and do not accept time windows. GOALS=[GUARANTEE] Restriction: EGO-enabled SLA service classes only support velocity goals. Deadline, throughput, and guarantee goals are not supported. The configured velocity value for EGO-enabled SLA service classes is considered to be a minimum number of jobs that should be in run state from the SLA Time window format The time window of an SLA goal has the standard form: begin_time-end_time Chapter 1. Configuration Files 333 lsb.serviceclasses Times are specified in the format: [day:]hour[:minute] where all fields are numbers with the following ranges: v day of the week: 0-6 (0 is Sunday) v hour: 0-23 v minute: 0-59 Specify a time window one of the following ways: v hour-hour v hour:minute-hour:minute v day:hour:minute-day:hour:minute The default value for minute is 0 (on the hour); the default value for day is every day of the week. You must specify at least the hour. Day of the week and minute are optional. Both the start time and end time values must use the same syntax. If you do not specify a minute, LSF assumes the first minute of the hour (:00). If you do not specify a day, LSF assumes every day of the week. If you do specify the day, you must also specify the minute. You can specify multiple time windows, but they cannot overlap. For example: timeWindow(8:00-14:00 18:00-22:00) is correct, but timeWindow(8:00-14:00 11:00-15:00) is not valid. Tip: To configure a time window that is always open, use the timeWindow keyword with empty parentheses. Examples GOALS=[THROUGHPUT 2 timeWindow ()] GOALS=[THROUGHPUT 10 timeWindow (8:30-16:30)] GOALS=[VELOCITY 5 timeWindow ()] GOALS=[DEADLINE timeWindow (16:30-8:30)] [VELOCITY 10 timeWindow (8:30-16:30)] GOALS=[GUARANTEE] MAX_HOST_IDLE_TIME Syntax MAX_HOST_IDLE_TIME=seconds Description For EGO-enabled SLA service classes, number of seconds that the SLA will hold its idle hosts before LSF releases them to EGO. Each SLA can configure a different idle time. Do not set this parameter to a small value, or LSF may release hosts too quickly. 334 Platform LSF Configuration Reference lsb.serviceclasses Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have MAX_HOST_IDLE_TIME set. If defined, it will be ignored. Default 120 seconds NAME Syntax NAME=string Description Required. A unique name that identifies the service class. Specify any ASCII string 60 characters or less. You can use letters, digits, underscores (_) or dashes (-). You cannot use blank spaces. Important: The name you use cannot be the same as an existing host partition, user group name, or fairshare queue name. Example NAME=Tofino Default None. You must provide a unique name for the service class. PRIORITY Syntax PRIORITY=integer Description Required (time-based SLAs only). The service class priority. A higher value indicates a higher priority, relative to other service classes. Similar to queue priority, service classes access the cluster resources in priority order. LSF schedules jobs from one service class at a time, starting with the highest-priority service class. If multiple service classes have the same priority, LSF runs all the jobs from these service classes in first-come, first-served order. Service class priority in LSF is completely independent of the UNIX scheduler’s priority system for time-sharing processes. In LSF, the NICE parameter is used to set the UNIX time-sharing priority for batch jobs. Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have PRIORITY set. If defined, it will be ignored. Chapter 1. Configuration Files 335 lsb.serviceclasses Default None. USER_GROUP Syntax USER_GROUP=all | [user_name] [user_group] ... Description Optional. A space-separated list of user names or user groups who can submit jobs to the service class. Administrators, root, and all users or groups listed can use the service class. Use the reserved word all to specify all LSF users. LSF cluster administrators are automatically included in the list of users, so LSF cluster administrators can submit jobs to any service class, or switch any user’s jobs into this service class, even if they are not listed. If user groups are specified in lsb.users, each user in the group can submit jobs to this service class. If a group contains a subgroup, the service class policy applies to each member in the subgroup recursively. If the group can define fairshare among its members, the SLA defined by the service class enforces the fairshare policy among the users of the SLA. User names must be valid login names. User group names can be LSF user groups (in lsb.users) or UNIX and Windows user groups. Guarantee SLAs (with GOALS=[GUARANTEE]) cannot have USER_GROUP set. If defined, it will be ignored. Example USER_GROUP=user1 user2 ugroup1 Default all (all users in the cluster can submit jobs to the service class) Examples v The resource-based service class AccountingSLA guarantees hosts to the user group accountingUG for jobs submitted to the queue longjobs. Jobs submitted to this queue by this usergroup without an SLA specified will be automatically attached to the SLA. The guaranteed resource pools used by the SLA are configured in lsb.resources. Begin ServiceClass NAME=AccountingSLA GOALS=[GUARANTEE] DESCRIPTION="Guaranteed hosts for the accounting department" ACCESS_CONTROL = QUEUES[longjobs] USERS[accountingUG] AUTO_ATTACH = Y End ServiceClass v The service class Sooke defines one deadline goal that is active during working hours between 8:30 AM and 4:00 PM. All jobs in the service class should 336 Platform LSF Configuration Reference lsb.serviceclasses complete by the end of the specified time window. Outside of this time window, the SLA is inactive and jobs are scheduled without any goal being enforced: Begin ServiceClass NAME=Sooke PRIORITY=20 GOALS=[DEADLINE timeWindow (8:30-16:00)] DESCRIPTION="working hours" End ServiceClass v The service class Nanaimo defines a deadline goal that is active during the weekends and at nights. Begin ServiceClass NAME=Nanaimo PRIORITY=20 GOALS=[DEADLINE timeWindow (5:18:00-1:8:30 20:00-8:30)] DESCRIPTION="weekend nighttime regression tests" End ServiceClass v The service class Sidney defines a throughput goal of 6 jobs per hour that is always active: Begin ServiceClass NAME=Sidney PRIORITY=20 GOALS=[THROUGHPUT 6 timeWindow ()] DESCRIPTION="constant throughput" End ServiceClass v The service class Tofino defines two velocity goals in a 24 hour period. The first goal is to have a maximum of 10 concurrently running jobs during business hours (9:00 a.m. to 5:00 p.m). The second goal is a maximum of 30 concurrently running jobs during off-hours (5:30 p.m. to 8:30 a.m.) Begin ServiceClass NAME=Tofino PRIORITY=20 GOALS=[VELOCITY 10 timeWindow (9:00-17:00)] [VELOCITY 30 timeWindow (17:30-8:30)] DESCRIPTION="day and night velocity" End ServiceClass v The service class Duncan defines a velocity goal that is active during working hours (9:00 a.m. to 5:30 p.m.) and a deadline goal that is active during off-hours (5:30 p.m. to 9:00 a.m.) Only users user1 and user2 can submit jobs to this service class. Begin ServiceClass NAME=Duncan PRIORITY=23 USER_GROUP=user1 user2 GOALS=[VELOCITY 8 timeWindow (9:00-17:30)] [DEADLINE timeWindow (17:30-9:00)] DESCRIPTION="Daytime/Nighttime SLA" End ServiceClass v The service class Tevere defines a combination similar to Duncan, but with a deadline goal that takes effect overnight and on weekends. During the working hours in weekdays the velocity goal favors a mix of short and medium jobs. Chapter 1. Configuration Files 337 lsb.serviceclasses Begin ServiceClass NAME=Tevere PRIORITY=20 GOALS=[VELOCITY 100 timeWindow (9:00-17:00)] [DEADLINE timeWindow (17:30-8:30 5:17:30-1:8:30)] DESCRIPTION="nine to five" End ServiceClass lsb.users The lsb.users file is used to configure user groups, hierarchical fairshare for users and user groups, and job slot limits for users and user groups. It is also used to configure account mappings in a MultiCluster environment. This file is optional. The lsb.users file is stored in the directory LSB_CONFDIR/cluster_name/configdir, where LSB_CONFDIR is defined in lsf.conf. Changing lsb.users configuration After making any changes to lsb.users, run badmin reconfig to reconfigure mbatchd. UserGroup section Optional. Defines user groups. The name of the user group can be used in other user group and queue definitions, as well as on the command line. Specifying the name of a user group in the GROUP_MEMBER section has exactly the same effect as listing the names of all users in the group. The total number of user groups cannot be more than 1024. Structure The first line consists of two mandatory keywords, GROUP_NAME and GROUP_MEMBER. The USER_SHARES and GROUP_ADMIN keywords are optional. Subsequent lines name a group and list its membership and optionally its share assignments and administrator. Each line must contain one entry for each keyword. Use empty parentheses () or a dash - to specify the default value for an entry. Restriction: If specifying a specific user name for a user group, that entry must precede all user groups. Examples of a UserGroup section Example 1: Begin UserGroup 338 GROUP_NAME GROUP_MEMBER groupA (user1 user2 user3 user4) (user5[full]) Platform LSF Configuration Reference GROUP_ADMIN lsb.users groupB (user7 user8 user9) (groupA[usershares]) groupC (groupA user5) (groupA) groupD (!) () End UserGroup Example 2: Begin UserGroup GROUP_NAME GROUP_MEMBER GROUP_ADMIN groupA (user1 user2 user3 user4) (user5) groupB (groupA user5) (groupA) groupC (!) () End UserGroup Example 2: Begin UserGroup GROUP_NAME GROUP_MEMBER USER_SHARES groupB (user1 user2) () groupC (user3 user4) ([User3,3] [User4,4]) groupA (GroupB GroupC user5) ([User5,1] [default,10]) End UserGroup GROUP_NAME An alphanumeric string representing the user group name. You cannot use the reserved name all or a "/" in a group name. GROUP_MEMBER User group members are the users who belong to the group. You can specify both user names and user group names. User and user group names can appear on multiple lines because users can belong to multiple groups. Note: When a user belongs to more than one group, any of the administrators specified for any of the groups the user belongs to can control that users’ jobs. Limit administrative control by defining STRICT_UG_CONTROL=Y in lsb.params and submitting jobs with the -G option, specifying which user group the job is submitted with. User groups may be defined recursively but must not create a loop. Syntax (user_name | user_group ...) | (all) | (!) Enclose the entire group member list in parentheses. Use space to separate multiple names. You can combine user names and user group names in the same list. Chapter 1. Configuration Files 339 lsb.users Valid values v all The reserved name all specifies all users in the cluster. v ! An exclamation mark (!) indicates an externally-defined user group, which the egroup executable retrieves. v user_name User names must be valid login names. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). v user_group User group names can be LSF user groups defined previously in this section, or UNIX and Windows user groups. If you specify a name that is both a UNIX user group and also a UNIX user, append a backslash to make sure it is interpreted as a group (user_group/). To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\user_group). GROUP_ADMIN User group administrators can administer the jobs of group members. You can specify both user names and user group names. v If you specify a user group as an administrator for another user group, all members of the first user group become administrators for the second user group. v You can also specify that all users of a group are also administrators of that same group. v Users can be administrators for more than one user group at the same time. Note: When a user belongs to more than one group, any of the administrators specified for any of the groups the user belongs to can control that users’ jobs. Define STRICT_UG_CONTROL=Y in lsb.params to limit user group administrator control to the user group specified by -G at job submission. By default a user group administrator has privileges equivalent to those of a job owner, and is allowed to control any job belonging to member users of the group they administer. A user group administrator can also resume jobs stopped by the LSF administrator or queue administrator if the job belongs to a member of their user group. Optionally, you can specify additional user group administrator rights for each user group administrator. User group administrator rights are inherited. For example, if admin2 has full rights for user group ugA and user group ugB is a member of ugA, admin2 also has full rights for user group ugB. Restriction: Unlike a job owner, a user group administrator cannot run brestart and bread -a data_file. 340 Platform LSF Configuration Reference lsb.users To manage security concerns, you cannot specify user group administrators for any user group containing the keyword all as a member unless STRICT_UG_CONTROL=Y is defined in lsb.params. Syntax (user_name | user_name[admin_rights] | user_group | user_group[admin_rights] ...) Enclose the entire group administrator list in parentheses. If you specify administrator rights for a user or group, enclose them in square brackets. You can combine user names and user group names in the same list. Use space to separate multiple names. Valid values v user_name User names must be valid login names. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). v user_group User group names can be LSF user groups defined previously in this section, or UNIX and Windows user groups. If you specify a name that is both a UNIX user group and also a UNIX user, append a backslash to make sure it is interpreted as a group (user_group/). To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\user_group). v admin_rights – If no rights are specified, only default job control rights are given to user group administrators. – usershares: user group administrators with usershares rights can adjust user shares using bconf update. – full: user group administrators with full rights can use bconf to adjust both usershares and group members, delete the user group, and create new user groups. User group administrators with full rights can only add a user group member to the user group if they also have full rights for the member user group. User group administrators adding a new user group with bconf create are automatically added to GROUP_ADMIN with full rights for the new user group. Restrictions v Wildcard and special characters are not supported (for example: *, !, $, #, &, ~) v The reserved keywords others, default, allremote are not supported. v User groups with the keyword all as a member can only have user group administrators configured if STRICT_UG_CONTROL=Y is defined in lsb.params. v User groups with the keyword all as a member cannot be user group administrators. v User groups and user groups administrator definitions cannot be recursive or create a loop. Chapter 1. Configuration Files 341 lsb.users USER_SHARES Optional. Enables hierarchical fairshare and defines a share tree for users and user groups. By default, when resources are assigned collectively to a group, the group members compete for the resources according to FCFS scheduling. You can use hierarchical fairshare to further divide the shares among the group members. Syntax ([user, number_shares]) Specify the arguments as follows: v Enclose the list in parentheses, even if you do not specify any user share assignments. v Enclose each user share assignment in square brackets, as shown. v Separate the list of share assignments with a space. v user—Specify users or user groups. You can assign the shares to: – A single user (specify user_name). To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). – Users in a group (specify group_name). To specify a Windows user group, include the domain name in uppercase letters (DOMAIN_NAME\group_name). | | – Users not included in any other share assignment, individually (specify the keyword default or default@) or collectively (specify the keyword others). Note: By default, when resources are assigned collectively to a group, the group members compete for the resources on a first-come, first-served (FCFS) basis. You can use hierarchical fairshare to further divide the shares among the group members. When resources are assigned to members of a group individually, the share assignment is recursive. Members of the group and of all subgroups always compete for the resources according to FCFS scheduling, regardless of hierarchical fairshare policies. v number_shares—Specify a positive integer representing the number of shares of the cluster resources assigned to the user. The number of shares assigned to each user is only meaningful when you compare it to the shares assigned to other users or to the total number of shares. The total number of shares is just the sum of all the shares assigned in each share assignment. User section Optional. If this section is not defined, all users and user groups can run an unlimited number of jobs in the cluster. This section defines the maximum number of jobs a user or user group can run concurrently in the cluster. This is to avoid situations in which a user occupies all or most of the system resources while other users’ jobs are waiting. Structure One field is mandatory: USER_NAME. MAX_JOBS, JL/P, and MAX_PEND_JOBS are optional. 342 Platform LSF Configuration Reference lsb.users You must specify a dash (-) to indicate the default value (unlimited) if a user or user group is specified. Fields cannot be left blank. Example of a User section Begin User USER_NAME MAX_JOBS JL/P MAX_PEND_JOBS user1 10 - 1000 user2 4 - - user3 - - - groupA 10 1 100000 groupA@ - 1 100 groupC - - 500 default 6 1 10 End User USER_NAME User or user group for which job slot limits are defined. Use the reserved user name default to specify a job slot limit that applies to each user and user group not explicitly named. Since the limit specified with the keyword default applies to user groups also, make sure you select a limit that is high enough, or explicitly define limits for user groups. User group names can be the LSF user groups defined previously, and/or UNIX and Windows user groups. To specify a Windows user account or user group, include the domain name in uppercase letters (DOMAIN_NAME\user_name or DOMAIN_NAME\user_group). Job slot limits apply to a group as a whole. Append the at sign (@) to a group name to make the job slot limits apply individually to each user in the group. If a group contains a subgroup, the job slot limit also applies to each member in the subgroup recursively. If the group contains the keyword all in the user list, the at sign (@) has no effect. To specify job slot limits for each user in a user group containing all, use the keyword default. MAX_JOBS Per-user or per-group job slot limit for the cluster. Total number of job slots that each user or user group can use in the cluster. Note: If a group contains the keyword all as a member, all users and user groups are included in the group. The per-group job slot limit set for the group applies to the group as a whole, limiting the entire cluster even when ENFORCE_ONE_UG_LIMIT is set in lsb.params. JL/P Per processor job slot limit per user or user group. Total number of job slots that each user or user group can use per processor. This job slot limit is configured per processor so that multiprocessor hosts will automatically run more jobs. Chapter 1. Configuration Files 343 lsb.users This number can be a fraction such as 0.5, so that it can also serve as a per-host limit. This number is rounded up to the nearest integer equal to or greater than the total job slot limits for a host. For example, if JL/P is 0.5, on a 4-CPU multiprocessor host, the user can only use up to 2 job slots at any time. On a uniprocessor machine, the user can use 1 job slot. MAX_PEND_JOBS Per-user or per-group pending job limit. This is the total number of pending job slots that each user or user group can have in the system. If a user is a member of multiple user groups, the user’s pending jobs are counted towards the pending job limits of all groups from which the user has membership. If ENFORCE_ONE_UG_LIMITS is set to Y in lsb.params and you submit a job while specifying a user group, only the limits for that user group (or any parent user group) apply to the job even if there are overlapping user group members. UserMap section Optional. Used only in a MultiCluster environment with a non-uniform user name space. Defines system-level cross-cluster account mapping for users and user groups, which allows users to submit a job from a local host and run the job as a different user on a remote host. Both the local and remote clusters must have corresponding user account mappings configured. Structure The following three fields are all required: v LOCAL v REMOTE v DIRECTION LOCAL A list of users or user groups in the local cluster. To specify a Windows user account or user group, include the domain name in uppercase letters (DOMAIN_NAME\user_name or DOMAIN_NAME\user_group). Separate multiple user names by a space and enclose the list in parentheses ( ): (user4 user6) REMOTE A list of remote users or user groups in the form user_name@cluster_name or user_group@cluster_name. To specify a Windows user account or user group, include the domain name in uppercase letters (DOMAIN_NAME\ user_name@cluster_name or DOMAIN_NAME\user_group@cluster_name). Separate multiple user names by a space and enclose the list in parentheses ( ): (user4@cluster2 user6@cluster2) DIRECTION Specifies whether the user account runs jobs locally or remotely. Both directions must be configured on the local and remote clusters. v The export keyword configures local users/groups to run jobs as remote users/groups. v The import keyword configures remote users/groups to run jobs as local users/groups. 344 Platform LSF Configuration Reference lsb.users Example of a UserMap section On cluster1: Begin UserMap LOCAL REMOTE DIRECTION user1 user2@cluster2 user3 user6@cluster2 export export End UserMap On cluster2: Begin UserMap LOCAL REMOTE DIRECTION user2 user1@cluster1 import user6 user3@cluster1 import End UserMap Cluster1 configures user1 to run jobs as user2 and user3 to run jobs as user6. Cluster2 configures user1 to run jobs as user2 and user3 to run jobs as user6. Automatic time-based configuration Variable configuration is used to automatically change LSF configuration based on time windows. You define automatic configuration changes in lsb.users by using if-else constructs and time expressions. After you change the files, reconfigure the cluster with the badmin reconfig command. The expressions are evaluated by LSF every 10 minutes based on mbatchd start time. When an expression evaluates true, LSF dynamically changes the configuration based on the associated configuration statements. Reconfiguration is done in real time without restarting mbatchd, providing continuous system availability. Example From 12 - 1 p.m. daily, user smith has 10 job slots, but during other hours, user has only 5 job slots. Begin User USER_NAME MAX_JOBS JL/P #if time (12-13) smith 10 - smith 5 - default 1 - #else #endif End User lsf.acct The lsf.acct file is the LSF task log file. The LSF Remote Execution Server, RES (see res(8)), generates a record for each task completion or failure. If the RES task logging is turned on (see lsadmin(8)), it appends the record to the task log file lsf.acct.<host_name>. Chapter 1. Configuration Files 345 lsf.acct lsf.acct structure The task log file is an ASCII file with one task record per line. The fields of each record are separated by blanks. The location of the file is determined by the LSF_RES_ACCTDIR variable defined in lsf.conf. If this variable is not defined, or the RES cannot access the log directory, the log file is created in /tmp instead. Fields The fields in a task record are ordered in the following sequence: pid (%d) Process ID for the remote task userName (%s) User name of the submitter exitStatus (%d) Task exit status dispTime (%ld) Dispatch time – time at which the task was dispatched for execution termTime (%ld) Completion time – time when task is completed/failed fromHost (%s) Submission host name execHost (%s) Execution host name cwd (%s) Current working directory cmdln (%s) Command line of the task lsfRusage The following fields contain resource usage information for the job (see getrusage(2)). If the value of some field is unavailable (due to job exit or the difference among the operating systems), -1 will be logged. Times are measured in seconds, and sizes are measured in KB. ru_utime (%f) User time used ru_stime (%f) System time used ru_maxrss (%f) Maximum shared text size ru_ixrss (%f) Integral of the shared text size over time (in KB seconds) 346 Platform LSF Configuration Reference lsf.acct ru_ismrss (%f) Integral of the shared memory size over time (valid only on Ultrix) ru_idrss (%f) Integral of the unshared data size over time ru_isrss (%f) Integral of the unshared stack size over time ru_minflt (%f) Number of page reclaims ru_majflt (%f) Number of page faults ru_nswap (%f) Number of times the process was swapped out ru_inblock (%f) Number of block input operations ru_oublock (%f) Number of block output operations ru_ioch (%f) Number of characters read and written (valid only on HP-UX) ru_msgsnd (%f) Number of System V IPC messages sent ru_msgrcv (%f) Number of messages received ru_nsignals (%f) Number of signals received ru_nvcsw (%f) Number of voluntary context switches ru_nivcsw (%f) Number of involuntary context switches ru_exutime (%f) Exact user time used (valid only on ConvexOS) lsf.cluster Changing lsf.cluster configuration After making any changes to lsf.cluster.cluster_name, run the following commands: v lsadmin reconfig to reconfigure LIM v badmin mbdrestart to restart mbatchd v lsadmin limrestart to restart LIM (on all changed non-master hosts) Chapter 1. Configuration Files 347 lsf.cluster Location This file is typically installed in the directory defined by LSF_ENVDIR. Structure The lsf.cluster.cluster_name file contains the following configuration sections: v Parameters section v ClusterAdmins section v Host section v ResourceMap section v RemoteClusters section Parameters section About lsf.cluster This is the cluster configuration file. There is one for each cluster, called lsf.cluster.cluster_name. The cluster_name suffix is the name of the cluster defined in the Cluster section of lsf.shared. All LSF hosts are listed in this file, along with the list of LSF administrators and the installed LSF features. The lsf.cluster.cluster_name file contains two types of configuration information: v Cluster definition information - affects all LSF applications. Defines cluster administrators, hosts that make up the cluster, attributes of each individual host such as host type or host model, and resources using the names defined in lsf.shared. v LIM policy information - affects applications that rely on LIM job placement policy. Defines load sharing and job placement policies provided by LIM. Parameters v ADJUST_DURATION v ELIM_ABORT_VALUE v v v v v ELIM_POLL_INTERVAL ELIMARGS EXINTERVAL FLOAT_CLIENTS FLOAT_CLIENTS_ADDR_RANGE v HOST_INACTIVITY_LIMIT v LSF_ELIM_BLOCKTIME v v v v v v LSF_ELIM_DEBUG LSF_ELIM_RESTARTS LSF_HOST_ADDR_RANGE MASTER_INACTIVITY_LIMIT PROBE_TIMEOUT RETRY_LIMIT ADJUST_DURATION Syntax ADJUST_DURATION=integer 348 Platform LSF Configuration Reference lsf.cluster Description Integer reflecting a multiple of EXINTERVAL that controls the time period during which load adjustment is in effect. The lsplace(1) and lsloadadj(1) commands artificially raise the load on a selected host. This increase in load decays linearly to 0 over time. Default 3 ELIM_ABORT_VALUE Syntax ELIM_ABORT_VALUE=integer Description Integer that triggers an abort for an ELIM. Default 97 (triggers abort) ELIM_POLL_INTERVAL Syntax ELIM_POLL_INTERVAL=seconds Description Time interval, in seconds, that the LIM samples external load index information. If your elim executable is programmed to report values more frequently than every 5 seconds, set the ELIM_POLL_INTERVAL so that it samples information at a corresponding rate. Valid values 0.001 to 5 Default 5 seconds ELIMARGS Syntax ELIMARGS=cmd_line_args Description Specifies command-line arguments required by an elim executable on startup. Used only when the external load indices feature is enabled. Chapter 1. Configuration Files 349 lsf.cluster Default Undefined EXINTERVAL Syntax EXINTERVAL=time_in_seconds Description Time interval, in seconds, at which the LIM daemons exchange load information On extremely busy hosts or networks, or in clusters with a large number of hosts, load may interfere with the periodic communication between LIM daemons. Setting EXINTERVAL to a longer interval can reduce network load and slightly improve reliability, at the cost of slower reaction to dynamic load changes. Note that if you define the time interval as less than 5 seconds, LSF automatically resets it to 5 seconds. Default 15 seconds FLOAT_CLIENTS Syntax FLOAT_CLIENTS=number_of_floating_clients Description Sets the maximum allowable size for floating clients in a cluster. If FLOAT_CLIENTS is not specified in lsf.cluster.cluster_name, the floating LSF client feature is disabled. CAUTION: When the LSF floating client feature is enabled, any host can submit jobs to the cluster. You can limit which hosts can be LSF floating clients with the parameter FLOAT_CLIENTS_ADDR_RANGE in lsf.cluster.cluster_name. LSF Floating Client Although an LSF Floating Client requires a license, LSF_Float_Client does not need to be added to the PRODUCTS line. LSF_Float_Client also cannot be added as a resource for specific hosts already defined in lsf.cluster.cluster_name. Should these lines be present, they are ignored by LSF. Default Undefined 350 Platform LSF Configuration Reference lsf.cluster FLOAT_CLIENTS_ADDR_RANGE Syntax FLOAT_CLIENTS_ADDR_RANGE=IP_address ... Description Optional. IP address or range of addresses of domains from which floating client hosts can submit requests. Multiple ranges can be defined, separated by spaces. The IP address can have either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. LSF supports both formats; you do not have to map IPv4 addresses to an IPv6 format. Note: To use IPv6 addresses, you must define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf. If the value of FLOAT_CLIENT_ADDR_RANGE is undefined, there is no security and any hosts can be LSF floating clients. If a value is defined, security is enabled. If there is an error in the configuration of this variable, by default, no hosts will be allowed to be LSF floating clients. When this parameter is defined, client hosts that do not belong to the domain will be denied access. If a requesting host belongs to an IP address that falls in the specified range, the host will be accepted to become a floating client. IP addresses are separated by spaces, and considered "OR" alternatives. If you define FLOAT_CLIENT_ADDR_RANGE with: v No range specified, all IPv4 and IPv6 clients can submit requests. v Only an IPv4 range specified, only IPv4 clients within the range can submit requests. v Only an IPv6 range specified, only IPv6 clients within the range can submit requests. v Both an IPv6 and IPv4 range specified, IPv6 and IPv4 clients within the ranges can submit requests. The asterisk (*) character indicates any value is allowed. The dash (-) character indicates an explicit range of values. For example 1-4 indicates 1,2,3,4 are allowed. Open ranges such as *-30, or 10-*, are allowed. If a range is specified with fewer fields than an IP address such as 10.161, it is considered as 10.161.*.*. Address ranges are validated at configuration time so they must conform to the required format. If any address range is not in the correct format, no hosts will be accepted as LSF floating clients, and an error message will be logged in the LIM log. Chapter 1. Configuration Files 351 lsf.cluster This parameter is limited to 2048 characters. For IPv6 addresses, the double colon symbol (::) indicates multiple groups of 16-bits of zeros. You can also use (::) to compress leading and trailing zeros in an address filter, as shown in the following example: FLOAT_CLIENTS_ADDR_RANGE=1080::8:800:20fc:* This definition allows hosts with addresses 1080:0:0:0:8:800:20fc:* (three leading zeros). You cannot use the double colon (::) more than once within an IP address. You cannot use a zero before or after (::). For example, 1080:0::8:800:20fc:* is not a valid address. Notes After you configure FLOAT_CLIENTS_ADDR_RANGE, check the lim.log.host_name file to make sure this parameter is correctly set. If this parameter is not set or is wrong, this will be indicated in the log file. Examples FLOAT_CLIENTS_ADDR_RANGE=100 All IPv4 and IPv6 hosts with a domain address starting with 100 will be allowed access. v To specify only IPv4 hosts, set the value to 100.* v To specify only IPv6 hosts, set the value to 100:* FLOAT_CLIENTS_ADDR_RANGE=100-110.34.1-10.4-56 All client hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. Example: 100.34.9.45, 100.34.1.4, 102.34.3.20, etc. No IPv6 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE=100.172.1.13 100.*.30-54 124.24-*.1.*-34 All client hosts belonging to a domain with the address 100.172.1.13 will be allowed access. All client hosts belonging to domains starting with 100, then any number, then a range of 30 to 54 will be allowed access. All client hosts belonging to domains starting with 124, then from 24 onward, then 1, then from 0 to 34 will be allowed access. No IPv6 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE=12.23.45.* All client hosts belonging to domains starting with 12.23.45 are allowed. No IPv6 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE=100.*43 The * character can only be used to indicate any value. In this example, an error will be inserted in the LIM log and no hosts will be accepted to become LSF floating clients. No IPv6 hosts are allowed. 352 Platform LSF Configuration Reference lsf.cluster FLOAT_CLIENTS_ADDR_RANGE=100.*43 100.172.1.13 Although one correct address range is specified, because *43 is not correct format, the entire line is considered not valid. An error will be inserted in the LIM log and no hosts will be accepted to become LSF floating clients. No IPv6 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE = 3ffe All client IPv6 hosts with a domain address starting with 3ffe will be allowed access. No IPv4 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE = 3ffe:fffe::88bb:* Expands to 3ffe:fffe:0:0:0:0:88bb:*. All IPv6 client hosts belonging to domains starting with 3ffe:fffe::88bb:* are allowed. No IPv4 hosts are allowed. FLOAT_CLIENTS_ADDR_RANGE = 3ffe-4fff:fffe::88bb:aa-ff 12.23.45.* All IPv6 client hosts belonging to domains starting with 3ffe up to 4fff, then fffe::88bb, and ending with aa up to ff are allowed. All IPv4 client hosts belonging to domains starting with 12.23.45 are allowed. FLOAT_CLIENTS_ADDR_RANGE = 3ffe-*:fffe::88bb:*-ff All IPv6 client hosts belonging to domains starting with 3ffe up to ffff and ending with 0 up to ff are allowed. No IPv4 hosts are allowed. Default Undefined. No security is enabled. Any host in any domain is allowed access to LSF floating clients. See also LSF_ENABLE_SUPPORT_IPV6 HOST_INACTIVITY_LIMIT Syntax HOST_INACTIVITY_LIMIT=integer Description Integer that is multiplied by EXINTERVAL, the time period you set for the communication between the master and slave LIMs to ensure all parties are functioning. A slave LIM can send its load information any time from EXINTERVAL to (HOST_INACTIVITY_LIMIT-1)*EXINTERVAL seconds. A master LIM sends a master announce to each host at least every EXINTERVAL*(HOST_INACTIVITY_LIMIT-1) seconds. The HOST_INACTIVITY_LIMIT must be greater than or equal to 2. Chapter 1. Configuration Files 353 lsf.cluster Increase or decrease the host inactivity limit to adjust for your tolerance for communication between master and slaves. For example, if you have hosts that frequently become inactive, decrease the host inactivity limit. Note that to get the right interval, you may also have to adjust your EXINTERVAL. Default 5 LSF_ELIM_BLOCKTIME Syntax LSF_ELIM_BLOCKTIME=seconds Description UNIX only; used when the external load indices feature is enabled. Maximum amount of time the master external load information manager (MELIM) waits for a complete load update string from an elim executable. After the time period specified by LSF_ELIM_BLOCKTIME, the MELIM writes the last string sent by an elim in the LIM log file (lim.log.host_name) and restarts the elim. Defining LSF_ELIM_BLOCKTIME also triggers the MELIM to restart elim executables if the elim does not write a complete load update string within the time specified for LSF_ELIM_BLOCKTIME. Valid values Non-negative integers. For example, if your elim writes name-value pairs with 1 second intervals between them, and your elim reports 12 load indices, allow at least 12 seconds for the elim to finish writing the entire load update string. In this case, define LSF_ELIM_BLOCKTIME as 15 seconds or more. A value of 0 indicates that the MELIM expects to receive the entire load string all at once. If you comment out or delete LSF_ELIM_BLOCKTIME, the MELIM waits 2 seconds for a complete load update string. Default 4 seconds See also LSF_ELIM_RESTARTS to limit how many times the ELIM can be restarted. LSF_ELIM_DEBUG Syntax LSF_ELIM_DEBUG=y Description UNIX only; used when the external load indices feature is enabled. 354 Platform LSF Configuration Reference lsf.cluster When this parameter is set to y, all external load information received by the load information manager (LIM) from the master external load information manager (MELIM) is logged in the LIM log file (lim.log.host_name). Defining LSF_ELIM_DEBUG also triggers the MELIM to restart elim executables if the elim does not write a complete load update string within the time specified for LSF_ELIM_BLOCKTIME. Default Undefined; external load information sent by an to the MELIM is not logged. See also LSF_ELIM_BLOCKTIME to configure how long LIM waits before restarting the ELIM. LSF_ELIM_RESTARTS to limit how many times the ELIM can be restarted. LSF_ELIM_RESTARTS Syntax LSF_ELIM_RESTARTS=integer Description UNIX only; used when the external load indices feature is enabled. Maximum number of times the master external load information manager (MELIM) can restart elim executables on a host. Defining this parameter prevents an ongoing restart loop in the case of a faulty elim. The MELIM waits the LSF_ELIM_BLOCKTIME to receive a complete load update string before restarting the elim. The MELIM does not restart any elim executables that exit with ELIM_ABORT_VALUE. Important: Either LSF_ELIM_BLOCKTIME or LSF_ELIM_DEBUG must also be defined; defining these parameters triggers the MELIM to restart elim executables. Valid values Non-negative integers. Default Undefined; the number of elim restarts is unlimited. See also LSF_ELIM_BLOCKTIME, LSF_ELIM_DEBUG LSF_HOST_ADDR_RANGE Syntax LSF_HOST_ADDR_RANGE=IP_address ... Chapter 1. Configuration Files 355 lsf.cluster Description Identifies the range of IP addresses that are allowed to be LSF hosts that can be dynamically added to or removed from the cluster. CAUTION: To enable dynamically added hosts after installation, you must define LSF_HOST_ADDR_RANGE in lsf.cluster.cluster_name, and LSF_DYNAMIC_HOST_WAIT_TIME in lsf.conf. If you enable dynamic hosts during installation, you must define an IP address range after installation to enable security. If a value is defined, security for dynamically adding and removing hosts is enabled, and only hosts with IP addresses within the specified range can be added to or removed from a cluster dynamically. Specify an IP address or range of addresses, using either a dotted quad notation (IPv4) or IP Next Generation (IPv6) format. LSF supports both formats; you do not have to map IPv4 addresses to an IPv6 format. Multiple ranges can be defined, separated by spaces. Note: To use IPv6 addresses, you must define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf. If there is an error in the configuration of LSF_HOST_ADDR_RANGE (for example, an address range is not in the correct format), no host will be allowed to join the cluster dynamically and an error message will be logged in the LIM log. Address ranges are validated at startup, reconfiguration, or restart, so they must conform to the required format. If a requesting host belongs to an IP address that falls in the specified range, the host will be accepted to become a dynamic LSF host. IP addresses are separated by spaces, and considered "OR" alternatives. If you define the parameter LSF_HOST_ADDR_RANGE with: v No range specified, all IPv4 and IPv6 clients are allowed. v Only an IPv4 range specified, only IPv4 clients within the range are allowed. v Only an IPv6 range specified, only IPv6 clients within the range are allowed. v Both an IPv6 and IPv4 range specified, IPv6 and IPv4 clients within the ranges are allowed. The asterisk (*) character indicates any value is allowed. The dash (-) character indicates an explicit range of values. For example 1-4 indicates 1,2,3,4 are allowed. Open ranges such as *-30, or 10-*, are allowed. For IPv6 addresses, the double colon symbol (::) indicates multiple groups of 16-bits of zeros. You can also use (::) to compress leading and trailing zeros in an address filter, as shown in the following example: 356 Platform LSF Configuration Reference lsf.cluster LSF_HOST_ADDR_RANGE=1080::8:800:20fc:* This definition allows hosts with addresses 1080:0:0:0:8:800:20fc:* (three leading zeros). You cannot use the double colon (::) more than once within an IP address. You cannot use a zero before or after (::). For example, 1080:0::8:800:20fc:* is not a valid address. If a range is specified with fewer fields than an IP address such as 10.161, it is considered as 10.161.*.*. This parameter is limited to 2048 characters. Notes After you configure LSF_HOST_ADDR_RANGE, check the lim.log.host_name file to make sure this parameter is correctly set. If this parameter is not set or is wrong, this will be indicated in the log file. Examples LSF_HOST_ADDR_RANGE=100 All IPv4 and IPv6 hosts with a domain address starting with 100 will be allowed access. v To specify only IPv4 hosts, set the value to 100.* v To specify only IPv6 hosts, set the value to 100:* LSF_HOST_ADDR_RANGE=100-110.34.1-10.4-56 All hosts belonging to a domain with an address having the first number between 100 and 110, then 34, then a number between 1 and 10, then, a number between 4 and 56 will be allowed access. No IPv6 hosts are allowed. Example: 100.34.9.45, 100.34.1.4, 102.34.3.20, etc. LSF_HOST_ADDR_RANGE=100.172.1.13 100.*.30-54 124.24-*.1.*-34 The host with the address 100.172.1.13 will be allowed access. All hosts belonging to domains starting with 100, then any number, then a range of 30 to 54 will be allowed access. All hosts belonging to domains starting with 124, then from 24 onward, then 1, then from 0 to 34 will be allowed access. No IPv6 hosts are allowed. LSF_HOST_ADDR_RANGE=12.23.45.* All hosts belonging to domains starting with 12.23.45 are allowed. No IPv6 hosts are allowed. LSF_HOST_ADDR_RANGE=100.*43 The * character can only be used to indicate any value. The format of this example is not correct, and an error will be inserted in the LIM log and no hosts will be able to join the cluster dynamically. No IPv6 hosts are allowed. Chapter 1. Configuration Files 357 lsf.cluster LSF_HOST_ADDR_RANGE=100.*43 100.172.1.13 Although one correct address range is specified, because *43 is not correct format, the entire line is considered not valid. An error will be inserted in the LIM log and no hosts will be able to join the cluster dynamically. No IPv6 hosts are allowed. LSF_HOST_ADDR_RANGE = 3ffe All client IPv6 hosts with a domain address starting with 3ffe will be allowed access. No IPv4 hosts are allowed. LSF_HOST_ADDR_RANGE = 3ffe:fffe::88bb:* Expands to 3ffe:fffe:0:0:0:0:88bb:*.All IPv6 client hosts belonging to domains starting with 3ffe:fffe::88bb:* are allowed. No IPv4 hosts are allowed. LSF_HOST_ADDR_RANGE = 3ffe-4fff:fffe::88bb:aa-ff 12.23.45.* All IPv6 client hosts belonging to domains starting with 3ffe up to 4fff, then fffe::88bb, and ending with aa up to ff are allowed. IPv4 client hosts belonging to domains starting with 12.23.45 are allowed. LSF_HOST_ADDR_RANGE = 3ffe-*:fffe::88bb:*-ff All IPv6 client hosts belonging to domains starting with 3ffe up to ffff and ending with 0 up to ff are allowed. No IPv4 hosts are allowed. Default Undefined (dynamic host feature disabled). If you enable dynamic hosts during installation, no security is enabled and all hosts can join the cluster. See also LSF_ENABLE_SUPPORT_IPV6 MASTER_INACTIVITY_LIMIT Syntax MASTER_INACTIVITY_LIMIT=integer Description An integer reflecting a multiple of EXINTERVAL. A slave will attempt to become master if it does not hear from the previous master after (HOST_INACTIVITY_LIMIT +host_number*MASTER_INACTIVITY_LIMIT)*EXINTERVAL seconds, where host_number is the position of the host in lsf.cluster.cluster_name. The master host is host_number 0. Default 2 358 Platform LSF Configuration Reference lsf.cluster PROBE_TIMEOUT Syntax PROBE_TIMEOUT=time_in_seconds Description Specifies the timeout in seconds to be used for the connect(2) system call Before taking over as the master, a slave LIM will try to connect to the last known master via TCP. Default 2 seconds RETRY_LIMIT Syntax RETRY_LIMIT=integer Description Integer reflecting a multiple of EXINTERVAL that controls the number of retries a master or slave LIM makes before assuming that the slave or master is unavailable. If the master does not hear from a slave for HOST_INACTIVITY_LIMIT exchange intervals, it will actively poll the slave for RETRY_LIMIT exchange intervals before it will declare the slave as unavailable. If a slave does not hear from the master for HOST_INACTIVITY_LIMIT exchange intervals, it will actively poll the master for RETRY_LIMIT intervals before assuming that the master is down. Default 2 ClusterAdmins section (Optional) The ClusterAdmins section defines the LSF administrators for the cluster. The only keyword is ADMINISTRATORS. If the ClusterAdmins section is not present, the default LSF administrator is root. Using root as the primary LSF administrator is not recommended. ADMINISTRATORS Syntax ADMINISTRATORS=administrator_name ... Description Specify UNIX user names. Chapter 1. Configuration Files 359 lsf.cluster You can also specify UNIX user group names, Windows user names, and Windows user group names.To specify a Windows user account or user group, include the domain name in uppercase letters (DOMAIN_NAME\user_name or DOMAIN_NAME\user_group). The first administrator of the expanded list is considered the primary LSF administrator. The primary administrator is the owner of the LSF configuration files, as well as the working files under LSB_SHAREDIR/cluster_name. If the primary administrator is changed, make sure the owner of the configuration files and the files under LSB_SHAREDIR/cluster_name are changed as well. Administrators other than the primary LSF administrator have the same privileges as the primary LSF administrator except that they do not have permission to change LSF configuration files. They can perform clusterwide operations on jobs, queues, or hosts in the system. For flexibility, each cluster may have its own LSF administrators, identified by a user name, although the same administrators can be responsible for several clusters. Use the -l option of the lsclusters command to display all of the administrators within a cluster. Windows domain: v If the specified user or user group is a domain administrator, member of the Power Users group or a group with domain administrative privileges, the specified user or user group must belong to the LSF user domain. v If the specified user or user group is a user or user group with a lower degree of privileges than outlined in the previous point, the user or user group must belong to the LSF user domain and be part of the Global Admins group. Windows workgroup v If the specified user or user group is not a workgroup administrator, member of the Power Users group, or a group with administrative privileges on each host, the specified user or user group must belong to the Local Admins group on each host. Compatibility For backwards compatibility, ClusterManager and Manager are synonyms for ClusterAdmins and ADMINISTRATORS respectively. It is possible to have both sections present in the same lsf.cluster.cluster_name file to allow daemons from different LSF versions to share the same file. Example The following gives an example of a cluster with two LSF administrators. The user listed first, user2, is the primary administrator. Begin ClusterAdmins ADMINISTRATORS = user2 user7 End ClusterAdmins Default lsfadmin 360 Platform LSF Configuration Reference lsf.cluster Host section The Host section is the last section in lsf.cluster.cluster_name and is the only required section. It lists all the hosts in the cluster and gives configuration information for each host. The order in which the hosts are listed in this section is important, because the first host listed becomes the LSF master host. Since the master LIM makes all placement decisions for the cluster, it should be on a fast machine. The LIM on the first host listed becomes the master LIM if this host is up; otherwise, that on the second becomes the master if its host is up, and so on. Also, to avoid the delays involved in switching masters if the first machine goes down, the master should be on a reliable machine. It is desirable to arrange the list such that the first few hosts in the list are always in the same subnet. This avoids a situation where the second host takes over as master when there are communication problems between subnets. Configuration information is of two types: v Some fields in a host entry simply describe the machine and its configuration. v Other fields set thresholds for various resources. Example Host section This example Host section contains descriptive and threshold information for three hosts: Begin Host HOSTNAME model hostA hostD type server r1m pg tmp RESOURCES RUNWINDOW SparcIPC Sparc 1 3.5 15 0 (sunos frame) () Sparc10 Sparc 1 3.5 15 0 (sunos) (5:18:30-1:8:30) hostD ! ! 1 2.0 10 0 () () hostE ! ! 1 2.0 10 0 (linux !bigmem) () End Host Descriptive fields The following fields are required in the Host section: v HOSTNAME v RESOURCES v type v model The following fields are optional: v server v nd v RUNWINDOW v REXPRI Chapter 1. Configuration Files 361 lsf.cluster HOSTNAME Description Official name of the host as returned by hostname(1) The name must be listed in lsf.shared as belonging to this cluster. model Description Host model The name must be defined in the HostModel section of lsf.shared. This determines the CPU speed scaling factor applied in load and placement calculations. Optionally, the ! keyword for the model or type column, indicates that the host model or type is to be automatically detected by the LIM running on the host. nd Description Number of local disks This corresponds to the ndisks static resource. On most host types, LSF automatically determines the number of disks, and the nd parameter is ignored. nd should only count local disks with file systems on them. Do not count either disks used only for swapping or disks mounted with NFS. Default The number of disks determined by the LIM, or 1 if the LIM cannot determine this RESOURCES Description The static Boolean resources and static or dynamic numeric and string resources available on this host. The resource names are strings defined in the Resource section of lsf.shared. You may list any number of resources, enclosed in parentheses and separated by blanks or tabs. For example: (fs frame hpux) Optionally, you can specify an exclusive resource by prefixing the resource with an exclamation mark (!). For example, resource bigmem is defined in lsf.shared, and is defined as an exclusive resource for hostE: Begin Host HOSTNAME model type server r1m pg tmp RESOURCES RUNWINDOW ! ! 1 () ... hostE 2.0 10 ... End Host 362 Platform LSF Configuration Reference 0 (linux !bigmem) lsf.cluster Square brackets are not valid and the resource name must be alphanumeric. You must explicitly specify the exclusive resources in the resource requirements for the job to select a host with an exclusive resource for a job. For example: bsub -R "bigmem" myjob or bsub -R "defined(bigmem)" myjob You can specify static and dynamic numeric and string resources in the resource column of the Host clause. For example: Begin Host HOSTNAME model type server r1m mem swp RESOURCES hostA ! ! 1 3.5 () () (mg elimres patchrev=3 owner=user1) hostB ! ! 1 3.5 () () (specman=5 switch=1 owner=test) hostC ! ! 1 3.5 () () (switch=2 rack=rack2_2_3 owner=test) hostD ! ! 1 3.5 () () (switch=1 rack=rack2_2_3 owner=test) End #Keywords Host Static resource information is displayed by lshosts, with exclusive resources prefixed by '!'. REXPRI Description UNIX only Default execution priority for interactive remote jobs run under the RES The range is from -20 to 20. REXPRI corresponds to the BSD-style nice value used for remote jobs. For hosts with System V-style nice values with the range 0 - 39, a REXPRI of -20 corresponds to a nice value of 0, and +20 corresponds to 39. Higher values of REXPRI correspond to lower execution priority; -20 gives the highest priority, 0 is the default priority for login sessions, and +20 is the lowest priority. Default 0 RUNWINDOW Description Dispatch window for interactive tasks. When the host is not available for remote execution, the host status is lockW (locked by run window). LIM does not schedule interactive tasks on hosts locked by dispatch windows. Run windows only apply to interactive tasks placed by LIM. The LSF batch system uses its own (optional) host dispatch windows to control batch job processing on batch server hosts. Format A dispatch window consists of one or more time windows in the format begin_time-end_time. No blanks can separate begin_time and end_time. Time is specified in the form [day:]hour[:minute]. If only one field is specified, LSF assumes Chapter 1. Configuration Files 363 lsf.cluster it is an hour. Two fields are assumed to be hour:minute. Use blanks to separate time windows. Default Always accept remote jobs server Description Indicates whether the host can receive jobs from other hosts Specify 1 if the host can receive jobs from other hosts; specify 0 otherwise. Servers that are set to 0 are LSF clients. Client hosts do not run the LSF daemons. Client hosts can submit interactive and batch jobs to the cluster, but they cannot execute jobs sent from other hosts. Default 1 type Description Host type as defined in the HostType section of lsf.shared The strings used for host types are determined by the system administrator: for example, SUNSOL, DEC, or HPPA. The host type is used to identify binary-compatible hosts. The host type is used as the default resource requirement. That is, if no resource requirement is specified in a placement request, the task is run on a host of the same type as the sending host. Often one host type can be used for many machine models. For example, the host type name SUNSOL6 might be used for any computer with a SPARC processor running SunOS 6. This would include many Sun models and quite a few from other vendors as well. Optionally, the ! keyword for the model or type column, indicates that the host model or type is to be automatically detected by the LIM running on the host. Threshold fields The LIM uses these thresholds in determining whether to place remote jobs on a host. If one or more LSF load indices exceeds the corresponding threshold (too many users, not enough swap space, etc.), then the host is regarded as busy, and LIM will not recommend jobs to that host. The CPU run queue length threshold values (r15s, r1m, and r15m) are taken as effective queue lengths as reported by lsload -E. All of these fields are optional; you only need to configure thresholds for load indices that you wish to use for determining whether hosts are busy. Fields that are not configured are not considered when determining host status. The keywords for the threshold fields are not case sensitive. 364 Platform LSF Configuration Reference lsf.cluster Thresholds can be set for any of the following: v The built-in LSF load indexes (r15s, r1m, r15m, ut, pg, it, io, ls, swp, mem, tmp) v External load indexes defined in the Resource section of lsf.shared ResourceMap section The ResourceMap section defines shared resources in your cluster. This section specifies the mapping between shared resources and their sharing hosts. When you define resources in the Resources section of lsf.shared, there is no distinction between a shared and non-shared resource. By default, all resources are not shared and are local to each host. By defining the ResourceMap section, you can define resources that are shared by all hosts in the cluster or define resources that are shared by only some of the hosts in the cluster. This section must appear after the Host section of lsf.cluster.cluster_name, because it has a dependency on host names defined in the Host section. ResourceMap section structure The first line consists of the keywords RESOURCENAME and LOCATION. Subsequent lines describe the hosts that are associated with each configured resource. Example ResourceMap section Begin ResourceMap RESOURCENAME LOCATION verilog (5@[all]) local ([host1 host2] [others]) End ResourceMap The resource verilog must already be defined in the RESOURCE section of the lsf.shared file. It is a static numeric resource shared by all hosts. The value for verilog is 5. The resource local is a numeric shared resource that contains two instances in the cluster. The first instance is shared by two machines, host1 and host2. The second instance is shared by all other hosts. Resources defined in the ResourceMap section can be viewed by using the -s option of the lshosts (for static resource) and lsload (for dynamic resource) commands. LOCATION Description Defines the hosts that share the resource For a static resource, you must define an initial value here as well. Do not define a value for a dynamic resource. instance is a list of host names that share an instance of the resource. The reserved words all, others, and default can be specified for the instance: all - Indicates that there is only one instance of the resource in the whole cluster and that this resource is shared by all of the hosts Use the not operator (~) to exclude hosts from the all specification. For example: Chapter 1. Configuration Files 365 lsf.cluster (2@[all ~host3 ~host4]) means that 2 units of the resource are shared by all server hosts in the cluster made up of host1 host2 ... hostn, except for host3 and host4. This is useful if you have a large cluster but only want to exclude a few hosts. The parentheses are required in the specification. The not operator can only be used with the all keyword. It is not valid with the keywords others and default. others - Indicates that the rest of the server hosts not explicitly listed in the LOCATION field comprise one instance of the resource For example: 2@[host1] 4@[others] indicates that there are 2 units of the resource on host1 and 4 units of the resource shared by all other hosts. default - Indicates an instance of a resource on each host in the cluster This specifies a special case where the resource is in effect not shared and is local to every host. default means at each host. Normally, you should not need to use default, because by default all resources are local to each host. You might want to use ResourceMap for a non-shared static resource if you need to specify different values for the resource on different hosts. RESOURCENAME Description Name of the resource This resource name must be defined in the Resource section of lsf.shared. You must specify at least a name and description for the resource, using the keywords RESOURCENAME and DESCRIPTION. v A resource name cannot begin with a number. v A resource name cannot contain any of the following characters: : . ( ) [ + - * / ! & | < > @ = v A resource name cannot be any of the following reserved names: cpu cpuf io logins ls idle maxmem maxswp maxtmp type model status it mem ncpus define_ncpus_cores define_ncpus_procs define_ncpus_threads ndisks pg r15m r15s r1m swap swp tmp ut v To avoid conflict with inf and nan keywords in 3rd-party libraries, resource names should not begin with inf or nan (upper case or lower case). Resource requirment strings, such as -R "infra" or -R "nano" will cause an error. Use -R "defined(infxx)" or -R "defined(nanxx)", to specify these resource names. v Resource names are case sensitive v Resource names can be up to 39 characters in length RemoteClusters section Optional. This section is used only in a MultiCluster environment. By default, the local cluster can obtain information about all other clusters specified in lsf.shared. The RemoteClusters section limits the clusters that the local cluster can obtain information about. 366 Platform LSF Configuration Reference lsf.cluster The RemoteClusters section is required if you want to configure cluster equivalency, cache interval, daemon authentication across clusters, or if you want to run parallel jobs across clusters. To maintain compatibility in this case, make sure the list includes all clusters specified in lsf.shared, even if you only configure the default behavior for some of the clusters. The first line consists of keywords. CLUSTERNAME is mandatory and the other parameters are optional. Subsequent lines configure the remote cluster. Example RemoteClusters section Begin RemoteClusters CLUSTERNAME EQUIV CACHE_INTERVAL RECV_FROM AUTH cluster1 Y 60 Y KRB cluster2 N 60 Y - cluster4 N 60 N PKI End RemoteClusters CLUSTERNAME Description Remote cluster name Defines the Remote Cluster list. Specify the clusters you want the local cluster to recognize. Recognized clusters must also be defined in lsf.shared. Additional clusters listed in lsf.shared but not listed here will be ignored by this cluster. EQUIV Description Specify ‘Y’ to make the remote cluster equivalent to the local cluster. Otherwise, specify ‘N’. The master LIM considers all equivalent clusters when servicing requests from clients for load, host, or placement information. EQUIV changes the default behavior of LSF commands and utilities and causes them to automatically return load (lsload(1)), host (lshosts(1)), or placement (lsplace(1)) information about the remote cluster as well as the local cluster, even when you don’t specify a cluster name. CACHE_INTERVAL Description Specify the load information cache threshold, in seconds. The host information threshold is twice the value of the load information threshold. To reduce overhead and avoid updating information from remote clusters unnecessarily, LSF displays information in the cache, unless the information in the cache is older than the threshold value. Default 60 (seconds) Chapter 1. Configuration Files 367 lsf.cluster RECV_FROM Description Specifies whether the local cluster accepts parallel tasks that originate in a remote cluster RECV_FROM does not affect regular or interactive batch jobs. Specify Y if you want to run parallel jobs across clusters. Otherwise, specify N. Default Y AUTH Description Defines the preferred authentication method for LSF daemons communicating across clusters. Specify the same method name that is used to identify the corresponding eauth program (eauth.method_name). If the remote cluster does not prefer the same method, LSF uses default security between the two clusters. Default - (only privileged port (setuid) authentication is used between clusters) lsf.conf The lsf.conf file controls the operation of LSF. About lsf.conf lsf.conf is created during installation and records all the settings chosen when LSF was installed. The lsf.conf file dictates the location of the specific configuration files and operation of individual servers and applications. The lsf.conf file is used by LSF and applications built on top of it. For example, information in lsf.conf is used by LSF daemons and commands to locate other configuration files, executables, and network services. lsf.conf is updated, if necessary, when you upgrade to a new version. This file can also be expanded to include application-specific parameters. Parameters in this file can also be set as environment variables, except for the parameters related to job packs. Corresponding parameters in ego.conf When EGO is enabled in LSF Version 7, you can configure some LSF parameters in lsf.conf that have corresponding EGO parameter names in EGO_CONFDIR/ego.conf (LSF_CONFDIR/lsf.conf is a separate file from EGO_CONFDIR/ego.conf). If both the LSF and the EGO parameters are set in their respective files, the definition in ego.conf is used. You must continue to set LSF parameters only in lsf.conf. 368 Platform LSF Configuration Reference lsf.conf When EGO is enabled in the LSF cluster (LSF_ENABLE_EGO=Y), you also can set the following EGO parameters related to LIM, PIM, and ELIM in either lsf.conf or ego.conf: v v v v v v v v EGO_DISABLE_UNRESOLVABLE_HOST (dynamically added hosts only) EGO_ENABLE_AUTO_DAEMON_SHUTDOWN EGO_DAEMONS_CPUS EGO_DEFINE_NCPUS EGO_SLAVE_CTRL_REMOTE_HOST EGO_WORKDIR EGO_PIM_SWAP_REPORT EGO_ESLIM_TIMEOUT If EGO is not enabled, you do not need to set these parameters. See Administering IBM Platform LSF for more information about configuring LSF for EGO. Change lsf.conf configuration Depending on which parameters you change, you may need to run one or more of the following commands after changing lsf.conf parameters: v lsadmin reconfig to reconfigure LIM v badmin mbdrestart to restart mbatchd v badmin hrestart to restart sbatchd If you have installed LSF in a mixed cluster, you must make sure that lsf.conf parameters set on UNIX and Linux match any corresponding parameters in the local lsf.conf files on your Windows hosts. Location The default location of lsf.conf is in $LSF_TOP/conf. This default location can be overridden when necessary by either the environment variable LSF_ENVDIR. Format Each entry in lsf.conf has one of the following forms: NAME=VALUE NAME= NAME="STRING1 STRING2 ..." The equal sign = must follow each NAME even if no value follows and there should be no space beside the equal sign. A value that contains multiple strings separated by spaces must be enclosed in quotation marks. Lines starting with a pound sign (#) are comments and are ignored. Do not use #if as this is reserved syntax for time-based configuration. Chapter 1. Configuration Files 369 lsf.conf DAEMON_SHUTDOWN_DELAY Syntax DAEMON_SHUTDOWN_DELAY=time_in_seconds Description Applies when EGO_ENABLE_AUTO_DAEMON_SHUTDOWN=Y. Controls amount of time the slave LIM waits to communicate with other (RES and SBD) local daemons before exiting. Used to shorten or lengthen the time interval between a host attempting to join the cluster and, if it was unsuccessful, all of the local daemons shutting down. The value should not be less than the minimum interval of RES and SBD housekeeping. Most administrators should set this value to somewhere between 3 minutes and 60 minutes. Default Set to 180 seconds (3 minutes) at time of installation for the DEFAULT configuration template. If otherwise undefined, then 1800 seconds (30 minutes). EGO_DEFINE_NCPUS Syntax EGO_DEFINE_NCPUS=procs | cores | threads Description If defined, enables an administrator to define a value other than the number of cores available for ncpus. The value of ncpus depends on the value of EGO_DEFINE_NCPUS as follows: EGO_DEFINE_NCPUS=procs ncpus=number of processors EGO_DEFINE_NCPUS=cores ncpus=number of processors x number of cores EGO_DEFINE_NCPUS=threads ncpus=number of processors x number of cores x number of threads Note: When PARALLEL_SCHED_BY_SLOT=Y in lsb.params, the resource requirement string keyword ncpus refers to the number of slots instead of the number of CPUs, however lshosts output will continue to show ncpus as defined by EGO_DEFINE_NCPUS in lsf.conf. For changes to EGO_DEFINE_NCPUS to take effect, restart lim, mbatchd, and sbatchd daemons in sequence. Default EGO_DEFINE_NCPUS=cores 370 Platform LSF Configuration Reference lsf.conf EGO_ENABLE_AUTO_DAEMON_SHUTDOWN Syntax EGO_ENABLE_AUTO_DAEMON_SHUTDOWN="Y" | "N" Description For hosts that attempted to join the cluster but failed to communicate within the LSF_DYNAMIC_HOST_WAIT_TIME period, automatically shuts down any running daemons. This parameter can be useful if an administrator remove machines from the cluster regularly (by editing lsf.cluster file) or when a host belonging to the cluster is imaged, but the new host should not be part of the cluster. An administrator no longer has to go to each host that is not a part of the cluster to shut down any running daemons. Default Set to Y at time of installation. If otherwise undefined, then N. EGO_PARAMETER EGO_ENABLE_AUTO_DAEMON_SHUTDOWN EGO_ESLIM_TIMEOUT Syntax EGO_ESLIM_TIMEOUT=time_seconds Description Controls how long the LIM waits for any external static LIM scripts to run. After the timeout period expires, the LIM stops the scripts. Use the external static LIM to automatically detect the operating system type and version of hosts. LSF automatically detects the operating systems types and versions and displays them when running lshosts -l or lshosts -s. You can then specify those types in any -R resource requriement string. For example, bsub -R "select[ostype=RHEL4.6]". Default 10 seconds EGO_PARAMETER EGO_ESLIM_TIMEOUT Chapter 1. Configuration Files 371 lsf.conf JOB_STARTER_EXTEND Syntax JOB_STARTER_EXTEND="preservestarter" | "preservestarter userstarter" Description Applies to Windows execution hosts only. Allows you to use a job starter that includes symbols (for example: &&, |, ||). The job starter configured in JOB_STARTER_EXTEND can handle these special characters. The file $LSF_TOP/9.1/misc/examples/preservestarter.c is the only extended job starter created by default. Users can also develop their own extended job starters based on preservestarter.c. You must also set JOB_STARTER=preservestarter in lsb.queues. Default Not defined. LS_DUPLICATE_RESOURCE_CHECKING Syntax LS_DUPLICATE_RESOURCE_CHECKING=Y|N Description When the parameter is enabled, there are strict checks for types of duplicated LS resources. LS resources can only override the LSF resources which have the same necessary characteristics (shared, numeric, non-built-in) as the LS resources. Otherwise, the LSF resource will take effect and the LS resource will be ignored. When the parameter is disabled, there are no checks for duplicated LS resources. LS resources can override the resource in LSF without considering the resource's characteristics. In such cases, LSF may schedule jobs incorrectly. It is not recommended to define duplicated resources in both LS and LSF. After changing the parameter value, run badmin mbdrestart. Default Y LSB_AFS_BIN_DIR Syntax LSB_AFS_BIN_DIR=path to aklog directory Description If LSB_AFS_JOB_SUPPORT=Y, then LSF will need aklog in AFS to create a new PAG and apply for an AFS token. You can then use LSB_AFS_BIN_DIR to tell LSF the file path and directory where aklog resides. 372 Platform LSF Configuration Reference lsf.conf If LSB_AFS_BIN_DIR is not defined, LSF will search in the following order: /bin, /usr/bin, /usr/local/bin, /usr/afs/bin. The search stops as soon as an executable aklog is found. Default Not defined. LSB_AFS_JOB_SUPPORT Syntax LSB_AFS_JOB_SUPPORT=Y|y|N|n Description When this parameter is set to Y|y: 1. LSF assumes the user’s job is running in an AFS environment, and calls aklog -setpag to create a new PAG for the user’s job if it is a sequential job, or to create a separate PAG for each task res if the job is a blaunch job. 2. LSF runs the erenew script after the TGT is renewed. This script is primarily used to run aklog. 3. LSF assumes that JOB_SPOOL_DIR resides in the AFS volume. It kerberizes the child sbatchd to get the AFS token so the child sbatchd can access JOB_SPOOL_DIR. If this parameter changed, restart root res to make it take effect. Default N|n. LSF does not perform the three steps mentioned above. | LSB_AFS_LIB_DIR | Syntax | LSB_AFS_LIB_DIR=list of directories | Description | | | | | If LSB_AFS_JOB_SUPPORT=Y, LSF will need libkopenafs.so or libkopenafs.so.1 to create PAG. You can then use LSB_AFS_LIB_DIR to tell LSF the directory in which libkopenafs.so or libkopenafs.so.1 resides. When this parameter is defined to a blank space or comma separated list, LSF tries each item in the list to find and load libkopenafs.so or libkopenafs.so.1 to create PAG. | | | If LSB_AFS_LIB_DIR is not defined, or if libkopenafs.so or libkopenafs.so.1 cannot be found at the configured locations, LSF will search in six pre-defined directories: v /lib v /lib64 v /usr/lib v /usr/lib64 v /usr/local/lib v /usr/local/lib64 | | | | | Chapter 1. Configuration Files 373 lsf.conf | | The search stops as soon as libkopenafs.so or libkopenafs.so.1 is found and PAG is created. | Default | Not defined. LSB_API_CONNTIMEOUT Syntax LSB_API_CONNTIMEOUT=time_seconds Description The timeout in seconds when connecting to LSF. Valid values Any positive integer or zero Default 10 See also LSB_API_RECVTIMEOUT LSB_API_RECVTIMEOUT Syntax LSB_API_RECVTIMEOUT=time_seconds Description Timeout in seconds when waiting for a reply from LSF. Valid values Any positive integer or zero Default 10 See also LSB_API_CONNTIMEOUT LSB_API_VERBOSE Syntax LSB_API_VERBOSE=Y | N 374 Platform LSF Configuration Reference lsf.conf Description When LSB_API_VERBOSE=Y, LSF batch commands will display a retry error meesage to stderr when LIM is not available: LSF daemon (LIM) not responding ... still trying When LSB_API_VERBOSE=N, LSF batch commands will not display a retry error message when LIM is not available. Default Y. Retry message is displayed to stderr. LSB_BJOBS_CONSISTENT_EXIT_CODE Syntax LSB_BJOBS_CONSISTENT_EXIT_CODE=Y | N Description When LSB_BJOBS_CONSISTENT_EXIT_CODE=Y, the bjobs command exits with 0 only when unfinished jobs are found, and 255 when no jobs are found, or a non-existent job ID is entered. No jobs are running: bjobs No unfinished job found echo $? 255 Job 123 does not exist: bjobs 123 Job <123> is not found echo $? 255 Job 111 is running: bjobs 111 JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 111 user1 RUN normal hostA hostB myjob Oct 22 09:22 echo $? 0 Job 111 is running, and job 123 does not exist: bjobs 111 123 JOBID USER STAT 111 user1 RUN QUEUE normal FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME hostA hostB myjob Oct 22 09:22 Job <123> is not found echo $? 255 Job 111 is finished: Chapter 1. Configuration Files 375 lsf.conf bjobs 111 No unfinished job found echo $? 255 When LSB_BJOBS_CONSISTENT_EXIT_CODE=N, the bjobs command exits with 255 only when a non-existent job ID is entered. bjobs returns 0 when no jobs are found, all jobs are finished, or if at least one job ID is valid. No jobs are running: bjobs No unfinished job found echo $? 0 Job 123 does not exist: bjobs 123 Job <123> is not found echo $? 0 Job 111 is running: bjobs 111 JOBID USER STAT 111 user1 RUN QUEUE normal FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME hostA hostB myjob Oct 22 09:22 echo $? 0 Job 111 is running, and job 123 does not exist: bjobs 111 123 JOBID USER 111 user1 RUN STAT QUEUE normal FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME hostA hostB myjob Oct 22 09:22 Job <123> is not found echo $? 255 Job 111 is finished: bjobs 111 No unfinished job found echo $? 0 Default N. LSB_BJOBS_FORMAT Syntax LSB_BJOBS_FORMAT="field_name[:[-][output_width]] ... [delimiter='character']" 376 Platform LSF Configuration Reference lsf.conf Description Sets the customized output format for the bjobs command. v Specify which bjobs fields (or aliases instead of the full field names), in which order, and with what width to display. v Specify only the bjobs field name or alias to set its output to unlimited width and left justification. v Specify the colon (:) without a width to set the output width to the recommended width for that field. v Specify the colon (:) with a width to set the maximum number of characters to display for the field. When its value exceeds this width, bjobs truncates the output as follows: – For the JOB_NAME field, bjobs removes the header characters and replaces them with an asterisk (*) – For other fields, bjobs truncates the ending characters v Specify a hyphen (-) to set right justification when displaying the output for the specific field. If not specified, the default is to set left justification when displaying output for a field. v Use delimiter= to set the delimiting character to display between different headers and fields. This must be a single character. By default, the delimiter is a space. Output customization only applies to the output for certain bjobs options, as follows: v LSB_BJOBS_FORMAT and bjobs -o both apply to output for the bjobs command with no options, and for bjobs options with short form output that filter information, including the following: -a, -app, -cname, -d, -g, -G, -J, -Jd, -Lp, -m, -P, -q, -r, -sla, -u, -x, -X. v LSB_BJOBS_FORMAT does not apply to output for bjobs options that use a modified format and filter information, but you can use bjobs -o to customize the output for these options. These options include the following: -fwd, -N, -p, -s. v LSB_BJOBS_FORMAT and bjobs -o do not apply to output for bjobs options that use a modified format, including the following: -A, -aff, -aps, -l, -UF, -ss, -sum, -UF, -w, -W, -WF, -WL, -WP. The bjobs -o option overrides the LSB_BJOBS_FORMAT environment variable, which overrides the LSB_BJOBS_FORMAT setting in lsf.conf. Valid values The following are the field names used to specify the bjobs fields to display, recommended width, aliases you can use instead of field names, and units of measurement for the displayed field: Chapter 1. Configuration Files 377 lsf.conf Table 1. Output fields for bjobs | Field name Width Aliases jobid 7 id stat 5 user 7 user_group 15 queue 10 job_name 10 name job_description 17 description proj_name 11 proj, project application 13 app service_class 13 sla job_group 10 group job_priority 12 priority dependency 15 command 15 cmd pre_exec_command 16 pre_cmd post_exec_command 17 post_cmd resize_notification_command 27 pids 20 exit_code 10 exit_reason 50 from_host 11 first_host 11 exec_host 11 nexec_host Note: If the allocated host group or compute unit is condensed, this field does not display the real number of hosts. Use bjobs -X -o to view the real number of hosts in these situations. 10 | alloc_slot 20 | nalloc_slot 10 | host_file 10 378 Platform LSF Configuration Reference Unit Category Common ugroup Command resize_cmd Host lsf.conf Table 1. Output fields for bjobs (continued) Field name Width Aliases Unit submit_time 15 start_time 15 estimated_start_time 20 estart_time specified_start_time 20 sstart_time specified_terminate_time 24 sterminate_time time_left 11 finish_time 16 %complete 11 warning_action 15 warn_act action_warning_time 19 warn_time cpu_used 10 run_time 15 idle_factor 11 exception_status 16 slots 5 mem 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) max_mem 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) avg_mem 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) memlimit 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) swap 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) swaplimit 10 LSF_UNIT_FOR_LIMITS in lsf.conf (KB by default) min_req_proc 12 max_req_proc 12 effective_resreq 17 network_req 15 filelimit 10 corelimit 10 stacklimit 10 processlimit 12 Category Time seconds CPU seconds except_stat Resource requirement eresreq Resource limits Chapter 1. Configuration Files 379 lsf.conf Table 1. Output fields for bjobs (continued) Field name Width Aliases input_file 10 output_file 11 error_file 10 output_dir 15 sub_cwd 10 exec_home 10 exec_cwd 10 forward_cluster 15 fwd_cluster forward_time 15 fwd_time Unit Category File Directory MultiCluster Field names and aliases are not case sensitive. Valid values for the output width are any positive integer between 1 and 4096. If the jobid field is defined with no output width and LSB_JOBID_DISP_LENGTH is defined in lsf.conf, the LSB_JOBID_DISP_LENGTH value is used for the output width. If jobid is defined with a specified output width, the specified output width overrides the LSB_JOBID_DISP_LENGTH value. Example LSB_BJOBS_FORMAT="jobid stat: queue:- project:10 application:-6 delimiter=’^’" Running bjobs displays the following fields: v JOBID with unlimited width and left justified. If LSB_JOBID_DISP_LENGTH is specified, that value is used for the output width instead. v STAT with a maximum width of five characters (which is the recommended width) and left justified. v QUEUE with a maximum width of ten characters (which is the recommended width) and right justified. v PROJECT with a maximum width of ten characters and left justified. v APPLICATION with a maximum width of six characters and right justified. v The ^ character is displayed between different headers and fields. Default Not defined. The current bjobs output is used. LSB_BLOCK_JOBINFO_TIMEOUT Syntax LSB_BLOCK_JOBINFO_TIMEOUT=time_minutes Description Timeout in minutes for job information query commands (e.g., bjobs). Valid values Any positive integer 380 Platform LSF Configuration Reference lsf.conf Default Not defined (no timeout) See also MAX_JOBINFO_QUERY_PERIOD in lsb.params LSB_BPEEK_REMOTE_OUTPUT Syntax LSB_BPEEK_REMOTE_OUTPUT=y|Y|n|N Description If disabled (set to N), the bpeek command attempts to retrieve the job output from the local host first. If that fails, bpeek attempts to retrieve the job output from the remote host instead. If enabled (set to Y), it is the opposite. The bpeek command attempts to retrieve the job output from the remote host first, then the local host. When attempting to retrieve the job output from the remote host, bpeek attempts to use RES first, then rsh. If neither is running on the remote host, the bpeek command cannot retrieve job output. Best Practices Three directories are related to the bpeek command: v the user’s home directory v JOB_SPOOL_DIR v the checkpoint directory If these directories are on a shared file system, this parameter can be disabled. If any of these directories are not on a shared file system, this parameter should be enabled, and either RES or rsh should be started on the remote job execution host. Default N LSB_BSUB_ERR_RETRY Syntax LSB_BSUB_ERR_RETRY=RETRY_CNT[integer] ERR_TYPE[error1 [error2] [...]] Description In some cases, jobs can benefit from being automatically retried in the case of failing for a particular error. When specified, LSB_BSUB_ERR_RETRY automatically retries jobs that exit with a particular reason, up to the number of times specified by RETRY_CNT. Chapter 1. Configuration Files 381 lsf.conf Only the following error types (ERR_TYPE) are supported: v BAD_XDR: Error during XDR. v MSG_SYS: Failed to send or receive a message. v INTERNAL: Internal library error. The number of retries (RETRY_CNT) can be a minimum of 1 to a maximum of 50. Considerations when setting this parameter: v Users may experience what seems like a lag during job submission while the job is retried automatically in the background. v Users may see a job submitted more than once, with no explanation (no error is communicated to the user; the job keeps getting submitted until it succeeds or reaches its maximum retry count). In this case, the job ID also changes each time the error is retried. Default Not defined. If retry count is not valid, defaults to 5. LSB_CHUNK_RUSAGE Syntax LSB_CHUNK_RUSAGE=y Description Applies only to chunk jobs. When set, sbatchd contacts PIM to retrieve resource usage information to enforce resource usage limits on chunk jobs. By default, resource usage limits are not enforced for chunk jobs because chunk jobs are typically too short to allow LSF to collect resource usage. If LSB_CHUNK_RUSAGE=Y is defined, limits may not be enforced for chunk jobs that take less than a minute to run. Default Not defined. No resource usage is collected for chunk jobs. LSB_CMD_LOG_MASK Syntax LSB_CMD_LOG_MASK=log_level Description Specifies the logging level of error messages from LSF batch commands. To specify the logging level of error messages for LSF commands, use LSF_CMD_LOG_MASK. To specify the logging level of error messages for LSF daemons, use LSF_LOG_MASK. 382 Platform LSF Configuration Reference lsf.conf LSB_CMD_LOG_MASK sets the log level and is used in combination with LSB_DEBUG_CMD, which sets the log class for LSF batch commands. For example: LSB_CMD_LOG_MASK=LOG_DEBUG LSB_DEBUG_CMD="LC_TRACE LC_EXEC" LSF commands log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. The level specified by LSB_CMD_LOG_MASK determines which messages are recorded and which are discarded. All messages logged at the specified level or higher are recorded, while lower level messages are discarded. For debugging purposes, the level LOG_DEBUG contains the fewest number of debugging messages and is used for basic debugging. The level LOG_DEBUG3 records all debugging messages, and can cause log files to grow very large; it is not often used. Most debugging is done at the level LOG_DEBUG2. The commands log to the syslog facility unless LSB_CMD_LOGDIR is set. Valid values The log levels from highest to lowest are: v LOG_EMERG v LOG_ALERT v LOG_CRIT v LOG_ERR v LOG_WARNING v v v v v v LOG_NOTICE LOG_INFO LOG_DEBUG LOG_DEBUG1 LOG_DEBUG2 LOG_DEBUG3 Default LOG_WARNING See also LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_TIME_CMD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR, LSF_TIME_CMD LSB_CMD_LOGDIR Syntax LSB_CMD_LOGDIR=path Description Specifies the path to the LSF command log files. Chapter 1. Configuration Files 383 lsf.conf Default /tmp See also LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_TIME_CMD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR, LSF_TIME_CMD LSB_CONFDIR Syntax LSB_CONFDIR=path Description Specifies the path to the directory containing the LSF configuration files. The configuration directories are installed under LSB_CONFDIR. Configuration files for each cluster are stored in a subdirectory of LSB_CONFDIR. This subdirectory contains several files that define user and host lists, operation parameters, and queues. All files and directories under LSB_CONFDIR must be readable from all hosts in the cluster. LSB_CONFDIR/cluster_name/configdir must be owned by the LSF administrator. If live reconfiguration through the bconf command is enabled by the parameter LSF_LIVE_CONFDIR, configuration files are written to and read from the directory set by LSF_LIVE_CONFDIR. Do not change this parameter after LSF has been installed. Default LSF_CONFDIR/lsbatch See also LSF_CONFDIR, LSF_LIVE_CONFDIR LSB_CPUSET_BESTCPUS Syntax LSB_CPUSET_BESTCPUS=y | Y Description If set, enables the best-fit algorithm for SGI cpusets Default Y (best-fit) 384 Platform LSF Configuration Reference lsf.conf LSB_CPUSET_DISPLAY_CPULIST Syntax LSB_CPUSET_DISPLAY_CPULIST=Y | N Description The bjobs/bhist/bacct -l commands display the CPU IDs in the dynamic CPUset allocated on each host. The CPU IDs are displayed as CPUS=cpu_ID_list after NCPUS=num_cpus for each host. The cpu_ID_list is displayed in condensed format as a range of continuous IDs. After enabling LSB_CPUSET_DISPLAY_CPULIST in lsf.conf, the LSF administrator must run badmin reconfig to make the change effective. CPU IDs are shown in bjobs/bhist/bacct -l for the cpuset jobs dispatched after badmin reconfig. Default N. bjobs/bhist/bacct -l do not display CPU IDs for cpusets allocated on each host. LSB_DEBUG Syntax LSB_DEBUG=1 | 2 Description Sets the LSF batch system to debug. If defined, LSF runs in single user mode: v No security checking is performed v Daemons do not run as root When LSB_DEBUG is defined, LSF does not look in the system services database for port numbers. Instead, it uses the port numbers defined by the parameters LSB_MBD_PORT/LSB_SBD_PORT in lsf.conf. If these parameters are not defined, it uses port number 40000 for mbatchd and port number 40001 for sbatchd. You should always specify 1 for this parameter unless you are testing LSF. Can also be defined from the command line. Valid values LSB_DEBUG=1 The LSF system runs in the background with no associated control terminal. LSB_DEBUG=2 The LSF system runs in the foreground and prints error messages to tty. Chapter 1. Configuration Files 385 lsf.conf Default Not defined See also LSB_DEBUG, LSB_DEBUG_CMD, LSB_DEBUG_MBD, LSB_DEBUG_NQS, LSB_DEBUG_SBD, LSB_DEBUG_SCH, LSF_DEBUG_LIM, LSF_DEBUG_RES, LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT, LSF_LOGDIR, LSF_LIM_DEBUG, LSF_RES_DEBUG LSB_DEBUG_CMD Syntax LSB_DEBUG_CMD=log_class Description Sets the debugging log class for commands and APIs. Specifies the log class filtering to be applied to LSF batch commands or the API. Only messages belonging to the specified log class are recorded. LSB_DEBUG_CMD sets the log class and is used in combination with LSB_CMD_LOG_MASK, which sets the log level. For example: LSB_CMD_LOG_MASK=LOG_DEBUG LSB_DEBUG_CMD="LC_TRACE LC_EXEC" Debugging is turned on when you define both parameters. The daemons log to the syslog facility unless LSB_CMD_LOGDIR is defined. To specify multiple log classes, use a space-separated list enclosed by quotation marks. For example: LSB_DEBUG_CMD="LC_TRACE LC_EXEC" Can also be defined from the command line. Valid values Valid log classes are: v LC_ADVRSV and LC2_ADVRSV: Log advance reservation modifications v LC_AFS and LC2_AFS: Log AFS messages v LC_AUTH and LC2_AUTH: Log authentication messages v v v v v LC_CHKPNT and LC2_CHKPNT: Log checkpointing messages LC_COMM and LC2_COMM: Log communication messages LC_DCE and LC2_DCE: Log messages pertaining to DCE support LC_EEVENTD and LC2_EEVENTD: Log eeventd messages LC_ELIM and LC2_ELIM: Log ELIM messages v LC_EXEC and LC2_EXEC: Log significant steps for job execution v LC_FAIR and LC2_FAIR: Log fairshare policy messages v LC_FILE and LC2_FILE: Log file transfer messages v LC_FLEX and LC2_FLEX: Log messages related to FlexNet v LC2_GUARANTEE: Log messages related to guarantee SLAs 386 Platform LSF Configuration Reference lsf.conf v v v v v LC_HANG and LC2_HANG: Mark where a program might hang LC_JARRAY and LC2_JARRAY: Log job array messages LC_JLIMIT and LC2_JLIMIT: Log job slot limit messages LC2_LIVECONF: Log live reconfiguration messages LC_LOADINDX and LC2_LOADINDX: Log load index messages v v v v v v v LC_M_LOG and LC2_M_LOG: Log multievent logging messages LC_MEMORY and LC2_MEMORY: Log messages related to MEMORY allocation LC_MPI and LC2_MPI: Log MPI messages LC_MULTI and LC2_MULTI: Log messages pertaining to MultiCluster LC_PEND and LC2_PEND: Log messages related to job pending reasons LC_PERFM and LC2_PERFM: Log performance messages LC_PIM and LC2_PIM: Log PIM messages v v v v LC_PREEMPT and LC2_PREEMPT: Log preemption policy messages LC_RESOURCE and LC2_RESOURCE: Log messages related to resource broker LC_RESREQ and LC2_RESREQ: Log resource requirement messages LC_SCHED and LC2_SCHED: Log messages pertaining to the mbatchd scheduler. LC_SIGNAL and LC2_SIGNAL: Log messages pertaining to signals LC_SYS and LC2_SYS: Log system call messages LC_TRACE and LC2_TRACE: Log significant program walk steps LC_XDR and LC2_XDR: Log everything transferred by XDR v v v v v LC_XDRVERSION and LC2_XDRVERSION: Log messages for XDR version v LC2_KRB: Log message related to Kerberos integration v LC2_DC: Log message related to Dynamic Cluster v LC2_CGROUP: Log message related to cgroup operation v LC2_TOPOLOGY: Log message related to hardware topology v LC2_AFFINITY: Log message related to affinity v LC2_LSF_PE: Log message related to LSF PE integration v LC2_DAS: Log message related to LSF data manager Default Not defined See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_MBD, LSB_DEBUG_NQS, LSB_DEBUG_SBD, LSB_DEBUG_SCH, LSF_DEBUG_LIM, LSF_DEBUG_RES, LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT, LSF_LOGDIR, LSF_LIM_DEBUG, LSF_RES_DEBUG LSB_DEBUG_MBD Syntax LSB_DEBUG_MBD=log_class Description Sets the debugging log class for mbatchd. Chapter 1. Configuration Files 387 lsf.conf Specifies the log class filtering to be applied to mbatchd. Only messages belonging to the specified log class are recorded. LSB_DEBUG_MBD sets the log class and is used in combination with LSF_LOG_MASK, which sets the log level. For example: LSF_LOG_MASK=LOG_DEBUG LSB_DEBUG_MBD="LC_TRACE LC_EXEC" To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LSB_DEBUG_MBD="LC_TRACE LC_EXEC" You need to restart the daemons after setting LSB_DEBUG_MBD for your changes to take effect. If you use the command badmin mbddebug to temporarily change this parameter without changing lsf.conf, you do not need to restart the daemons. Valid values Valid log classes are the same as for LSB_DEBUG_CMD except for the log class LC_ELIM, which cannot be used with LSB_DEBUG_MBD. See LSB_DEBUG_CMD. Default Not defined See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_MBD, LSB_DEBUG_NQS, LSB_DEBUG_SBD, LSB_DEBUG_SCH, LSF_DEBUG_LIM, LSF_DEBUG_RES, LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT, LSF_LOGDIR, LSF_LIM_DEBUG, LSF_RES_DEBUG LSB_DEBUG_SBD Syntax LSB_DEBUG_SBD=log_class Description Sets the debugging log class for sbatchd. Specifies the log class filtering to be applied to sbatchd. Only messages belonging to the specified log class are recorded. LSB_DEBUG_SBD sets the log class and is used in combination with LSF_LOG_MASK, which sets the log level. For example: LSF_LOG_MASK=LOG_DEBUG LSB_DEBUG_SBD="LC_TRACE LC_EXEC" To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LSB_DEBUG_SBD="LC_TRACE LC_EXEC" You need to restart the daemons after setting LSB_DEBUG_SBD for your changes to take effect. 388 Platform LSF Configuration Reference lsf.conf If you use the command badmin sbddebug to temporarily change this parameter without changing lsf.conf, you do not need to restart the daemons. Valid values Valid log classes are the same as for LSB_DEBUG_CMD except for the log class LC_ELIM, which cannot be used with LSB_DEBUG_SBD. See LSB_DEBUG_CMD. Default Not defined See also LSB_DEBUG_MBD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR, badmin LSB_DEBUG_SCH Syntax LSB_DEBUG_SCH=log_class Description Sets the debugging log class for mbschd. Specifies the log class filtering to be applied to mbschd. Only messages belonging to the specified log class are recorded. LSB_DEBUG_SCH sets the log class and is used in combination with LSF_LOG_MASK, which sets the log level. For example: LSF_LOG_MASK=LOG_DEBUG LSB_DEBUG_SCH="LC_SCHED" To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LSB_DEBUG_SCH="LC_SCHED LC_TRACE LC_EXEC" You need to restart the daemons after setting LSB_DEBUG_SCH for your changes to take effect. Valid values Valid log classes are the same as for LSB_DEBUG_CMD except for the log class LC_ELIM, which cannot be used with LSB_DEBUG_SCH, and LC_HPC and LC_SCHED, which are only valid for LSB_DEBUG_SCH. See LSB_DEBUG_CMD. Default Not defined See also LSB_DEBUG_MBD, LSB_DEBUG_SBD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR, badmin Chapter 1. Configuration Files 389 lsf.conf LSB_DISABLE_LIMLOCK_EXCL Syntax LSB_DISABLE_LIMLOCK_EXCL=y | n Description If preemptive scheduling is enabled, this parameter enables preemption of and preemption by exclusive jobs when PREEMPT_JOBTYPE=EXCLUSIVE in lsb.params. Changing this parameter requires a restart of all sbatchds in the cluster (badmin hrestart). Do not change this parameter while exclusive jobs are running. When LSB_DISABLE_LIMLOCK_EXCL=y, for a host running an exclusive job: v LIM is not locked on a host running an exclusive job v lsload displays the host status ok. v bhosts displays the host status closed. v Users can run tasks on the host using lsrun or lsgrun. To prevent users from running tasks during execution of an exclusive job, the parameter LSF_DISABLE_LSRUN=y must be defined in lsf.conf. Default Set to y at time of installation. If otherwise undefined, then n (LSF locks the LIM on a host running an exclusive job and unlocks the LIM when the exclusive job finishes). LSB_DISABLE_RERUN_POST_EXEC Syntax LSB_DISABLE_RERUN_POST_EXEC=y | Y Description If set, and the job is rerunnable, the POST_EXEC configured at the job level or the queue level is not executed if the job is rerun. Running of post-execution commands upon restart of a rerunnable job may not always be desirable. For example, if the post-exec removes certain files, or does other cleanup that should only happen if the job finishes successfully, use LSB_DISABLE_RERUN_POST_EXEC to prevent the post-exec from running and allow the successful continuation of the job when it reruns. The POST_EXEC may still run for a job rerun when the execution host loses contact with the master host due to network problems. In this case mbatchd assumes the job has failed and restarts the job on another host. The original execution host, out of contact with the master host, completes the job and runs the POST_EXEC. Default Not defined 390 Platform LSF Configuration Reference lsf.conf LSB_DISPATCH_CHECK_RESUME_INTVL Syntax LSB_DISPATCH_CHECK_RESUME_INTVL=y|Y|n|N Description When this parameter is enabled, LSF takes the last resume time of a host into account when considering dispatching PEND jobs to this host. LSF will not dispatch PEND jobs to a host if it has just resumed a job on this host within the mbdSleepTime interval. After configuring this parameter, run badmin reconfig to have it take effect. Default N LSB_DISPLAY_YEAR Syntax LSB_DISPLAY_YEAR=y|Y|n|N Description Toggles on and off inclusion of the year in the time string displayed by the commands bjobs -l, bacct -l, and bhist -l|-b|-t. Default N LSB_EAUTH_DATA_REUSE Syntax LSB_EAUTH_DATA_REUSE=Y | N Description When set to Y, blaunch caches authentication data returned by eauth -c when connecting to the first remote execution server in memory. blaunch uses this cached data to authenticate subsequent first remote execution servers. If set to N, blaunch does not cache authentication data. Every time blaunch connects to a different authentication, it calls eauth -c to fetch new authentication data. Default Y LSB_EAUTH_EACH_SUBPACK Syntax LSB_EAUTH_EACH_SUBPACK=Y|N Chapter 1. Configuration Files 391 lsf.conf Description Enable this parameter to have bsub call eauth for each sub-pack. LSB_MAX_PACK_JOBS defines the number of jobs that one sub-pack contains. If LSB_EAUTH_EACH_SUBPACK is not enabled, bsub only calls eauth for the first sub-pack and caches this eauth data. The cached data is reused for each subsequent sub-pack instead of bsub calling eauth again. Default N LSB_ECHKPNT_KEEP_OUTPUT Syntax LSB_ECHKPNT_KEEP_OUTPUT=y | Y Description Saves the standard output and standard error of custom echkpnt and erestart methods to: v checkpoint_dir/$LSB_JOBID/echkpnt.out v checkpoint_dir/$LSB_JOBID/echkpnt.err v checkpoint_dir/$LSB_JOBID/erestart.out v checkpoint_dir/$LSB_JOBID/erestart.err Can also be defined as an environment variable. Default Not defined. Standard error and standard output messages from custom echkpnt and erestart programs is directed to /dev/null and discarded by LSF. See also LSB_ECHKPNT_METHOD, LSB_ECHKPNT_METHOD_DIR LSB_ECHKPNT_METHOD Syntax LSB_ECHKPNT_METHOD="method_name [method_name] ..." Description Name of custom echkpnt and erestart methods. Can also be defined as an environment variable, or specified through the bsub -k option. The name you specify here is used for both your custom echkpnt and erestart programs. You must assign your custom echkpnt and erestart programs the name echkpnt.method_name and erestart.method_name. The programs echkpnt.method_name and erestart.method_name. must be in LSF_SERVERDIR or in the directory specified by LSB_ECHKPNT_METHOD_DIR. 392 Platform LSF Configuration Reference lsf.conf Do not define LSB_ECHKPNT_METHOD=default as default is a reserved keyword to indicate to use the default echkpnt and erestart methods of LSF. You can however, specify bsub -k "my_dir method=default" my_job to indicate that you want to use the default checkpoint and restart methods. When this parameter is not defined in lsf.conf or as an environment variable and no custom method is specified at job submission through bsub -k, LSF uses echkpnt.default and erestart.default to checkpoint and restart jobs. When this parameter is defined, LSF uses the custom checkpoint and restart methods specified. Limitations The method name and directory (LSB_ECHKPNT_METHOD_DIR) combination must be unique in the cluster. For example, you may have two echkpnt applications with the same name such as echkpnt.mymethod but what differentiates them is the different directories defined with LSB_ECHKPNT_METHOD_DIR. It is the cluster administrator’s responsibility to ensure that method name and method directory combinations are unique in the cluster. Default Not defined. LSF uses echkpnt.default and erestart.default to checkpoint and restart jobs See also LSB_ECHKPNT_METHOD_DIR, LSB_ECHKPNT_KEEP_OUTPUT LSB_ECHKPNT_METHOD_DIR Syntax LSB_ECHKPNT_METHOD_DIR=path Description Absolute path name of the directory in which custom echkpnt and erestart programs are located. The checkpoint method directory should be accessible by all users who need to run the custom echkpnt and erestart programs. Can also be defined as an environment variable. Default Not defined. LSF searches in LSF_SERVERDIR for custom echkpnt and erestart programs. See also LSB_ESUB_METHOD, LSB_ECHKPNT_KEEP_OUTPUT Chapter 1. Configuration Files 393 lsf.conf LSB_ENABLE_HPC_ALLOCATION | | Syntax | LSB_ENABLE_HPC_ALLOCATION=Y|y|N|n | Description | | | | | When set to Y|y, this parameter changes concept of the required number of slots for a job to the required number of tasks for a job. The specified numbers of tasks will be the number of tasks to launch on execution hosts. The allocated slots will change to all slots on the allocated execution hosts for an exclusive job in order to reflect the actual slot allocation. | This improves job scheduling, job accounting, and resource utilization. | | | | When LSB_ENABLE_HPC_ALLOCATION is not set or is set to N|n, the following behavior change will still take effect: | | | v v v v v v Pending reasons in bjobs output keep task concept v Changes for PROCLIMIT, PER_SLOT and IMPT_SLOTBKLG | | Event and API changes for task concept Field "alloc_slot nalloc_slot" for bjobs –o -alloc option in bqueues, bhosts, busers, and bapp remains Command help messages change to task concept Error messages in logs change to task concept The following behavior changes do NOT take effect: | | v Command output for bjobs, bhist, and bacct v Exclusive job slot allocation change | | | | | | | Important: v After changing the LSB_ENABLE_HPC_ALLOCATION setting, the cluster admin must run badmin mbdrestart to restart mbatchd and make it take effect in mbatchd. v Changing LSB_ENABLE_HPC_ALLOCATION takes effect immediately for the affected commands, (that is, bjobs, bhist, bacct, bapp, bhosts, bmod, bqueues, bsub, and busers). Therefore, it is not required to restart mbatchd to enable the change for b* commands. v For a non-shared LSF installation, b* commands take effect on slave hosts according to the local lsf.conf setting. If LSB_ENABLE_HPC_ALLOCATION is not defined in the local lsf.conf, slave b* commands will call the master lim and take the configuration from the master host. To allow a slave take the latest configuration from the master, master lim must do a reconfig after configuration changes on the master. | Default | N | For new installations of LSF, LSB_ENABLE_HPC_ALLOCATION is set to Y automatically. | | | | | | | 394 Platform LSF Configuration Reference lsf.conf LSB_ENABLE_PERF_METRICS_LOG Syntax LSB_ENABLE_PERF_METRICS_LOG=Y | N Description Enable this parameter to have LSF log mbatchd performance metrics. In any sample period, data that is not likely to cause performance issues will not be logged. The performance metric data is logged periodically according to the time interval set in LSB_PERF_METRICS_SAMPLE_PERIOD. Default N. LSB_ESUB_METHOD Syntax LSB_ESUB_METHOD="esub_application [esub_application] ..." Description Specifies a mandatory esub that applies to all job submissions. LSB_ESUB_METHOD lists the names of the application-specific esub executables used in addition to any executables specified by the bsub -a option. For example, LSB_ESUB_METHOD="dce fluent" runs LSF_SERVERDIR/esub.dce and LSF_SERVERDIR/esub.fluent for all jobs submitted to the cluster. These esubs define, respectively, DCE as the mandatory security system and FLUENT as the mandatory application for all jobs. LSB_ESUB_METHOD can also be defined as an environment variable. The value of LSB_ESUB_METHOD must correspond to an actual esub file. For example, to use LSB_ESUB_METHOD=fluent, the file esub.fluent must exist in LSF_SERVERDIR. The name of the esub program must be a valid file name. Valid file names contain only alphanumeric characters, underscore (_) and hyphen (-). Restriction: The name esub.user is reserved. Do not use the name esub.user for an application-specific esub. The master esub (mesub) uses the name you specify to invoke the appropriate esub program. The esub and esub.esub_application programs must be located in LSF_SERVERDIR. LSF does not detect conflicts based on esub names. For example, if LSB_ESUB_METHOD="openmpi" and bsub -a pvm is specified at job submission, the job could fail because these esubs define two different types of parallel job handling. Chapter 1. Configuration Files 395 lsf.conf Default Not defined. LSF does not apply a mandatory esub to jobs submitted to the cluster. LSB_EVENTS_FILE_KEEP_OPEN Syntax LSB_EVENTS_FILE_KEEP_OPEN=Y|N Description Windows only. Specify Y to open the events file once, and keep it open always. Specify N to open and close the events file each time a record is written. Default Y LSB_FANOUT_TIMEOUT_PER_LAYER Syntax LSB_FANOUT_TIMEOUT_PER_LAYER=time_seconds Description Controls how long sbatchd waits until the next sbatchd replies. Can also be set as an environment variable, which takes precedence. Default 20 seconds. LSB_FORK_JOB_REQUEST Syntax LSB_FORK_JOB_REQUEST=Y|N Description This parameter is enabled by default to improve mbatchd response time after mbatchd is started or restarted (including parallel restart) and has finished replaying events. A child mbatchd process is forked to sync cluster state information to mbschd after events have been replayed. Default Y 396 Platform LSF Configuration Reference lsf.conf LSB_HJOB_PER_SESSION Syntax LSB_HJOB_PER_SESSION=max_num Description Specifies the maximum number of jobs that can be dispatched in each scheduling cycle to each host Valid values Any positive integer Default Not defined Notes LSB_HJOB_PER_SESSION is activated only if the JOB_ACCEPT_INTERVAL parameter is set to 0. See also JOB_ACCEPT_INTERVAL parameter in lsb.params LSB_HUGETLB Syntax LSB_HUGETLB=Y|N Description The information regarding which virtual memory page maps to which physical memory page is kept in a data structure named Page Table. Most architectures use a fast lookup cache named Translation Lookaside Buffer (TLB). Consequently, TLB misses bring additional performance costs, so it is important to reduce TLB misses. In advanced architectures like x86 or IA64, huge page size is supported (e.g., 2 Mb and 4 Mb sizes). The number of memory pages can be reduced through implementing the huge page size, which leads to decreased TLB misses and improved performance for processes like forked child, etc. To configure huge page memory: 1. Check the support and configuration of huge page size: cat /proc/meminfo | grep Huge The output of cat /proc/meminfo will include lines such as: HugePages_Total: vvv HugePages_Free: www HugePages_Rsvd: xxx HugePages_Surp: yyy Hugepagesize: zzz kB Chapter 1. Configuration Files 397 lsf.conf Where: v HugePages_Total is the size of the pool of huge pages. v HugePages_Free is the number of huge pages in the pool that are not yet allocated. v HugePages_Rsvd is short for "reserved" and is the number of huge pages for which a commitment to allocate from the pool has been made, though no allocation has yet been made. Reserved huge pages guarantee that an application can allocate a huge page from the pool of huge pages at fault time. v HugePages_Surp is short for "surplus" and is the number of huge pages in the pool above the value in /proc/sys/vm/nr_hugepages. The maximum number of surplus huge pages is controlled by /proc/sys/vm/ nr_overcommit_hugepages. 2. Configure the number of huge size pages: v To set the number of huge pages using /proc entry: # echo 5 > /proc/sys/vm/nr_hugepages v To set the number of huge pages using sysctl: # sysctl -w vm.nr_hugepages=5 v To make the change permanent, add the following line to the file /etc/sysctl.conf: # echo "vm.nr_hugepages=5" >> /etc/sysctl.conf This file is used during the boot process. You must reboot to allocate the number of hugepages needed. This is because hugepages requires large areas of contiguous physical memory. Over time, physical memory may be mapped and allocated to pages. Therefore, the physical memory can become fragmented. Default N LSB_INDEX_BY_JOB Syntax LSB_INDEX_BY_JOB="JOBNAME" Description When set to JOBNAME, creates a job index of job names. Define when using job dependency conditions (bsub -w) with job names to optimize job name searches. Valid values JOBNAME Default Not defined (job index is not created). 398 Platform LSF Configuration Reference lsf.conf LSB_INTERACT_MSG_ENH Syntax LSB_INTERACT_MSG_ENH=y | Y Description If set, enables enhanced messaging for interactive batch jobs. To disable interactive batch job messages, set LSB_INTERACT_MSG_ENH to any value other than y or Y; for example, LSB_INTERACT_MSG_ENH=N. Default Not defined See also LSB_INTERACT_MSG_INTVAL LSB_INTERACT_MSG_INTVAL Syntax LSB_INTERACT_MSG_INTVAL=time_seconds Description Specifies the update interval in seconds for interactive batch job messages. LSB_INTERACT_MSG_INTVAL is ignored if LSB_INTERACT_MSG_ENH is not set. Job information that LSF uses to get the pending or suspension reason is updated according to the value of PEND_REASON_UPDATE_INTERVAL in lsb.params. Default Not defined. If LSB_INTERACT_MSG_INTVAL is set to an incorrect value, the default update interval is 60 seconds. See also LSB_INTERACT_MSG_ENH LSB_JOB_CPULIMIT Syntax LSB_JOB_CPULIMIT=y | n Description Determines whether the CPU limit is a per-process limit enforced by the OS or whether it is a per-job limit enforced by LSF: v The per-process limit is enforced by the OS when the CPU time of one process of the job exceeds the CPU limit. Chapter 1. Configuration Files 399 lsf.conf v The per-job limit is enforced by LSF when the total CPU time of all processes of the job exceed the CPU limit. This parameter applies to CPU limits set when a job is submitted with bsub -c, and to CPU limits set for queues by CPULIMIT in lsb.queues. v LSF-enforced per-job limit: When the sum of the CPU time of all processes of a job exceed the CPU limit, LSF sends a SIGXCPU signal (where supported by the operating system) from the operating system to all processes belonging to the job, then SIGINT, SIGTERM and SIGKILL. The interval between signals is 10 seconds by default. The time interval between SIGXCPU, SIGINT, SIGKILL, SIGTERM can be configured with the parameter JOB_TERMINATE_INTERVAL in lsb.params. Restriction: SIGXCPU is not supported by Windows. v OS-enforced per process limit: When one process in the job exceeds the CPU limit, the limit is enforced by the operating system. For more details, refer to your operating system documentation for setrlimit(). The setting of LSB_JOB_CPULIMIT has the following effect on how the limit is enforced: LSB_JOB_CPULIMIT LSF per-job limit OS per-process limit y Enabled Disabled n Disabled Enabled Not defined Enabled Enabled Default Not defined Notes To make LSB_JOB_CPULIMIT take effect, use the command badmin hrestart all to restart all sbatchds in the cluster. Changing the default Terminate job control action: You can define a different terminate action in lsb.queues with the parameter JOB_CONTROLS if you do not want the job to be killed. For more details on job controls, see Administering IBM Platform LSF. Limitations If a job is running and the parameter is changed, LSF is not able to reset the type of limit enforcement for running jobs. v If the parameter is changed from per-process limit enforced by the OS to per-job limit enforced by LSF (LSB_JOB_CPULIMIT=n changed to LSB_JOB_CPULIMIT=y), both per-process limit and per-job limit affect the running job. This means that signals may be sent to the job either when an 400 Platform LSF Configuration Reference lsf.conf individual process exceeds the CPU limit or the sum of the CPU time of all processes of the job exceed the limit. A job that is running may be killed by the OS or by LSF. v If the parameter is changed from per-job limit enforced by LSF to per-process limit enforced by the OS (LSB_JOB_CPULIMIT=y changed to LSB_JOB_CPULIMIT=n), the job is allowed to run without limits because the per-process limit was previously disabled. See also lsb.queues, bsub, JOB_TERMINATE_INTERVAL in lsb.params, LSB_MOD_ALL_JOBS LSB_JOBID_DISP_LENGTH Syntax LSB_JOBID_DISP_LENGTH=integer Description By default, LSF commands bjobs and bhist display job IDs with a maximum length of 7 characters. Job IDs greater than 9999999 are truncated on the left. When LSB_JOBID_DISP_LENGTH=10, the width of the JOBID column in bjobs and bhist increases to 10 characters. If LSB_BJOBS_FORMAT is defined (in lsf.conf or as a runtime environment variable) or bjobs -o is run to include the JOBID column, and either of these specify a column width for the JOBID column, those specifications override the LSB_JOBID_DISP_LENGTH value. If there is no column width specified for the JOBID column, the LSB_JOBID_DISP_LENGTH value applies. Valid values Specify an integer between 7 and 10. Default Not defined. LSF uses the default 7-character length for job ID display. LSB_JOB_MEMLIMIT Syntax LSB_JOB_MEMLIMIT=y | n Description Determines whether the memory limit is a per-process limit enforced by the OS or whether it is a per-job limit enforced by LSF. v The per-process limit is enforced by the OS when the memory allocated to one process of the job exceeds the memory limit. v The per-job limit is enforced by LSF when the sum of the memory allocated to all processes of the job exceeds the memory limit. Chapter 1. Configuration Files 401 lsf.conf This parameter applies to memory limits set when a job is submitted with bsub -M mem_limit, and to memory limits set for queues with MEMLIMIT in lsb.queues. The setting of LSB_JOB_MEMLIMIT has the following effect on how the limit is enforced: When LSB_JOB_MEMLIMIT is LSF-enforced per-job limit OS-enforced per-process limit y Enabled Disabled n or not defined Disabled Enabled When LSB_JOB_MEMLIMIT is Y, the LSF-enforced per-job limit is enabled, and the OS-enforced per-process limit is disabled. When LSB_JOB_MEMLIMIT is N or not defined, the LSF-enforced per-job limit is disabled, and the OS-enforced per-process limit is enabled. LSF-enforced per-job limit: When the total memory allocated to all processes in the job exceeds the memory limit, LSF sends the following signals to kill the job: SIGINT, SIGTERM, then SIGKILL. The interval between signals is 10 seconds by default. On UNIX, the time interval between SIGINT, SIGKILL, SIGTERM can be configured with the parameter JOB_TERMINATE_INTERVAL in lsb.params. OS-enforced per process limit: When the memory allocated to one process of the job exceeds the memory limit, the operating system enforces the limit. LSF passes the memory limit to the operating system. Some operating systems apply the memory limit to each process, and some do not enforce the memory limit at all. OS memory limit enforcement is only available on systems that support RLIMIT_RSS for setrlimit(). The following operating systems do not support the memory limit at the OS level and the job is allowed to run without a memory limit: v Windows v Sun Solaris 2.x Default Not defined. Per-process memory limit enforced by the OS; per-job memory limit enforced by LSF disabled Notes To make LSB_JOB_MEMLIMIT take effect, use the command badmin hrestart all to restart all sbatchds in the cluster. If LSB_JOB_MEMLIMIT is set, it overrides the setting of the parameter LSB_MEMLIMIT_ENFORCE. The parameter LSB_MEMLIMIT_ENFORCE is ignored. The difference between LSB_JOB_MEMLIMIT set to y and LSB_MEMLIMIT_ENFORCE set to y is that with LSB_JOB_MEMLIMIT, only the 402 Platform LSF Configuration Reference lsf.conf per-job memory limit enforced by LSF is enabled. The per-process memory limit enforced by the OS is disabled. With LSB_MEMLIMIT_ENFORCE set to y, both the per-job memory limit enforced by LSF and the per-process memory limit enforced by the OS are enabled. Changing the default Terminate job control action: You can define a different Terminate action in lsb.queues with the parameter JOB_CONTROLS if you do not want the job to be killed. For more details on job controls, see Administering IBM Platform LSF. Limitations If a job is running and the parameter is changed, LSF is not able to reset the type of limit enforcement for running jobs. v If the parameter is changed from per-process limit enforced by the OS to per-job limit enforced by LSF (LSB_JOB_MEMLIMIT=n or not defined changed to LSB_JOB_MEMLIMIT=y), both per-process limit and per-job limit affect the running job. This means that signals may be sent to the job either when the memory allocated to an individual process exceeds the memory limit or the sum of memory allocated to all processes of the job exceed the limit. A job that is running may be killed by LSF. v If the parameter is changed from per-job limit enforced by LSF to per-process limit enforced by the OS (LSB_JOB_MEMLIMIT=y changed to LSB_JOB_MEMLIMIT=n or not defined), the job is allowed to run without limits because the per-process limit was previously disabled. See also LSB_MEMLIMIT_ENFORCE, LSB_MOD_ALL_JOBS, lsb.queues, bsub, JOB_TERMINATE_INTERVAL in lsb.params LSB_JOB_OUTPUT_LOGGING Syntax LSB_JOB_OUTPUT_LOGGING=Y | N Description Determines whether jobs write job notification messages to the logfile. Default Not defined (jobs do not write job notification messages to the logfile). LSB_JOB_REPORT_MAIL Syntax LSB_JOB_REPORT_MAIL=Y|N Description If you do not want sbatchd to send mail when the job is done, then set this parameter to N before submitting the job. This parameter only affects email sent by sbatchd. Chapter 1. Configuration Files 403 lsf.conf When the administrator sets LSB_JOB_REPORT_MAIL in lsf.conf, email notification for all jobs is disabled. All sbatchds must be restarted on all hosts. However, end users can set the value for LSB_JOB_REPORT_MAIL in the job submission environment to disable email notification for only that particular job and not email for all jobs. In this case, there is no need to restart sbatchd. Default Not defined. LSB_JOBID_DISP_LENGTH Syntax LSB_JOBID_DISP_LENGTH=integer Description By default, LSF commands bjobs and bhist display job IDs with a maximum length of 7 characters. Job IDs greater than 9999999 are truncated on the left. When LSB_JOBID_DISP_LENGTH=10, the width of the JOBID column in bjobs and bhist increases to 10 characters. If LSB_BJOBS_FORMAT is defined (in lsf.conf or as a runtime environment variable) or bjobs -o is run to include the JOBID column, and either of these specify a column width for the JOBID column, those specifications override the LSB_JOBID_DISP_LENGTH value. If there is no column width specified for the JOBID column, the LSB_JOBID_DISP_LENGTH value applies. Valid values Specify an integer between 7 and 10. Default Not defined. LSF uses the default 7-character length for job ID display. LSB_JOBINFO_DIR Syntax LSB_JOBINFO_DIR=directory Description Use this parameter to specify a directory for job information instead of using the default directory. If this parameter is specified, LSF directly accesses this directory to get the job information files. By default, the job information directory is located in the LSF shared directory, which is in the same file system as the one used for logging events. In large scale clusters with millions of single jobs, there are several job files in the job information directory. The job information directory requires random read/write operations on multiple job files simultaneously, while the event log directory has events appended to a single events file. 404 Platform LSF Configuration Reference lsf.conf The LSB_JOBINFO_DIR directory must be the following: v Owned by the primary LSF administrator v Accessible from all hosts that can potentially become the master host v Accessible from the master host with read and write permission v Set for 700 permission If the directory cannot be created, mbatchd will exit. Note: Using the LSB_JOBINFO_DIR parameter requires draining the whole cluster. Note: LSB_JOBINFO_DIR should be used for XL clusters. If it is configured for a non-XL cluster, all of the old job info directories must be copied to the new specified location. Default If this parameter is not set, it uses the following path from lsb.params: $LSB_SHAREDIR/cluster_name/logdir/info LSB_KEEP_SYSDEF_RLIMIT Syntax LSB_KEEP_SYSDEF_RLIMIT=y | n Description If resource limits are configured for a user in the SGI IRIX User Limits Database (ULDB) domain specified in LSF_ULDB_DOMAIN, and there is no domain default, the system default is honored. If LSB_KEEP_SYSDEF_RLIMIT=n, and no resource limits are configured in the domain for the user and there is no domain default, LSF overrides the system default and sets system limits to unlimited. Default Not defined. No resource limits are configured in the domain for the user and there is no domain default. LSB_KRB_CHECK_INTERVAL Syntax LSB_KRB_CHECK_INTERVAL=minutes Description Set a time interval for how long krbrenewd and root sbatchd should wait before the next check. If this parameter is changed, restart mbatchd/sbatchd Default 15 minutes Chapter 1. Configuration Files 405 lsf.conf LSB_KRB_LIB_PATH Syntax LSB_KRB_LIB_PATH=path to krb5 lib Description Specify the library path that contains the krb5 libs: v libkrb5.so v libcom_err.so v libk5cryto.so v libkrb5support.so You can configure multiple paths in this parameter. These paths can be blank, comma or semicolon separated. LSF will load the libs from these paths in the order you specified. Once loaded successfully, the search stops. Default On 32 bit platforms: /lib, /usr/lib, /usr/local/lib On 64 bit platforms: /lib64, /usr/lib64, /usr/loca/lib64 LSB_KRB_RENEW_MARGIN Syntax LSB_KRB_RENEW_MARGIN=minutes Description Specify how long krbrenewd and root sbatchd have to renew a Ticket Granting Ticket (TGT) before it expires. If this parameter is changed, restart mbatchd/sbatchd to have it take effect. Default 60 minutes LSB_KRB_TGT_FWD Syntax LSB_KRB_TGT_FWD=Y|N Description Use this parameter to control the user Ticket Granting Ticket (TGT) forwarding feature. When set to Y, user TGT is forwarded from the submission host to the execution host. mbatchd and/or root sbatchd do the required renewing along the way. Default N. (Do not forward user TGT during job submission.) 406 Platform LSF Configuration Reference lsf.conf LSB_KRB_TGT_DIR Syntax LSB_KRB_TGT_DIR=directory Description Specify a directory in which Ticket Granting Ticket (TGT) for a running job is stored. Each job or task will have its environment variable KRB5CCNAME pointing to the TGT file. Please note that this parameter controls the TGT location for running jobs, not for pending jobs on the mbatchd side. For pending jobs, TGTs are stored in the mbatchd info directory, along with their job log file. LSF tries to find the directory to place user TGTs in the following order: 1. LSB_KRB_TGT_DIR 2. /tmp on execution host Once LSF finds a valid directory, the search stops. If LSB_KRB_TGT_DIR is not defined to /tmp, LSF will create a symbolic link in /tmp to point to the actual TGT file. The symlink name pattern is lsf_krb5cc_${jid}_${some_suffix}. Default Not defined LSB_LOCALDIR Syntax LSB_LOCALDIR=path Description Enables duplicate logging. Specify the path to a local directory that exists only on the first LSF master host. LSF puts the primary copies of the event and accounting log files in this directory. LSF puts the duplicates in LSB_SHAREDIR. Important: Always restart both the mbactchd and sbatchd when modifying LSB_LOCALDIR. Example LSB_LOCALDIR=/usr/share/lsbatch/loginfo Default Not defined See also LSB_SHAREDIR, EVENT_UPDATE_INTERVAL in lsb.params Chapter 1. Configuration Files 407 lsf.conf LSB_LOG_MASK_MBD Syntax LSB_LOG_MASK_MBD=message_log_level Description Specifies the logging level of error messages for LSF mbatchd only. This value overrides LSB_LOG_MASK for mbatchd only. For example: LSB_LOG_MASK_MBD=LOG_DEBUG The valid log levels for this parameter are: v LOG_ERR v LOG_WARNING v LOG_NOTICE v LOG_INFO v LOG_DEBUG v LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 Run badmin mbdrestart to make changes take effect. Default Not defined (logging level is controlled by LSB_LOG_MASK). LSB_LOG_MASK_SBD Syntax LSB_LOG_MASK_SBD=message_log_level Description Specifies the logging level of error messages for LSF sbatchd only. This value overrides LSF_LOG_MASK for sbatchd only. For example: LSB_LOG_MASK_SBD=LOG_DEBUG The valid log levels for this parameter are: v LOG_ERR v LOG_WARNING v LOG_NOTICE v LOG_INFO v LOG_DEBUG v LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 408 Platform LSF Configuration Reference lsf.conf Run badmin hrestart to make changes take effect. Default Not defined (logging level is controlled by LSF_LOG_MASK). LSB_LOG_MASK_SCH Syntax LSB_LOG_MASK_SCH=message_log_level Description Specifies the logging level of error messages for LSF mbschd only. This value overrides LSB_LOG_MASK for mbschd only. For example: LSB_LOG_MASK_SCHD=LOG_DEBUG The valid log levels for this parameter are: v LOG_ERR v v v v v LOG_WARNING LOG_NOTICE LOG_INFO LOG_DEBUG LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 Run badmin reconfig make changes take effect. Default Not defined (logging level is controlled by LSB_LOG_MASK). LSB_MAIL_FROM_DOMAIN Syntax LSB_MAIL_FROM_DOMAIN=domain_name Description Windows only. LSF uses the username as the from address to send mail. In some environments the from address requires domain information. If LSB_MAIL_FROM_DOMAIN is set, the domain name specified in this parameter will be added to the from address. For example, if LSB_MAIL_FROM_DOMAIN is not set the, from address is SYSTEM; if LSB_MAIL_FROM_DOMAIN=example.com, the from address is [email protected]. Chapter 1. Configuration Files 409 lsf.conf Default Not defined. LSB_MAILPROG Syntax LSB_MAILPROG=file_name Description Path and file name of the mail program used by LSF to send email. This is the electronic mail program that LSF uses to send system messages to the user. When LSF needs to send email to users it invokes the program defined by LSB_MAILPROG in lsf.conf. You can write your own custom mail program and set LSB_MAILPROG to the path where this program is stored. LSF administrators can set the parameter as part of cluster reconfiguration. Provide the name of any mail program. For your convenience, LSF provides the sendmail mail program, which supports the sendmail protocol on UNIX. In a mixed cluster, you can specify different programs for Windows and UNIX. You can set this parameter during installation on Windows. For your convenience, LSF provides the lsmail.exe mail program, which supports SMTP and Microsoft Exchange Server protocols on Windows. If lsmail is specified, the parameter LSB_MAILSERVER must also be specified. If you change your mail program, the LSF administrator must restart sbatchd on all hosts to retrieve the new value. UNIX By default, LSF uses /usr/lib/sendmail to send email to users. LSF calls LSB_MAILPROG with two arguments; one argument gives the full name of the sender, and the other argument gives the return address for mail. LSB_MAILPROG must read the body of the mail message from the standard input. The end of the message is marked by end-of-file. Any program or shell script that accepts the arguments and input, and delivers the mail correctly, can be used. LSB_MAILPROG must be executable by any user. Windows If LSB_MAILPROG is not defined, no email is sent. Examples LSB_MAILPROG=lsmail.exe LSB_MAILPROG=/serverA/tools/lsf/bin/unixhost.exe Default /usr/lib/sendmail (UNIX) blank (Windows) 410 Platform LSF Configuration Reference lsf.conf See also LSB_MAILSERVER, LSB_MAILTO LSB_MAILSENDER Syntax LSB_MAILSENDER=user_name Description Changes the default mail sender for job finish notifications. By default, the user who submits a job receives an email notification when the job finishes. The default finish notification mail sender is the LSF admin. However, there is a risk that the job submitter’s mail box is full. If the job submitter’s mail box is full, the notification email is returned to the mail sender (that is, the LSF admin). It is possible that the LSF admin will get a mail box full of returned notification mails, greatly inconveniencing the admin. Setting this parameter with a different user name for job finish notifications allows the admin’s mail box to remain clear of returned notification emails. For example, you may set up a dedicated user from which all notification emails will be sent and this user will receive all returned notifications. Default Not defined. LSB_MAILSERVER Syntax LSB_MAILSERVER=mail_protocol:mail_server Description Part of mail configuration on Windows. This parameter only applies when lsmail is used as the mail program (LSB_MAILPROG=lsmail.exe).Otherwise, it is ignored. Both mail_protocol and mail_server must be indicated. Set this parameter to either SMTP or Microsoft Exchange protocol (SMTP or EXCHANGE) and specify the name of the host that is the mail server. This parameter is set during installation of LSF on Windows or is set or modified by the LSF administrator. If this parameter is modified, the LSF administrator must restart sbatchd on all hosts to retrieve the new value. Examples LSB_MAILSERVER=EXCHANGE:[email protected] LSB_MAILSERVER=SMTP:MailHost Chapter 1. Configuration Files 411 lsf.conf Default Not defined See also LSB_LOCALDIR LSB_MAILSIZE_LIMIT Syntax LSB_MAILSIZE_LIMIT=email_size_KB Description Limits the size in KB of the email containing job output information. The system sends job information such as CPU, process and memory usage, job output, and errors in email to the submitting user account. Some batch jobs can create large amounts of output. To prevent large job output files from interfering with your mail system, use LSB_MAILSIZE_LIMIT to set the maximum size in KB of the email containing the job information. Specify a positive integer. If the size of the job output email exceeds LSB_MAILSIZE_LIMIT, the output is saved to a file under JOB_SPOOL_DIR or to the default job output directory if JOB_SPOOL_DIR is not defined. The email informs users of where the job output is located. If the -o option of bsub is used, the size of the job output is not checked against LSB_MAILSIZE_LIMIT. If you use a custom mail program specified by the LSB_MAILPROG parameter that can use the LSB_MAILSIZE environment variable, it is not necessary to configure LSB_MAILSIZE_LIMIT. Default By default, LSB_MAILSIZE_LIMIT is not enabled. No limit is set on size of batch job output email. See also LSB_MAILPROG, LSB_MAILTO LSB_MAILTO Syntax LSB_MAILTO=mail_account Description LSF sends electronic mail to users when their jobs complete or have errors, and to the LSF administrator in the case of critical errors in the LSF system. The default is 412 Platform LSF Configuration Reference lsf.conf to send mail to the user who submitted the job, on the host on which the daemon is running; this assumes that your electronic mail system forwards messages to a central mailbox. The LSB_MAILTO parameter changes the mailing address used by LSF. LSB_MAILTO is a format string that is used to build the mailing address. Common formats are: v !U: Mail is sent to the submitting user's account name on the local host. The substring !U, if found, is replaced with the user’s account name. v !U@company_name.com: Mail is sent to user@company_name.com on the mail server. The mail server is specified by LSB_MAILSERVER. v !U@!H: Mail is sent to user@submission_hostname. The substring !H is replaced with the name of the submission host. This format is valid on UNIX only. It is not supported on Windows. All other characters (including any other ‘!’) are copied exactly. If this parameter is modified, the LSF administrator must restart sbatchd on all hosts to retrieve the new value. Windows only: When a job exception occurs (for example, a job is overrun or underrun), an email is sent to the primary administrator set in the lsf.cluster.cluster_name file to the domian set in LSB_MAILTO. For example, if the primary administrator is lsfadmin and [email protected], an email is sent to [email protected]. The email must be a valid Windows email account. Default !U See also LSB_MAILPROG, LSB_MAILSIZE_LIMIT LSB_MAX_ASKED_HOSTS_NUMBER Syntax LSB_MAX_ASKED_HOSTS_NUMBER=integer Description Limits the number of hosts a user can specify with the -m (host preference) option of the following commands: v v v v v v v bsub brun bmod brestart brsvadd brsvmod brsvs Chapter 1. Configuration Files 413 lsf.conf The job is rejected if more hosts are specified than the value of LSB_MAX_ASKED_HOSTS_NUMBER. CAUTION: If this value is set high, there will be a performance effect if users submit or modify jobs using the -m option and specify a large number of hosts. 512 hosts is the suggested upper limit. Valid values Any whole, positive integer. Default 512 LSB_MAX_FORWARD_PER_SESSION Syntax LSB_MAX_FORWARD_PER_SESSION=integer Description MutliCluster job forwarding model only. Sets the maximum number of jobs forwarded within a scheduling session. Defined in the submission cluster only. Default 50 LSB_MAX_JOB_DISPATCH_PER_SESSION Sets the maximum number of job decisions that mbschd can make during one job scheduling session. Syntax LSB_MAX_JOB_DISPATCH_PER_SESSION=integer Description The system sets LSB_MAX_JOB_DISPATCH_PER_SESSION automatically during mbatchd startup, but you can adjust it manually: Both mbatchd and sbatchd must be restarted when you manually change the value of this parameter. Default LSB_MAX_JOB_DISPATCH_PER_SESSION = Min (MAX(300, Total CPUs), 3000) 414 Platform LSF Configuration Reference lsf.conf See also MAX_SBD_CONNS in lsb.params and LSF_NON_PRIVILEGED_PORTS in lsf.conf LSB_MAX_PACK_JOBS Syntax LSB_MAX_PACK_JOBS=integer Description Applies to job packs only. Enables the job packs feature and specifies the maximum number of job submission requests in one job pack. If the value is 0, job packs are disabled. If the value is 1, jobs from the file are submitted individually, as if submitted directly using the bsub command. We recommend 100 as the initial pack size. Tune this parameter based on cluster performance. The larger the pack size, the faster the job submission rate is for all the job requests the job submission file. However, while mbatchd is processing a pack, mbatchd is blocked from processing other requests, so increasing pack size can affect mbatchd response time for other job submissions. If you change the configuration of this parameter, you must restart mbatchd. Parameters related to job packs are not supported as environment variables. Valid Values Any positive integer or 0. Default Set to 300 at time of installation for the HIGH_THROUGHPUT configuration template. If otherwise undefined, then 0 (disabled). LSB_MAX_PROBE_SBD Syntax LSB_MAX_PROBE_SBD=integer Description Specifies the maximum number of sbatchd instances that can be polled by mbatchd in the interval MBD_SLEEP_TIME/10. Use this parameter in large clusters to reduce the time it takes for mbatchd to probe all sbatchds. The value of LSB_MAX_PROBE_SBD cannot be greater than the number of hosts in the cluster. If it is, mbatchd adjusts the value of LSB_MAX_PROBE_SBD to be same as the number of hosts. After modifying LSB_MAX_PROBE_SBD, use badmin mbdrestart to restart mbatchd and let the modified value take effect. Chapter 1. Configuration Files 415 lsf.conf If LSB_MAX_PROBE_SBD is defined, the value of MAX_SBD_FAIL in lsb.params can be less than 3. After modifying LSB_MAX_PROBE_SBD, use badmin mbdrestart to restart mbatchd and let the modified value take effect. Valid values Any number between 0 and 64. Default 20 See also MAX_SBD_FAIL in lsb.params LSB_MBD_BUSY_MSG Syntax LSB_MBD_BUSY_MSG="message_string" Description Specifies the message displayed when mbatchd is too busy to accept new connections or respond to client requests. Define this parameter if you want to customize the message. Valid values String, either non-empty or empty. Default Not defined. By default, LSF displays the message "LSF is processing your request. Please wait..." Batch commands retry the connection to mbatchd at the intervals specified by the parameters LSB_API_CONNTIMEOUT and LSB_API_RECVTIMEOUT. LSB_MBD_CONNECT_FAIL_MSG Syntax LSB_MBD_CONNECT_FAIL_MSG="message_string" Description Specifies the message displayed when internal system connections to mbatchd fail. Define this parameter if you want to customize the message. 416 Platform LSF Configuration Reference lsf.conf Valid values String, either non-empty or empty. Default Not defined. By default, LSF displays the message "Cannot connect to LSF. Please wait..." Batch commands retry the connection to mbatchd at the intervals specified by the parameters LSB_API_CONNTIMEOUT and LSB_API_RECVTIMEOUT. LSB_MBD_DOWN_MSG Syntax LSB_MBD_DOWN_MSG="message_string" Description Specifies the message displayed by the bhosts command when mbatchd is down or there is no process listening at either the LSB_MBD_PORT or the LSB_QUERY_PORT. Define this parameter if you want to customize the message. Valid values String, either non-empty or empty. Default Not defined. By default, LSF displays the message "LSF is down. Please wait..." Batch commands retry the connection to mbatchd at the intervals specified by the parameters LSB_API_CONNTIMEOUT and LSB_API_RECVTIMEOUT. LSB_MBD_MAX_SIG_COUNT Syntax LSB_MBD_MAX_SIG_COUNT=integer Description When a host enters an unknown state, the mbatchd attempts to retry any pending jobs. This parameter specifies the maximum number of pending signals that the mbatchd deals with concurrently in order not to overload it. A high value for LSB_MBD_MAX_SIG_COUNT can negatively impact the performance of your cluster. Valid Valid values Integers between 5-100, inclusive. Default 5 Chapter 1. Configuration Files 417 lsf.conf LSB_MBD_PORT See LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT. LSB_MC_CHKPNT_RERUN Syntax LSB_MC_CHKPNT_RERUN=y | n Description For checkpointable MultiCluster jobs, if a restart attempt fails, the job is rerun from the beginning (instead of from the last checkpoint) without administrator or user intervention. The submission cluster does not need to forward the job again. The execution cluster reports the job’s new pending status back to the submission cluster, and the job is dispatched to the same host to restart from the beginning Default n LSB_MC_DISABLE_HOST_LOOKUP Syntax LSB_MC_DISABLE_HOST_LOOKUP=Y Description Disable submit host name lookup for remote jobs. When this parameter is set, the job sbatchd does not look up the submission host name when executing or cleaning a remote job. LSF will not be able to do any host dependent automounting. Default N. LSF will look up submit host name for remote jobs. LSB_MC_INITFAIL_MAIL Syntax LSB_MC_INITFAIL_MAIL=Y | All | Administrator Description MultiCluster job forwarding model only. Specify Y to make LSF email the job owner when a job is suspended after reaching the retry threshold. Specify Administrator to make LSF email the primary administrator when a job is suspended after reaching the retry threshold. 418 Platform LSF Configuration Reference lsf.conf Specify All to make LSF email both the job owner and the primary administrator when a job is suspended after reaching the retry threshold. Default not defined LSB_MC_INITFAIL_RETRY Syntax LSB_MC_INITFAIL_RETRY=integer Description MultiCluster job forwarding model only. Defines the retry threshold and causes LSF to suspend a job that repeatedly fails to start. For example, specify 2 retry attempts to make LSF attempt to start a job 3 times before suspending it. Default 5 LSB_MEMLIMIT_ENFORCE Syntax LSB_MEMLIMIT_ENFORCE=y | n Description Specify y to enable LSF memory limit enforcement. If enabled, LSF sends a signal to kill all processes that exceed queue-level memory limits set by MEMLIMIT in lsb.queues or job-level memory limits specified by bsub -M mem_limit. Otherwise, LSF passes memory limit enforcement to the OS. UNIX operating systems that support RLIMIT_RSS for setrlimit() can apply the memory limit to each process. The following operating systems do not support memory limit at the OS level: v Windows v Sun Solaris 2.x Default Not defined. LSF passes memory limit enforcement to the OS. See also lsb.queues Chapter 1. Configuration Files 419 lsf.conf LSB_MEMLIMIT_ENF_CONTROL | | Syntax | | LSB_MEMLIMIT_ENF_CONTROL=<Memory Threshold>:<Swap Threshold>:<Check Interval>:[all] | Description | | | | | | This parameter further refines the behavior of enforcing a job memory limit. In the case that one or more jobs reach a specified memory limit (both the host memory and swap utilization has reached a configurable threshold) at execution time, the worst offending job will be killed. A job is selected as the worst offending job on that host if it has the most overuse of memory (actual memory rusage minus memory limit of the job). | | You also have the choice of killing all jobs exceeding the thresholds (not just the worst). | The following describes usage and restrictions on this parameter. v <Memory Threshold>: (Used memory size/maximum memory size) A threshold indicating the maximum limit for the ratio of used memory size to maximum memory size on the host. The threshold represents a percentage and must be an integer between 1 and 100. v <Swap Threshold>: (Used swap size/maximum swap size) A threshold indicating the maximum limit for the ratio of used swap memory size to maximum swap memory size on the host. The threshold represents a percentage and must be an integer between 1 and 100. v <Check Interval>: The value, in seconds, specifying the length of time that the host memory and swap memory usage will not be checked during the nearest two checking cycles. The value must be an integer greater than or equal to the value of SBD_SLEEP_TIME. v The keyword :all can be used to terminate all single host jobs that exceed the memory limit when the host threshold is reached. If not used, only the worst offending job is killed. v If the cgroup memory enforcement feature is enabled (LSB_RESOURCE_ENFORCE includes the keyword "memory"), LSB_MEMLIMIT_ENF_CONTROL is ignored. v The host will be considered to reach the threshold when both Memory Threshold and Swap Threshold are reached. | | | | | | | | | | | | | | | | | | | | | | | | | v LSB_MEMLIMIT_ENF_CONTROL does not have any effect on jobs running across multiple hosts. They will be terminated if they are over the memory limit regardless of usage on the execution host. v On some operating systems, when the used memory equals the total memory, the OS may kill some processes. In this case, the job exceeding the memory limit may be killed by the OS not an LSF memory enforcement policy. | In this case, the exit reason of the job will indicate “killed by external signal”. | | | | Default | Not enabled. All jobs exceeding the memory limit will be terminated. 420 Platform LSF Configuration Reference lsf.conf LSB_MIG2PEND Syntax LSB_MIG2PEND=0 | 1 Description Applies only to migrating checkpointable or rerunnable jobs. When defined with a value of 1, LSF requeues migrating jobs instead of restarting or rerunning them on the first available host. LSF requeues the jobs in the PEND state in order of the original submission time and with the original job priority. If you want to place the migrated jobs at the bottom of the queue without considering submission time, define both LSB_MIG2PEND=1 and LSB_REQUEUE_TO_BOTTOM=1 in lsf.conf. Ignored in a MultiCluster environment. Default Not defined. LSF restarts or reruns migrating jobs on the first available host. See also LSB_REQUEUE_TO_BOTTOM LSB_MIXED_PATH_DELIMITER Syntax LSB_MIXED_PATH_DELIMITER="|" Description Defines the delimiter between UNIX and Windows paths if LSB_MIXED_PATH_ENABLE=y. For example, /home/tmp/J.out|c:\tmp\J.out. Default A pipe "|" is the default delimiter. See also LSB_MIXED_PATH_ENABLE LSB_MIXED_PATH_ENABLE Syntax LSB_MIXED_PATH_ENABLE=y | n Description Allows you to specify both a UNIX and Windows path when submitting a job in a mixed cluster (both Windows and UNIX hosts). Chapter 1. Configuration Files 421 lsf.conf The format is always unix_path_cmd|windows_path_cmd. Applies to the following options of bsub: v v v v v v v v -o, -oo -e, -eo -i, -is -cwd -E, -Ep CMD queue level PRE_EXEC, POST_EXEC application level PRE_EXEC, POST_EXEC For example: bsub -o "/home/tmp/job%J.out|c:\tmp\job%J.out" -e "/home/tmp/err%J.out|c:\ tmp\err%J.out" -E "sleep 9| sleep 8" -Ep "sleep 7| sleep 6" -cwd "/home/tmp|c:\tmp" "sleep 121|sleep 122" The delimiter is configurable: LSB_MIXED_PATH_DELIMITER. Note: LSB_MIXED_PATH_ENABLE doesn't support interactive mode (bsub -I). Default Not defined. LSF jobs submitted . See also LSB_MIXED_PATH_DELIMITER LSB_MOD_ALL_JOBS Syntax LSB_MOD_ALL_JOBS=y | Y Description If set, enables bmod to modify resource limits and location of job output files for running jobs. After a job has been dispatched, the following modifications can be made: v CPU limit (-c [hour:]minute[/host_name | /host_model] | -cn) v Memory limit (-M mem_limit | -Mn) v v v v v 422 Rerunnable jobs (-r | -rn) Resource requirements (-R "res_req" except -R "cu[cu_string]") Run limit (-W run_limit[/host_name | /host_model] | -Wn) Standard output file name (-o output_file | -on) Standard error file name (-e error_file | -en) Platform LSF Configuration Reference lsf.conf v Overwrite standard output (stdout) file name up to 4094 characters for UNIX or 255 characters for Windows (-oo output_file) v Overwrite standard error (stderr) file name up to 4094 characters for UNIX or 255 characters for Windows (-eo error_file) To modify the CPU limit or the memory limit of running jobs, the parameters LSB_JOB_CPULIMIT=Y and LSB_JOB_MEMLIMIT=Y must be defined in lsf.conf. Important: Always run badmin mbdrestart after modifying LSB_MOD_ALL_JOBS. Default Set to Y at time of installation. If otherwise undefined, then N. See also LSB_JOB_CPULIMIT, LSB_JOB_MEMLIMIT LSB_NCPU_ENFORCE Description When set to 1, enables parallel fairshare and considers the number of CPUs when calculating dynamic priority for queue-level user-based fairshare. LSB_NCPU_ENFORCE does not apply to host-partition user-based fairshare. For host-partition user-based fairshare, the number of CPUs is automatically considered. Default Not defined LSB_NQS_PORT Syntax LSB_NQS_PORT=port_number Description Required for LSF to work with NQS. TCP service port to use for communication with NQS. Where defined This parameter can alternatively be set as an environment variable or in the services database such as /etc/services. Example LSB_NQS_PORT=607 Chapter 1. Configuration Files 423 lsf.conf Default Not defined LSB_NUM_NIOS_CALLBACK_THREADS Syntax LSB_NUM_NIOS_CALLBACK_THREADS=integer Description Specifies the number of callback threads to use for batch queries. If your cluster runs a large amount of blocking mode (bsub -K) and interactive jobs (bsub -I), response to batch queries can become very slow. If you run large number of bsub -I or bsub -K jobs, you can define the threads to the number of processors on the master host. Default Not defined LSB_PACK_MESUB Syntax LSB_PACK_MESUB=Y|y|N|n Description Applies to job packs only. If LSB_PACK_MESUB=N, mesub will not be executed for any jobs in the job submission file, even if there are esubs configured at the application level (-a option of bsub), or using LSB_ESUB_METHOD in lsf.conf, or through a named esub executable under LSF_SERVERDIR. If LSB_PACK_MESUB=Y, mesub is executed for every job in the job submission file. Parameters related to job packs are not supported as environment variables. Default Y LSB_PACK_SKIP_ERROR Syntax LSB_PACK_SKIP_ERROR=Y|y|N|n Description Applies to job packs only. 424 Platform LSF Configuration Reference lsf.conf If LSB_PACK_SKIP_ERROR=Y, all requests in the job submission file are submitted, even if some of the job submissions fail. The job submission process always continues to the end of the file. If LSB_PACK_SKIP_ERROR=N, job submission stops if one job submission fails. The remaining requests in the job submission file are not submitted. If you change the configuration of this parameter, you must restart mbatchd. Parameters related to job packs are not supported as environment variables. Default N LSB_PERF_METRICS_LOGDIR Syntax LSB_PERF_METRICS_LOGDIR=/tmp Description Sets the directory in which mbatchd performance metric data is logged. The primary owner of this directory is the LSF administrator. Default LSF_LOGDIR LSB_PERF_METRICS_SAMPLE_PERIOD Syntax LSB_PERF_METRICS_SAMPLE_PERIOD=minutes Description Determines the sampling period for which mbatchd performance metric data is collected. The sampling period should not be too long, such as days. Default 5 minutes LSB_POSTEXEC_SEND_MAIL Syntax LSB_POSTEXEC_SEND_MAIL=Y|y|N|n Description Enable this parameter to have LSF send an email to the user that provides the details of post execution, if any. This includes any applicable output. Chapter 1. Configuration Files 425 lsf.conf Default N LSB_QUERY_ENH Syntax LSB_QUERY_ENH=Y|N Description Extends multithreaded query support to batch query requests (in addition to bjobs query requests). In addition, the mbatchd system query monitoring mechanism starts automatically instead of being triggered by a query request. This ensures a consistent query response time within the system. Default N (multithreaded query support for bjobs query requests only) LSB_QUERY_PORT Syntax LSB_QUERY_PORT=port_number Description Optional. Applies only to UNIX platforms that support thread programming. When using MultiCluster, LSB_QUERY_PORT must be defined on all clusters. This parameter is recommended for busy clusters with many jobs and frequent query requests to increase mbatchd performance when you use the bjobs command. This may indirectly increase overall mbatchd performance. The port_number is the TCP/IP port number to be used by mbatchd to only service query requests from the LSF system. mbatchd checks the query port during initialization. If LSB_QUERY_PORT is not defined: v mbatchd uses the port specified by LSB_MBD_PORT in lsf.conf, or, if LSB_MBD_PORT is not defined, looks into the system services database for port numbers to communicate with other hosts in the cluster. v For each query request it receives, mbatchd forks one child mbatchd to service the request. Each child mbatchd processes one request and then exits. If LSB_QUERY_PORT is defined: v mbatchd prepares this port for connection. The default behavior of mbatchd changes, a child mbatchd is forked, and the child mbatchd creates threads to process requests. 426 Platform LSF Configuration Reference lsf.conf v mbatchd responds to requests by forking one child mbatchd. As soon as mbatchd has forked a child mbatchd, the child mbatchd takes over and listens on the port to process more query requests. For each request, the child mbatchd creates a thread to process it. The interval used by mbatchd for forking new child mbatchds is specified by the parameter MBD_REFRESH_TIME in lsb.params. The child mbatchd continues to listen to the port number specified by LSB_QUERY_PORT and creates threads to service requests until the job changes status, a new job is submitted, or the time specified in MBD_REFRESH_TIME in lsb.params has passed (see MBD_REFRESH_TIME in lsb.params for more details). When any of these happens, the parent mbatchd sends a message to the child mbatchd to exit. LSB_QUERY_PORT must be defined when NEWJOB_REFRESH=Y in lsb.params to enable a child mbatchd to get up to date information about new jobs from the parent mbatchd. Default Not defined See also MBD_REFRESH_TIME and NEWJOB_REFRESH in lsb.params LSB_REQUEUE_TO_BOTTOM Syntax LSB_REQUEUE_TO_BOTTOM=0 | 1 Description Specify 1 to put automatically requeued jobs at the bottom of the queue instead of at the top. Also requeues migrating jobs to the bottom of the queue if LSB_MIG2PEND is also defined with a value of 1. Specify 0 to requeue jobs to the top of the queue. Ignored in a MultiCluster environment. Default 0 (LSF requeues jobs to the top of the queue). See also LSB_MIG2PEND, REQUEUE_EXIT_VALUES in lsb.queues LSB_RESOURCE_ENFORCE Syntax LSB_RESOURCE_ENFORCE="resource [resource]" Chapter 1. Configuration Files 427 lsf.conf Description Controls resource enforcement through the Linux cgroup memory and cpuset subsytem on Linux systems with cgroup support. Memory and cpuset enforcement for Linux cgroups is supported on Red Hat Enterprise Linux (RHEL) 6.2 or above, SuSe Linux Enterprise Linux 11 SP2 or above. resource can be either memory or cpu, or both cpu and memory in either order. LSF can impose strict host-level memory and swap limits on systems that support Linux cgroups. These limits cannot be exceeded. All LSF job processes are controlled by the Linux cgroup system. If job processes on a host use more memory than the defined limit, the job will be immediately killed by the Linux cgroup memory subsystem. Memory is enforced on a per job/per host basis, not per task. If the host OS is Red Hat Enterprise Linux 6.3 or above, cgroup memory limits are enforced, and LSF is notified to terminate the job. Additional notification is provided to users through specific termination reasons displayed by bhist –l. To enable memory enforcement, configure LSB_RESOURCE_ENFORCE="memory". Note: If LSB_RESOURCE_ENFORCE="memory" is configured, all existing LSF memory limit related parameters such as LSF_HPC_EXTENSIONS="TASK_MEMLIMIT", LSF_HPC_EXTENSIONS="TASK_SWAPLIMIT", LSB_JOB_MEMLIMIT and LSB_MEMLIMIT_ENFORCE will be ignored. LSF can also enforce CPU affinity binding on systems that support the Linux cgroup cpuset subsystem. When CPU affinity binding through Linux cgroups is enabled, LSF will create a cpuset to contain job processes if the job has affinity resource requirements, so that the job processes cannot escape from the allocated CPUs. Each affinity job cpuset includes only the CPU and memory nodes that LSF distributes. Linux cgroup cpusets are only created for affinity jobs. To enable CPU enforcement, configure LSB_RESOURCE_ENFORCE="cpu". If you are enabling memory and CPU enforcement through the Linux cgroup memory cpsuset subsystems after upgrading an existing LSF cluster, make sure that the following parameters are set in lsf.conf: v LSF_PROCESS_TRACKING=Y v LSF_LINUX_CGROUP_ACCT=Y Examples For a parallel job with 3 tasks and a memory limit of 100 MB, such as the following: bsub -n 3 -M 100 –R "span[ptile=2]" blaunch ./mem_eater The application mem_eater keeps increasing the memory usage. LSF will kill the job if it consumes more than 200 MB total memory on one host. For example, if hosta runs 2 tasks and hostb runs 1 task, the job will only be killed if total memory on exceeds 200 MB on either hosta or hostb. If one of the tasks consumes more than 100 MB memory but less than 200 MB, and the other task doesn’t consume any memory, the job will not be killed. That is, LSF does not support per task memory enforcement for cgroups. For a job with affinity requirement, such as the following: 428 Platform LSF Configuration Reference lsf.conf bsub -R "affinity[core:membind=localonly]"./myapp LSF will create a cpuset which contains one core and attach the process ID of the application ./myapp to this cpuset. The cpuset serves as a strict container for job processes, so that the application ./myapp cannot bind to other CPUs. LSF will add all memory nodes into the cpuset to make sure the job can access all memory nodes on the host, and will make sure job processes will access preferred memory nodes first. Default Not defined. Resource enforcement through the Linux cgroup system is not enabled. LSB_RLA_PORT Syntax LSB_RLA_PORTport_number Description TCP port used for communication between the LSF topology adapter (RLA) and the HPC scheduler plugin. Default 6883 LSB_RLA_UPDATE Syntax LSB_RLA_UPDATE=time_seconds Description Specifies how often the HPC scheduler refreshes free node information from the LSF topology adapter (RLA). Default 600 seconds LSB_RLA_WORKDIR Syntax LSB_RLA_WORKDIR=directory Description Directory to store the LSF topology adapter (RLA) status file. Allows RLA to recover its original state when it restarts. When RLA first starts, it creates the directory defined by LSB_RLA_WORKDIR if it does not exist, then creates subdirectories for each host. Chapter 1. Configuration Files 429 lsf.conf You should avoid using /tmp or any other directory that is automatically cleaned up by the system. Unless your installation has restrictions on the LSB_SHAREDIR directory, you should use the default for LSB_RLA_WORKDIR. Default LSB_SHAREDIR/cluster_name/rla_workdir LSB_SACCT_ONE_UG Syntax LSB_SACCT_ONE_UG=y | Y | n | N Description When set to Y, minimizes overall memory usage of mbatchd during fairshare accounting at job submission by limiting the number of share account nodes created on mbatchd startup. Most useful when there are a lot of user groups with all members in the fairshare policy. When a default user group is defined, inactive user share accounts are still defined for the default user group. When setting this parameter, you must restart the mbatchd. Default N LSB_SBD_PORT See LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT. LSB_SET_TMPDIR Syntax LSB_SET_TMPDIR=y|n|<ENV_VAR_NAME> If y, LSF sets the TMPDIR environment variable, overwriting the current value with the job-specific temporary directory. For more details on the job-specific temporary directory, refer to LSF_TMPDIR. If this parameter is set to the name of an environment variable (for example, MY_TMPDIR), LSF sets the value of this environment variable to the job-specific temporary directory. The user application can use this environment variable within the code. Example LSB_SET_TMPDIR=MY_TMPDIR On Unix, the name of this environment variable is $MY_TMPDIR and its value is the job-specific temporary directory. On Windows, the name of this environment variable is %MY_TMPDIR% and its value is the job-specific temporary directory. 430 Platform LSF Configuration Reference lsf.conf Default n LSB_SHAREDIR Syntax LSB_SHAREDIR=directory Description Directory in which the job history and accounting logs are kept for each cluster. These files are necessary for correct operation of the system. Like the organization under LSB_CONFDIR, there is one subdirectory for each cluster. The LSB_SHAREDIR directory must be owned by the LSF administrator. It must be accessible from all hosts that can potentially become the master host, and must allow read and write access from the master host. The LSB_SHAREDIR directory typically resides on a reliable file server. Default LSF_INDEP/work See also LSB_LOCALDIR LSB_SHORT_HOSTLIST Syntax LSB_SHORT_HOSTLIST=1 Description Displays an abbreviated list of hosts in bjobs and bhist for a parallel job where multiple processes of a job are running on a host. Multiple processes are displayed in the following format: processes*hostA For example, if a parallel job is running 5 processes on hostA, the information is displayed in the following manner: 5*hostA Setting this parameter may improve mbatchd restart performance and accelerate event replay. Default Set to 1 at time of installation for the HIGH_THROUGHPUT and PARALLEL configuration templates. Otherwise, not defined. Chapter 1. Configuration Files 431 lsf.conf LSB_SIGSTOP Syntax LSB_SIGSTOP=signal_name | signal_value Description Specifies the signal sent by the SUSPEND action in LSF. You can specify a signal name or a number. If this parameter is not defined, by default the SUSPEND action in LSF sends the following signals to a job: v Parallel or interactive jobs: SIGTSTP is sent to allow user programs to catch the signal and clean up. The parallel job launcher also catches the signal and stops the entire job (task by task for parallel jobs). Once LSF sends SIGTSTP, LSF assumes the job is stopped. v Other jobs: SIGSTOP is sent. SIGSTOP cannot be caught by user programs. The same set of signals is not supported on all UNIX systems. To display a list of the symbolic names of the signals (without the SIG prefix) supported on your system, use the kill -l command. Example LSB_SIGSTOP=SIGKILL In this example, the SUSPEND action sends the three default signals sent by the TERMINATE action (SIGINT, SIGTERM, and SIGKILL) 10 seconds apart. Default Not defined. Default SUSPEND action in LSF is sent. LSB_SSH_XFORWARD_CMD Syntax LSB_SSH_XFORWARD_CMD=[/path[/path]]ssh command [ssh options] Description Optional when submitting jobs with SSH X11 forwarding. Allows you to specify an SSH command and options when a job is submitted with -XF. Replace the default value with an SSH command (full PATH and options allowed). When running a job with the -XF option, runs the SSH command specified here. Default ssh -X -n LSB_STDOUT_DIRECT Syntax LSB_STDOUT_DIRECT=y|Y|n|N 432 Platform LSF Configuration Reference lsf.conf Description When set, and used with the -o or -e options of bsub, redirects standard output or standard error from the job directly to a file as the job runs. If LSB_STDOUT_DIRECT is not set and you use the bsub -o option, the standard output of a job is written to a temporary file and copied to the file you specify after the job finishes. LSB_STDOUT_DIRECT is not supported on Windows. Default Not defined LSB_STOP_IGNORE_IT Usage LSB_STOP_IGNORE_IT= Y | y Description Allows a solitary job to be stopped regardless of the idle time (IT) of the host that the job is running on. By default, if only one job is running on a host, the host idle time must be zero in order to stop the job. Default Not defined LSB_SUB_COMMANDNAME Syntax LSB_SUB_COMMANDNAME=y | Y Description If set, enables esub to use the variable LSB_SUB_COMMAND_LINE in the esub job parameter file specified by the LSB_SUB_PARM_FILE environment variable. The LSB_SUB_COMMAND_LINE variable carries the value of the bsub command argument, and is used when esub runs. Example esub contains: #!/bin/sh . $LSB_SUB_PARM_FILE exec 1>&2 if [ $LSB_SUB_COMMAND_LINE="netscape" ]; then echo "netscape is not allowed to run in batch mode" exit $LSB_SUB_ABORT_VALUE fi LSB_SUB_COMMAND_LINE is defined in $LSB_SUB_PARM_FILE as: Chapter 1. Configuration Files 433 lsf.conf LSB_SUB_COMMAND_LINE=netscape A job submitted with: bsub netscape ... Causes esub to echo the message: netscape is not allowed to run in batch mode Default Not defined See also LSB_SUB_COMMAND_LINE and LSB_SUB_PARM_FILE environment variables LSB_SUBK_SHOW_EXEC_HOST Syntax LSB_SUBK_SHOW_EXEC_HOST=Y | N Description When enabled, displays the execution host in the output of the command bsub -K. If the job runs on multiple hosts, only the first execution host is shown. In a MultiCluster environment, this parameter must be set in both clusters. Tip: Restart sbatchd on the execution host to make changes take effect. Default Set to Y at time of installation. If otherwise undefined, then N. LSB_TIME_CMD Syntax LSB_TIME_CMD=timimg_level Description The timing level for checking how long batch commands run. Time usage is logged in milliseconds; specify a positive integer. Example: LSB_TIME_CMD=1 Default Not defined 434 Platform LSF Configuration Reference lsf.conf See also LSB_TIME_MBD, LSB_TIME_SBD, LSF_TIME_LIM, LSF_TIME_RES | LSB_TIME_DMD | The timing level for checking how long dmd routines run. | Syntax | LSB_TIME_DMD=timing_level | Description | Specify a positive integer. Time usage is logged in milliseconds. | Example: LSB_TIME_DMD=1 | | You must run bdata admin reconfig to reconfigure the LSF data manager daemon (dmd) for this parameter to take effect. | Default | 0 | See also | LSB_TIME_CMD, LSB_TIME_MBD, LSB_TIME_SBD, LSF_TIME_LIM, LSF_TIME_RES LSB_TIME_MBD Syntax LSB_TIME_MBD=timing_level Description The timing level for checking how long mbatchd routines run. Time usage is logged in milliseconds; specify a positive integer. Example: LSB_TIME_MBD=1 Default Not defined See also LSB_TIME_CMD, LSB_TIME_SBD, LSF_TIME_LIM, LSF_TIME_RES LSB_TIME_SCH Syntax LSB_TIME_SCH=timing_level Chapter 1. Configuration Files 435 lsf.conf Description The timing level for checking how long mbschd routines run. Time usage is logged in milliseconds; specify a positive integer. Example: LSB_TIME_SCH=1 Default Not defined LSB_TIME_RESERVE_NUMJOBS Syntax LSB_TIME_RESERVE_NUMJOBS=maximum_reservation_jobs Description Enables time-based slot reservation. The value must be positive integer. LSB_TIME_RESERVE_NUMJOBS controls maximum number of jobs using time-based slot reservation. For example, if LSB_TIME_RESERVE_NUMJOBS=4, only the top 4 jobs get their future allocation information. Use LSB_TIME_RESERVE_NUMJOBS=1 to allow only the highest priority job to get accurate start time prediction. Recommended value 3 or 4 is the recommended setting. Larger values are not as useful because after the first pending job starts, the estimated start time of remaining jobs may be changed. Default Not defined LSB_TIME_SBD The timing level for checking how long sbatchd routines run. Syntax LSB_TIME_SBD=timing_level Description Specify a positive integer. Time usage is logged in milliseconds. Example: LSB_TIME_SBD=1 Default Not defined 436 Platform LSF Configuration Reference lsf.conf See also | LSB_TIME_CMD, LSB_TIME_DMD, LSB_TIME_MBD, LSF_TIME_LIM, LSF_TIME_RES LSB_TSJOBS_HELPER_HOSTS Syntax LSB_TSJOBS_HELPER_HOSTS="helper_host_list" helper_host_list is a space-separated list of hosts that are Terminal Services job helper hosts. Description Lists the Terminal Services job helper hosts. Helper hosts must be LSF servers in the LSF cluster. Configure a maximum of 256 hosts in the list. The local helper service will select one host from the list and send requests to the helper to create a user session. If the host fails to create the user session, the next helper host in the list will be tried. The local helper service will not select itself as helper host. For stability, you should configure one helper host for every 40 execution hosts. Example: LSB_TSJOBS_HELPER_HOSTS="host1 host2 host3" To make the modified parameter take effect, restart the LIM (by running lsadmin limrestart all) and restart the TSJobHelper Windows service on the execution hosts. Default None LSB_TSJOBS_HELPER_PORT Syntax LSB_TSJOBS_HELPER_PORT=port_number Description Specify the service port to use for communication with TSJobHelper. For example: LSB_TSJOBS_HELPER_PORT=6889 TSJobHelper uses this port to communicate with the helper hosts. To make the modified parameter take effect, restart the LIM (by running lsadmin limrestart all) and restart the TSJobHelper Windows service on the execution hosts and helper hosts. Default 6889 Chapter 1. Configuration Files 437 lsf.conf LSB_TSJOBS_HELPER_TIMEOUT Syntax LSB_TSJOBS_HELPER_TIMEOUT=seconds Description Specify the maximum time out that the local TSJobHelper service waits for a helper host to reply. After time out, the local service tries the next helper host in the LSB_TSJOBS_HELPER_HOSTS list. Example: LSB_TSJOBS_HELPER_TIMEOUT=60 To make the modified parameter take effect, restart the LIM (by running lsadmin limrestart all) and restart the TSJobHelper Windows service on the execution hosts. Default 60 seconds LSB_USER_REQUEUE_TO_BOTTOM Syntax LSB_USER_REQUEUE_TO_BOTTOM=1 | 0 Description Determines whether jobs are requeued to the top or bottom of the queue. When defined with a value of 1, LSF requeues jobs to the top of the queue. When defined with a value of 0, LSF requeues jobs to the bottom of the queue. If you want to place the migrated jobs at the bottom of the queue without considering submission time, define both LSB_MIG2PEND=1 and LSB_REQUEUE_TO_BOTTOM=1 in lsf.conf. Ignored in a MultiCluster environment. Default Not defined. LSF requeues jobs in order of original submission time and job priority. See also LSB_MIG2PEND LSB_UTMP Syntax LSB_UTMP=y | Y 438 Platform LSF Configuration Reference lsf.conf Description If set, enables registration of user and account information for interactive batch jobs submitted with bsub -Ip or bsub -Is. To disable utmp file registration, set LSB_UTMP to any value other than y or Y; for example, LSB_UTMP=N. LSF registers interactive batch jobs the job by adding a entries to the utmp file on the execution host when the job starts. After the job finishes, LSF removes the entries for the job from the utmp file. Limitations Registration of utmp file entries is supported on the following platforms: v Solaris (all versions) v HP-UX (all versions) v Linux (all versions) utmp file registration is not supported in a MultiCluster environment. Because interactive batch jobs submitted with bsub -I are not associated with a pseudo-terminal, utmp file registration is not supported for these jobs. Default Not defined LSF_AM_OPTIONS Syntax LSF_AM_OPTIONS=AMFIRST | AMNEVER Description Determines the order of file path resolution when setting the user’s home directory. This variable is rarely used but sometimes LSF does not properly change the directory to the user’s home directory when the user’s home directory is automounted. Setting LSF_AM_OPTIONS forces LSF to change directory to $HOME before attempting to automount the user’s home. When this parameter is not defined or set to AMFIRST, LSF, sets the user’s $HOME directory from the automount path. If it cannot do so, LSF sets the user’s $HOME directory from the passwd file. When this parameter is set to AMNEVER, LSF, never uses automount to set the path to the user’s home. LSF sets the user’s $HOME directory directly from the passwd file. Valid values The two values are AMFIRST and AMNEVER Chapter 1. Configuration Files 439 lsf.conf Default Same as AMFIRST LSF_API_CONNTIMEOUT Syntax LSF_API_CONNTIMEOUT=time_seconds Description Timeout when connecting to LIM. EGO parameter EGO_LIM_CONNTIMEOUT Default 5 See also LSF_API_RECVTIMEOUT LSF_API_RECVTIMEOUT Syntax LSF_API_RECVTIMEOUT=time_seconds Description Timeout when receiving a reply from LIM. EGO parameter EGO_LIM_RECVTIMEOUT Default 20 See also LSF_API_CONNTIMEOUT LSF_ASPLUGIN Syntax LSF_ASPLUGIN=path 440 Platform LSF Configuration Reference lsf.conf Description Points to the SGI Array Services library libarray.so. The parameter only takes effect on 64-bit x-86 Linux 2.6, glibc 2.3. Default /usr/lib64/libarray.so LSF_AUTH Syntax LSF_AUTH=eauth | ident Description Enables either external authentication or authentication by means of identification daemons. This parameter is required for any cluster that contains Windows hosts, and is optional for UNIX-only clusters. After defining or changing the value of LSF_AUTH, you must shut down and restart the LSF daemons on all server hosts to apply the new authentication method. eauth For site-specific customized external authentication. Provides the highest level of security of all LSF authentication methods. ident For authentication using the RFC 931/1413/1414 protocol to verify the identity of the remote client. If you want to use ident authentication, you must download and install the ident protocol, available from the public domain, and register ident as required by your operating system. For UNIX-only clusters, privileged ports authentication (setuid) can be configured by commenting out or deleting the LSF_AUTH parameter. If you choose privileged ports authentication, LSF commands must be installed as setuid programs owned by root. If the commands are installed in an NFS-mounted shared file system, the file system must be mounted with setuid execution allowed, that is, without the nosuid option. Restriction: To enable privileged ports authentication, LSF_AUTH must not be defined; setuid is not a valid value for LSF_AUTH. Default eauth During LSF installation, a default eauth executable is installed in the directory specified by the environment variable LSF_SERVERDIR. The default executable provides an example of how the eauth protocol works. You should write your own eauth executable to meet the security requirements of your cluster. Chapter 1. Configuration Files 441 lsf.conf LSF_AUTH_DAEMONS Syntax LSF_AUTH_DAEMONS=y | Y Description Enables LSF daemon authentication when external authentication is enabled (LSF_AUTH=eauth in the file lsf.conf). Daemons invoke eauth to authenticate each other as specified by the eauth executable. Default Not defined. LSF_BIND_JOB LSF_BIND_JOB Specifies the processor binding policy for sequential and parallel job processes that run on a single host. Syntax LSF_BIND_JOB=NONE | BALANCE | PACK | ANY | USER | USER_CPU_LIST Description | | | | | | Note: LSF_BIND_JOB is deprecated in LSF Standard Edition and LSF Advanced Edition. You should enable LSF CPU and memory affinity scheduling in with the AFFINITY parameter in lsb.hosts. If both LSF_BIND_JOB and affinity scheduling are enabled, affinity scheduling takes effect, and LSF_BIND_JOB is disabled. LSF_BIND_JOB and BIND_JOB are the only affinity options available in LSF Express Edition. | On Linux execution hosts that support this feature, job processes are hard bound to selected processors. If processor binding feature is not configured with the BIND_JOB parameter in an application profile in lsb.applications, the LSF_BIND_JOB configuration setting lsf.conf takes effect. The application profile configuration for processor binding overrides the lsf.conf configuration. For backwards compatibility: v LSF_BIND_JOB=Y is interpreted as LSF_BIND_JOB=BALANCE v LSF_BIND_JOB=N is interpreted as LSF_BIND_JOB=NONE Supported platforms Linux with kernel version 2.6 or higher Default Not defined. Processor binding is disabled. 442 Platform LSF Configuration Reference lsf.conf LSF_BINDIR Syntax LSF_BINDIR=directory Description Directory in which all LSF user commands are installed. Default LSF_MACHDEP/bin LSF_BMPLUGIN Syntax LSF_BMPLUGIN=path Description Points to the bitmask library libbitmask.so. The parameter only takes effect on 64-bit x-86 Linux 2.6, glibc 2.3. Default /usr/lib64/libbitmask.so LSF_CMD_LOG_MASK Syntax LSF_CMD_LOG_MASK=log_level Description Specifies the logging level of error messages from LSF commands. For example: LSF_CMD_LOG_MASK=LOG_DEBUG To specify the logging level of error messages, use LSB_CMD_LOG_MASK. To specify the logging level of error messages for LSF daemons, use LSF_LOG_MASK. LSF commands log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. The level specified by LSF_CMD_LOG_MASK determines which messages are recorded and which are discarded. All messages logged at the specified level or higher are recorded, while lower level messages are discarded. For debugging purposes, the level LOG_DEBUG contains the fewest number of debugging messages and is used for basic debugging. The level LOG_DEBUG3 records all debugging messages, and can cause log files to grow very large; it is not often used. Most debugging is done at the level LOG_DEBUG2. The commands log to the syslog facility unless LSF_CMD_LOGDIR is set. Chapter 1. Configuration Files 443 lsf.conf Valid values The log levels from highest to lowest are: v LOG_EMERG v LOG_ALERT v LOG_CRIT v v v v v v v LOG_ERR LOG_WARNING LOG_NOTICE LOG_INFO LOG_DEBUG LOG_DEBUG1 LOG_DEBUG2 v LOG_DEBUG3 Default LOG_WARNING See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_TIME_CMD, LSB_CMD_LOGDIR, LSF_LOG_MASK, LSF_LOGDIR, LSF_TIME_CMD LSF_CMD_LOGDIR Syntax LSF_CMD_LOGDIR=path Description The path to the log files used for debugging LSF commands. This parameter can also be set from the command line. Default /tmp See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_TIME_CMD, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR, LSF_TIME_CMD LSF_COLLECT_ENERGY_USAGE Syntax LSF_COLLECT_ENERGY_USAGE=N | Y 444 Platform LSF Configuration Reference lsf.conf Description Determines if the collection of job and node energy usage is enabled on the LSF cluster. This is used for CPU frequency management and energy usage reporting. The default value is N. Default N LSF_CONF_RETRY_INT Syntax LSF_CONF_RETRY_INT=time_seconds Description The number of seconds to wait between unsuccessful attempts at opening a configuration file (only valid for LIM). This allows LIM to tolerate temporary access failures. EGO parameter EGO_CONF_RETRY_INT Default 30 See also LSF_CONF_RETRY_MAX LSF_CONF_RETRY_MAX Syntax LSF_CONF_RETRY_MAX=integer Description The maximum number of retry attempts by LIM to open a configuration file. This allows LIM to tolerate temporary access failures. For example, to allow one more attempt after the first attempt has failed, specify a value of 1. EGO parameter EGO_CONF_RETRY_MAX Default 0 See also LSF_CONF_RETRY_INT Chapter 1. Configuration Files 445 lsf.conf LSF_CONFDIR Syntax LSF_CONFDIR=directory Description Directory in which all LSF configuration files are installed. These files are shared throughout the system and should be readable from any host. This directory can contain configuration files for more than one cluster. The files in the LSF_CONFDIR directory must be owned by the primary LSF administrator, and readable by all LSF server hosts. If live reconfiguration through the bconf command is enabled by the parameter LSF_LIVE_CONFDIR, configuration files are written to and read from the directory set by LSF_LIVE_CONFDIR. Default LSF_INDEP/conf See also LSB_CONFDIR, LSF_LIVE_CONFDIR LSF_CPUSETLIB Syntax LSF_CPUSETLIB=path Description Points to the SGI cpuset library libcpuset.so. The parameter only takes effect on 64-bit x-86 Linux 2.6, glibc 2.3. Default /usr/lib64/libcpuset.so LSF_CRASH_LOG Syntax LSF_CRASH_LOG=Y | N Description On Linux hosts only, enables logging when or if a daemon crashes. Relies on the Linux debugger (gdb). Two log files are created, one for the root daemons (res, lim, sbd, and mbatchd) in /tmp/lsf_root_daemons_crash.log and one for administrative daemons (mbschd) in /tmp/lsf_admin_daemons_crash.log. File permissions for both files are 600. 446 Platform LSF Configuration Reference lsf.conf If enabling, you must restart the daemons for the change to take effect. Default N (no log files are created for daemon crashes) | LSF_DATA_HOSTS | | | Specifies a list of hosts where the LSF data manager daemon (dmd) can run, and where clients can contact the LSF data manager that is associated with the cluster. The dmd daemon can run only on the listed hosts. | Syntax | LSF_DATA_HOSTS=host_list | Description | | | | All LSF data manager clients, including mbatchd, use this parameter to contact dmd. This parameter must be defined in every cluster that talks to the LSF data manager. Defining this parameter acts as a switch that enables the overall LSF data management features. | | | The order of host names in the list decides which host is the LSF data manager master host, and order of failover. All host names must be listed in the same order for all LSF data managers. | | To have LIM start the LSF data manager daemon automatically, and keep monitoring it, the hosts that are listed in LSF_DATA_HOSTS must be LSF server hosts. | | When the LSF data manager starts, it verifies that its current host name is on this list, and that it is an LSF server. | | | | | | | | To change LSF_DATA_HOSTS, complete these steps: 1. Run lsadmin limshutdown to shut down LIM on the LSF data manager candidate hosts. 2. Run bdata admin shutdown to shut down dmd on the LSF data manager candidate hosts. 3. Change the parameter. 4. Run lsadmin limstartup to start LIM on the LSF data manager candidate hosts. 5. Run badmin mbdrestart to restart all LSF master candidate hosts. | Default | Not defined | LSF_DATA_PORT | Port number of the LSF data manager daemon (dmd) associated with the cluster. | Syntax | LSF_DATA_PORT=integer Chapter 1. Configuration Files 447 lsf.conf | Description | | | | | | | | After you change LSF_DATA_PORT, complete these steps: 1. Run lsadmin limshutdown to shut down LIM on the LSF data manager candidate hosts. 2. Run bdata admin shutdown to shut down dmd on the LSF data manager candidate hosts. 3. Change the parameter. 4. Run lsadmin limstartup to start LIM on the LSF data manager candidate hosts. 5. Run badmin mbdrestart to restart all LSF master candidate hosts. | Default | 1729 LSF_DAEMON_WRAP Syntax LSF_DAEMON_WRAP=y | Y Description This parameter only applies to Kerberos integrations with versions of LSF older than 9.1.2 and should not be used with newer versions. When this parameter is set to y or Y, mbatchd, sbatchd, and RES run the executable daemons.wrap located in LSF_SERVERDIR. Default Not defined. LSF does not run the daemons.wrap executable. LSF_DAEMONS_CPUS Syntax LSF_DAEMONS_CPUS="mbatchd_cpu_list:mbschd_cpu_list" mbatchd_cpu_list Defines the list of master host CPUS where the mbatchd daemon processes can run (hard CPU affinity). Format the list as a white-space delimited list of CPU numbers. mbschd_cpu_list Defines the list of master host CPUS where the mbschd daemon processes can run. Format the list as a white-space delimited list of CPU numbers. Description By default, mbatchd and mbschd can run on any CPUs. If LSF_DAEMONS_CPUS is set, they only run on a specified list of CPUs. An empty list means LSF daemons can run on any CPUs. Use spaces to separate multiple CPUs. 448 Platform LSF Configuration Reference lsf.conf The operating system can assign other processes to run on the same CPU; however, if utilization of the bound CPU is lower than utilization of the unbound CPUs. Related parameters To improve scheduling and dispatch performance of all LSF daemons, you should use LSF_DAEMONS_CPUS together with EGO_DAEMONS_CPUS (in ego.conf or lsf.conf), which controls LIM CPU allocation, and MBD_QUERY_CPUS, which binds mbactchd query processes to specific CPUs so that higher priority daemon processes can run more efficiently. To get best performance, CPU allocation for all four daemons should be assigned their own CPUs. For example, on a 4 CPU SMP host, the following configuration gives the best performance: EGO_DAEMONS_CPUS=0 LSF_DAEMONS_CPUS=1:2 MBD_QUERY_CPUS=3 Examples If you specify LSF_DAEMONS_CPUS="1:2" the mbatchd processes run only on CPU number 1 on the master host, and mbschd run on only on CPU number 2. If you specify LSF_DAEMONS_CPUS="1 2:1 2" both mbatchd and mbschd run CPU 1 and CPU 2. Important You can specify CPU affinity only for master hosts that use one of the following operating systems: v Linux 2.6 or higher v Solaris 8 or higher EGO parameter LSF_DAEMONS_CPUS=lim_cpu_list: run the EGO LIM daemon on the specified CPUs. Default Not defined See also MBD_QUERY_CPUS in lsb.params LSF_DEBUG_CMD Syntax LSF_DEBUG_CMD=log_class Chapter 1. Configuration Files 449 lsf.conf Description Sets the debugging log class for LSF commands and APIs. Specifies the log class filtering to be applied to LSF commands or the API. Only messages belonging to the specified log class are recorded. LSF_DEBUG_CMD sets the log class and is used in combination with LSF_CMD_LOG_MASK, which sets the log level. For example: LSF_CMD_LOG_MASK=LOG_DEBUG LSF_DEBUG_CMD="LC_TRACE LC_EXEC" Debugging is turned on when you define both parameters. The daemons log to the syslog facility unless LSF_CMD_LOGDIR is defined. To specify multiple log classes, use a space-separated list enclosed by quotation marks. For example: LSF_DEBUG_CMD="LC_TRACE LC_EXEC" Can also be defined from the command line. Valid values Valid log classes are: v LC_AFS and LC2_AFS: Log AFS messages v LC_AUTH and LC2_AUTH: Log authentication messages v LC_CHKPNT and LC2_CHKPNT: Log checkpointing messages v LC_COMM and LC2_COMM: Log communication messages v LC_DCE and LC2_DCE: Log messages pertaining to DCE support v v v v v LC_EEVENTD and LC2_EEVENTD: Log eeventd messages LC_ELIM and LC2_ELIM: Log ELIM messages LC_EXEC and LC2_EXEC: Log significant steps for job execution LC_FAIR - Log fairshare policy messages LC_FILE and LC2_FILE: Log file transfer messages v LC_HANG and LC2_HANG: Mark where a program might hang v LC_JARRAY and LC2_JARRAY: Log job array messages v LC_JLIMIT and LC2_JLIMIT: Log job slot limit messages v LC_LICENSE and LC2_LICENSE : Log license management messages (LC_LICENCE is also supported for backward compatibility) v v v v v LC_LOADINDX and LC2_LOADINDX: Log load index messages LC_M_LOG and LC2_M_LOG: Log multievent logging messages LC_MPI and LC2_MPI: Log MPI messages LC_MULTI and LC2_MULTI: Log messages pertaining to MultiCluster LC_PEND and LC2_PEND: Log messages related to job pending reasons v LC_PERFM and LC2_PERFM: Log performance messages v LC_PIM and LC2_PIM: Log PIM messages v LC_PREEMPT and LC2_PREEMPT: Log preemption policy messages v LC_RESREQ and LC2_RESREQ: Log resource requirement messages v LC_SIGNAL and LC2_SIGNAL: Log messages pertaining to signals 450 Platform LSF Configuration Reference lsf.conf v v v v v LC_SYS and LC2_SYS: Log system call messages LC_TRACE and LC2_TRACE: Log significant program walk steps LC_XDR and LC2_XDR: Log everything transferred by XDR LC2_KRB: Log message related to Kerberos integration LC2_DC: Log message related to Dynamic Cluster v v v v LC2_CGROUP: Log message related to cgroup operation LC2_TOPOLOGY: Log message related to hardware topology LC2_AFFINITY: Log message related to affinity LC2_LSF_PE: Log message related to LSF PE integration */ Default Not defined See also LSF_CMD_LOG_MASK, LSF_CMD_LOGDIR, LSF_DEBUG_LIM, LSF_DEBUG_RES, LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT, LSF_LOGDIR, LSF_LIM_DEBUG, LSF_RES_DEBUG LSF_DEBUG_LIM Syntax LSF_DEBUG_LIM=log_class Description Sets the log class for debugging LIM. Specifies the log class filtering to be applied to LIM. Only messages belonging to the specified log class are recorded. The LSF_DEBUG_LIM sets the log class and is used in combination with EGO_LOG_MASK in ego,conf, which sets the log level. For example, in ego.conf: EGO_LOG_MASK=LOG_DEBUG and in lsf.conf: LSF_DEBUG_LIM=LC_TRACE Important: If EGO is enabled, LSF_LOG_MASK no longer specifies LIM logging level. Use EGO_LOG_MASK in ego.conf to control message logging for LIM. The default value for EGO_LOG_MASK is LOG_WARNING. You need to restart the daemons after setting LSF_DEBUG_LIM for your changes to take effect. If you use the command lsadmin limdebug to temporarily change this parameter without changing lsf.conf, you do not need to restart the daemons. Chapter 1. Configuration Files 451 lsf.conf To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LSF_DEBUG_LIM="LC_TRACE LC_EXEC" This parameter can also be defined from the command line. Valid values Valid log classes are: v LC_AFS and LC2_AFS: Log AFS messages v LC_AUTH and LC2_AUTH: Log authentication messages v LC_CHKPNT - log checkpointing messages LC_COMM and LC2_COMM: Log communication messages LC_DCE and LC2_DCE: Log messages pertaining to DCE support LC_EXEC and LC2_EXEC: Log significant steps for job execution LC_FILE and LC2_FILE: Log file transfer messages LC_HANG and LC2_HANG: Mark where a program might hang LC_JGRP - Log job group messages LC_LICENSE and LC2_LICENSE : Log license management messages (LC_LICENCE is also supported for backward compatibility) v LC_LICSCHED - Log License Scheduler messages v LC_MEMORY - Log memory limit messages v LC_MULTI and LC2_MULTI: Log messages pertaining to MultiCluster v v v v v v v v LC_PIM and LC2_PIM: Log PIM messages v v v v v LC_RESOURCE - Log resource broker messages LC_SIGNAL and LC2_SIGNAL: Log messages pertaining to signals LC_TRACE and LC2_TRACE: Log significant program walk steps LC_XDR and LC2_XDR: Log everything transferred by XDR LC2_TOPOLOGY: Debug the hardware topology detection during runtime EGO parameter EGO_DEBUG_LIM Default Not defined See also LSF_DEBUG_RES, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR LSF_DEBUG_RES Syntax LSF_DEBUG_RES=log_class Description Sets the log class for debugging RES. 452 Platform LSF Configuration Reference lsf.conf Specifies the log class filtering to be applied to RES. Only messages belonging to the specified log class are recorded. LSF_DEBUG_RES sets the log class and is used in combination with LSF_LOG_MASK, which sets the log level. For example: LSF_LOG_MASK=LOG_DEBUG LSF_DEBUG_RES=LC_TRACE To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LSF_DEBUG_RES="LC_TRACE LC_EXEC" You need to restart the daemons after setting LSF_DEBUG_RES for your changes to take effect. If you use the command lsadmin resdebug to temporarily change this parameter without changing lsf.conf, you do not need to restart the daemons. Valid values For a list of valid log classes see LSF_DEBUG_LIM Default Not defined See also LSF_DEBUG_LIM, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR LSF_DEFAULT_FREQUENCY Syntax LSF_DEFAULT_FREQUENCY=[float_number][unit] Description Sets the default CPU frequency for compute nodes when nodes start and when node has finished a job that uses a different CPU frequency. Value is a positive float number with units (GHz or MHz). If no units are set, the default is GHz. If nothing is set for this parameter, the nominal CPU frequency of the host will be used. Default Not defined (Nominal CPU frequency is used) LSF_DHCP_ENV Syntax LSF_DHCP_ENV=y Description If defined, enables dynamic IP addressing for all LSF client hosts in the cluster. Chapter 1. Configuration Files 453 lsf.conf Dynamic IP addressing is not supported across clusters in a MultiCluster environment. If you set LSF_DHCP_ENV, you must also specify LSF_DYNAMIC_HOST_WAIT_TIME in order for hosts to rejoin a cluster after their IP address changes. Tip: After defining or changing this parameter, you must run lsadmin reconfig and badmin mbdrestart to restart all LSF daemons. EGO parameter EGO_DHCP_ENV Default Not defined See also LSF_DYNAMIC_HOST_WAIT_TIME LSF_DISABLE_LSRUN Syntax LSF_DISABLE_LSRUN=y | Y Description When defined, RES refuses remote connections from lsrun and lsgrun unless the user is either an LSF administrator or root. However, the local host from which lsrun is executed will be exempted from this limitation as long as the local host is also the target host for lsrun/lsgrun. For remote execution by root, LSF_ROOT_REX must be defined. Note: Defining LSF_ROOT_REX allows root to run lsgrun, even if LSF_DISABLE_LSRUN=y is defined. | | Other remote execution commands, such as ch and lsmake are not affected. Default Set to Y at time of installation. Otherwise, not defined. | See also | LSF_ROOT_REX LSF_DISPATCHER_LOGDIR Syntax LSF_DISPATCHER_LOGDIR=path 454 Platform LSF Configuration Reference lsf.conf Description Specifies the path to the log files for slot allocation decisions for queue-based fairshare. If defined, LSF writes the results of its queue-based fairshare slot calculation to the specified directory. Each line in the file consists of a timestamp for the slot allocation and the number of slots allocated to each queue under its control. LSF logs in this file every minute. The format of this file is suitable for plotting with gnuplot. Example # clients managed by LSF # Roma # Verona # Genova # Pisa # Venezia # Bologna 15/3 19:4:50 0 0 0 0 0 0 15/3 19:5:51 8 5 2 5 2 0 15/3 19:6:51 8 5 2 5 5 1 15/3 19:7:53 8 5 2 5 5 5 15/3 19:8:54 8 5 2 5 5 0 15/3 19:9:55 8 5 0 5 4 2 The queue names are in the header line of the file. The columns correspond to the allocations per each queue. Default Not defined LSF_DJOB_TASK_REG_WAIT_TIME Syntax LSF_DJOB_TASK_REG_WAIT_TIME=time_seconds Description Allows users/admin to define a fixed timeout value to override the internal timeout set by LSF in order to avoid task registration timeout for a large parallel job. Can be configured in lsf.conf or set as an environment variable of bsub. The environment variable will overwrite the lsf.conf configuration. If neither is present, LSF will use the default value. When it is specified by the environment variable or configured in lsf.conf, the value will be directly used by LSF without any adjusting. Default 300 seconds. LSF_DUALSTACK_PREFER_IPV6 Syntax LSF_DUALSTACK_PREFER_IPV6=Y | y Chapter 1. Configuration Files 455 lsf.conf Description Define this parameter when you want to ensure that clients and servers on dual-stack hosts use IPv6 addresses only. Setting this parameter configures LSF to sort the dynamically created address lookup list in order of AF_INET6 (IPv6) elements first, followed by AF_INET (IPv4) elements, and then others. Restriction: IPv4-only and IPv6-only hosts cannot belong to the same cluster. In a MultiCluster environment, you cannot mix IPv4-only and IPv6-only clusters. Follow these guidelines for using IPv6 addresses within your cluster: v Define this parameter only if your cluster – Includes only dual-stack hosts, or a mix of dual-stack and IPv6-only hosts, and – Does not include IPv4-only hosts or IPv4 servers running on dual-stack hosts (servers prior to LSF version 7) Important: Do not define this parameter for any other cluster configuration. v Within a MultiCluster environment, do not define this parameter if any cluster contains IPv4-only hosts or IPv4 servers (prior to LSF version 7) running on dual-stack hosts. v Applications must be engineered to work with the cluster IP configuration. v If you use IPv6 addresses within your cluster, ensure that you have configured the dual-stack hosts correctly. For more detailed information, see Administering IBM Platform LSF. v Define the parameter LSF_ENABLE_SUPPORT_IPV6 in lsf.conf. Default Not defined. LSF sorts the dynamically created address lookup list in order of AF_INET (IPv4) elements first, followed by AF_INET6 (IPv6) elements, and then others. Clients and servers on dual-stack hosts use the first address lookup structure in the list (IPv4). See also LSF_ENABLE_SUPPORT_IPV6 LSF_DYNAMIC_HOST_TIMEOUT Syntax LSF_DYNAMIC_HOST_TIMEOUT=time_hours LSF_DYNAMIC_HOST_TIMEOUT=time_minutesm|M 456 Platform LSF Configuration Reference lsf.conf Description Enables automatic removal of dynamic hosts from the cluster and specifies the timeout value (minimum 10 minutes). To improve performance in very large clusters, you should disable this feature and remove unwanted hosts from the hostcache file manually. Specifies the length of time a dynamic host is unavailable before the master host removes it from the cluster. Each time LSF removes a dynamic host, mbatchd automatically reconfigures itself. Valid value The timeout value must be greater than or equal to 10 minutes. Values below 10 minutes are set to the minimum allowed value 10 minutes; values above 100 hours are set to the maximum allowed value 100 hours. Example LSF_DYNAMIC_HOST_TIMEOUT=60 A dynamic host is removed from the cluster when it is unavailable for 60 hours. LSF_DYNAMIC_HOST_TIMEOUT=60m A dynamic host is removed from the cluster when it is unavailable for 60 minutes. EGO parameter EGO_DYNAMIC_HOST_TIMEOUT Default Not defined. Unavailable hosts are never removed from the cluster. LSF_DYNAMIC_HOST_WAIT_TIME Syntax LSF_DYNAMIC_HOST_WAIT_TIME=time_seconds Description Defines the length of time in seconds that a dynamic host waits communicating with the master LIM to either add the host to the cluster or to shut down any running daemons if the host is not added successfully. Note: To enable dynamically added hosts, the following parameters must be defined: v LSF_DYNAMIC_HOST_WAIT_TIME in lsf.conf v LSF_HOST_ADDR_RANGE in lsf.cluster.cluster_name Recommended value An integer greater than zero, up to 60 seconds for every 1000 hosts in the cluster, for a maximum of 15 minutes. Selecting a smaller value results in a quicker Chapter 1. Configuration Files 457 lsf.conf response time for hosts at the expense of an increased load on the master LIM. Example LSF_DYNAMIC_HOST_WAIT_TIME=60 A host waits 60 seconds from startup to send a request for the master LIM to add it to the cluster or to shut down any daemons if it is not added to the cluster. Default Not defined. Dynamic hosts cannot join the cluster. LSF_EGO_DAEMON_CONTROL Syntax LSF_EGO_DAEMON_CONTROL="Y" | "N" Description Enables EGO Service Controller to control LSF res and sbatchd startup. Set the value to "Y" if you want EGO Service Controller to start res and sbatchd, and restart them if they fail. To configure this parameter at installation, set EGO_DAEMON_CONTROL in install.config so that res and sbatchd start automatically as EGO services. If LSF_ENABLE_EGO="N", this parameter is ignored and EGO Service Controller is not started. If you manually set EGO_DAEMON_CONTROL=Y after installation, you must configure LSF res and sbatchd startup to AUTOMATIC in the EGO configuration files res.xml and sbatchd.xml under EGO_ESRVDIR/esc/conf/services. To avoid conflicts with existing LSF startup scripts, do not set this parameter to "Y" if you use a script (for example in /etc/rc or /etc/inittab) to start LSF daemons. If this parameter is not defined in install.config file, it takes default value of "N". Important: After installation, LSF_EGO_DAEMON_CONTROL alone does not change the start type for the sbatchd and res EGO services to AUTOMATIC in res.xml and sbatchd.xml under EGO_ESRVDIR/esc/conf/services. You must edit these files and set the <sc:StartType> parameter to AUTOMATIC. Example LSF_EGO_DAEMON_CONTROL="N" Default N (res and sbatchd are started manually or through operating system rc facility) 458 Platform LSF Configuration Reference lsf.conf LSF_EGO_ENVDIR Syntax LSF_EGO_ENVDIR=directory Description Directory where all EGO configuration files are installed. These files are shared throughout the system and should be readable from any host. If LSF_ENABLE_EGO="N", this parameter is ignored and ego.conf is not loaded. Default LSF_CONFDIR/ego/cluster_name/kernel. If not defined, or commented out, /etc is assumed. LSF_ENABLE_EGO Syntax LSF_ENABLE_EGO="Y" | "N" Description Enables EGO functionality in the LSF cluster. If you set LSF_ENABLE_EGO="Y", you must set or uncomment LSF_EGO_ENVDIR in lsf.conf. If you set LSF_ENABLE_EGO="N" you must remove or comment out LSF_EGO_ENVDIR in lsf.conf. Set the value to "N" if you do not want to take advantage of the following LSF features that depend on EGO: v LSF daemon control by EGO Service Controller v EGO-enabled SLA scheduling Important: After changing the value of LSF_ENABLE_EGO, you must shut down and restart the cluster. Default Y (EGO is enabled in the LSF cluster) LSF_ENABLE_EXTSCHEDULER Syntax LSF_ENABLE_EXTSCHEDULER=y | Y Description If set, enables mbatchd external scheduling for LSF HPC features. Chapter 1. Configuration Files 459 lsf.conf Default Set to Y at time of installation for the PARALLEL configuration template. Otherwise, not defined. LSF_ENABLE_SUPPORT_IPV6 Syntax LSF_ENABLE_SUPPORT_IPV6=y | Y Description If set, enables the use of IPv6 addresses in addition to IPv4. Default Not defined See also LSF_DUALSTACK_PREFER_IPV6 LSF_ENVDIR Syntax LSF_ENVDIR=directory Description Directory containing the lsf.conf file. By default, lsf.conf is installed by creating a shared copy in LSF_CONFDIR and adding a symbolic link from /etc/lsf.conf to the shared copy. If LSF_ENVDIR is set, the symbolic link is installed in LSF_ENVDIR/lsf.conf. The lsf.conf file is a global environment configuration file for all LSF services and applications. The LSF default installation places the file in LSF_CONFDIR. Default /etc LSF_EVENT_PROGRAM Syntax LSF_EVENT_PROGRAM=event_program_name Description Specifies the name of the LSF event program to use. If a full path name is not provided, the default location of this program is LSF_SERVERDIR. 460 Platform LSF Configuration Reference lsf.conf If a program that does not exist is specified, event generation does not work. If this parameter is not defined, the default name is genevent on UNIX, and genevent.exe on Windows. Default Not defined LSF_EVENT_RECEIVER Syntax LSF_EVENT_RECEIVER=event_receiver_program_name Description Specifies the LSF event receiver and enables event generation. Any string may be used as the LSF event receiver; this information is not used by LSF to enable the feature but is only passed as an argument to the event program. If LSF_EVENT_PROGRAM specifies a program that does not exist, event generation does not work. Default Not defined. Event generation is disabled LSF_GET_CONF Syntax LSF_GET_CONF=lim Description Synchronizes a local host's cluster configuration with the master host's cluster configuration. Specifies that a slave host must request cluster configuration details from the LIM of a host on the SERVER_HOST list. Use when a slave host does not share a filesystem with master hosts, and therefore cannot access cluster configuration. Default Not defined. LSF_HOST_CACHE_NTTL Syntax LSF_HOST_CACHE_NTTL=time_seconds Chapter 1. Configuration Files 461 lsf.conf Description Negative-time-to-live value in seconds. Specifies the length of time the system caches a failed DNS lookup result. If you set this value to zero (0), LSF does not cache the result. Note: Setting this parameter does not affect the positive-time-to-live value set by the parameter LSF_HOST_CACHE_PTTL. Valid values Positive integer. Recommended value less than or equal to 60 seconds (1 minute). Default 20 seconds See also LSF_HOST_CACHE_PTTL LSF_HOST_CACHE_PTTL Syntax LSF_HOST_CACHE_PTTL=time_seconds Description Positive-time-to-live value in seconds. Specifies the length of time the system caches a successful DNS lookup result. If you set this value to zero (0), LSF does not cache the result. Note: Setting this parameter does not affect the negative-time-to-live value set by the parameter LSF_HOST_CACHE_NTTL. Valid values Positive integer. Recommended value equal to or greater than 3600 seconds (1 hour). Default 86400 seconds (24 hours) See also LSF_HOST_CACHE_NTTL 462 Platform LSF Configuration Reference lsf.conf LSF_HPC_EXTENSIONS Syntax LSF_HPC_EXTENSIONS="extension_name ..." Description Enables LSF HPC extensions. After adding or changing LSF_HPC_EXTENSIONS, use badmin mbdrestart and badmin hrestart to reconfigure your cluster. Valid values The following extension names are supported: CUMULATIVE_RUSAGE: When a parallel job script runs multiple commands, resource usage is collected for jobs in the job script, rather than being overwritten when each command is executed. DISP_RES_USAGE_LIMITS: bjobs displays resource usage limits configured in the queue as well as job-level limits. HOST_RUSAGE: For parallel jobs, reports the correct rusage based on each host’s usage and the total rusage being charged to the execution host. This host rusage breakdown applies to the blaunch framework, the pam framework, and vendor MPI jobs. For a running job, you will see run time, memory, swap, utime, stime, and pids and pgids on all hosts that a parallel job spans. For finished jobs, you will see memory, swap, utime, and stime on all hosts that a parallel job spans. The host-based rusage is reported in the JOB_FINISH record of lsb.acct and lsb.stream, and the JOB_STATUS record of lsb.events if the job status is done or exit. Also for finished jobs, bjobs -l shows CPU time, bhist -l shows CPU time, and bacct -l shows utime, stime, memory, and swap. In the MultiCluster lease model, the parallel job must run on hosts that are all in the same cluster. If you use the jobFinishLog API, all external tools must use jobFinishLog built with LSF 9.1 or later, or host-based rusage will not work. If you add or remove this extension, you must restart mbatchd, sbatchd, and res on all hosts. The behaviour used to be controlled by HOST_RUSAGE prior to LSF 9.1. NO_HOST_RUSAGE: Turn on this parameter if you do not want to see host-based job resource usage details. However, mbatchd will continue to report job rusage with bjobs for all running jobs even if you configure NO_HOST_RUSAGE and restart all the daemons. LSB_HCLOSE_BY_RES: If res is down, host is closed with a message Host is closed because RES is not available. The status of the closed host is closed_Adm. No new jobs are dispatched to this host, but currently running jobs are not suspended. RESERVE_BY_STARTTIME: LSF selects the reservation that gives the job the earliest predicted start time. By default, if multiple host groups are available for reservation, LSF chooses the largest possible reservation based on number of slots. Chapter 1. Configuration Files 463 lsf.conf SHORT_EVENTFILE: Compresses long host name lists when event records are written to lsb.events and lsb.acct for large parallel jobs. The short host string has the format: number_of_hosts*real_host_name Tip: When SHORT_EVENTFILE is enabled, older daemons and commands (pre-LSF Version 7) cannot recognize the lsb.acct and lsb.events file format. For example, if the original host list record is 6 "hostA" "hostA" "hostA" "hostA" "hostB" "hostC" redundant host names are removed and the short host list record becomes 3 "4*hostA" "hostB" "hostC" When LSF_HPC_EXTENSIONS="SHORT_EVENTFILE" is set, and LSF reads the host list from lsb.events or lsb.acct, the compressed host list is expanded into a normal host list. SHORT_EVENTFILE affects the following events and fields: v JOB_START in lsb.events when a normal job is dispatched – numExHosts (%d) – execHosts (%s) v JOB_CHUNK in lsb.events when a job is inserted into a job chunk – numExHosts (%d) – execHosts (%s) v JOB_FORWARD in lsb.events when a job is forwarded to a MultiCluster leased host – numReserHosts (%d) – reserHosts (%s) v JOB_FINISH record in lsb.acct – numExHosts (%d) – execHosts (%s) SHORT_PIDLIST: Shortens the output from bjobs to omit all but the first process ID (PID) for a job. bjobs displays only the first ID and a count of the process group IDs (PGIDs) and process IDs for the job. Without SHORT_PIDLIST, bjobs -l displays all the PGIDs and PIDs for the job. With SHORT_PIDLIST set, bjobs -l displays a count of the PGIDS and PIDs. TASK_MEMLIMIT: Enables enforcement of a memory limit (bsub -M, bmod -M, or MEMLIMIT in lsb.queues) for individual tasks in a parallel job. If any parallel task exceeds the memory limit, LSF terminates the entire job. TASK_SWAPLIMIT: Enables enforcement of a virtual memory (swap) limit (bsub -v, bmod -v, or SWAPLIMIT in lsb.queues) for individual tasks in a parallel job. If any parallel task exceeds the swap limit, LSF terminates the entire job. 464 Platform LSF Configuration Reference lsf.conf Example JOB_START events in lsb.events: For a job submitted with bsub -n 64 -R "span[ptile=32]" blaunch sleep 100 Without SHORT_EVENTFILE, a JOB_START event like the following is logged in lsb.events: "JOB_START" "9.12" 1389121640 602 4 0 0 60.0 64 "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB "" "" 0 "" 0 "" 2147483647 4 "select[type == local] order[r15s:pg] span[ptile=32] " "" -1 "" -1 0 "" -1 0 With SHORT_EVENTFILE, a JOB_START event would be logged in lsb.events with the number of execution hosts (numExHosts field) changed from 64 to 2 and the execution host list (execHosts field) shortened to "32*hostA" and "32*hostB": "JOB_START" "9.12" 1389304047 703 4 0 0 60.0 2 "32* HostA " "32* HostB " "" "" 0 "" 0 "" 2147483647 4 "select[type == local] order[r15s:pg] " "" -1 "" -1 0 "" -1 0 Example JOB_FINISH records in lsb.acct: For a job submitted with bsub -n 64 -R "span[ptile=32]" blaunch sleep 100 Without SHORT_EVENTFILE, a JOB_FINISH event like the following is logged in lsb.acct: "JOB_FINISH" "9.12" 1389121646 602 33793 33816578 64 1389121640 0 0 1389121640 "user1" "normal" "span[ptile=32]" "" "" "HostB" "/scratch/user1/logdir" "" "" "" "1389121640.602" 0 64 "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostA" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" "HostB" 64 60.0 "" "blaunch sleep 100" 0.073956 0.252925 93560 0 -1 0 0 46526 15 0 3328 0 -1 0 0 0 6773 720 -1 "" "default" 0 64 "" "" 0 0 0 "" "" "" "" 0 "" 0 "" -1 "/user1" "" "" "" -1 "" "" 4112 "" 1389121640 "" "" 0 2 HostA 0 0 0 0 0 HostB 0 0 0 0 0 -1 0 0 "select[type == local] order[r15s:pg] span[ptile=32] " "" -1 "" -1 0 "" 0 0 "" 6 "/scratch/user1/logdir" 0 "" 0.000000 With SHORT_EVENTFILE, a JOB_FINISH event like the following would be logged in lsb.acct with the number of execution hosts (numExHosts field) changed from 64 to 2 and the execution host list (execHosts field) shortened to "32*hostA" and "32*hostB": "JOB_FINISH" "9.12" 1389304053 703 33793 33554434 64 1389304041 0 0 1389304047 "user1" "normal" "" "" "" "HostB" "/scratch/user1/LSF/conf" "" "" "" "1389304041.703" 0 2 "32*HostA" "32*HostB" 64 60.0 "" "blaunch sleep 100" 0.075956 0.292922 93612 0 -1 0 0 46466 0 0 0 0 -1 0 0 0 9224 478 -1 "" "default" 0 64 "" "" 0 0 0 "" "" "" "" 0 "" 0 "" -1 "/user1" "" "" "" -1 "" "" 4112 "" 1389304047 "" "" 0 2 HostB 0 0 0 0 0 HostA 0 0 0 0 0 -1 0 0 "select[type == local] order[r15s:pg] " "" -1 "" -1 0 "" 0 0 "" 6 "/scratch/user1/LSF/conf" 0 "" 0.000000 Chapter 1. Configuration Files 465 lsf.conf Example bjobs -l output without SHORT_PIDLIST: bjobs -l displays all the PGIDs and PIDs for the job: bjobs -l Job <109>, User <user3>, Project <default>, Status <RUN>, Queue <normal>, Inte ractive mode, Command <./myjob.sh> Mon Jul 21 20:54:44 2009: Submitted from host <hostA>, CWD <$HOME/LSF/jobs; RUNLIMIT 10.0 min of hostA STACKLIMIT CORELIMIT MEMLIMIT 5256 K 10000 K 5000 K Mon Jul 21 20:54:51 2009: Started on <hostA>; Mon Jul 21 20:55:03 2009: Resource usage collected. MEM: 2 Mbytes; SWAP: 15 Mbytes PGID: 256871; PIDs: 256871 PGID: 257325; PIDs: 257325 257500 257482 257501 257523 257525 257531 SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - << Job <109> is done successfully. >> Example bjobs -l output with SHORT_PIDLIST: bjobs -l displays a count of the PGIDS and PIDs: bjobs -l Job <109>, User <user3>, Project <default>, Status <RUN>, Queue <normal>, Inte ractive mode, Command <./myjob.sh> Mon Jul 21 20:54:44 2009: Submitted from host <hostA>, CWD <$HOME/LSF/jobs; RUNLIMIT 10.0 min of hostA STACKLIMIT CORELIMIT MEMLIMIT 5256 K 10000 K 5000 K Mon Jul 21 20:54:51 2009: Started on <hostA>; Mon Jul 21 20:55:03 2009: Resource usage collected. MEM: 2 Mbytes; PGID(s): SWAP: 15 Mbytes 256871:1 PID, 257325:7 PIDs SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - 466 Platform LSF Configuration Reference lsf.conf Default Set to "CUMULATIVE_RUSAGE HOST_RUSAGE LSB_HCLOSE_BY_RES SHORT_EVENTFILE" at time of installation for the PARALLEL configuration template. If otherwise undefined, then "HOST_RUSAGE". LSF_HPC_PJL_LOADENV_TIMEOUT Syntax LSF_HPC_PJL_LOADENV_TIMEOUT=time_seconds Description Timeout value in seconds for PJL to load or unload the environment. For example, set LSF_HPC_PJL_LOADENV_TIMEOUT to the number of seconds needed for IBM POE to load or unload adapter windows. At job startup, the PJL times out if the first task fails to register with PAM within the specified timeout value. At job shutdown, the PJL times out if it fails to exit after the last Taskstarter termination report within the specified timeout value. Default LSF_HPC_PJL_LOADENV_TIMEOUT=300 LSF_ID_PORT Syntax LSF_ID_PORT=port_number Description The network port number used to communicate with the authentication daemon when LSF_AUTH is set to ident. Default Not defined LSF_INCLUDEDIR Syntax LSF_INCLUDEDIR=directory Description Directory under which the LSF API header files lsf.h and lsbatch.h are installed. Default LSF_INDEP/include Chapter 1. Configuration Files 467 lsf.conf See also LSF_INDEP LSF_INDEP Syntax LSF_INDEP=directory Description Specifies the default top-level directory for all machine-independent LSF files. This includes man pages, configuration files, working directories, and examples. For example, defining LSF_INDEP as /usr/share/lsf/mnt places man pages in /usr/share/lsf/mnt/man, configuration files in /usr/share/lsf/mnt/conf, and so on. The files in LSF_INDEP can be shared by all machines in the cluster. As shown in the following list, LSF_INDEP is incorporated into other LSF environment variables. v LSB_SHAREDIR=$LSF_INDEP/work v LSF_CONFDIR=$LSF_INDEP/conf v LSF_INCLUDEDIR=$LSF_INDEP/include v LSF_MANDIR=$LSF_INDEP/man v XLSF_APPDIR=$LSF_INDEP/misc Default /usr/share/lsf/mnt See also LSF_MACHDEP, LSB_SHAREDIR, LSF_CONFDIR, LSF_INCLUDEDIR, LSF_MANDIR, XLSF_APPDIR LSF_INTERACTIVE_STDERR Syntax LSF_INTERACTIVE_STDERR=y | n Description Separates stderr from stdout for interactive tasks and interactive batch jobs. This is useful to redirect output to a file with regular operators instead of the bsub -e err_file and -o out_file options. This parameter can also be enabled or disabled as an environment variable. CAUTION: If you enable this parameter globally in lsf.conf, check any custom scripts that manipulate stderr and stdout. 468 Platform LSF Configuration Reference lsf.conf When this parameter is not defined or set to n, the following are written to stdout on the submission host for interactive tasks and interactive batch jobs: v Job standard output messages v Job standard error messages The following are written to stderr on the submission host for interactive tasks and interactive batch jobs: v LSF messages v NIOS standard messages v NIOS debug messages (if LSF_NIOS_DEBUG=1 in lsf.conf) When this parameter is set to y, the following are written to stdout on the submission host for interactive tasks and interactive batch jobs: v Job standard output messages The following are written to stderr on the submission host: v Job standard error messages v LSF messages v NIOS standard messages v NIOS debug messages (if LSF_NIOS_DEBUG=1 in lsf.conf) Default Not defined Notes When this parameter is set, the change affects interactive tasks and interactive batch jobs run with the following commands: v bsub -I v bsub -Ip v v v v v bsub -Is lsrun lsgrun lsmake (makefile) bsub pam (HPC features must be enabled) Limitations v Pseudo-terminal: Do not use this parameter if your application depends on stderr as a terminal. This is because LSF must use a non-pseudo-terminal connection to separate stderr from stdout. v Synchronization: Do not use this parameter if you depend on messages in stderr and stdout to be synchronized and jobs in your environment are continuously submitted. A continuous stream of messages causes stderr and stdout to not be synchronized. This can be emphasized with parallel jobs. This situation is similar to that of rsh. v NIOS standard and debug messages: NIOS standard messages, and debug messages (when LSF_NIOS_DEBUG=1 in lsf.conf or as an environment variable) are written to stderr. NIOS standard messages are in the format Chapter 1. Configuration Files 469 lsf.conf <<message>>, which makes it easier to remove them if you wish. To redirect NIOS debug messages to a file, define LSF_CMD_LOGDIR in lsf.conf or as an environment variable. See also LSF_NIOS_DEBUG, LSF_CMD_LOGDIR LSF_LD_SECURITY Syntax LSF_LD_SECURITY=y | n Description LSF_LD_SECURITY: When set, jobs submitted using bsub -Is or bsub -Ip cause the environment variables LD_PRELOAD and LD_LIBRARY_PATH to be removed from the job environment during job initialization to ensure enhanced security against users obtaining root privileges. Two new environment variables are created (LSF_LD_LIBRARY_PATH and LSF_LD_PRELOAD) to allow LD_PRELOAD and LD_LIBRARY_PATH to be put back before the job runs. Default N LSF_LIBDIR Syntax LSF_LIBDIR=directory Description Specifies the directory in which the LSF libraries are installed. Library files are shared by all hosts of the same type. Default LSF_MACHDEP/lib LSF_LIC_SCHED_HOSTS Syntax LSF_LIC_SCHED_HOSTS="candidate_host_list" candidate_host_list is a space-separated list of hosts that are candidate License Scheduler hosts. Description The candidate License Scheduler host list is read by LIM on each host to check if the host is a candidate License Scheduler master host. If the host is on the list, LIM 470 Platform LSF Configuration Reference lsf.conf starts the License Scheduler daemon (bld) on the host. LSF_LIC_SCHED_PREEMPT_REQUEUE Syntax LSF_LIC_SCHED_PREEMPT_REQUEUE=y | n Description Set this parameter to requeue a job whose license is preempted by IBM Platform License Scheduler. The job is killed and requeued instead of suspended. If you set LSF_LIC_SCHED_PREEMPT_REQUEUE, do not set LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE. If both these parameters are set, LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE is ignored. Default N See also LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE, LSF_LIC_SCHED_PREEMPT_STOP LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE Syntax LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE=y | n Description Set this parameter to release the resources of a License Scheduler job that is suspended. These resources are only available to pending License Scheduler jobs that request at least one license that is the same as the suspended job. If the suspended License Scheduler job is an exclusive job (-x), the suspended job's resources are not released and pending License Scheduler jobs cannot use these resources. A pending exclusive License Scheduler job can also be blocked from starting on a host by a suspended License Scheduler job, even when the jobs share licenses or exclusive preemption is enabled. Jobs attached to a service class are not normally preemptable with queue-based preemption, except by jobs in queues with SLA_GUARANTEES_IGNORE=Y specified (in lsb.queues). However, License Scheduler can preempt jobs with service classes for licenses. By default, the job slots are the only resources available after a job is suspended. You can also specify that memory resources are available by enabling preemption for memory resources. To enable memory resource preemption, specify PREEMPTABLE_RESOURCES = mem in lsb.params. By default, the job slots are the only resources available after a job is suspended. You can also specify that memory or affinity resources are available by enabling preemption for these resources: Chapter 1. Configuration Files 471 lsf.conf v To enable memory resource preemption, specify PREEMPTABLE_RESOURCES = mem in lsb.params. v To enable affinity resource preemption, specify PREEMPT_JOBTYPE = AFFINITY in lsb.params. If you set LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE, do not set LSF_LIC_SCHED_PREEMPT_REQUEUE. If both these parameters are set, LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE is ignored. Default Y See also LSF_LIC_SCHED_PREEMPT_REQUEUE, LSF_LIC_SCHED_PREEMPT_STOP LSF_LIC_SCHED_PREEMPT_STOP Syntax LSF_LIC_SCHED_PREEMPT_STOP=y | n Description Set this parameter to use job controls to stop a job that is preempted. When this parameter is set, a UNIX SIGSTOP signal is sent to suspend a job instead of a UNIX SIGTSTP. To send a SIGSTOP signal instead of SIGTSTP, the following parameter in lsb.queues must also be set: JOB_CONTROLS=SUSPEND[SIGSTOP] Default N See also LSF_LIC_SCHED_PREEMPT_SLOT_RELEASE, LSF_LIC_SCHED_PREEMPT_REQUEUE LSF_LIC_SCHED_STRICT_PROJECT_NAME Syntax LSF_LIC_SCHED_STRICT_PROJECT_NAME=y | n Description Enforces strict checking of the License Scheduler project name upon job submission or job modification (bsub or bmod). If the project named is misspelled (case sensitivity applies), the job is rejected. If this parameter is not set or it is set to n, and if there is an error in the project name, the default project is used. 472 Platform LSF Configuration Reference lsf.conf Default N LSF_LIM_API_NTRIES Syntax LSF_LIM_API_NTRIES=integer Description Defines the number of times LSF commands will try to communicate with the LIM API when LIM is not available. LSF_LIM_API_NTRIES is ignored by LSF and EGO daemons and EGO commands. The LSF_LIM_API_NTRIES environment variable. overrides the value of LSF_LIM_API_NTRIES in lsf.conf. Valid values 1 to 65535 Default 1. LIM API exits without retrying. LSF_LIM_DEBUG Syntax LSF_LIM_DEBUG=1 | 2 Description Sets LSF to debug mode. If LSF_LIM_DEBUG is defined, LIM operates in single user mode. No security checking is performed, so LIM should not run as root. LIM does not look in the services database for the LIM service port number. Instead, it uses port number 36000 unless LSF_LIM_PORT has been defined. Specify 1 for this parameter unless you are testing LSF. Valid values LSF_LIM_DEBUG=1 LIM runs in the background with no associated control terminal. LSF_LIM_DEBUG=2 LIM runs in the foreground and prints error messages to tty. EGO parameter EGO_LIM_DEBUG Chapter 1. Configuration Files 473 lsf.conf Default Not defined See also LSF_RES_DEBUG, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR LSF_LIM_IGNORE_CHECKSUM Syntax LSF_LIM_IGNORE_CHECKSUM=y | Y Description Configure LSF_LIM_IGNORE_CHECKSUM=Y to ignore warning messages logged to lim log files on non-master hosts. When LSF_MASTER_LIST is set, lsadmin reconfig only restarts master candidate hosts (for example, after adding or removing hosts from the cluster). This can cause superfluous warning messages like the following to be logged in the lim log files for non-master hosts because lim on these hosts are not restarted after configuration change: Aug 26 13:47:35 2006 9746 4 9.1.3 xdr_loadvector: Sender <10.225.36.46:9999> has a different configuration Default Not defined. See also LSF_MASTER_LIST LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT Syntax LSF_LIM_PORT=port_number Description TCP service ports to use for communication with the LSF daemons. If port parameters are not defined, LSF obtains the port numbers by looking up the LSF service names in the /etc/services file or the NIS (UNIX). If it is not possible to modify the services database, you can define these port parameters to set the port numbers. EGO parameter EGO_LIM_PORT 474 Platform LSF Configuration Reference lsf.conf Default On UNIX, the default is to get port numbers from the services database. On Windows, these parameters are mandatory. Default port number values are: v LSF_LIM_PORT=7869 v LSF_RES_PORT=6878 v LSB_MBD_PORT=6881 v LSB_SBD_PORT=6882 LSF_LINUX_CGROUP_ACCT Syntax LSF_LINUX_CGROUP_ACCT=Y|y|N|n Description Enable this parameter to track processes based on CPU and memory accounting for Linux systems that support cgroup's memory and cpuacct subsystems. Once enabled, this parameter takes effect for new jobs. When this parameter and LSF_PROCESS_TRACKING are enabled, they take precedence over parameters LSF_PIM_LINUX_ENHANCE and EGO_PIM_SWAP_REPORT. Default Set to Y at time of installation. If otherwise undefined, then N. LSF_LIVE_CONFDIR Syntax LSF_LIVE_CONFDIR=directory Description Enables and disables live reconfiguration (bconf command) and sets the directory where configuration files changed by live reconfiguration are saved. bconf requests will be rejected if the directory does not exist and cannot be created, or is specified using a relative path. When LSF_LIVE_CONFDIR is defined and contains configuration files, all LSF restart and reconfiguration reads these configuration files instead of the files in LSF_CONFDIR. After adding or changing LSF_LIVE_CONFDIR in lsf.conf, use badmin mbdrestart and lsadmin reconfig to reconfigure your cluster. Important: Remove LSF_LIVE_CONFDIR configuration files or merge files into LSF_CONFDIR before disabling bconf, upgrading LSF, applying patches to LSF, or adding server hosts. Chapter 1. Configuration Files 475 lsf.conf See bconf in the LSF Command Reference or bconf man page for bconf (live reconfiguration) details. Default During installation, LSF_LIVE_CONFDIR is set to LSB_SHAREDIR/cluster_name/ live_confdir where cluster_name is the name of the LSF cluster, as returned by lsid. See also LSF_CONFDIR, LSB_CONFDIR LSF_LOAD_USER_PROFILE Syntax LSF_LOAD_USER_PROFILE=local | roaming Description When running jobs on Windows hosts, you can specify whether a user profile should be loaded. Use this parameter if you have jobs that need to access user-specific resources associated with a user profile. Local and roaming user profiles are Windows features. For more information about them, check Microsoft documentation. v Local: LSF loads the Windows user profile from the local execution machine (the host on which the job runs). Note: If the user has logged onto the machine before, the profile of that user is used. If not, the profile for the default user is used v Roaming: LSF loads a roaming user profile if it has been set up. If not, the local user profile is loaded instead. Default Not defined. No user profiles are loaded when jobs run on Windows hosts. LSF_LOCAL_RESOURCES Syntax LSF_LOCAL_RESOURCES="resource ..." Description Defines instances of local resources residing on the slave host. v For numeric resources, defined name-value pairs: "[resourcemap value*resource_name]" v For Boolean resources, the value is the resource name in the form: "[resource resource_name]" 476 Platform LSF Configuration Reference lsf.conf When the slave host calls the master host to add itself, it also reports its local resources. The local resources to be added must be defined in lsf.shared. Default indicates an instance of a resource on each host in the cluster. This specifies a special case where the resource is in effect not shared and is local to every host. Default means at each host. Normally, you should not need to use default, because by default all resources are local to each host. You might want to use ResourceMap for a non-shared static resource if you need to specify different values for the resource on different hosts. If the same resource is already defined in lsf.shared as default or all, it cannot be added as a local resource. The shared resource overrides the local one. Tip: LSF_LOCAL_RESOURCES is usually set in the slave.config file during installation. If LSF_LOCAL_RESOURCES are already defined in a local lsf.conf on the slave host, lsfinstall does not add resources you define in LSF_LOCAL_RESOURCES in slave.config. You should not have duplicate LSF_LOCAL_RESOURCES entries in lsf.conf. If local resources are defined more than once, only the last definition is valid. Important: Resources must already be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name. If the ResourceMap section does not exist, local resources are not added. Example | LSF_LOCAL_RESOURCES="[resourcemap 1*verilog] [resource linux] [resource !bigmem]" | | Prefix the resource name with an exclamation mark (!) to indicate that the resource is exclusive to the host. EGO parameter EGO_LOCAL_RESOURCES Default Not defined LSF_LOG_MASK Syntax LSF_LOG_MASK=message_log_level Description Specifies the logging level of error messages for LSF daemons, except LIM, which is controlled by EGO. For mbatchd and mbschd, LSF_LOG_MASK_MBD and LSF_LOG_MASK_SCH override LSF_LOG_MASK . For example: Chapter 1. Configuration Files 477 lsf.conf LSF_LOG_MASK=LOG_DEBUG If EGO is enabled in the LSF cluster, and EGO_LOG_MASK is not defined, LSF uses the value of LSF_LOG_MASK for LIM, PIM, and MELIM. EGO vemkd and pem components continue to use the EGO default values. If EGO_LOG_MASK is defined, and EGO is enabled, then EGO value is taken. To specify the logging level of error messages for LSF commands, use LSF_CMD_LOG_MASK. To specify the logging level of error messages for LSF batch commands, use LSB_CMD_LOG_MASK. On UNIX, this is similar to syslog. All messages logged at the specified level or higher are recorded; lower level messages are discarded. The LSF_LOG_MASK value can be any log priority symbol that is defined in syslog.h (see syslog). The log levels in order from highest to lowest are: v LOG_EMERG v LOG_ALERT v v v v v LOG_CRIT LOG_ERR LOG_WARNING LOG_NOTICE LOG_INFO v LOG_DEBUG v LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 The most important LSF log messages are at the LOG_ERR or LOG_WARNING level. Messages at the LOG_INFO and LOG_DEBUG level are only useful for debugging. Although message log level implements similar functionality to UNIX syslog, there is no dependency on UNIX syslog. It works even if messages are being logged to files instead of syslog. LSF logs error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. The level specified by LSF_LOG_MASK determines which messages are recorded and which are discarded. All messages logged at the specified level or higher are recorded, while lower level messages are discarded. For debugging purposes, the level LOG_DEBUG contains the fewest number of debugging messages and is used for basic debugging. The level LOG_DEBUG3 records all debugging messages, and can cause log files to grow very large; it is not often used. Most debugging is done at the level LOG_DEBUG2. In versions earlier than LSF 4.0, you needed to restart the daemons after setting LSF_LOG_MASK in order for your changes to take effect. LSF 4.0 implements dynamic debugging, which means you do not need to restart the daemons after setting a debugging environment variable. 478 Platform LSF Configuration Reference lsf.conf EGO parameter EGO_LOG_MASK Default LOG_WARNING See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_DEBUG_NQS, LSB_TIME_CMD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_DEBUG_LIM, LSB_DEBUG_MBD, LSF_DEBUG_RES, LSB_DEBUG_SBD, LSB_DEBUG_SCH, LSF_LOG_MASK, LSF_LOGDIR, LSF_TIME_CMD LSF_LOG_MASK_LIM Syntax LSF_LOG_MASK_LIM=message_log_level Description Specifies the logging level of error messages for LSF LIM only. This value overrides LSF_LOG_MASK for LIM only. For example: LSF_LOG_MASK_LIM=LOG_DEBUG The valid log levels for this parameter are: v LOG_ERR v LOG_WARNING v LOG_NOTICE v LOG_INFO v LOG_DEBUG v LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 Run lsadmin limrestart to make changes take effect. Default Not defined (logging level is controlled by LSF_LOG_MASK). LSF_LOG_MASK_RES Syntax LSF_LOG_MASK_RES=message_log_level Description Specifies the logging level of error messages for LSF RES only. This value overrides LSF_LOG_MASK for RES only. Chapter 1. Configuration Files 479 lsf.conf For example: LSF_LOG_MASK_RES=LOG_DEBUG The valid log levels for this parameter are: v LOG_ERR v LOG_WARNING v LOG_NOTICE v v v v v LOG_INFO LOG_DEBUG LOG_DEBUG1 LOG_DEBUG2 LOG_DEBUG3 Run lsadmin resrestart to make changes take effect. Default Not defined (logging level is controlled by LSF_LOG_MASK). LSF_LOG_MASK_WIN Syntax LSF_LOG_MASK_WIN=message_log_level Description Allows you to reduce the information logged to the LSF Windows event log files. Messages of lower severity than the specified level are discarded. For all LSF files, the types of messages saved depends on LSF_LOG_MASK, so the threshold for the Windows event logs is either LSF_LOG_MASK or LSF_LOG_MASK_WIN, whichever is higher. LSF_LOG_MASK_WIN is ignored if LSF_LOG_MASK is set to a higher level. The LSF event log files for Windows are: v v v v v lim.log.host_name res.log.host_name sbatchd.log.host_name mbatchd.log.host_name pim.log.host_name The log levels you can specify for this parameter, in order from highest to lowest, are: v LOG_ERR v LOG_WARNING v LOG_INFO v LOG_NONE (LSF does not log Windows events) Default LOG_ERR 480 Platform LSF Configuration Reference lsf.conf See also LSF_LOG_MASK LSF_LOGDIR Syntax LSF_LOGDIR=directory Description Defines the LSF system log file directory. Error messages from all servers are logged into files in this directory. To effectively use debugging, set LSF_LOGDIR to a directory such as /tmp. This can be done in your own environment from the shell or in lsf.conf. Windows LSF_LOGDIR is required on Windows if you wish to enable logging. You must also define LSF_LOGDIR_USE_WIN_REG=n. If you define LSF_LOGDIR without defining LSF_LOGDIR_USE_WIN_REG=n, LSF logs error messages into files in the default local directory specified in one of the following Windows registry keys: v On Windows 2000, Windows XP, and Windows 2003: HKEY_LOCAL_MACHINE\SOFTWARE\IBM Platform\LSF\LSF_LOGDIR v On Windows XP x64 and Windows 2003 x64: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\IBM Platform\LSF\LSF_LOGDIR If a server is unable to write in the LSF system log file directory, LSF attempts to write to the following directories in the following order: v LSF_TMPDIR if defined v %TMP% if defined v %TEMP% if defined v System directory, for example, c:\winnt UNIX If a server is unable to write in this directory, the error logs are created in /tmp on UNIX. If LSF_LOGDIR is not defined, syslog is used to log everything to the system log using the LOG_DAEMON facility. The syslog facility is available by default on most UNIX systems. The /etc/syslog.conf file controls the way messages are logged and the files they are logged to. See the man pages for the syslogd daemon and the syslog function for more information. Default Not defined. On UNIX, log messages go to syslog. On Windows, no logging is performed. Chapter 1. Configuration Files 481 lsf.conf See also LSB_CMD_LOG_MASK, LSB_CMD_LOGDIR, LSB_DEBUG, LSB_DEBUG_CMD, LSB_TIME_CMD, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR_USE_WIN_REG, LSF_TIME_CMD Files v v v v lim.log.host_name res.log.host_name sbatchd.log.host_name sbatchdc.log.host_name (when LSF_DAEMON_WRAP=Y) v mbatchd.log.host_name v eeventd.log.host_name v pim.log.host_name LSF_LOGDIR_USE_WIN_REG Syntax LSF_LOGDIR_USE_WIN_REG=n | N Description Windows only. If set, LSF logs error messages into files in the directory specified by LSF_LOGDIR in lsf.conf. Use this parameter to enable LSF to save log files in a different location from the default local directory specified in the Windows registry. If not set, or if set to any value other than N or n, LSF logs error messages into files in the default local directory specified in one of the following Windows registry keys: v On Windows 2000, Windows XP, and Windows 2003: HKEY_LOCAL_MACHINE\SOFTWARE\IBM Platform\LSF\LSF_LOGDIR v On Windows XP x64 and Windows 2003 x64: HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node\IBM Platform\LSF\LSF_LOGDIR Default Not set. LSF uses the default local directory specified in the Windows registry. See also LSF_LOGDIR LSF_LOGFILE_OWNER Syntax LSF_LOGFILE_OWNER="user_name" 482 Platform LSF Configuration Reference lsf.conf Description Specifies an owner for the LSF log files other than the default, the owner of lsf.conf. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). Default Not set. The LSF Administrator with root privileges is the owner of LSF log files. LSF_LSLOGIN_SSH Syntax LSF_LSLOGIN_SSH=Y | y Description Enables SSH to secure communication between hosts and during job submission. SSH is used when running any of the following: v Remote login to a lightly loaded host (lslogin) v An interactive job (bsub -IS | -ISp | ISs) v An X-window job (bsub -IX) v An externally submitted job that is interactive or X-window job (esub) Default Not set. LSF uses rlogin to authenticate users. LSF_MACHDEP Syntax LSF_MACHDEP=directory Description Specifies the directory in which machine-dependent files are installed. These files cannot be shared across different types of machines. In clusters with a single host type, LSF_MACHDEP is usually the same as LSF_INDEP. The machine dependent files are the user commands, daemons, and libraries. You should not need to modify this parameter. As shown in the following list, LSF_MACHDEP is incorporated into other LSF variables. v LSF_BINDIR=$LSF_MACHDEP/bin v LSF_LIBDIR=$LSF_MACHDEP/lib v LSF_SERVERDIR=$LSF_MACHDEP/etc v XLSF_UIDDIR=$LSF_MACHDEP/lib/uid Default /usr/share/lsf Chapter 1. Configuration Files 483 lsf.conf See also LSF_INDEP LSF_MANAGE_FREQUENCY Syntax LSF_MANAGE_FREQUENCY=N | CORE | HOST Description Uses a keyword value (N, CORE, or HOST) to set whether the CPU frequency is set for the core (CPU) or by host (node). If the value CORE is set, jobs will require affinity resource requirements. The default value for this parameter is N (not set). Default N LSF_MANDIR Syntax LSF_MANDIR=directory Description Directory under which all man pages are installed. The man pages are placed in the man1, man3, man5, and man8 subdirectories of the LSF_MANDIR directory. This is created by the LSF installation process, and you should not need to modify this parameter. Man pages are installed in a format suitable for BSD-style man commands. For most versions of UNIX and Linux, you should add the directory LSF_MANDIR to your MANPATH environment variable. If your system has a man command that does not understand MANPATH, you should either install the man pages in the /usr/man directory or get one of the freely available man programs. Default LSF_INDEP/man LSF_MASTER_LIST Syntax LSF_MASTER_LIST="host_name ..." Description Required. Defines a list of hosts that are candidates to become the master host for the cluster. Listed hosts must be defined in lsf.cluster.cluster_name. 484 Platform LSF Configuration Reference lsf.conf Host names are separated by spaces. Tip: On UNIX and Linux, master host candidates should share LSF configuration and binaries. On Windows, configuration files are shared, but not binaries. Starting in LSF 7, LSF_MASTER_LIST must be defined in lsf.conf. If EGO is enabled, LSF_MASTER_LIST can only be defined lsf.conf. EGO_MASTER_LIST can only be defined in ego.conf. EGO_MASTER_LIST cannot be defined in lsf.conf. LSF_MASTER_LIST cannot be defined ego.conf. LIM reads EGO_MASTER_LIST wherever it is defined. If both LSF_MASTER_LIST and EGO_MASTER_LIST are defined, the value of EGO_MASTER_LIST in ego.conf is taken. To avoid errors, you should make sure that the value of LSF_MASTER_LIST matches the value of EGO_MASTER_LIST, or define only EGO_MASTER_LIST. If EGO is disabled, ego.conf not loaded and the value of LSF_MASTER_LIST defined in lsf.conf is taken. When you run lsadmin reconfig to reconfigure the cluster, only the master LIM candidates read lsf.shared and lsf.cluster.cluster_name to get updated information. The elected master LIM sends configuration information to slave LIMs. If you have a large number of non-master hosts, you should configure LSF_LIM_IGNORE_CHECKSUM=Y to ignore warning messages like the following logged to lim log files on non-master hosts. Feb 26 13:47:35 2013 9746 4 9.1.3 xdr_loadvector: Sender <10.225.36.46:9999> has a different configuration Interaction with LSF_SERVER_HOSTS You can use the same list of hosts, or a subset of the master host list defined in LSF_MASTER_LIST, in LSF_SERVER_HOSTS. If you include the primary master host in LSF_SERVER_HOSTS, you should define it as the last host of the list. If LSF_ADD_CLIENTS is defined in install.config at installation, lsfinstall automatically appends the hosts in LSF_MASTER_LIST to the list of hosts in LSF_SERVER_HOSTS so that the primary master host is last. For example: LSF_MASTER_LIST="lsfmaster hostE" LSF_SERVER_HOSTS="hostB hostC hostD hostE lsfmaster" The value of LSF_SERVER_HOSTS is not changed during upgrade. EGO parameter EGO_MASTER_LIST Default Defined at installation Chapter 1. Configuration Files 485 lsf.conf See also LSF_LIM_IGNORE_CHECKSUM LSF_MASTER_NSLOOKUP_TIMEOUT Syntax LSF_MASTER_NSLOOKUP_TIMEOUT=time_milliseconds Description Timeout in milliseconds that the master LIM waits for DNS host name lookup. If LIM spends a lot of time calling DNS to look up a host name, LIM appears to hang. This parameter is used by master LIM only. Only the master LIM detects this parameter and enable the DNS lookup timeout. Default Not defined. No timeout for DNS lookup See also LSF_LIM_IGNORE_CHECKSUM LSF_MAX_TRY_ADD_HOST Syntax LSF_MAX_TRY_ADD_HOST=integer Description When a slave LIM on a dynamically added host sends an add host request to the master LIM, but master LIM cannot add the host for some reason. the slave LIM tries again. LSF_MAX_TRY_ADD_HOST specifies how many times the slave LIM retries the add host request before giving up. Default 20 LSF_MC_NON_PRIVILEGED_PORTS Syntax LSF_MC_NON_PRIVILEGED_PORTS=y | Y Description MultiCluster only. If this parameter is enabled in one cluster, it must be enabled in all clusters. 486 Platform LSF Configuration Reference lsf.conf Specify Y to make LSF daemons use non-privileged ports for communication across clusters. Compatibility This disables privileged port daemon authentication, which is a security feature. If security is a concern, you should use eauth for LSF daemon authentication (see LSF_AUTH_DAEMONS in lsf.conf). Default Not defined. LSF daemons use privileged port authentication LSF_MISC Syntax LSF_MISC=directory Description Directory in which miscellaneous machine independent files, such as example source programs and scripts, are installed. Default LSF_CONFDIR/misc LSF_NIOS_DEBUG Syntax LSF_NIOS_DEBUG=1 Description Enables NIOS debugging for interactive jobs (if LSF_NIOS_DEBUG=1). NIOS debug messages are written to standard error. This parameter can also be defined as an environment variable. When LSF_NIOS_DEBUG and LSF_CMD_LOGDIR are defined, NIOS debug messages are logged in nios.log.host_name. in the location specified by LSF_CMD_LOGDIR. If LSF_NIOS_DEBUG is defined, and the directory defined by LSF_CMD_LOGDIR is inaccessible, NIOS debug messages are logged to /tmp/nios.log.host_name instead of stderr. On Windows, NIOS debug messages are also logged to the temporary directory. Default Not defined Chapter 1. Configuration Files 487 lsf.conf See also LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR LSF_NIOS_ERR_LOGDIR Syntax LSF_NIOS_ERR_LOGDIR=directory Description Applies to Windows only. If LSF_NIOS_ERR_LOGDIR is specified, logs NIOS errors to directory/ nios.error.log.hostname.txt. If the attempt fails, LSF tries to write to another directory instead. The order is: 1. the specified log directory 2. 3. 4. 5. LSF_TMPDIR %TMP% %TEMP% the system directory, for example, C:\winnt If LSF_NIOS_DEBUG is also specified, NIOS debugging overrides the LSF_NIOS_ERR_LOGDIR setting. LSF_NIOS_ERR_LOGDIR is an alternative to using the NIOS debug functionality. This parameter can also be defined as an environment variable. Default Not defined See also LSF_NIOS_DEBUG, LSF_CMD_LOGDIR LSF_NIOS_JOBSTATUS_INTERVAL Syntax LSF_NIOS_JOBSTATUS_INTERVAL=time_minutes Description Time interval at which NIOS polls mbatchd to check if a job is still running. Applies to interactive batch jobs and blocking jobs. Use this parameter if you have scripts that depend on an exit code being returned. If this parameter is not defined and a network connection is lost, mbatchd cannot communicate with NIOS and the return code of a job is not retrieved. 488 Platform LSF Configuration Reference lsf.conf When LSF_NIOS_JOBSTATUS_INTERVAL is defined, NIOS polls mbatchd on the defined interval to check if a job is still running (or pending). NIOS continues to poll mbatchd until it receives an exit code or mbatchd responds that the job does not exist (if the job has already been cleaned from memory for example). For interactive jobs NIOS polls mbatchd to retrieve a job's exit status when this parameter is enabled and: v the connection between NIOS and the job RES is broken. For example, a network failure between submission host and execution host occurs. v job RES runs abnormally. For example, it is out of memory. v job is waiting for dispatch. For blocking jobs, NIOS will always poll mbatchd to retrieve a job's exit status when this parameter is enabled, If an exit code cannot be retrieved, NIOS generates an error message and the code -11. Valid values Any integer greater than zero. Default Not defined Notes Set this parameter to large intervals such as 15 minutes or more so that performance is not negatively affected if interactive jobs are pending for too long. NIOS always calls mbatchd on the defined interval to confirm that a job is still pending and this may add load to mbatchd. See also Environment variable LSF_NIOS_PEND_TIMEOUT LSF_NIOS_MAX_TASKS Syntax LSF_NIOS_MAX_TASKS=integer Description Specifies the maximum number of NIOS tasks. Default Not defined LSF_NIOS_RES_HEARTBEAT Syntax LSF_NIOS_RES_HEARTBEAT=time_minutes Chapter 1. Configuration Files 489 lsf.conf Description Applies only to interactive non-parallel batch jobs. Defines how long NIOS waits before sending a message to RES to determine if the connection is still open. Use this parameter to ensure NIOS exits when a network failure occurs instead of waiting indefinitely for notification that a job has been completed. When a network connection is lost, RES cannot communicate with NIOS and as a result, NIOS does not exit. When this parameter is defined, if there has been no communication between RES and NIOS for the defined period of time, NIOS sends a message to RES to see if the connection is still open. If the connection is no longer available, NIOS exits. Valid values Any integer greater than zero Default Not defined Notes The time you set this parameter to depends how long you want to allow NIOS to wait before exiting. Typically, it can be a number of hours or days. Too low a number may add load to the system. LSF_NON_PRIVILEGED_PORTS Syntax LSF_NON_PRIVILEGED_PORTS=Y | N Description By default, LSF uses non privileged ports for communication. If LSF_NON_PRIVILEGED_PORTS=N, LSF clients (LSF commands and daemons) do not use privileged ports to communicate with daemons and LSF daemons do not check privileged ports for incoming requests to do authentication. To use privileged port communication, change the value of LSF_NON_PRIVILEGED_PORTS to N, then do the following: 1. Shut down the cluster. 2. Change the parameter value. 3. Restart the cluster so the new value takes effect. For migration and compatibility for each cluster: v If all hosts were upgraded to LSF 9.1 or above, then they work with non-privileged ports. v If all hosts were upgraded to LSF 9.1 or above, but you want to use privileged ports for communication, then set LSF_NON_PRIVILEGED_PORTS to N and make sure that the value for LSB_MAX_JOB_DISPATCH_PER_SESSION is less than 300. 490 Platform LSF Configuration Reference lsf.conf v If the master host is upgraded to LSF 9.1 or above, but some server hosts are still running older versions, and if the value defined for LSB_MAX_JOB_DISPATCH_PER_SESSION is above 300, then no changes are required. If the value is less than 300, then you need to set LSF_NON_PRIVILEGED_PORTS=Y. This tells the old sbatchd to use non-privileged ports for communication. Default The default value is Y, which means LSF uses non-privileged port communication. LSF_PAM_APPL_CHKPNT Syntax LSF_PAM_APPL_CHKPNT=Y | N Description When set to Y, allows PAM to function together with application checkpointing support. Default Y LSF_PAM_CLEAN_JOB_DELAY Syntax LSF_PAM_CLEAN_JOB_DELAY=time_seconds Description The number of seconds LSF waits before killing a parallel job with failed tasks. Specifying LSF_PAM_CLEAN_JOB_DELAY implies that if any parallel tasks fail, the entire job should exit without running the other tasks in the job. The job is killed if any task exits with a non-zero exit code. Specify a value greater than or equal to zero (0). Applies only to PAM jobs. Default Undefined: LSF kills the job immediately LSF_PAM_HOSTLIST_USE Syntax LSF_PAM_HOSTLIST_USE=unique Description Used to start applications that use both OpenMP and MPI. Chapter 1. Configuration Files 491 lsf.conf Valid values unique Default Not defined Notes At job submission, LSF reserves the correct number of processors and PAM starts only 1 process per host. For example, to reserve 32 processors and run on 4 processes per host, resulting in the use of 8 hosts: bsub -n 32 -R "span[ptile=4]" pam yourOpenMPJob Where defined This parameter can alternatively be set as an environment variable. For example: setenv LSF_PAM_HOSTLIST_USE unique LSF_PAM_PLUGINDIR Syntax LSF_PAM_PLUGINDIR=path Description The path to libpamvcl.so. Used with LSF HPC features. Default Path to LSF_LIBDIR LSF_PAM_USE_ASH Syntax LSF_PAM_USE_ASH=y | Y Description Enables LSF to use the SGI IRIX Array Session Handles (ASH) to propagate signals to the parallel jobs. See the IRIX system documentation and the array_session(5) man page for more information about array sessions. Default Not defined LSF_PASSWD_DIR Syntax LSF_PASSWD_DIR=file_path 492 Platform LSF Configuration Reference lsf.conf Description Defines a location for LSF to load and update the passwd.lsfuser file, used for registering lspassword for a Windows user account. Note: LSF_PASSWD_DIR does not need to be configured if the cluster contains no Windows users. Note: passwd.lsfuser is automatically generated by LSF - it does not need to be created. Specify the full path to a shared directory accessible by all master candidate hosts. The LSF lim daemon must have read and write permissions on this directory. By default, passwd.lsfuser is located in $LSF_CONFDIR. The default location is only used if LSF_PASSWD_DIR is undefined; if you define a new location and lim fails to access passwd.lsfuser in LSF_PASSWD_DIR, it will not check $LSF_CONFDIR. You must restart lim to make changes take effect. Default Not defined (that is, passwd.lsfuser is located in $LSF_CONFDIR) LSF_PE_NETWORK_NUM Syntax LSF_PE_NETWORK_NUM=num_networks Description For LSF IBM Parallel Environment (PE) integration. Specify a value between 0 and 8 to set the number of InfiniBand networks on the host. If the number is changed, run lsadmin reconfig and badmin mbdrestart to make the change take effect LSF_PE_NETWORK_NUM must be defined with a non-zero value in lsf.conf for LSF to collect network information to run PE jobs. Note: If LSF_PE_NETWORK_NUM is configured with a valid value, MAX_JOBID in lsb.params should not be configured with a value larger than 4194303 and MAX_JOB_ARRAY_SIZE in lsb.params should not be configured with a value larger than 1023. Otherwise, the jobID may not be represented correctly in PE and jobs do not run as expected. Example For example, hostA has two networks: 18338657685884897280 and 18338657685884897536. Each network has 256 windows. Set LSF_PE_NETWORK_NUM=2. Maximum 8 Chapter 1. Configuration Files 493 lsf.conf Default 0 LSF_PE_NETWORK_UPDATE_INTERVAL Syntax LSF_PE_NETWORK_UPDATE_INTERVAL=seconds Description For LSF IBM Parallel Environment (PE) integration. When LSF collects network information for PE jobs, LSF_PE_NETWORK_UPDATE_INTERVAL specifies the interval for updating network information. Default 15 seconds LSF_PIM_INFODIR Syntax LSF_PIM_INFODIR=path Description The path to where PIM writes the pim.info.host_name file. Specifies the path to where the process information is stored. The process information resides in the file pim.info.host_name. The PIM also reads this file when it starts so that it can accumulate the resource usage of dead processes for existing process groups. EGO parameter EGO_PIM_INFODIR Default Not defined. The system uses /tmp. LSF_PIM_LINUX_ENHANCE Syntax LSF_PIM_LINUX_ENHANCE=Y | N Description When enabled, the PIM daemon reports proportional memory utilization for each process attached to a shared memory segment. The sum total of memory utilization of all processes on the host is now accurately reflected in the total memory used. (The Linux kernel must be version 2.6.14 or newer.) 494 Platform LSF Configuration Reference lsf.conf When EGO_PIM_SWAP_REPORT is set, the swap amount is correctly reported. The swap amount is the virtual memory minus the value of the rss value in the static Linux file. Applies only to Linux operating systems and Red Hat Enterprise Linux 4.7.5.0. Default Set to Y at time of installation. If otherwise undefined, then N. LSF_PIM_SLEEPTIME Syntax LSF_PIM_SLEEPTIME=time_seconds Description The reporting period for PIM. PIM updates the process information every 15 minutes unless an application queries this information. If an application requests the information, PIM updates the process information every LSF_PIM_SLEEPTIME seconds. If the information is not queried by any application for more than 5 minutes, the PIM reverts back to the 15 minute update period. EGO parameter EGO_PIM_SLEEPTIME Default 30 seconds LSF_PIM_SLEEPTIME_UPDATE Syntax LSF_PIM_SLEEPTIME_UPDATE=y | n Description UNIX only. Use this parameter to improve job throughput and reduce a job’s start time if there are many jobs running simultaneously on a host. This parameter reduces communication traffic between sbatchd and PIM on the same host. When this parameter is not defined or set to n, sbatchd queries PIM as needed for job process information. When this parameter is defined, sbatchd does not query PIM immediately as it needs information; sbatchd only queries PIM every LSF_PIM_SLEEPTIME seconds. Limitations When this parameter is defined: Chapter 1. Configuration Files 495 lsf.conf v sbatchd may be intermittently unable to retrieve process information for jobs whose run time is smaller than LSF_PIM_SLEEPTIME. v It may take longer to view resource usage with bjobs -l. EGO parameter EGO_PIM_SLEEPTIME_UPDATE Default Set to y at time of installation. Otherwise, not defined. LSF_POE_TIMEOUT_BIND Syntax LSF_POE_TIMEOUT_BIND=time_seconds Description This parameter applies to the PAM Taskstarter. It specifies the time in seconds for the poe_w wrapper to keep trying to set up a server socket to listen on. poe_w is the wrapper for the IBM poe driver program. LSF_POE_TIMEOUT_BIND can also be set as an environment variable for poe_w to read. Default 120 seconds LSF_POE_TIMEOUT_SELECT Syntax LSF_POE_TIMEOUT_SELECT=time_seconds Description This parameter applies to the PAM Taskstarter. It specifies the time in seconds for the poe_w wrapper to wait for connections from the pmd_w wrapper. pmd_w is the wrapper for pmd (IBM PE Partition Manager Daemon). LSF_POE_TIMEOUT_SELECT can also be set as an environment variable for poe_w to read. Default 160 seconds LSF_PROCESS_TRACKING Syntax LSF_PROCESS_TRACKING=Y|y|N|n 496 Platform LSF Configuration Reference lsf.conf Description Enable this parameter to track processes based on job control functions such as termination, suspension, resume and other signaling, on Linux systems which support cgroups' freezer subsystem. Once enabled, this parameter takes effect for new jobs. Disable this parameter if you want LSF to depend on PIM's updates for tracking the relationship between jobs and process. Default Set to Y at time of installation. If otherwise undefined, then N. LSF_REMOTE_COPY_CMD Syntax LSF_REMOTE_COPY_CMD="copy_command" Description UNIX only. Specifies the shell command or script to use with the following LSF commands if RES fails to copy the file between hosts. v lsrcp v bsub –i, –f, –is, -Zs }Ci(s) v bmod -Zs By default, rcp is used for these commands. There is no need to restart any daemons when this parameter changes. For example, to use scp instead of rcp for remote file copying, specify: LSF_REMOTE_COPY_CMD="scp -B -o 'StrictHostKeyChecking no'" You can also configure ssh options such as BatchMode, StrictHostKeyChecking in the global SSH_ETC/ssh_config file or $HOME/.ssh/config. When remote copy of a file via RES fails, the environment variable “LSF_LSRCP_ERRNO” is set to the system defined errno. You can use this variable in a self-defined shell script executed by lsrcp. The script can do the appropriate cleanup, recopy, or retry, or it can just exit without invoking any other copy command. LSF automatically appends two parameters before executing the command: v The first parameter is the source file path. v The second parameter is the destination file path. Valid values Values are passed directly through. Any valid scp, rcp, or custom copy commands and options are supported except for compound multi-commands. For example, set LSF_REMOTE_COPY_CMD="scp -B -o 'StrictHostKeyChecking no'". Chapter 1. Configuration Files 497 lsf.conf To avoid a recursive loop, the value of LSF_REMOTE_COPY_CMD must not be lsrcp or a shell script executing lsrcp. Default Not defined. LSF_RES_ACCT Syntax LSF_RES_ACCT=time_milliseconds | 0 Description If this parameter is defined, RES logs information for completed and failed tasks by default (see lsf.acct). The value for LSF_RES_ACCT is specified in terms of consumed CPU time (milliseconds). Only tasks that have consumed more than the specified CPU time are logged. If this parameter is defined as LSF_RES_ACCT=0, then all tasks are logged. For those tasks that consume the specified amount of CPU time, RES generates a record and appends the record to the task log file lsf.acct.host_name. This file is located in the LSF_RES_ACCTDIR directory. If this parameter is not defined, the LSF administrator must use the lsadmin command (see lsadmin) to turn task logging on after RES has started. Default Not defined See also LSF_RES_ACCTDIR LSF_RES_ACCTDIR Syntax LSF_RES_ACCTDIR=directory Description The directory in which the RES task log file lsf.acct.host_name is stored. If LSF_RES_ACCTDIR is not defined, the log file is stored in the /tmp directory. Default (UNIX)/tmp (Windows) C:\temp 498 Platform LSF Configuration Reference lsf.conf See also LSF_RES_ACCT LSF_RES_ACTIVE_TIME Syntax LSF_RES_ACTIVE_TIME=time_seconds Description Time in seconds before LIM reports that RES is down. Minimum value 10 seconds Default 90 seconds LSF_RES_ALIVE_TIMEOUT Syntax LSF_RES_ALIVE_TIMEOUT=time_seconds Description Controls how long the task res on non-first execution hosts waits (in seconds) before cleaning up the job. If set to 0, this parameter is disabled. Restart all res in the cluster after setting or changing this parameter. Default 60 seconds LSF_RES_CLIENT_TIMEOUT Syntax LSF_RES_CLIENT_TIMEOUT=time_minutes Description Specifies in minutes how long an application RES waits for a new task before exiting. CAUTION: If you use the LSF API to run remote tasks and you define this parameter with timeout. the remote execution of the new task fails (for example, ls_rtask()). Chapter 1. Configuration Files 499 lsf.conf Default The parameter is not set; the application RES waits indefinitely for new task to come until client tells it to quit. LSF_RES_CONNECT_RETRY Syntax LSF_RES_CONNECT_RETRY=integer | 0 Description The number of attempts by RES to reconnect to NIOS. If LSF_RES_CONNECT_RETRY is not defined, the default value is used. Default 0 See also LSF_NIOS_RES_HEARTBEAT LSF_RES_DEBUG Syntax LSF_RES_DEBUG=1 | 2 Description Sets RES to debug mode. If LSF_RES_DEBUG is defined, the Remote Execution Server (RES) operates in single user mode. No security checking is performed, so RES should not run as root. RES does not look in the services database for the RES service port number. Instead, it uses port number 36002 unless LSF_RES_PORT has been defined. Specify 1 for this parameter unless you are testing RES. Valid values LSF_RES_DEBUG=1 RES runs in the background with no associated control terminal. LSF_RES_DEBUG=2 RES runs in the foreground and prints error messages to tty. Default Not defined 500 Platform LSF Configuration Reference lsf.conf See also LSF_LIM_DEBUG, LSF_CMD_LOGDIR, LSF_CMD_LOG_MASK, LSF_LOG_MASK, LSF_LOGDIR LSF_RES_PORT See LSF_LIM_PORT, LSF_RES_PORT, LSB_MBD_PORT, LSB_SBD_PORT. LSF_RES_RLIMIT_UNLIM Syntax LSF_RES_RLIMIT_UNLIM=cpu | fsize | data | stack | core | vmem Description By default, RES sets the hard limits for a remote task to be the same as the hard limits of the local process. This parameter specifies those hard limits which are to be set to unlimited, instead of inheriting those of the local process. Valid values are cpu, fsize, data, stack, core, and vmem, for CPU, file size, data size, stack, core size, and virtual memory limits, respectively. Example The following example sets the CPU, core size, and stack hard limits to be unlimited for all remote tasks: LSF_RES_RLIMIT_UNLIM="cpu core stack" Default Not defined LSF_RES_TIMEOUT Syntax LSF_RES_TIMEOUT=time_seconds Description Timeout when communicating with RES. Default 15 LSF_ROOT_REX Syntax | LSF_ROOT_REX=all | text Description UNIX only. Specifies the root execution privileges for jobs from local and remote hosts. Chapter 1. Configuration Files 501 lsf.conf | | | | If defined as any value in the local cluster, allows root remote execution privileges (subject to identification checking) for jobs from local hosts in the same cluster, for both interactive and batch jobs. Causes RES to accept requests from root on other local hosts in the same cluster, subject to identification checking. | | | | If defined as all in Platform MultiCluster, allows root remote execution privileges (subject to identification checking) for jobs from remote and local cluster hosts, for both interactive and batch jobs. Causes RES to accept requests from the superuser (root) on local and remote hosts, subject to identification checking. If LSF_ROOT_REX is not defined, remote execution requests from root are refused. Default Not defined. Root execution is not allowed. See also | LSF_AUTH, LSF_DISABLE_LSRUN LSF_RSH Syntax LSF_RSH=command [command_options] Description Specifies shell commands to use when the following LSF commands require remote execution: v v v v v v v v badmin hstartup bpeek lsadmin limstartup lsadmin resstartup lsfrestart lsfshutdown lsfstartup lsrcp By default, rsh is used for these commands. Use LSF_RSH to enable support for ssh. EGO parameter EGO_RSH Default Not defined Example To use an ssh command before trying rsh for LSF commands, specify: LSF_RSH="ssh -o ’PasswordAuthentication no’ -o ’StrictHostKeyChecking no’" 502 Platform LSF Configuration Reference lsf.conf ssh options such as PasswordAuthentication and StrictHostKeyChecking can also be configured in the global SSH_ETC/ssh_config file or $HOME/.ssh/config. See also ssh, ssh_config LSF_SECUREDIR Syntax LSF_SECUREDIR=path Description Windows only; mandatory if using lsf.sudoers. Path to the directory that contains the file lsf.sudoers (shared on an NTFS file system). LSF_SERVER_HOSTS Syntax LSF_SERVER_HOSTS="host_name ..." Description Defines one or more server hosts that the client should contact to find a Load Information Manager (LIM). LSF server hosts are hosts that run LSF daemons and provide loading-sharing services. Client hosts are hosts that only run LSF commands or applications but do not provide services to any hosts. Important: LSF_SERVER_HOSTS is required for non-shared slave hosts. Use this parameter to ensure that commands execute successfully when no LIM is running on the local host, or when the local LIM has just started. The client contacts the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running. If LSF_SERVER_HOSTS is not defined, the client tries to contact the LIM on the local host. The host names in LSF_SERVER_HOSTS must be enclosed in quotes and separated by white space. For example: LSF_SERVER_HOSTS="hostA hostD hostB" The parameter string can include up to 4094 characters for UNIX or 255 characters for Windows. Chapter 1. Configuration Files 503 lsf.conf Interaction with LSF_MASTER_LIST Starting in LSF 7, LSF_MASTER_LIST must be defined in lsf.conf. You can use the same list of hosts, or a subset of the master host list, in LSF_SERVER_HOSTS. If you include the primary master host in LSF_SERVER_HOSTS, you should define it as the last host of the list. If LSF_ADD_CLIENTS is defined in install.config at installation, lsfinstall automatically appends the hosts in LSF_MASTER_LIST to the list of hosts in LSF_SERVER_HOSTS so that the primary master host is last. For example: LSF_MASTER_LIST="lsfmaster hostE" LSF_SERVER_HOSTS="hostB hostC hostD hostE lsfmaster" LSF_ADD_CLIENTS="clientHostA" The value of LSF_SERVER_HOSTS is not changed during upgrade. Default Not defined See also LSF_MASTER_LIST LSF_SHELL_AT_USERS Syntax LSF_SHELL_AT_USERS="user_name user_name ..." Description Applies to lstcsh only. Specifies users who are allowed to use @ for host redirection. Users not specified with this parameter cannot use host redirection in lstcsh. To specify a Windows user account, include the domain name in uppercase letters (DOMAIN_NAME\user_name). If this parameter is not defined, all users are allowed to use @ for host redirection in lstcsh. Default Not defined LSF_SHIFT_JIS_INPUT Syntax LSF_SHIFT_JIS_INPUT=y | n Description Enables LSF to accept Shift-JIS character encoding for job information (for example, user names, queue names, job names, job group names, project names, commands and arguments, esub parameters, external messages, etc.) 504 Platform LSF Configuration Reference lsf.conf Default n LSF_STRICT_CHECKING Syntax LSF_STRICT_CHECKING=Y Description If set, enables more strict checking of communications between LSF daemons and between LSF commands and daemons when LSF is used in an untrusted environment, such as a public network like the Internet. If you enable this parameter, you must enable it in the entire cluster, as it affects all communications within LSF. If it is used in a MultiCluster environment, it must be enabled in all clusters, or none. Ensure that all binaries and libraries are upgraded to LSF Version 7, including LSF_BINDIR, LSF_SERVERDIR and LSF_LIBDIR directories, if you enable this parameter. If your site uses any programs that use the LSF base and batch APIs, or LSF MPI (Message Passing Interface), they need to be recompiled using the LSF Version 7 APIs before they can work properly with this option enabled. Important: You must shut down the entire cluster before enabling or disabling this parameter. If LSF_STRICT_CHECKING is defined, and your cluster has slave hosts that are dynamically added, LSF_STRICT_CHECKING must be configured in the local lsf.conf on all slave hosts. Valid value Set to Y to enable this feature. Default Not defined. LSF is secure in trusted environments. LSF_STRICT_RESREQ Syntax LSF_STRICT_RESREQ=Y | N Description When LSF_STRICT_RESREQ=Y, the resource requirement selection string must conform to the stricter resource requirement syntax described in Administering IBM Platform LSF. The strict resource requirement syntax only applies to the select section. It does not apply to the other resource requirement sections (order, rusage, same, span, or cu). Chapter 1. Configuration Files 505 lsf.conf When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource. When LSF_STRICT_RESREQ=N, the default resource requirement selection string evaluation is performed. Default Set to Y at time of installation. If otherwise undefined, then N. LSF_STRIP_DOMAIN Syntax LSF_STRIP_DOMAIN=domain_suffix[:domain_suffix ...] Description (Optional) If all of the hosts in your cluster can be reached using short host names, you can configure LSF to use the short host names by specifying the portion of the domain name to remove. If your hosts are in more than one domain or have more than one domain name, you can specify more than one domain suffix to remove, separated by a colon (:). Example: LSF_STRIP_DOMAIN=.example.com:.generic.com In the above example, LSF accepts hostA, hostA.example.com, and hostA.generic.com as names for hostA, and uses the name hostA in all output. The leading period ‘.’ is required. Setting this parameter only affects host names displayed through LSF, it does not affect DNS host lookup. After adding or changing LSF_STRIP_DOMAIN, use lsadmin reconfig and badmin mbdrestart to reconfigure your cluster. EGO parameter EGO_STRIP_DOMAIN Default Not defined LSF_TIME_CMD Syntax LSF_TIME_CMD=timimg_level Description The timing level for checking how long LSF commands run. Time usage is logged in milliseconds. Specify a positive integer. 506 Platform LSF Configuration Reference lsf.conf Default Not defined See also LSB_TIME_MBD, LSB_TIME_SBD, LSB_TIME_CMD, LSF_TIME_LIM, LSF_TIME_RES LSF_TIME_LIM Syntax LSF_TIME_LIM=timing_level Description The timing level for checking how long LIM routines run. Time usage is logged in milliseconds. Specify a positive integer. EGO parameter EGO_TIME_LIM Default Not defined See also LSB_TIME_CMD, LSB_TIME_MBD, LSB_TIME_SBD, LSF_TIME_RES LSF_TIME_RES Syntax LSF_TIME_RES=timing_level Description The timing level for checking how long RES routines run. Time usage is logged in milliseconds. Specify a positive integer. LSF_TIME_RES is not supported on Windows. Default Not defined See also LSB_TIME_CMD, LSB_TIME_MBD, LSB_TIME_SBD, LSF_TIME_LIM Chapter 1. Configuration Files 507 lsf.conf LSF_TMPDIR Syntax LSF_TMPDIR=directory Description Specifies the path and directory for temporary job output. When LSF_TMPDIR is defined in lsf.conf, LSF uses the directory specified by LSF_TMPDIR on the execution host when a job is started and creates a job-specific temporary directory as a subdirectory of the LSF_TMPDIR directory. The name of the job-specific temporary directory has the following format: v For regular jobs: – Unix: $LSF_TMPDIR/jobID.tmpdir – Windows: %LSF_TMPDIR%\jobID.tmpdir v For array jobs: – Unix: $LSF_TMPDIR/arrayID_arrayIndex.tmpdir – Windows: %LSF_TMPDIR%\arrayID_arrayIndex.tmpdir On UNIX, the directory has the permission 0700 and is owned by the execution user. After adding LSF_TMPDIR to lsf.conf, use badmin hrestart all to reconfigure your cluster. If LSB_SET_TMPDIR= Y, the environment variable TMPDIR will be set to the job-specific temporary directory. Valid values Specify any valid path up to a maximum length of 256 characters. The 256 character maximum path length includes the temporary directories and files that the system creates as jobs run. The path that you specify for LSF_TMPDIR should be as short as possible to avoid exceeding this limit. UNIX Specify an absolute path. For example: LSF_TMPDIR=/usr/share/lsf_tmp Windows Specify a UNC path or a path with a drive letter. For example: LSF_TMPDIR=\\HostA\temp\lsf_tmp LSF_TMPDIR=D:\temp\lsf_tmp Temporary directory for tasks launched by blaunch By default, LSF creates the job-specific temporary directory only on the first execution host. 508 Platform LSF Configuration Reference lsf.conf To create a job-specific temporary directory on each execution host, set LSB_SET_TMPDIR=Y so that the path of the job-specific temporary directory is available through the TMPDIR environment variable, or set LSB_SET_TMPDIR to a user-defined environment variable so that the path of the job-specific temporary directory is available through the user-defined environment variable. Tasks launched through the blaunch distributed application framework make use of the job-specific temporary directory: v When the job-specific temporary directory environment variable is set on the first execution host, the blaunch framework propagates this environment variable to all execution hosts when launching remote tasks v The job RES or the task RES creates the job-specific temporary directory if it does not already exist before starting the job v The directory created by the job RES or task RES has permission 0700 and is owned by the execution user v If the job-specific temporary directory was created by the task RES, LSF deletes the directory and its contents when the task is complete v If the job-specific temporary directory was created by the job RES, LSF deletes the directory and its contents when the job is done v If the job-specific temporary directory is on a shared file system, it is assumed to be shared by all the hosts allocated to the blaunch job, so LSF does not remove the directories created by the job RES or task RES Default By default, LSF_TMPDIR is not enabled. If LSF_TMPDIR is not specified in lsf.conf, this parameter is defined as follows: v On UNIX: $TMPDIR or /tmp v On Windows: %TMP%, %TEMP, or %SystemRoot% LSF_UNIT_FOR_LIMITS Syntax LSF_UNIT_FOR_LIMITS=unit Description Enables scaling of large units in the resource usage limits: v core v memory v stack v swap When set, LSF_UNIT_FOR_LIMITS applies cluster-wide to these limits at the job-level (bsub), queue-level (lsb.queues), and application level (lsb.applications). This parameter alters the meaning of all numeric values in lsb.resources to match the unit set, whether gpool, limits, hostexport, etc. It also controls the resource rusage attached to the job and the memory amount that defines the size of a package in GSLA. Chapter 1. Configuration Files 509 lsf.conf The limit unit specified by LSF_UNIT_FOR_LIMITS also applies to these limits when modified with bmod, and the display of these resource usage limits in query commands (bacct, bapp, bhist, bhosts, bjobs, bqueues, lsload, and lshosts). Important: Before changing the units of your resource usage limits, you should completely drain the cluster of all workload. There should be no running, pending, or finished jobs in the system. In a MultiCluster environment, you should configure the same unit for all clusters. Note: Other limits (such as the file limit) are not affected by setting the parameter LSF_UNIT_FOR_LIMITS. Example A job is submitted with bsub -M 100 and LSF_UNIT_FOR_LIMITS=MB; the memory limit for the job is 100 MB rather than the default 100 KB. Valid values unit indicates the unit for the resource usage limit, one of: v KB (kilobytes) v MB (megabytes) v GB (gigabytes) v TB (terabytes) v PB (petabytes) v EB (exabytes) Default Set to MB at time of installation. If LSF_UNIT_FOR_LIMITS is not defined in lsf.conf, then the default setting is in KB, and for RUSAGE it is MB. LSF_USE_HOSTEQUIV Syntax LSF_USE_HOSTEQUIV=y | Y Description (UNIX only; optional) If LSF_USE_HOSTEQUIV is defined, RES and mbatchd call the ruserok() function to decide if a user is allowed to run remote jobs. The ruserok() function checks in the /etc/hosts.equiv file and the user’s $HOME/.rhosts file to decide if the user has permission to execute remote jobs. If LSF_USE_HOSTEQUIV is not defined, all normal users in the cluster can execute remote jobs on any host. 510 Platform LSF Configuration Reference lsf.conf If LSF_ROOT_REX is set, root can also execute remote jobs with the same permission test as for normal users. Default Not defined See also LSF_ROOT_REX LSF_USER_DOMAIN Syntax LSF_USER_DOMAIN=domain_name:domain_name:domain_name... . Description Enables the UNIX/Windows user account mapping feature, which allows cross-platform job submission and execution in a mixed UNIX/Windows environment. LSF_USER_DOMAIN specifies one or more Windows domains that LSF either strips from the user account name when a job runs on a UNIX host, or adds to the user account name when a job runs on a Windows host. Important: Configure LSF_USER_DOMAIN immediately after you install LSF; changing this parameter in an existing cluster requires that you verify and possibly reconfigure service accounts, user group memberships, and user passwords. Specify one or more Windows domains, separated by a colon (:). You can enter an unlimited number of Windows domains. A period (.) specifies a local account, not a domain. Examples LSF_USER_DOMAIN=BUSINESS LSF_USER_DOMAIN=BUSINESS:ENGINEERING:SUPPORT Default The default depends on your LSF installation: v If you upgrade a cluster to LSF version 7, the default is the existing value of LSF_USER_DOMAIN, if defined v For a new cluster, this parameter is not defined, and UNIX/Windows user account mapping is not enabled LSF_VPLUGIN Syntax LSF_VPLUGIN=path Chapter 1. Configuration Files 511 lsf.conf Description The full path to the vendor MPI library libxmpi.so. Used with LSF HPC features. Examples v IBM Platform MPI: LSF_VPLUGIN=/opt/mpi/lib/pa1.1/libmpirm.sl v MPI: LSF_VPLUGIN=/usr/lib32/libxmpi.so v Linux (64-bit x-86 Linux 2.6, glibc 2.3.): LSF_VPLUGIN=/usr/lib32/libxmpi.so:/ usr/lib/libxmpi.so: /usr/lib64/libxmpi.so Default Not defined LSF_WINDOWS_HOST_TYPES Syntax LSF_WINDOWS_HOST_TYPES="HostType1 HostType2 HostType3 ..." Description Use this parameter to set the Windows host type in mixed cluster environments, with a UNIX master host and Windows clients. Set LSF_WINDOWS_HOST_TYPES in lsf.conf to configure Windows host types. If not defined (and Windows clients are other than the default (NTX86, NTX64, and NTIA64), running lspasswd on the Windows server returns an error message. Except for "NTX86", "NTX64, and "NTIA64", all Windows host types defined in LSF_WINDOWS_HOST_TYPES must be defined in the HostType section in lsf.shared file. If not defined in HostType, LSF issues a warning message (except for the three default types) when starting or restarting lim. After changing LSF_WINDOWS_HOST_TYPES, run lsadmin limrestart for changes to take effect. Examples If LSF_WINDOWS_HOST_TYPES="NTX86 NTX64 aaa" but "aaa" is not defined in lsf.shared. After lim startup, the log file will show: Feb 27 05:09:10 2013 15150 3 1.2.7 dotypelist(): /.../conf/lsf.shared: The host type defined by LSF_WINDOWS_HOST_TYPES aaa should also be defined in lsf.shared. Default LSF_WINDOWS_HOST_TYPES="NTX86 NTX64 NTIA64" XLSF_APPDIR Syntax XLSF_APPDIR=directory 512 Platform LSF Configuration Reference lsf.conf Description (UNIX only; optional) Directory in which X application default files for LSF products are installed. The LSF commands that use X look in this directory to find the application defaults. Users do not need to set environment variables to use the LSF X applications. The application default files are platform-independent. Default LSF_INDEP/misc XLSF_UIDDIR Syntax XLSF_UIDDIR=directory Description (UNIX only) Directory in which Motif User Interface Definition files are stored. These files are platform-specific. Default LSF_LIBDIR/uid MC_PLUGIN_REMOTE_RESOURCE Syntax MC_PLUGIN_REMOTE_RESOURCE=y Description MultiCluster job forwarding model only. By default, the submission cluster does not consider remote resources. Define MC_PLUGIN_REMOTE_RESOURCE=y in the submission cluster to allow consideration of remote resources. Note: When MC_PLUGIN_REMOTE_RESOURCE is defined, only the following resource requirements (boolean only) are supported: -R "type==type_name", -R "same[type]" and -R "defined(resource_name)" Note: When MC_PLUGIN_SCHEDULE_ENHANCE in lsb.params is defined, remote resources are considered as if MC_PLUGIN_REMOTE_RESOURCE=Y regardless of the actual value. In addition, details of the remote cluster workload are considered by the submission cluster scheduler. Default Not defined. The submission cluster does not consider remote resources. Chapter 1. Configuration Files 513 lsf.conf See also MC_PLUGIN_SCHEDULE_ENHANCE in lsb.params | lsf.datamanager | | | | | | | The lsf.datamanager file controls the operation of IBM Platform Data Manager for LSF features. There is one LSF data management configuration file for each cluster, called lsf.datamanager.cluster_name. The cluster_name suffix is the name of the cluster that is defined in the Cluster section of lsf.shared. The file is read by the LSF data management daemon dmd. Since one LSF data manager can serve multiple LSF clusters, the contents of this file must be identical on each cluster that shares LSF data manager. | Changing lsf.datamanager configuration | | After you change lsf.datamanager.cluster_name, run bdata admin reconfig to reconfigure LSF data manager on all candidate hosts. | Location | | The lsf.datamanager file is located in the directory that is defined by LSF_ENVDIR. | Structure | The lsf.datamanager.cluster_name file contains two configuration sections: | | | | | | Parameters section The Parameters section of lsf.datamanager configures LSF data manager administrators, default file transfer command, the location of the LSF data manager staging area, file permissions on the LSF data manager cache, and the grace period for LSF data manager cache cleanup and other LSF data manager operations. | | | | | RemoteDataManagers section Optional. The RemoteDataManagers section tells a local LSF data manager how to communicate with remote LSF data managers in MultiCluster forwarding clusters. Only the cluster that is sending jobs needs to configure the RemoteDataManagers section. Parameters section | | | | | The Parameters section of lsf.datamanager configures LSF data manager administrators, default file transfer command, the location of the LSF data manager staging area, file permissions on the LSF data manager cache, and the grace period for LSF data manager cache cleanup and other LSF data manager operations. | | | | Parameters v v v v v v v | | | | 514 ADMINS CACHE_INPUT_GRACE_PERIOD CACHE_OUTPUT_GRACE_PERIOD CACHE_PERMISSIONS FILE_TRANSFER_CMD QUERY_NTHREADS STAGING_AREA Platform LSF Configuration Reference lsf.conf | v REMOTE_CACHE_REFRESH_INTERVAL | | ADMINS | Syntax | ADMINS=user_name [user_name ... ] | Description | | | | The specified users can reconfigure and shut down the LSF data manager daemon (dmd), and use the bdata tags subcommand to list or clean intermediate files that are associated with a tag for users. LSF data manager administrators must be administrators in each of the connected LSF clusters. | | You cannot configure the root user as an LSF data manager administrator, but root has the same privileges as an LSF data manager administrator. | The parameter takes effect after you restart or reconfigure LSF data manager. | Valid values | Any LSF administrator user account, except root. | Example | ADMINS=lsfadmin user1 user2 | Default | None. Required parameter. | | | CACHE_INPUT_GRACE_PERIOD | Syntax | CACHE_INPUT_GRACE_PERIOD=minutes | Description | | | After the specified number of minutes, the job that is associated with the input file can no longer be queried through bdata and the files that it requested to be staged in are physically deleted from the cache. | | | The grace period for the input file begins when no data jobs request the file and no transfer jobs are transferring it. The input file grace period does not apply to transfers with ERROR status. | The parameter takes effect after you restart or reconfigure LSF data manager. | Valid values | 1 - 2147483647 minutes Required. List of LSF data manager administrator users. Minimum time in minutes that an input file is kept in the LSF data manager cache after no jobs reference it. Chapter 1. Configuration Files 515 lsf.conf | Default | 1440 minutes (1 day) | | | CACHE_OUTPUT_GRACE_PERIOD | Syntax | CACHE_OUTPUT_GRACE_PERIOD=minutes | Description | | | After the specified number of minutes, the job that is associated with the output file can no longer be queried through bdata and the files that it requested to be staged out are physically deleted from the cache. | | | | | The grace period for the output file begins when all the output file records associated to the same job reach TRANSFERRED or ERROR status. However, the files and job records are not cleaned up until the grace periods expire for all stage-out requirements associated with the same job. Output files can be queried until the grace period expires for all output file records associated with the job. | | | | | The grace period on the output file begins whether the file transfer completes successfully or unsuccessfully. Files are cleaned up only after the grace periods of all stage-out requirements for the job expire. After the grace period expires, the stage-out records are removed, and you cannot search the stage-out history from LSF data manager. | | The grace period does not apply to files uploaded to the cache with the bstage out -tag command. You must use bdata tags clean to clean up these files manually. | The parameter takes effect after you restart or reconfigure LSF data manager. | Valid values | 1 - 2147483647 minutes | Default | 180 minutes (3 hours) | | | CACHE_PERMISSIONS | Syntax | CACHE_PERMISSIONS=user | group | all | Description | | | | CACHE_PERMISSIONS=user By default, files are stored in a user-specific subdirectory in the staging area. All files are owned by that user with 700 file permission mode. If two users ask for the same file to be pre-staged with bsub -data, LSF transfers a separate Minimum time in minutes that an output file is kept in the LSF data manager cache after its transfer completes, either successfully or unsuccessfully. Sets file permissions and ownership of the LSF data manager staging area subdirectories. 516 Platform LSF Configuration Reference lsf.conf | | copy for each user. Different users cannot share the same cached file or query the staged files for other users with bdata cache. | | | | | | CACHE_PERMISSIONS=group Users in the same primary UNIX user group can share the cached files to avoid unnecessary file transfers. Files are owned by the first user in the group that requests the file. The subdirectory is based on the main group of the transferring user, and file permissions are set to 750. Users in the same primary group can query files for their group with bdata cache. | | | | | | | CACHE_PERMISSIONS=all Only a single cache is created for incoming files, and files are shareable by all users. All files pre-staged with bsub -data are stored in the staging area with permission 755. Files are owned by the first user who requests the file. If two users ask for the same file to be pre-staged, only one copy of the file is pre-staged, owned by the first requesting user. Users can query any file in the cache with bdata cache. | | | | Note: The value of CACHE_PERMISSIONS affects the directory structure of the staging area cache, which LSF data manager depends on for recovery. To avoid losing the cache, do not change the value of CACHE_PERMISSIONS between LSF data manager restarts. | | | | If you must change the value of CACHE_PERMISSIONS after the cache is already in use, files in the cache subdirectory $STAGING_AREA/stgin/$CACHE_PERMISSIONS corresponding to the old value are not cleaned up by LSF data manager. You must manually delete them as root. Be careful that no running jobs are using them. | The parameter takes effect after you restart or reconfigure LSF data manager. | Default | user | | FILE_TRANSFER_CMD | Syntax | FILE_TRANSFER_CMD=command | Description | | | The specified command must take two arguments of the form [host_name:]abs_file_path. The first argument is an absolute path to the location of the source file and the second is absolute path to the destination of the transfer. | | | The command must be able to accept path descriptors with or without host names for each of its two arguments. For example, the default scp command satisfies both requirements. The cp command is not valid because it can't accept a host name. | | | | The command that you specify must block until the transfer is successfully completed or an error occurs. It must return 0 if successful and a non-zero value if an error occurs. Provide a full path to the command so that it can be accessed from the hosts that the data transfer queue points to. The command that is used by LSF data manager to transfer data files. Chapter 1. Configuration Files 517 lsf.conf | | If the command returns successfully, LSF data manager assumes that the transfer was completed without error | The parameter takes effect after you restart or reconfigure LSF data manager. | | | | Note: If you change FILE_TRANSFER_CMD, transfer jobs that are submitted before the change continue to use the old value. You must kill these jobs and any dependent data jobs, and submit new data jobs for the files to be transferred with the new command. | Default | /usr/bin/scp | | QUERY_NTHREADS | Syntax | QUERY_NTHREADS=integer | Description | | Increase the value of QUERY_NTHREADS to improve the responsiveness of the local LSF data manager to requests by remote data managers and bdata clients. | The parameter takes effect after you restart or reconfigure LSF data manager. | Valid values | 1 - 2147483647 | Default | 4 | | | REMOTE_CACHE_REFRESH_INTERVAL | Syntax | REMOTE_CACHE_REFRESH_INTERVAL=seconds | Description | | | | REMOTE_CACHE_REFRESH_INTERVAL affects how often LSF data manager queries LSF data managers in other clusters. After the specified number of seconds has elapsed, the job assumes that the remote cluster information about this file is stale, and queries the cluster for availability again. | The parameter takes effect after you restart or reconfigure LSF data manager. | Valid values | 1 - 2147483647 seconds Number of threads in the LSF data manager client query thread pool. Number of seconds that information about remote LSF data manager file availability is considered to be fresh. 518 Platform LSF Configuration Reference lsf.conf | Default | 15 seconds | | | STAGING_AREA | Syntax | STAGING_AREA=[host_name:]abs_file_path | Description | | | Any host that the LSF data manager runs on must statically mount a staging area. This path must not point to a directory that has the sticky bit set in any directory in its path (such as /tmp or any subdirectory of /tmp). | | | | If no host name is specified, LSF data manager assumes that all hosts mount the staging area at the same path. If a host name is specified, the specified host name and path must be a subdirectory of an exported file system that appears in the first column of the /etc/mtab file on all hosts that mount the staging area. | The resolved directory must exist. | | You must restart the LSF data manager daemon (dmd) for this parameter to take effect. | | | Note: If you change the STAGING_AREA parameter, files that are stored in the previous cache are not recovered, and therefore are never deleted by the LSF data manager. | Data transfer hosts | | | Each cluster using LSF data management must have a set of hosts that act as data transfer nodes (also referred to as I/O nodes). These hosts must have the following properties: v They must have write access to the staging area. v Each host must be able to reach the source location of any files specified in the job data requirements, and the destination of any stage-out request. v They must statically mount the staging area at the same location as the LSF data manager. v They must be configured as the target hosts of the queue defining the DATA_TRANSFER parameter. v Every staging area must map to a single instance of the data management component. | | | | | | | | | | | | | | | | Absolute path to the top of the data management staging area as it is accessed from the LSF data manager hosts. v The data manager must be able to access the staging area as root. v Any compute node in the target data cluster which does not directly mount the staging area, must have passwordless SSH access to the configured staging area file servers. v If multiple LSF data manager hosts are configured for failover (LSF_DATA_HOSTS in lsf.conf), all data manager candidates must mount the staging area at the same local path. Chapter 1. Configuration Files 519 lsf.conf | Example | | For the following STAGING_AREA: | | | | | The following /etc/mtab entries are valid: v hostA.company.com:/vol/ /mnt/vol1 0 0 - The staging area is accessed locally at /mnt/vol1/dmd_cache v hostA:/vol/dmd_cache /mnt/staging 0 0 - The staging area is accessed locally at /mnt/staging | | The following /etc/mtab entry is not valid: | | In this case, the export location is not an ancestor of the location pointed to by the STAGING_AREA parameter. | Default | None. Required parameter. STAGING_AREA=hostA.company.com:/vol/dmd_cache hostA.company.com:/vol/vol1 /mnt/staging 0 0 RemoteDataManagers section | | | | | Optional. The RemoteDataManagers section tells a local LSF data manager how to communicate with remote LSF data managers in MultiCluster forwarding clusters. Only the cluster that is sending jobs needs to configure the RemoteDataManagers section. | | | | | | Before a job with data requirements is forwarded to another cluster, the local LSF data manager contacts remote LSF data managers, which serve as candidate clusters for job forwarding. The local dmd collects the information that is needed for the scheduler forwarding decision. If the dmd of a candidate forwarding cluster is not configured in the section, LSF excludes that cluster from being scheduled for job forwarding. | | | | Configure RemoteDataManagers section only in MultiCluster submission clusters. When the RemoteDataManagers section is configured, LSF maintains a connection between the master LSF data manager in each cluster, and uses this connection to query the availability of remote files. | | | By default, the local cluster can obtain information about all other clusters that are specified in lsf.shared. The RemoteDataManagers section limits the clusters that the local cluster can obtain information about. | | Every cluster in lsf.shared that has LSF data management features enabled must appear in the list of clusters. | | | | You can configure only one RemoteDataManagers section in lsf.datamanager. If more than one RemoteDataManagers section exists, only the first section takes effect. Sections beyond the first one are ignored. Duplicate cluster names are ignored, and the first cluster name in the list is used. | | | The bdata connections command reports the connection status of each remote LSF data manager. If there is a configuration error, the corresponding manager shows as disconnected (disc) in the output of this command. 520 Platform LSF Configuration Reference lsf.conf | | | | The first line consists of the following required keywords: v CLUSTERNAME v SERVERS v PORT | | Subsequent lines specify the cluster names, server hosts, and ports for the remote LSF data managers. | | | | | | | Example RemoteDataManagers section format | RemoteDataManagers parameters | CLUSTERNAME | | Begin RemoteDataManagers CLUSTERNAME SERVERS cluster1 (host11 host12 ... host1n) cluster2 (host21 host22 ... host2n) clusterM (hostM1 hostM2 ... hostMn) End RemoteDataManagers PORT 1729 4104 13832 Specify the clusters that you want the local cluster to recognize as LSF data managers. | | | SERVERS The list of servers must correspond to the value of the LSF_DATA_HOSTS parameter in lsf.conf of the appropriate cluster. | PORT | | The port number for the cluster must correspond to the value of the LSF_DATA_PORT parameter in lsf.conf of the appropriate cluster. | lsf.licensescheduler The lsf.licensescheduler file contains License Scheduler configuration information. All sections except ProjectGroup are required. In cluster mode, the Project section is also not required. Changing lsf.licensescheduler configuration After making any changes to lsf.licensescheduler, run the following commands: v bladmin reconfig to reconfigure bld v If you made the following changes to this file, you may need to restart mbatchd: – Deleted any feature. – Deleted projects in the DISTRIBUTION parameter of the Feature section. In these cases a message is written to the log file prompting the restart. If you have added, changed, or deleted any Feature or Projects sections, you may need to restart mbatchd. In this case a message is written to the log file prompting the restart. If required, run badmin mbdrestart to restart each LSF cluster. Parameters section Description Required. Defines License Scheduler configuration parameters. Chapter 1. Configuration Files 521 lsf.licensescheduler Parameters section structure The Parameters section begins and ends with the lines Begin Parameters and End Parameters. Each subsequent line describes one configuration parameter. Mandatory parameters are as follows: Begin Parameters ADMIN=lsadmin HOSTS=hostA hostB hostC LMSTAT_PATH=/etc/flexlm/bin RLMSTAT_PATH=/etc/rlm/bin LM_STAT_INTERVAL=30 PORT=9581 End Parameters | Parameters v ADMIN v AUTH v BLC_HEARTBEAT_FACTOR v CHECKOUT_FROM_FIRST_HOST_ONLY v CLUSTER_MODE v DEMAND_LIMIT v v v v v DISTRIBUTION_POLICY_VIOLATION_ACTION ENABLE_INTERACTIVE FAST_DISPATCH HEARTBEAT_INTERVAL HEARTBEAT_TIMEOUT v HIST_HOURS v HOSTS v v v v v | INUSE_FROM_RUSAGE LIB_CONNTIMEOUT LIB_RECVTIMEOUT LM_REMOVE_INTERVAL LM_REMOVE_SUSP_JOBS v LM_REMOVE_SUSP_JOBS_INTERVAL v LM_STAT_INTERVAL | v v v v v | LM_STAT_TIMEOUT LM_TYPE LMREMOVE_SUSP_JOBS LMREMOVE_SUSP_JOBS_INTERVAL LMSTAT_PATH v LOG_EVENT v LOG_INTERVAL v LS_DEBUG_BLC v LS_DEBUG_BLD v v v v 522 LS_ENABLE_MAX_PREEMPT LS_LOG_MASK LS_MAX_STREAM_FILE_NUMBER LS_MAX_STREAM_SIZE Platform LSF Configuration Reference lsf.licensescheduler | v v v v v LS_MAX_TASKMAN_PREEMPT LS_MAX_TASKMAN_SESSIONS LS_STREAM_FILE LS_PREEMPT_PEER MBD_HEARTBEAT_INTERVAL v v v v v v v MBD_REFRESH_INTERVAL MERGE_BY_SERVICE_DOMAIN PEAK_INUSE_PERIOD PORT PREEMPT_ACTION PROJECT_GROUP_PATH REMOTE_LMSTAT_PROTOCOL v RLMSTAT_PATH v STANDBY_CONNTIMEOUT ADMIN Syntax ADMIN=user_name ... Description Defines the License Scheduler administrator using a valid UNIX user account. You can specify multiple accounts. Used for both project mode and cluster mode. AUTH Syntax AUTH=Y Description Enables License Scheduler user authentication for projects for taskman jobs. Used for both project mode and cluster mode. BLC_HEARTBEAT_FACTOR Syntax BLC_HEARTBEAT_FACTOR=integer Description Enables bld to detect blcollect failure. Defines the number of times that bld receives no response from a license collector daemon (blcollect) before bld resets the values for that collector to zero. Each license usage reported to bld by the collector is treated as a heartbeat. Used for both project mode and cluster mode. Chapter 1. Configuration Files 523 lsf.licensescheduler Default 3 CHECKOUT_FROM_FIRST_HOST_ONLY Syntax CHECKOUT_FROM_FIRST_HOST_ONLY=Y Description If enabled, License Scheduler to only consider user@host information for the first execution host for a parallel job when merging the license usage data. Setting in individual Feature sections overrides the global setting in the Parameters section. If disabled, License Scheduler attempts to check out user@host keys in the parallel job constructed using the user name and all execution host names, and merges the corresponding checkout information on the service domain if found. In addition, if MERGE_BY_SERVICE_DOMAIN=Y is defined, License Scheduler merges multiple user@host data for parallel jobs across different service domains. Default Undefined (N).License Scheduler attempts to check out user@host keys in the parallel job constructed using the user name and all execution host names, and merges the corresponding checkout information on the service domain if found. CLUSTER_MODE Syntax CLUSTER_MODE=Y Description Enables cluster mode (instead of project mode) in License Scheduler. Setting in individual Feature sections overrides the global setting in the Parameters section. Cluster mode emphasizes high utilization of license tokens above other considerations such as ownership. License ownership and sharing can still be configured, but within each cluster instead of across multiple clusters. Preemption of jobs (and licenses) also occurs within each cluster instead of across clusters. Cluster mode was introduced in License Scheduler 8.0. Before cluster mode was introduced, project mode was the only choice available. Default Not defined (N). License Scheduler runs in project mode. DEMAND_LIMIT Syntax DEMAND_LIMIT=integer 524 Platform LSF Configuration Reference lsf.licensescheduler Description Sets a limit to which License Scheduler considers the demand by each project in each cluster when allocating licenses. Setting in the Feature section overrides the global setting in the Parameters section. Used for fast dispatch project mode only. When enabled, the demand limit helps prevent License Scheduler from allocating more licenses to a project than can actually be used, which reduces license waste by limiting the demand that License Scheduler considers. This is useful in cases when other resource limits are reached, License Scheduler allocates more tokens than Platform LSF can actually use because jobs are still pending due to lack of other resources. When disabled (that is, DEMAND_LIMIT=0 is set), License Scheduler takes into account all the demand reported by each cluster when scheduling. DEMAND_LIMIT does not affect the DEMAND that blstat displays. Instead, blstat displays the entire demand sent for a project from all clusters. For example, one cluster reports a demand of 15 for a project. Another cluster reports a demand of 20 for the same project. When License Scheduler allocates licenses, it takes into account a demand of five from each cluster for the project and the DEMAND that blstat displays is 35. Periodically, each cluster sends a demand for each project. This is calculated in a cluster for a project by summing up the rusage of all jobs of the project pending due to lack of licenses. Whether to count a job's rusage in the demand depends on the job's pending reason. In general, the demand reported by a cluster only represents a potential demand from the project. It does not take into account other resources that are required to start a job. For example, a demand for 100 licenses is reported for a project. However, if License Scheduler allocates 100 licenses to the project, the project does not necessarily use all 100 licenses due to slot available, limits, or other scheduling constraints. In project mode and fast dispatch project mode, mbatchd in each cluster sends a demand for licenses from each project. In project mode, License Scheduler assumes that each project can actually use the demand that is sent to it. In fast dispatch project mode, DEMAND_LIMIT limits the amount of demand from each project in each cluster that is considered when scheduling. Default 5 DISTRIBUTION_POLICY_VIOLATION_ACTION Syntax DISTRIBUTION_POLICY_VIOLATION_ACTION=(PERIOD reporting_period CMD reporting_command) reporting_period Specify the keyword PERIOD with a positive integer representing the interval (a multiple of LM_STAT_INTERVAL periods) at which License Scheduler checks for distribution policy violations. reporting_command Chapter 1. Configuration Files 525 lsf.licensescheduler Specify the keyword CMD with the directory path and command that License Scheduler runs when reporting a violation. Description Optional. Defines how License Scheduler handles distribution policy violations. Distribution policy violations are caused by non-LSF workloads; License Scheduler explicitly follows its distribution policies. License Scheduler reports a distribution policy violation when the total number of licenses given to the LSF workload, both free and in use, is less than the LSF workload distribution specified in WORKLOAD_DISTRIBUTION. If License Scheduler finds a distribution policy violation, it creates or overwrites the LSF_LOGDIR/bld.violation.service_domain_name.log file and runs the user command specified by the CMD keyword. Used for project mode only. Example The LicenseServer1 service domain has a total of 80 licenses, and its workload distribution and enforcement is configured as follows: Begin Parameter ... DISTRIBUTION_POLICY_VIOLATION_ACTION=(PERIOD 5 CMD /bin/mycmd) ... End Parameter Begin Feature NAME=ApplicationX DISTRIBUTION=LicenseServer1(Lp1 1 Lp2 2) WORKLOAD_DISTRIBUTION=LicenseServer1(LSF 8 NON_LSF 2) End Feature According to this configuration, 80% of the available licenses, or 64 licenses, are available to the LSF workload. License Scheduler checks the service domain for a violation every five scheduling cycles, and runs the /bin/mycmd command if it finds a violation. If the current LSF workload license usage is 50 and the number of free licenses is 10, the total number of licenses assigned to the LSF workload is 60. This is a violation of the workload distribution policy because this is less than the specified LSF workload distribution of 64 licenses. ENABLE_INTERACTIVE Syntax ENABLE_INTERACTIVE=Y Description Optional. Globally enables one share of the licenses for interactive tasks. Tip: By default, ENABLE_INTERACTIVE is not set. License Scheduler allocates licenses equally to each cluster and does not distribute licenses for interactive tasks. 526 Platform LSF Configuration Reference lsf.licensescheduler Used for project mode only. FAST_DISPATCH Syntax FAST_DISPATCH=Y Description Enables fast dispatch project mode for the license feature, which increases license utilization for project licenses. Setting in the Feature section overrides the global setting in the Parameters section. Used for project mode only. When enabled, License Scheduler does not have to run lmutil, lmstat, rlmutil, or rlmstat to verify that a license is free before each job dispatch. As soon as a job finishes, the cluster can reuse its licenses for another job of the same project, which keeps gaps between jobs small. However, because License Scheduler does not run lmutil, lmstat, rlmutil, or rlmstat to verify that the license is free, there is an increased chance of a license checkout failure for jobs if the license is already in use by a job in another project. The fast dispatch project mode supports the following parameters in the Feature section: v ALLOCATION v DEMAND_LIMIT v DISTRIBUTION v GROUP_DISTRIBUTION v v v v v LM_LICENSE_NAME LS_FEATURE_PERCENTAGE NAME NON_SHARED_DISTRIBUTION SERVICE_DOMAINS v WORKLOAD_DISTRIBUTION The fast dispatch project mode also supports the MBD_HEARTBEAT_INTERVAL parameter in the Parameters section. Other parameters are not supported, including those that project mode supports, such as the following parameters: v ACCINUSE_INCLUDES_OWNERSHIP v DYNAMIC v GROUP v LOCAL_TO v LS_ACTIVE_PERCENTAGE Default Not defined (N). License Scheduler runs in project mode without fast dispatch. Chapter 1. Configuration Files 527 lsf.licensescheduler HEARTBEAT_INTERVAL Syntax HEARTBEAT_INTERVAL=seconds Description The time interval between bld heartbeats indicating the bld is still running. Default 60 seconds HEARTBEAT_TIMEOUT Syntax HEARTBEAT_TIMEOUT=seconds Description The time a slave bld waits to hear from the master bld before assuming it has died. Default 120 seconds HIST_HOURS Syntax HIST_HOURS=hours Description Determines the rate of decay the accumulated use value used in fairshare and preemption decisions. When HIST_HOURS=0, accumulated use is not decayed. Accumulated use is displayed by the blstat command under the heading ACUM_USE. Used for project mode only. Default 5 hours. Accumulated use decays to 1/10 of the original value over 5 hours. HOSTS Syntax HOSTS=host_name.domain_name ... Description Defines License Scheduler hosts, including License Scheduler candidate hosts. 528 Platform LSF Configuration Reference lsf.licensescheduler Specify a fully qualified host name such as hostX.mycompany.com. You can omit the domain name if all your License Scheduler clients run in the same DNS domain. Used for both project mode and cluster mode. INUSE_FROM_RUSAGE Syntax INUSE_FROM_RUSAGE=Y|N Description When not defined or set to N, the INUSE value uses rusage from bsub job submissions merged with license checkout data reported by blcollect (as reported by blstat). When INUSE_FROM_RUSAGE=Y, the INUSE value uses the rusage from bsub job submissions instead of waiting for the blcollect update. This can result in faster reallocation of tokens when using dynamic allocation (when ALLOC_BUFFER is set). When for individual license features, the Feature section setting overrides the global Parameters section setting. Used for cluster mode only. Default N LIB_CONNTIMEOUT Syntax LIB_CONNTIMEOUT=seconds Description Specifies a timeout value in seconds for communication between License Scheduler and LSF APIs. LIB_CONNTIMEOUT=0 indicates no timeout. Used for both project mode and cluster mode. Default 5 seconds LIB_RECVTIMEOUT Syntax LIB_RECVTIMEOUT=seconds Description Specifies a timeout value in seconds for communication between License Scheduler and LSF. Used for both project mode and cluster mode. Chapter 1. Configuration Files 529 lsf.licensescheduler Default 5 seconds LM_REMOVE_INTERVAL Syntax LM_REMOVE_INTERVAL=seconds Description Specifies the minimum time a job must have a license checked out before lmremove or rlmremove can remove the license (using preemption). lmremove or rlmremove causes the license manager daemon and vendor daemons to close the TCP connection with the application. | License Scheduler only considers preempting a job after this interval has elapsed. LM_REMOVE_INTERVAL overrides the LS_WAIT_TO_PREEMPT value if LM_REMOVE_INTERVAL is larger. When using lmremove or rlmremove as part of the preemption action (LM_REMOVE_SUSP_JOBS), define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove or rlmremove to release licenses from the job. | | Used for both project mode and cluster mode. Default 180 seconds | | LM_REMOVE_SUSP_JOBS | LM_REMOVE_SUSP_JOBS=seconds | Description | | | | Enables License Scheduler to use lmremove (for FlexNet) or rlmremove (for Reprise License Manager) to remove license features from each recently-suspended job. After enabling this parameter, the preemption action is to suspend the job's processes and use lmremove or rlmremove to remove licences from the application. | | | | | License Scheduler continues to try removing the license feature for the specified number of seconds after the job is first suspended. When setting this parameter for an application, specify a value greater than the period following a license checkout that lmremove or rlmremove will fail for the application. This ensures that when a job suspends, its licenses are released. This period depends on the application. | | | | When using lmremove or rlmremove as part of the preemption action, define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove or rlmremove to release licenses from the job. | This parameter applies to all features in fast dispatch project mode. Syntax 530 Platform LSF Configuration Reference lsf.licensescheduler | Used for fast dispatch project mode only. | Default | Undefined. The default preemption action is to send a TSTP signal to the job. | | LM_REMOVE_SUSP_JOBS_INTERVAL | LM_REMOVE_SUSP_JOBS_INTERVAL=seconds | Description | | | Specifies the minimum length of time between subsequent child processes that License Scheduler forks to run lmremove or rlmremoveevery time it receives an update from a license collector daemon (blcollect). | | Use this parameter when using lmremove or rlmremove as part of the preemption action (LM_REMOVE_SUSP_JOBS). | Used for fast dispatch project mode only. | Default | 0. Uses the value of LM_STAT_INTERVAL instead. Syntax LM_STAT_INTERVAL Syntax LM_STAT_INTERVAL=seconds Description Defines a time interval between calls that License Scheduler makes to collect license usage information from the license manager. Default 60 seconds LM_STAT_TIMEOUT Syntax LM_STAT_TIMEOUT=seconds Description Sets the timeout value passed to the lmutil lmstat, lmstat, rlmutil rlmstat, or rlmstat command. The Parameters section setting is overwritten by the ServiceDomain setting, which is overwritten by the command line setting (blcollect -t timeout). Used for both project mode and cluster mode. Chapter 1. Configuration Files 531 lsf.licensescheduler Default 180 seconds | | LM_TYPE | LM_TYPE=FLEXLM | RLM | Description | | Defines the license manager system that is used by the license servers. This determines how License Scheduler communicates with the license servers. | | Define LM_TYPE=FLEXLM if the license servers are using FlexNet Manager as the license manager system. | | Define LM_TYPE=RLM if the license servers are using Reprise License Manager as the license manager system. | Default | FLEXLM Syntax LMREMOVE_SUSP_JOBS Syntax LMREMOVE_SUSP_JOBS=seconds Description Use LM_REMOVE_SUSP_JOBS instead. This parameter is only maintained for backwards compability. | | Enables License Scheduler to use lmremove to remove license features from each recently-suspended job. After enabling this parameter, the preemption action is to suspend the job's processes and use lmremove to remove licences from the application. lmremove causes the license manager daemon and vendor daemons to close the TCP connection with the application. License Scheduler continues to try removing the license feature for the specified number of seconds after the job is first suspended. When setting this parameter for an application, specify a value greater than the period following a license checkout that lmremove will fail for the application. This ensures that when a job suspends, its licenses are released. This period depends on the application. When using lmremove as part of the preemption action, define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove to release licenses from the job. This parameter applies to all features in fast dispatch project mode. Used for fast dispatch project mode only. 532 Platform LSF Configuration Reference lsf.licensescheduler Default Undefined. The default preemption action is to send a TSTP signal to the job. LMREMOVE_SUSP_JOBS_INTERVAL Syntax LMREMOVE_SUSP_JOBS_INTERVAL=seconds Description | | Replace LMREMOVE_SUSP_JOBS_INTERVAL with LM_REMOVE_SUSP_JOBS_INTERVAL instead. LMREMOVE_SUSP_JOBS_INTERVAL is only maintained for backwards compatibility. Specifies the minimum length of time between subsequent child processes that License Scheduler forks to run lmremove every time it receives an update form a license collector daemon (blcollect). Use this parameter when using lmremove as part of the preemption action (LMREMOVE_SUSP_JOBS). Used for fast dispatch project mode only. Default 0 LMSTAT_PATH Syntax LMSTAT_PATH=path Description Defines the full path to the location of the FlexNet command lmutil (or lmstat). Used for project mode, fast dispatch project mode, and cluster mode. LOG_EVENT Syntax LOG_EVENT=Y Description Enables logging of License Scheduler events in the bld.stream file. Default Not defined. Information is not logged. LOG_INTERVAL Syntax LOG_INTERVAL=seconds Chapter 1. Configuration Files 533 lsf.licensescheduler Description The interval between token allocation data logs in the data directory Default 60 seconds LS_DEBUG_BLC Syntax LS_DEBUG_BLC=log_class Description Sets the debugging log class for the License Scheduler blcollect daemon. Used for both project mode and cluster mode. Specifies the log class filtering to be applied to blcollect. Only messages belonging to the specified log class are recorded. LS_DEBUG_BLC sets the log class and is used in combination with LS_LOG_MASK, which sets the log level. For example: LS_LOG_MASK=LOG_DEBUG LS_DEBUG_BLC="LC_TRACE" To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LS_DEBUG_BLC="LC_TRACE" You need to restart the blcollect daemons after setting LS_DEBUG_BLC for your changes to take effect. Valid values Valid log classes are: v v v v v v v LC_AUTH and LC2_AUTH: Log authentication messages LC_COMM and LC2_COMM: Log communication messages LC_FLEX - Log everything related to FLEX_STAT or FLEX_EXEC Flexera APIs LC_PERFM and LC2_PERFM: Log performance messages LC_PREEMPT - Log license preemption policy messages LC_RESREQ and LC2_RESREQ: Log resource requirement messages LC_SYS and LC2_SYS: Log system call messages v LC_TRACE and LC2_TRACE: Log significant program walk steps v LC_XDR and LC2_XDR: Log everything transferred by XDR Default Not defined. LS_DEBUG_BLD Syntax LS_DEBUG_BLD=log_class 534 Platform LSF Configuration Reference lsf.licensescheduler Description Sets the debugging log class for the License Scheduler bld daemon. Used for both project mode and cluster mode. Specifies the log class filtering to be applied to bld. Messages belonging to the specified log class are recorded. Not all debug message are controlled by log class. LS_DEBUG_BLD sets the log class and is used in combination with MASK, which sets the log level. For example: LS_LOG_MASK=LOG_DEBUG LS_DEBUG_BLD="LC_TRACE" To specify multiple log classes, use a space-separated list enclosed in quotation marks. For example: LS_DEBUG_BLD="LC_TRACE" You need to restart the bld daemon after setting LS_DEBUG_BLD for your changes to take effect. If you use the command bladmin blddebug to temporarily change this parameter without changing lsf.licensescheduler, you do not need to restart the daemons. Valid values Valid log classes are: v LC_AUTH and LC2_AUTH: Log authentication messages v LC_COMM and LC2_COMM: Log communication messages v v v v v v LC_FLEX - Log everything related to FLEX_STAT or FLEX_EXEC Flexera APIs LC_MEMORY - Log memory use messages LC_PREEMPT - Log license preemption policy messages LC_RESREQ and LC2_RESREQ: Log resource requirement messages LC_TRACE and LC2_TRACE: Log significant program walk steps LC_XDR and LC2_XDR: Log everything transferred by XDR Valid values Valid log classes are the same as for LS_DEBUG_CMD. Default Not defined. LS_ENABLE_MAX_PREEMPT Syntax LS_ENABLE_MAX_PREEMPT=Y Description Enables maximum preemption time checking for LSF and taskman jobs. Chapter 1. Configuration Files 535 lsf.licensescheduler When LS_ENABLE_MAX_PREEMPT is disabled, preemption times for taskman job are not checked regardless of the value of parameters LS_MAX_TASKMAN_PREEMPT in lsf.licensescheduler and MAX_JOB_PREEMPT in lsb.queues, lsb.applications, or lsb.params. Used for project mode only. Default N LS_LOG_MASK Syntax LS_LOG_MASK=message_log_level Description Specifies the logging level of error messages for License Scheduler daemons. If LS_LOG_MASK is not defined in lsf.licensescheduler, the value of LSF_LOG_MASK in lsf.conf is used. If neither LS_LOG_MASK nor LSF_LOG_MASK is defined, the default is LOG_WARNING. Used for both project mode and cluster mode. For example: LS_LOG_MASK=LOG_DEBUG The log levels in order from highest to lowest are: v LOG_ERR v LOG_WARNING v LOG_INFO v LOG_DEBUG v LOG_DEBUG1 v LOG_DEBUG2 v LOG_DEBUG3 The most important License Scheduler log messages are at the LOG_WARNING level. Messages at the LOG_DEBUG level are only useful for debugging. Although message log level implements similar functionality to UNIX syslog, there is no dependency on UNIX syslog. It works even if messages are being logged to files instead of syslog. License Scheduler logs error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. The level specified by LS_LOG_MASK determines which messages are recorded and which are discarded. All messages logged at the specified level or higher are recorded, while lower level messages are discarded. For debugging purposes, the level LOG_DEBUG contains the fewest number of debugging messages and is used for basic debugging. The level LOG_DEBUG3 records all debugging messages, and can cause log files to grow very large; it is not often used. Most debugging is done at the level LOG_DEBUG2. 536 Platform LSF Configuration Reference lsf.licensescheduler Default LOG_WARNING LS_MAX_STREAM_FILE_NUMBER Syntax LS_MAX_STREAM_FILE_NUMBER=integer Description Sets the number of saved bld.stream.timestamp log files. When LS_MAX_STREAM_FILE_NUMBER=2, for example, the two most recent files are kept along with the current bld.stream file. Used for both project mode and cluster mode. Default 0 (old bld.stream file is not saved) LS_MAX_STREAM_SIZE Syntax LS_MAX_STREAM_SIZE=integer Description Defines the maximum size of the bld.stream file in MB. once this size is reached an EVENT_END_OF_STREAM is logged, a new bld.stream file is created, and the old bld.stream file is renamed bld.stream.timestamp. Used for both project mode and cluster mode. Default 1024 LS_MAX_TASKMAN_PREEMPT Syntax LS_MAX_TASKMAN_PREEMPT=integer Description Defines the maximum number of times taskman jobs can be preempted. Maximum preemption time checking for all jobs is enabled by LS_ENABLE_MAX_PREEMPT. Used for project mode only. Default unlimited Chapter 1. Configuration Files 537 lsf.licensescheduler LS_MAX_TASKMAN_SESSIONS Syntax LS_MAX_TASKMAN_SESSIONS=integer Description Defines the maximum number of taskman jobs that run simultaneously. This prevents system-wide performance issues that occur if there are a large number of taskman jobs running in License Scheduler. The number taskman sessions must be a positive integer. The actual maximum number of taskman jobs is affected by the operating system file descriptor limit. Make sure the operating system file descriptor limit and the maximum concurrent connections are large enough to support all taskman tasks, License Scheduler (bl*) commands, and connections between License Scheduler and LSF. Used for both project mode and cluster mode. LS_STREAM_FILE Syntax LS_STREAM_FILE=path Used for both project mode and cluster mode. Description Defines the full path and filename of the bld event log file, bld.stream by default. Note: In License Scheduler 8.0 the bld.events log file was replaced by the bld.stream log file. Default LSF_TOP/work/db/bld.stream LS_PREEMPT_PEER Syntax LS_PREEMPT_PEER=Y Description Enables bottom-up license token preemption in hierarchical project group configuration. License Scheduler attempts to preempt tokens from the closest projects in the hierarchy first. This balances token ownership from the bottom up. Used for project mode only. 538 Platform LSF Configuration Reference lsf.licensescheduler Default Not defined. Token preemption in hierarchical project groups is top down. MBD_HEARTBEAT_INTERVAL Syntax MBD_HEARTBEAT_INTERVAL=seconds Description Sets the length of time the cluster license allocation remains unchanged after a cluster has disconnected from bld. After MBD_HEARTBEAT_INTERVAL has passed, the allocation is set to zero and licenses are redistributed to other clusters. Used for cluster mode and fast dispatch project mode only. Default 900 seconds MBD_REFRESH_INTERVAL Syntax MBD_REFRESH_INTERVAL=seconds Description MBD_REFRESH_INTERVAL: Cluster mode and project mode. This parameter allows the administrator to independently control the minimum interval between load updates from bld, and the minimum interval between load updates from LIM. The parameter controls the frequency of scheduling interactive (taskman) jobs. The parameter is read by mbatchd on startup. When MBD_REFRESH_INTERVAL is set or changed, you must restart bld, and restart mbatchd in each cluster. Used for both project mode and cluster mode. Default 15 seconds MERGE_BY_SERVICE_DOMAIN Syntax MERGE_BY_SERVICE_DOMAIN=Y | N Description If enabled, correlates job license checkout with the lmutil lmstat, lmstat, rlmutil rlmstat, or rlmstat output across all service domains first before reserving licenses. In project mode (but not fast dispatch project mode), this parameter supports the case where the application's checkout license number is less than or equal to the job's rusage. If the checked out licenses are greater than the job's rusage, the ENABLE_DYNAMIC_RUSAGE parameter is still required. Chapter 1. Configuration Files 539 lsf.licensescheduler Default N (Does not correlate job license checkout with the lmutil, lmstat, rlmutil, or rlmstat output across all service domains before reserving licenses) PEAK_INUSE_PERIOD Syntax PEAK_INUSE_PERIOD=seconds Description Defines the interval over which a peak INUSE value is determined for dynamic license allocation in cluster mode for all license features over all service domains. When defining the interval for LSF AE submission clusters, the interval is determined for the entire LSF AE mega-cluster (the submission cluster and its execution clusters). | | | Used for cluster mode only. When defined in both the Parameters section and the Feature section, the Feature section definition is used for that license feature. Default 300 seconds PORT Syntax PORT=integer Description Defines the TCP listening port used by License Scheduler hosts, including candidate License Scheduler hosts. Specify any non-privileged port number. Used for both project mode and cluster mode. PREEMPT_ACTION Syntax PREEMPT_ACTION=action Description Specifies the action used for taskman job preemption. By default, if PREEMPT_ACTION is not configured, bld sends a TSTP signal to preempt taskman jobs. You can specify a script using this parameter. For example, PREEMPT_ACTION = /home/user1/preempt.s issues preempt.s when preempting a taskman job. Used for project mode only. 540 Platform LSF Configuration Reference lsf.licensescheduler Default Not defined. A TSTP signal is used to preempt taskman jobs. PROJECT_GROUP_PATH Syntax PROJECT_GROUP_PATH=Y Description Enables hierarchical project group paths for fast dispatch project mode, which enables the following: v Features can use hierarchical project groups with project and project group names that are not unique, as long as the projects or project groups do not have the same parent. That is, you can define projects and project groups in more than one hierarchical project group. v When specifying -Lp license_project, you can use paths to describe the project hierarchy without specifying the root group. For example, if you have root as your root group, which has a child project group named groupA with a project named proj1, you can use -Lp /groupA/proj1 to specify this project. v Hierarchical project groups have a default project named others with a default share value of 0. Any projects that do not match the defined projects in a project group are assigned into the others project. If there is already a project named others, the preexisting others project specification overrides the default project. If LSF_LIC_SCHED_STRICT_PROJECT_NAME (in lsf.conf) and PROJECT_GROUP_PATH are both defined, PROJECT_GROUP_PATH takes precedence and overrides the LSF_LIC_SCHED_STRICT_PROJECT_NAME behavior for fast dispatch project mode. Note: To use PROJECT_GROUP_PATH, you need LSF, Version 9.1.1, or later. Used for fast dispatch project mode only. Default Not defined (N). REMOTE_LMSTAT_PROTOCOL Syntax REMOTE_LMSTAT_PROTOCOL=ssh [ssh_command_options] | rsh [rsh_command_options] | lsrun [lsrun_command_options] Description Specifies the method that License Scheduler uses to connect to the remote agent host if there are remote license servers that need a remote agent host to collect license information. If there are remote license servers that need a remote agent host to collect license information, License Scheduler uses the specified command (and optional Chapter 1. Configuration Files 541 lsf.licensescheduler command options) to connect to the agent host. License Scheduler automatically appends the name of the remote agent host to the command, so there is no need to specify the host with the command. Note: License Scheduler does not validate the specified command, so you must ensure that you correctly specify the command. The blcollect log file notes that the command failed, but not any details on the connection error. To determine specific connection errors, manually specify the command to connect to the remote server before specifying it in REMOTE_LMSTAT_PROTOCOL. If using lsrun as the connection method, the remote agent host must be a server host in the LSF cluster and RES must be started on this host. If using ssh or rsh as the connection method, the agent host does not have to be a server host in the LSF cluster. REMOTE_LMSTAT_PROTOCOL works with REMOTE_LMSTAT_SERVERS, which defines the remote license servers and remote agent hosts. If you do not define REMOTE_LMSTAT_SERVERS, REMOTE_LMSTAT_PROTOCOL is not used. Used for both project mode and cluster mode. Default ssh | | RLMSTAT_PATH | RLMSTAT_PATH=path | Description | Defines the full path to the location of the Reprise License Manager commands. | Used for both project mode and cluster mode. | Default | If not defined, this is set to LMSTAT_PATH. Syntax STANDBY_CONNTIMEOUT Syntax STANDBY_CONNTIMEOUT=seconds Description Sets the connection timeout the standby bld waits when trying to contact each host before assuming the host is unavailable. Used for both project mode and cluster mode. Default 5 seconds 542 Platform LSF Configuration Reference lsf.licensescheduler Clusters section Description Required. Lists the clusters that can use License Scheduler. When configuring clusters for a WAN, the Clusters section of the master cluster must define its slave clusters. The Clusters section is the same for both project mode and cluster mode. Clusters section structure The Clusters section begins and ends with the lines Begin Clusters and End Clusters. The second line is the column heading, CLUSTERS. Subsequent lines list participating clusters, one name per line: Begin Clusters CLUSTERS cluster1 cluster2 End Clusters CLUSTERS Defines the name of each participating LSF cluster. Specify using one name per line. ServiceDomain section Description Required. Defines License Scheduler service domains as groups of physical license server hosts that serve a specific network. The ServiceDomain section is the same for both project mode and cluster mode. ServiceDomain section structure Define a section for each License Scheduler service domain. This example shows the structure of the section: Begin ServiceDomain NAME=DesignCenterB LIC_SERVERS=((1888@hostD)(1888@hostE)) LIC_COLLECTOR=CenterB End ServiceDomain Parameters v LIC_SERVERS v LIC_COLLECTOR v LM_STAT_INTERVAL v LM_STAT_TIMEOUT | v LM_TYPE v NAME v REMOTE_LMSTAT_SERVERS Chapter 1. Configuration Files 543 lsf.licensescheduler LIC_SERVERS Syntax When using FlexNet as the license manager (LM_TYPE=FLEXLM): LIC_SERVERS=([(host_name | port_number@host_name |(port_number@host_name port_number@host_name port_number@host_name))] ...) When using Reprise License Manager as the license manager (LM_TYPE=RLM): LIC_SERVERS=([( port_number@host_name |(port_number@host_name port_number@host_name port_number@host_name))] ...) | | | Description Defines the license server hosts that make up the License Scheduler service domain. Specify one or more license server hosts, and for each license server host, specify the number of the port that the license manager uses, then the at symbol (@), then the name of the host. Put one set of parentheses around the list, and one more set of parentheses around each host, unless you have redundant servers (three hosts sharing one license file). If you have redundant servers, the parentheses enclose all three hosts. | | | If License Scheduler is using FlexNet as the license manager (that is, LM_TYPE=FLEXLM), and FlexNet uses the default port on a host, you can specify the host name without the port number. | | If License Scheduler is using Reprise License Manager as the license manager (that is, LM_TYPE=RLM), you must specify a port number for every license server host. Used for both project mode and cluster mode. Examples v One FlexNet license server host: LIC_SERVERS=((1700@hostA)) v Multiple license server hosts with unique license.dat files: LIC_SERVERS=((1700@hostA)(1700@hostB)(1700@hostC)) v Redundant license server hosts sharing the same license.dat file: LIC_SERVERS=((1700@hostD 1700@hostE 1700@hostF)) LIC_COLLECTOR Syntax LIC_COLLECTOR=license_collector_name Description Optional. Defines a name for the license collector daemon (blcollect) to use in each service domain. blcollect collects license usage information from FlexNet and passes it to the License Scheduler daemon (bld). It improves performance by allowing you to distribute license information queries on multiple hosts. You can only specify one collector per service domain, but you can specify one collector to serve multiple service domains. Each time you run blcollect, you must specify the name of the collector for the service domain. You can use any name you want. 544 Platform LSF Configuration Reference lsf.licensescheduler Used for both project mode and cluster mode. Default Undefined. The License Scheduler daemon uses one license collector daemon for the entire cluster. LM_STAT_INTERVAL Syntax LM_STAT_INTERVAL=seconds Description Defines a time interval between calls that License Scheduler makes to collect license usage information from the license manager. The value specified for a service domain overrides the global value defined in the Parameters section. Each service domain definition can specify a different value for this parameter. Used for both project mode and cluster mode. Default License Scheduler applies the global value defined in the Parameters section. LM_STAT_TIMEOUT Syntax LM_STAT_TIMEOUT=seconds Description Sets the timeout value passed to the lmutil lmstat, lmstat, rlmutil rlmstat, or rlmstat command. The Parameters section setting is overwritten by the ServiceDomain setting, which is overwritten by the command line setting (blcollect -t timeout). | | When using Reprise License Manager as the license manager (LM_TYPE=RLM), this parameter is ignored. Used for both project mode and cluster mode. Default 180 seconds | | LM_TYPE | LM_TYPE=FLEXLM | RLM Syntax Chapter 1. Configuration Files 545 lsf.licensescheduler | Description | | | Defines the license manager system that is used by the license servers. This determines how License Scheduler communicates with the license servers that are defined by the LIC_SERVERS parameter. | | Define LM_TYPE=FLEXLM if the license servers are using FlexNet Manager as the license manager system. | | | | Define LM_TYPE=RLM if the license servers are using Reprise License Manager as the license manager system. When LM_TYPE=RLM is defined, LIC_SERVERS must define port_number@host_name (that is, LIC_SERVERS must define a port number). Defining just the host name (or @host_name) without the port number is not allowed. | Default | FLEXLM NAME Defines the name of the service domain. Used for both project mode and cluster mode. REMOTE_LMSTAT_SERVERS Syntax REMOTE_LMSTAT_SERVERS=host_name[(host_name ...)] [host_name[(host_name ...)] ...] Description Defines the remote license servers and, optionally, the remote agent hosts that serve these remote license servers. A remote license server is a license server that does not run on the same domain as the license collector. A remote agent host serves remote license servers within the same domain, allowing the license collector to get license information on the remote license servers with a single remote connection. Defining remote agent hosts are useful when there are both local and remote license servers because it is slower for the license collector to connect to multiple remote license servers to get license information than it is to connect to local license servers. The license collector connects to the remote agent host (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and calls lmutil, lmstat, rlmutil, or rlmstat to collect license information from the license servers that the agent hosts serve. This allows the license collector to connect to one remote agent host to get license information from all the remote license servers on the same domain as the remote agent host. These license servers should be in the same subnet as the agent host to improve access. Remote license servers must also be license servers defined in LIC_SERVERS. Any remote license servers defined in REMOTE_LMSTAT_SERVERS that are not also defined in LIC_SERVERS are ignored. Remote agent hosts that serve other license servers do not need to be defined in LIC_SERVERS. Remote agent hosts that are not defined in LIC_SERVERS function only as remote agents and not as license servers. 546 Platform LSF Configuration Reference lsf.licensescheduler If you specify a remote agent host without additional servers (that is, the remote agent host does not serve any license servers), the remote agent host is considered to be a remote license server with itself as the remote agent host. That is, the license collector connects to the remote agent host and only gets license information on the remote agent host. Because these hosts are remote license servers, these remote agent hosts must also be defined as license servers in LIC_SERVERS, or they will be ignored. Used for both project mode and cluster mode. Examples v One local license server (hostA) and one remote license server (hostB): LIC_SERVERS=((1700@hostA)(1700@hostB)) REMOTE_LMSTAT_SERVERS=hostB – The license collector runs lmutil, lmstat, rlmutil, or rlmstat directly on hostA to get license information on hostA. – Because hostB is defined without additional license servers, hostB is a remote agent host that only serves itself. The license collector connects to hostB (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and runs lmutil, lmstat, rlmutil, or rlmstat to get license information on 1700@hostB. v One local license server (hostA), one remote agent host (hostB) that serves one remote license server (hostC), and one remote agent host (hostD) that serves two remote license servers (hostE and hostF): LIC_SERVERS=((1700@hostA)(1700@hostB)(1700@hostC)(1700@hostD)(1700@hostE)(1700@hostF)) REMOTE_LMSTAT_SERVERS=hostB(hostC) hostD(hostE hostF) – The license collector runs lmutil, lmstat, rlmutil, or rlmstat directly to get license information from 1700@hostA, 1700@hostB, and 1700@hostD. – The license collector connects to hostB (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and runs lmutil, lmstat, rlmutil, or rlmstat to get license information on 1700@hostC. hostB and hostC should be in the same subnet to improve access. – The license collector connects to hostD (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and runs lmutil, lmstat, rlmutil, or rlmstat to get license information on 1700@hostE and 1700@hostF. hostD, hostE, and hostF should be in the same subnet to improve access. v One local license server (hostA), one remote license server (hostB), and one remote agent host (hostC) that serves two remote license servers (hostD and hostE): LIC_SERVERS=((1700@hostA)(1700@hostB)(1700@hostC)(1700@hostD)(1700@hostE)) REMOTE_LMSTAT_SERVERS=hostB hostC(hostD hostE) – The license collector runs lmutil, lmstat, rlmutil, or rlmstat directly to get license information on 1700@hostA and 1700@hostC. – The license collector connects to hostB (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and runs lmutil, lmstat, rlmutil, or rlmstat to get license information on 1700@hostB. – The license collector connects to hostC (using the command specified by the REMOTE_LMSTAT_PROTOCOL parameter) and runs lmutil, lmstat, rlmutil, or rlmstat to get license information on 1700@hostD and 1700@hostE. hostC, hostD, and hostE should be in the same subnet to improve access. Chapter 1. Configuration Files 547 lsf.licensescheduler Feature section Description Required. Defines license distribution policies. Feature section structure Define a section for each feature managed by License Scheduler. Begin Feature NAME=vcs LM_LICENSE_NAME=vcs ... Distribution policy Parameters ... End Feature Parameters v ACCINUSE_INCLUDES_OWNERSHIP v ALLOC_BUFFER v ALLOCATION v CLUSTER_DISTRIBUTION v CLUSTER_MODE v DEMAND_LIMIT v v v v v DISTRIBUTION DYNAMIC ENABLE_DYNAMIC_RUSAGE ENABLE_MINJOB_PREEMPTION CHECKOUT_FROM_FIRST_HOST_ONLY v FAST_DISPATCH v FLEX_NAME v v v v v GROUP GROUP_DISTRIBUTION INUSE_FROM_RUSAGE LM_LICENSE_NAME LM_REMOVE_INTERVAL v LM_REMOVE_SUSP_JOBS v LMREMOVE_SUSP_JOBS | v v v v v LOCAL_TO LS_ACTIVE_PERCENTAGE LS_FEATURE_PERCENTAGE LS_WAIT_TO_PREEMPT NAME v NON_SHARED_DISTRIBUTION v PEAK_INUSE_PERIOD v PREEMPT_ORDER v PREEMPT_RESERVE v RETENTION_FACTOR v SERVICE_DOMAINS 548 Platform LSF Configuration Reference lsf.licensescheduler v WORKLOAD_DISTRIBUTION ACCINUSE_INCLUDES_OWNERSHIP Syntax ACCINUSE_INCLUDES_OWNERSHIP=Y Description When not defined, accumulated use is incremented each scheduling cycle by (tokens in use) + (tokens reserved) if this exceeds the number of tokens owned. When defined, accumulated use is incremented each scheduling cycle by (tokens in use) + (tokens reserved) regardless of the number of tokens owned. This is useful for projects that have a very high ownership set when considered against the total number of tokens available for LSF workload. Projects can be starved for tokens when the ownership is set too high and this parameter is not set. Accumulated use is displayed by the blstat command under the heading ACUM_USE. Used for project mode only. Cluster mode and fast dispatch project mode do not track accumulated use. Default N, not enabled. ALLOC_BUFFER Syntax ALLOC_BUFFER = buffer | cluster_name buffer ... default buffer Description Enables dynamic distribution of licenses across clusters in cluster mode. Cluster names must be the names of clusters defined in the Clusters section of lsf.licensescheduler. Used for cluster mode only. ALLOC_BUFFER=buffer sets one buffer size for all clusters, while ALLOC_BUFFER=cluster_name buffer ... sets a different buffer size for each cluster. The buffer size is used during dynamic redistribution of licenses. Increases in allocation are determined by the PEAK value, and increased by DEMAND for license tokens to a maximum increase of BUFFER, the buffer size configured by ALLOC_BUFFER. The license allocation can increase in steps as large as the buffer size, but no larger. Allocation buffers help determine the maximum rate at which tokens can be transferred to a cluster as demand increases in the cluster. The maximum rate of transfer to a cluster is given by the allocation buffer divided by Chapter 1. Configuration Files 549 lsf.licensescheduler MBD_REFRESH_INTERVAL. Be careful not to set the allocation buffer too large so that licenses are not wasted because they are be allocated to a cluster that cannot use them. Decreases in license allocation can be larger than the buffer size, but the allocation must remain at PEAK+BUFFER licenses. The license allocation includes up to the buffer size of extra licenses, in case demand increases. Increasing the buffer size allows the license allocation to grow faster, but also increases the number of licenses that may go unused at any given time. The buffer value must be tuned for each license feature and cluster to balance these two objectives. When defining the buffer size for LSF AE submission clusters, the license allocation for the entire LSF AE mega-cluster (the submission cluster and its execution clusters) can increase in steps as large as the buffer size, but no larger. | | | Detailed license distribution information is shown in the blstat output. Use the keyword default to apply a buffer size to all remaining clusters. For example: Begin Feature NAME = f1 CLUSTER_DISTRIBUTION = WanServers(banff 1 berlin 1 boston 1) ALLOC_BUFFER = banff 10 default 5 End Feature In this example, dynamic distribution is enabled. The cluster banff has a buffer size of 10, and all remaining clusters have a buffer size of 5. To allow a cluster to be able to use licenses only when another cluster does not need them, you can set the cluster distribution for the cluster to 0, and specify an allocation buffer for the number of tokens that the cluster can request. For example: Begin Feature CLUSTER_DISTRIBUTION=Wan(CL1 0 CL2 1) ALLOC_BUFFER=5 End Feature When no jobs are running, the token allocation for CL1 is 5. CL1 can get more than 5 tokens if CL2 does not require them. Default Not defined. Static distribution of licenses is used in cluster mode. ALLOCATION Syntax ALLOCATION=project_name (cluster_name [number_shares] ... ) ... cluster_name Specify LSF cluster names or interactive tasks that licenses are to be allocated to. project_name 550 Platform LSF Configuration Reference lsf.licensescheduler Specify a License Scheduler project (described in the Projects section or as default) that is allowed to use the licenses. number_shares Specify a positive integer representing the number of shares assigned to the cluster. The number of shares assigned to a cluster is only meaningful when you compare it to the number assigned to other clusters. The total number of shares is the sum of the shares assigned to each cluster. Description Defines the allocation of license features across clusters and interactive tasks. Used for project mode and fast dispatch project mode only. ALLOCATION ignores the global setting of the ENABLE_INTERACTIVE parameter because ALLOCATION is configured for the license feature. You can configure the allocation of license shares to: v Change the share number between clusters for a feature v Limit the scope of license usage and change the share number between LSF jobs and interactive tasks for a feature | | | When defining the allocation of license features for LSF AE submission clusters, the allocation is for the entire LSF AE mega-cluster (the submission cluster and its execution clusters). Tip: To manage interactive tasks in License Scheduler projects, use the LSF Task Manager, taskman. The Task Manager utility is supported by License Scheduler. Default If ENABLE_INTERACTIVE is not set, each cluster receives equal share, and interactive tasks receive no shares. Examples: Each example contains two clusters and 12 licenses of a specific feature. Example 1 ALLOCATION is not configured. The ENABLE_INTERACTIVE parameter is not set. Begin Parameters ... ENABLE_INTERACTIVE=n ... End Parameters Begin Feature NAME=ApplicationX DISTRIBUTION=LicenseServer1 (Lp1 1) End Feature Six licenses are allocated to each cluster. No licenses are allocated to interactive tasks. Chapter 1. Configuration Files 551 lsf.licensescheduler Example 2 ALLOCATION is not configured. The ENABLE_INTERACTIVE parameter is set. Begin Parameters ... ENABLE_INTERACTIVE=y ... End Parameters Begin Feature NAME=ApplicationX DISTRIBUTION=LicenseServer1 (Lp1 1) End Feature Four licenses are allocated to each cluster. Four licenses are allocated to interactive tasks. Example 3 In the following example, the ENABLE_INTERACTIVE parameter does not affect the ALLOCATION configuration of the feature. ALLOCATION is configured. The ENABLE_INTERACTIVE parameter is set. Begin Parameters ... ENABLE_INTERACTIVE=y ... End Parameters Begin Feature NAME=ApplicationY DISTRIBUTION=LicenseServer1 (Lp2 1) ALLOCATION=Lp2(cluster1 1 cluster2 0 interactive 1) End Feature The ENABLE_INTERACTIVE setting is ignored. Licenses are shared equally between cluster1 and interactive tasks. Six licenses of ApplicationY are allocated to cluster1. Six licenses are allocated to interactive tasks. Example 4 In the following example, the ENABLE_INTERACTIVE parameter does not affect the ALLOCATION configuration of the feature. ALLOCATION is configured. The ENABLE_INTERACTIVE parameter is not set. Begin Parameters ... ENABLE_INTERACTIVE=n ... End Parameters Begin Feature NAME=ApplicationZ DISTRIBUTION=LicenseServer1 (Lp1 1) ALLOCATION=Lp1(cluster1 0 cluster2 1 interactive 2) End Feature The ENABLE_INTERACTIVE setting is ignored. Four licenses of ApplicationZ are allocated to cluster2. Eight licenses are allocated to interactive tasks. 552 Platform LSF Configuration Reference lsf.licensescheduler CHECKOUT_FROM_FIRST_HOST_ONLY Syntax CHECKOUT_FROM_FIRST_HOST_ONLY=Y Description If enabled, License Scheduler only considers user@host information for the first execution host of a parallel job when merging the license usage data. Setting in individual Feature sections overrides the global setting in the Parameters section. If a feature has multiple Feature sections (using LOCAL_TO), each section must have the same setting for CHECKOUT_FROM_FIRST_HOST_ONLY. If disabled, License Scheduler attempts to check out user@host keys in the parallel job constructed using the user name and all execution host names, and merges the corresponding checkout information on the service domain if found. If MERGE_BY_SERVICE_DOMAIN=Y is defined, License Scheduler also merges multiple user@host data for parallel jobs across different service domains. Default Undefined (N).License Scheduler attempts to check out user@host keys in the parallel job constructed using the user name and all execution host names, and merges the corresponding checkout information on the service domain if found. CLUSTER_DISTRIBUTION Syntax CLUSTER_DISTRIBUTION=service_domain(cluster shares/min/max ... )... service_domain Specify a License Scheduler WAN service domain (described in the ServiceDomain section) that distributes licenses to multiple clusters, and the share for each cluster. Specify a License Scheduler LAN service domain for a single cluster. cluster Specify each LSF cluster that accesses licenses from this service domain. shares For each cluster specified for a WAN service domain, specify a positive integer representing the number of shares assigned to the cluster. (Not required for a LAN service domain.) The number of shares assigned to a cluster is only meaningful when you compare it to the number assigned to other clusters, or to the total number assigned by the service domain. The total number of shares is the sum of the shares assigned to each cluster. min Optionally, specify a positive integer representing the minimum number of license tokens allocated to the cluster when dynamic allocation is enabled for a WAN service domain (when ALLOC_BUFFER is defined for the feature). Chapter 1. Configuration Files 553 lsf.licensescheduler The minimum allocation is allocated exclusively to the cluster, and is similar to the non-shared allocation in project mode. Cluster shares take precedence over minimum allocations configured. If the minimum allocation exceeds the cluster's share of the total tokens, a cluster's allocation as given by bld may be less than the configured minimum allocation. max Optionally, specify a positive integer representing the maximum number of license tokens allocated to the cluster when dynamic allocation is enabled for a WAN service domain (when ALLOC_BUFFER is definedfor the feature). Description CLUSTER_DISTRIBUTION must be defined when using cluster mode. Defines the cross-cluster distribution policies for the license. The name of each service domain is followed by its distribution policy, in parentheses. The distribution policy determines how the licenses available in each service domain are distributed among the clients. The distribution policy is a space-separated list with each cluster name followed by its share assignment. The share assignment determines what fraction of available licenses is assigned to each cluster, in the event of competition between clusters. Examples CLUSTER_DISTRIBUTION=wanserver(Cl1 1 Cl2 1 Cl3 1 Cl4 1) CLUSTER_DISTRIBUTION = SD(C1 1 C2 1) SD1(C3 1 C4 1) SD2(C1 1) SD3(C2 1) In these examples, wanserver, SD, and SD1 are WAN service domains, while SD2 and SD3 are LAN service domains serving a single cluster. CLUSTER_MODE Syntax CLUSTER_MODE=Y Description Enables cluster mode (instead of project mode) for the license feature. Setting in the Feature section overrides the global setting in the Parameters section. Cluster mode emphasizes high utilization of license tokens above other considerations such as ownership. License ownership and sharing can still be configured, but within each cluster instead of across multiple clusters. Preemption of jobs (and licenses) also occurs within each cluster instead of across clusters. Cluster mode was introduced in License Scheduler 8.0. Before cluster mode was introduced, project mode was the only choice available. Default Undefined (N). License Scheduler runs in project mode. 554 Platform LSF Configuration Reference lsf.licensescheduler DEMAND_LIMIT Syntax DEMAND_LIMIT=integer Description Sets a limit to which License Scheduler considers the demand by each project in each cluster when allocating licenses. Setting in the Feature section overrides the global setting in the Parameters section. Used for fast dispatch project mode only. When enabled, the demand limit helps prevent License Scheduler from allocating more licenses to a project than can actually be used, which reduces license waste by limiting the demand that License Scheduler considers. This is useful in cases when other resource limits are reached, License Scheduler allocates more tokens than Platform LSF can actually use because jobs are still pending due to lack of other resources. When disabled (that is, DEMAND_LIMIT=0 is set), License Scheduler takes into account all the demand reported by each cluster when scheduling. DEMAND_LIMIT does not affect the DEMAND that blstat displays. Instead, blstat displays the entire demand sent for a project from all clusters. For example, one cluster reports a demand of 15 for a project. Another cluster reports a demand of 20 for the same project. When License Scheduler allocates licenses, it takes into account a demand of five from each cluster for the project and the DEMAND that blstat displays is 35. Periodically, each cluster sends a demand for each project. This is calculated in a cluster for a project by summing up the rusage of all jobs of the project pending due to lack of licenses. Whether to count a job's rusage in the demand depends on the job's pending reason. In general, the demand reported by a cluster only represents a potential demand from the project. It does not take into account other resources that are required to start a job. For example, a demand for 100 licenses is reported for a project. However, if License Scheduler allocates 100 licenses to the project, the project does not necessarily use all 100 licenses due to slot available, limits, or other scheduling constraints. In project mode and fast dispatch project mode, mbatchd in each cluster sends a demand for licenses from each project. In project mode, License Scheduler assumes that each project can actually use the demand that is sent to it. In fast dispatch project mode, DEMAND_LIMIT limits the amount of demand from each project in each cluster that is considered when scheduling. Default 5 DISTRIBUTION Syntax DISTRIBUTION=[service_domain_name([project_name number_shares[/ number_licenses_owned]] ... [default] )] ... Chapter 1. Configuration Files 555 lsf.licensescheduler service_domain_name Specify a License Scheduler service domain (described in the ServiceDomain section) that distributes the licenses. project_name Specify a License Scheduler project (described in the Projects section) that is allowed to use the licenses. number_shares Specify a positive integer representing the number of shares assigned to the project. The number of shares assigned to a project is only meaningful when you compare it to the number assigned to other projects, or to the total number assigned by the service domain. The total number of shares is the sum of the shares assigned to each project. number_licenses_owned Optional. Specify a slash (/) and a positive integer representing the number of licenses that the project owns. When configured, preemption is enabled and owned licenses are reclaimed using preemption when there is unmet demand. default A reserved keyword that represents the default project if the job submission does not specify a project (bsub -Lp), or the specified project is not configured in the Projects section of lsf.licensescheduler. Jobs that belong to projects do not get a share of the tokens when the project is not explicitly defined in DISTRIBUTION. Description Used for project mode and fast dispatch project mode only. One of DISTRIBUTION or GROUP_DISTRIBUTION must be defined when using project mode. GROUP_DISTRIBUTION and DISTRIBUTION are mutually exclusive. If defined in the same feature, the License Scheduler daemon returns an error and ignores this feature. Defines the distribution policies for the license. The name of each service domain is followed by its distribution policy, in parentheses. The distribution policy determines how the licenses available in each service domain are distributed among the clients. When in fast dispatch project mode, you can only specify one service domain. The distribution policy is a space-separated list with each project name followed by its share assignment. The share assignment determines what fraction of available licenses is assigned to each project, in the event of competition between projects. Optionally, the share assignment is followed by a slash and the number of licenses owned by that project. License ownership enables a preemption policy (In the event of competition between projects, projects that own licenses preempt jobs. Licenses are returned to the owner immediately). Examples DISTRIBUTION=wanserver (Lp1 1 Lp2 1 Lp3 1 Lp4 1) 556 Platform LSF Configuration Reference lsf.licensescheduler In this example, the service domain named wanserver shares licenses equally among four projects. If all projects are competing for a total of eight licenses, each project is entitled to two licenses at all times. If all projects are competing for only two licenses in total, each project is entitled to a license half the time. DISTRIBUTION=lanserver1 (Lp1 1 Lp2 2/6) In this example, the service domain named lanserver1 allows Lp1 to use one third of the available licenses and Lp2 can use two thirds of the licenses. However, Lp2 is always entitled to six licenses, and can preempt another project to get the licenses immediately if they are needed. If the projects are competing for a total of 12 licenses, Lp2 is entitled to eight licenses (six on demand, and two more as soon as they are free). If the projects are competing for only six licenses in total, Lp2 is entitled to all of them, and Lp1 can only use licenses when Lp2 does not need them. DYNAMIC Syntax DYNAMIC=Y Description If you specify DYNAMIC=Y, you must specify a duration in an rusage resource requirement for the feature. This enables License Scheduler to treat the license as a dynamic resource and prevents License Scheduler from scheduling tokens for the feature when they are not available, or reserving license tokens when they should actually be free. Used for project mode only. Cluster mode and fast dispatch project mode do not support rusage duration. ENABLE_DYNAMIC_RUSAGE Syntax ENABLE_DYNAMIC_RUSAGE=Y Description Enforces license distribution policies for features where the job checks out licenses in excess of rusage. When set, ENABLE_DYNAMIC_RUSAGE enables all license checkouts for features where the job checks out licenses in excess of rusage to be considered managed checkouts, instead of unmanaged (or OTHERS). Used for project mode only. Cluster mode and fast dispatch project mode do not support this parameter. ENABLE_MINJOB_PREEMPTION Syntax ENABLE_MINJOB_PREEMPTION=Y Chapter 1. Configuration Files 557 lsf.licensescheduler Description Minimizes the overall number of preempted jobs by enabling job list optimization. For example, for a job that requires 10 licenses, License Scheduler preempts one job that uses 10 or more licenses rather than 10 jobs that each use one license. Used for project mode only Default Undefined: License Scheduler does not optimize the job list when selecting jobs to preempt. FAST_DISPATCH Syntax FAST_DISPATCH=Y Description Enables fast dispatch project mode for the license feature, which increases license utilization for project licenses. Setting in the Feature section overrides the global setting in the Parameters section. Used for project mode only. When enabled, License Scheduler does not have to run lmutil, lmstat, rlmutil, or rlmstat to verify that a license is free before each job dispatch. As soon as a job finishes, the cluster can reuse its licenses for another job of the same project, which keeps gaps between jobs small. However, because License Scheduler does not run lmutil, lmstat, rlmutil, or rlmstat to verify that the license is free, there is an increased chance of a license checkout failure for jobs if the license is already in use by a job in another project. The fast dispatch project mode supports the following parameters in the Feature section: v ALLOCATION v DEMAND_LIMIT v DISTRIBUTION v GROUP_DISTRIBUTION v v v v v LM_LICENSE_NAME LS_FEATURE_PERCENTAGE NAME NON_SHARED_DISTRIBUTION SERVICE_DOMAINS v WORKLOAD_DISTRIBUTION The fast dispatch project mode also supports the MBD_HEARTBEAT_INTERVAL parameter in the Parameters section. Other parameters are not supported, including those that project mode supports, such as the following parameters: v ACCINUSE_INCLUDES_OWNERSHIP 558 Platform LSF Configuration Reference lsf.licensescheduler v v v v DYNAMIC GROUP LOCAL_TO LS_ACTIVE_PERCENTAGE Default Undefined (N). License Scheduler runs in project mode without fast dispatch. FLEX_NAME Syntax FLEX_NAME=feature_name1 [feature_name2 ...] Description | | Replace FLEX_NAME with LM_LICENSE_NAME. FLEX_NAME is only maintained for backwards compatibility. Optional. Defines the feature name—the name used by FlexNet to identify the type of license. You only need to specify this parameter if the License Scheduler token name is not identical to the FlexNet feature name. FLEX_NAME allows the NAME parameter to be an alias of the FlexNet feature name. For feature names that start with a number or contain a dash (-), you must set both NAME and FLEX_NAME, where FLEX_NAME is the actual FlexNet Licensing feature name, and NAME is an arbitrary license token name you choose. Specify a space-delimited list of feature names in FLEX_NAME to combine multiple FlexNet features into one feature name specified under the NAME parameter. This allows you to use the same feature name for multiple FlexNet features (that are interchangeable for applications). LSF recognizes the alias of the combined feature (specified in NAME) as a feature name instead of the individual FlexNet feature names specified in FLEX_NAME. When submitting a job to LSF, users specify the combined feature name in the bsub rusage string, which allows the job to use any token from any of the features specified in FLEX_NAME. Example To specify AppZ201 as an alias for the FlexNet feature named 201-AppZ: Begin Feature FLEX_NAME=201-AppZ NAME=AppZ201 DISTRIBUTION=LanServer1(Lp1 1 Lp2 1) End Feature To combine two FlexNet features (201-AppZ and 202-AppZ) into a feature named AppZ201: Begin Feature FLEX_NAME=201-AppZ 202-AppZ NAME=AppZ201 DISTRIBUTION=LanServer1(Lp1 1 Lp2 1) End Feature AppZ201 is a combined feature that uses both 201-AppZ and 202-AppZ tokens. Submitting a job with AppZ201 in the rusage string (for example, bsub -Lp Lp1 -R Chapter 1. Configuration Files 559 lsf.licensescheduler "rusage[AppZ201=2]" myjob) means that the job checks out tokens for either 201-AppZ or 202-AppZ. GROUP Syntax GROUP=[group_name(project_name... )] ... group_name Specify a name for a group of projects. This is different from a ProjectGroup section; groups of projects are not hierarchical. project_name Specify a License Scheduler project (described in the Projects section) that is allowed to use the licenses. The project must appear in the DISTRIBUTION and only belong to one group. Description Optional. Defines groups of projects and specifies the name of each group. The groups defined here are used for group preemption. The number of licenses owned by the group is the total number of licenses owned by member projects. Used for project mode only. Cluster mode and fast dispatch project mode do not support this parameter. This parameter is ignored if GROUP_DISTRIBUTION is also defined. Example For example, without the GROUP configuration shown, proj1 owns 4 license tokens and can reclaim them using preemption. After adding the GROUP configuration, proj1 and proj2 together own 8 license tokens. If proj2 is idle, proj1 is able to reclaim all 8 license tokens using preemption. Begin Feature NAME = AppY DISTRIBUTION = LanServer1(proj1 1/4 proj2 1/4 proj3 2) GROUP = GroupA(proj1 proj2) End Feature GROUP_DISTRIBUTION Syntax GROUP_DISTRIBUTION=top_level_hierarchy_name top_level_hierarchy_name Specify the name of the top level hierarchical group. Description Defines the name of the hierarchical group containing the distribution policy attached to this feature, where the hierarchical distribution policy is defined in a ProjectGroup section. 560 Platform LSF Configuration Reference lsf.licensescheduler One of DISTRIBUTION or GROUP_DISTRIBUTION must be defined when using project mode. GROUP_DISTRIBUTION and DISTRIBUTION are mutually exclusive. If defined in the same feature, the License Scheduler daemon returns an error and ignores this feature. If GROUP is also defined, it is ignored in favor of GROUP_DISTRIBUTION. Example The following example shows the GROUP_DISTRIBUTION parameter hierarchical scheduling for the top-level hierarchical group named groups. The SERVICE_DOMAINS parameter defines a list of service domains that provide tokens for the group. Begin Feature NAME = myjob2 GROUP_DISTRIBUTION = groups SERVICE_DOMAINS = LanServer wanServer End Feature INUSE_FROM_RUSAGE Syntax INUSE_FROM_RUSAGE=Y|N Description When not defined or set to N, the INUSE value uses rusage from bsub job submissions merged with license checkout data reported by blcollect (as reported by blstat). When INUSE_FROM_RUSAGE=Y, the INUSE value uses the rusage from bsub job submissions instead of waiting for the blcollect update. This can result in faster reallocation of tokens when using dynamic allocation (when ALLOC_BUFFER is set). When for individual license features, the Feature section setting overrides the global Parameters section setting. Used for cluster mode only. Default N LM_LICENSE_NAME Syntax LM_LICENSE_NAME=feature_name1 [feature_name2 ...] Description Optional. Defines the feature name—the name used by the license manager to identify the type of license. You only need to specify this parameter if the License Scheduler token name is not identical to the license manager feature name. LM_LICENSE_NAME allows the NAME parameter to be an alias of the license manager feature name. For feature names that start with a number or contain a dash (-), you Chapter 1. Configuration Files 561 lsf.licensescheduler must set both NAME and LM_LICENSE_NAME, where LM_LICENSE_NAME is the actual license manager feature name, and NAME is an arbitrary license token name you choose. Specify a space-delimited list of feature names in LM_LICENSE_NAME to combine multiple license manager features into one feature name specified under the NAME parameter. This allows you to use the same feature name for multiple license manager features (that are interchangeable for applications). LSF recognizes the alias of the combined feature (specified in NAME) as a feature name instead of the individual license manager feature names specified in LM_LICENSE_NAME. When submitting a job to LSF, users specify the combined feature name in the bsub rusage string, which allows the job to use any token from any of the features specified in LM_LICENSE_NAME. Example To specify AppZ201 as an alias for the license manager feature named 201-AppZ: Begin Feature LM_LICENSE_NAME=201-AppZ NAME=AppZ201 DISTRIBUTION=LanServer1(Lp1 1 Lp2 1) End Feature To combine two license manager features (201-AppZ and 202-AppZ) into a feature named AppZ201: Begin Feature LM_LICENSE_NAME=201-AppZ 202-AppZ NAME=AppZ201 DISTRIBUTION=LanServer1(Lp1 1 Lp2 1) End Feature AppZ201 is a combined feature that uses both 201-AppZ and 202-AppZ tokens. Submitting a job with AppZ201 in the rusage string (for example, bsub -Lp Lp1 -R "rusage[AppZ201=2]" myjob) means that the job checks out tokens for either 201-AppZ or 202-AppZ. LM_REMOVE_INTERVAL Syntax LM_REMOVE_INTERVAL=seconds Description Specifies the minimum time a job must have a license checked out before lmremove or rlmremove can remove the license. lmremove or rlmremove causes the license manager daemon and vendor daemons to close the TCP connection with the application. They can then retry the license checkout. | When using lmremove or rlmremove as part of the preemption action (LM_REMOVE_SUSP_JOBS), define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove or rlmremove to release licenses from the job. | | Used for both project mode and cluster mode. 562 Platform LSF Configuration Reference lsf.licensescheduler The value specified for a feature overrides the global value defined in the Parameters section. Each feature definition can specify a different value for this parameter. Default Undefined: License Scheduler applies the global value. | | LM_REMOVE_SUSP_JOBS | LM_REMOVE_SUSP_JOBS=seconds | Description | | | | | Enables License Scheduler to use lmremove or rlmremove to remove license features from each recently-suspended job. After enabling this parameter, the preemption action is to suspend the job's processes and use lmremove or rlmremove to remove licences from the application. lmremove or rlmremove causes the license manager daemon and vendor daemons to close the TCP connection with the application. | | | | | License Scheduler continues to try removing the license feature for the specified number of seconds after the job is first suspended. When setting this parameter for an application, specify a value greater than the period following a license checkout that lmremove or rlmremove will fail for the application. This ensures that when a job suspends, its licenses are released. This period depends on the application. | | | | When using lmremove or rlmremove as part of the preemption action, define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove or rlmremove to release licenses from the job. | Used for fast dispatch project mode only. | | | The value specified for a feature overrides the global value defined in the Parameters section. Each feature definition can specify a different value for this parameter. | Default | Undefined. The default preemption action is to send a TSTP signal to the job. Syntax LMREMOVE_SUSP_JOBS Syntax LMREMOVE_SUSP_JOBS=seconds Description | | Replace LMREMOVE_SUSP_JOBS with LM_REMOVE_SUSP_JOBS. LMREMOVE_SUSP_JOBS is only maintained for backwards compatibility. Enables License Scheduler to use lmremove to remove license features from each recently-suspended job. After enabling this parameter, the preemption action is to suspend the job's processes and use lmremove to remove licences from the Chapter 1. Configuration Files 563 lsf.licensescheduler application. lmremove causes the license manager daemon and vendor daemons to close the TCP connection with the application. License Scheduler continues to try removing the license feature for the specified number of seconds after the job is first suspended. When setting this parameter for an application, specify a value greater than the period following a license checkout that lmremove will fail for the application. This ensures that when a job suspends, its licenses are released. This period depends on the application. When using lmremove as part of the preemption action, define LM_REMOVE_INTERVAL=0 to ensure that License Scheduler can preempt a job immediately after checkout. After suspending the job, License Scheduler then uses lmremove to release licenses from the job. Used for fast dispatch project mode only. The value specified for a feature overrides the global value defined in the Parameters section. Each feature definition can specify a different value for this parameter. Default Undefined. The default preemption action is to send a TSTP signal to the job. LOCAL_TO Syntax LOCAL_TO=cluster_name | location_name(cluster_name [cluster_name ...]) Description Used for project mode only. Cluster mode and fast dispatch project mode do not support this parameter. Configures token locality for the license feature. You must configure different feature sections for same feature based on their locality. By default, if LOCAL_TO is not defined, the feature is available to all clients and is not restricted by geographical location. When LOCAL_TO is configured, for a feature, License Scheduler treats license features served to different locations as different token names, and distributes the tokens to projects according the distribution and allocation policies for the feature. LOCAL_TO cannot be defined for LSF AE submission clusters. | LOCAL_TO allows you to limit features from different service domains to specific clusters, so License Scheduler only grants tokens of a feature to jobs from clusters that are entitled to them. For example, if your license servers restrict the serving of license tokens to specific geographical locations, use LOCAL_TO to specify the locality of a license token if any feature cannot be shared across all the locations. This avoids having to define different distribution and allocation policies for different service domains, and allows hierarchical group configurations. 564 Platform LSF Configuration Reference lsf.licensescheduler License Scheduler manages features with different localities as different resources. Use blinfo and blstat to see the different resource information for the features depending on their cluster locality. License features with different localities must be defined in different feature sections. The same Service Domain can appear only once in the configuration for a given license feature. A configuration like LOCAL_TO=Site1(clusterA clusterB) configures the feature for more than one cluster when using project mode. A configuration like LOCAL_TO=clusterA configures locality for only one cluster. This is the same as LOCAL_TO=clusterA(clusterA). Cluster names must be the names of clusters defined in the Clusters section of lsf.licensescheduler. Examples Begin Feature NAME = hspice DISTRIBUTION = SD1 (Lp1 1 Lp2 1) LOCAL_TO = siteUS(clusterA clusterB) End Feature Begin Feature NAME = hspice DISTRIBUTION = SD2 (Lp1 1 Lp2 1) LOCAL_TO = clusterA End Feature Begin Feature NAME = hspice DISTRIBUTION = SD3 (Lp1 1 Lp2 1) SD4 (Lp1 1 Lp2 1) End Feature Or use the hierarchical group configuration (GROUP_DISTRIBUTION): Begin Feature NAME = hspice GROUP_DISTRIBUTION = group1 SERVICE_DOMAINS = SD1 LOCAL_TO = clusterA End Feature Begin Feature NAME = hspice GROUP_DISTRIBUTION = group1 SERVICE_DOMAINS = SD2 LOCAL_TO = clusterB End Feature Begin Feature NAME = hspice GROUP_DISTRIBUTION = group1 SERVICE_DOMAINS = SD3 SD4 End Feature Default Not defined. The feature is available to all clusters and taskman jobs, and is not restricted by cluster. Chapter 1. Configuration Files 565 lsf.licensescheduler LS_ACTIVE_PERCENTAGE Syntax LS_ACTIVE_PERCENTAGE=Y | N Description Configures license ownership in percentages instead of absolute numbers and adjusts ownership for inactive projects. Sets LS_FEATURE_PERCENTAGE=Y automatically. Settings LS_ACTIVE_PERCENTAGE=Y dynamically adjusts ownership based on project activity, setting ownership to zero for inactive projects and restoring the configured ownership setting when projects become active. If the total ownership for the license feature is greater than 100%, each ownership value is scaled appropriately for a total ownership of 100%. Used for project mode only. Cluster mode and fast dispatch project mode do not support this parameter. Default N (Ownership values are not changed based on project activity.) LS_FEATURE_PERCENTAGE Syntax LS_FEATURE_PERCENTAGE=Y | N Description Configures license ownership in percentages instead of absolute numbers. When not combined with hierarchical projects, affects the owned values in DISTRIBUTION and the NON_SHARED_DISTRIBUTION values only. When using hierarchical projects, percentage is applied to OWNERSHIP, LIMITS, and NON_SHARED values. Used for project mode and fast dispatch project mode only. For example: Begin Feature LS_FEATURE_PERCENTAGE = Y DISTRIBUTION = LanServer (p1 1 p2 1 p3 1/20) ... End Feature The service domain LanServer shares licenses equally among three License Scheduler projects. P3 is always entitled to 20% of the total licenses, and can preempt another project to get the licenses immediately if they are needed. Example 1 Begin Feature LS_FEATURE_PERCENTAGE = Y DISTRIBUTION = LanServer (p1 1 p2 1 p3 1/20) ... End Feature 566 Platform LSF Configuration Reference lsf.licensescheduler The service domain LanServer shares licenses equally among three License Scheduler projects. P3 is always entitled to 20% of the total licenses, and can preempt another project to get the licenses immediately if they are needed. Example 2 With LS_FEATURE_PERCENTAGE=Y in feature section and using hierarchical project groups: Begin ProjectGroup GROUP SHARES (R (A p4)) (1 1) (A (B p3)) (1 1) (B (p1 p2)) (1 1) End ProjectGroup OWNERSHIP () (- 10) (30 -) LIMITS NON_SHARED () () (- 20) () () (- 5) Project p1 owns 30% of the total licenses, and project p3 owns 10% of total licenses. P3's LIMITS is 20% of total licenses, and p2's NON_SHARED is 5%. Default N (Ownership is not configured with percentages, but with absolute numbers.) LS_WAIT_TO_PREEMPT Syntax LS_WAIT_TO_PREEMPT=seconds Description Defines the number of seconds that jobs must wait (time since it was dispatched) before it can be preempted. Applies to LSF and taskman jobs. Used for project mode only. When LM_REMOVE_INTERVAL is also defined, the LM_REMOVE_INTERVAL value overrides the LS_WAIT_TO_PREEMPT value. Default 0. The job can be preempted even if it was just dispatched. NAME Required. Defines the token name—the name used by License Scheduler and LSF to identify the license feature. Normally, license token names should be the same as the FlexNet Licensing feature names, as they represent the same license. However, LSF does not support names that start with a number, or names containing a dash or hyphen character (-), which may be used in the FlexNet Licensing feature name. NON_SHARED_DISTRIBUTION Syntax NON_SHARED_DISTRIBUTION=service_domain_name ([project_name number_non_shared_licenses] ... ) ... service_domain_name Chapter 1. Configuration Files 567 lsf.licensescheduler Specify a License Scheduler service domain (described in the ServiceDomain section) that distributes the licenses. project_name Specify a License Scheduler project (described in the section) that is allowed to use the licenses. number_non_shared_licenses Specify a positive integer representing the number of non-shared licenses that the project owns. Description Optional. Defines non-shared licenses. Non-shared licenses are privately owned, and are not shared with other license projects. They are available only to one project. Used for project mode and fast dispatch project mode only. Use blinfo -a to display NON_SHARED_DISTRIBUTION information. For projects defined with NON_SHARED_DISTRIBUTION, you must assign the project OWNERSHIP an equal or greater number of tokens than the number of non-shared licenses. If the number of owned licenses is less than the number of non-shared licenses, OWNERSHIP is set to the number of non-shared licenses. Examples v If the number of tokens normally given to a project (to satisfy the DISTRIBUTION share ratio) is larger than its NON_SHARED_DISTRIBUTION value, the DISTRIBUTION share ratio takes effect first. Begin Feature NAME=f1 # total 15 on LanServer LM_LICENSE_NAME=VCS-RUNTIME DISTRIBUTION=LanServer(Lp1 4/10 Lp2 1) NON_SHARED_DISTRIBUTION=LanServer(Lp1 10) End Feature In this example, 10 non-shared licenses are defined for the Lp1 project on LanServer. The DISTRIBUTION share ratio for Lp1:Lp2 is 4:1. If there are 15 licenses, Lp1 will normally get 12 licenses, which is larger than its NON_SHARED_DISTRIBUTION value of 10. Therefore, the DISTRIBUTION share ratio takes effect, so Lp1 gets 12 licenses and Lp2 gets 3 licenses for the 4:1 share ratio. v If the number of tokens normally given to a project (to satisfy the DISTRIBUTION share ratio) is smaller than its NON_SHARED_DISTRIBUTION value, the project will first get the number of tokens equal to NON_SHARED_DISTRIBUTION, then the DISTRIBUTION share ratio for the other projects takes effect for the remaining licenses. – For one project with non-shared licenses and one project with no non-shared licenses: , the project with no non-shared licenses is given all the remaining licenses since it would normally be given more according to the DISTRIBUTION share ratio: 568 Platform LSF Configuration Reference lsf.licensescheduler Begin Feature NAME=f1 # total 15 on LanServer LM_LICENSE_NAME=VCS-RUNTIME DISTRIBUTION=LanServer(Lp1 1/10 Lp2 4) NON_SHARED_DISTRIBUTION=LanServer(Lp1 10) End Feature In this example, 10 non-shared licenses are defined for the Lp1 project on LanServer. The DISTRIBUTION share ratio for Lp1:Lp2 is 1:4. If there are 15 licenses, Lp1 will normally get three licenses, which is smaller than its NON_SHARED_DISTRIBUTION value of 10. Therefore, Lp1 gets the first 10 licenses, and Lp2 gets the remaining five licenses (since it would normally get more according to the share ratio). – For one project with non-shared licenses and two or more projects with no non-shared licenses, the two projects with no non-shared licenses are assigned the remaining licenses according to the DISTRIBUTION share ratio with each other, ignoring the share ratio for the project with non-shared licenses. Begin Feature NAME=f1 # total 15 on LanServer LM_LICENSE_NAME=VCS-RUNTIME DISTRIBUTION=LanServer(Lp1 1/10 Lp2 4 Lp3 2) NON_SHARED_DISTRIBUTION=LanServer(Lp1 10) End Feature In this example, 10 non-shared licenses are defined for the Lp1 project on LanServer. The DISTRIBUTION share ratio for Lp1:Lp2:Lp3 is 1:4:2. If there are 15 licenses, Lp1 will normally get two licenses, which is smaller than its NON_SHARED_DISTRIBUTION value of 10. Therefore, Lp1 gets the first 10 licenses. The remaining licenses are given to Lp2 and Lp3 to a ratio of 4:2, so Lp2 gets three licenses and Lp3 gets two licenses. – For two projects with non-shared licenses and one with no non-shared licenses, the one project with no non-shared licenses is given the remaining licenses after the two projects are given their non-shared licenses: Begin Feature NAME=f1 # total 15 on LanServer LM_LICENSE_NAME=VCS-RUNTIME DISTRIBUTION=LanServer(Lp1 1/10 Lp2 4 Lp3 2/5) NON_SHARED_DISTRIBUTION=LanServer(Lp1 10 Lp3 5) End Feature In this example, 10 non-shared licenses are defined for the Lp1 project and five non-shared license are defined for the Lp3 project on LanServer. The DISTRIBUTION share ratio for Lp1:Lp2:Lp3 is 1:4:2. If there are 15 licenses, Lp1 will normally get two licenses and Lp3 will normally get four licenses, which are both smaller than their corresponding NON_SHARED_DISTRIBUTION values. Therefore, Lp1 gets 10 licenses and Lp3 gets five licenses. Lp2 gets no licenses even though it normally has the largest share because Lp1 and Lp3 have non-shared licenses. PEAK_INUSE_PERIOD Syntax PEAK_INUSE_PERIOD=seconds | cluster seconds ... Description Defines the interval over which a peak INUSE value is determined for dynamic license allocation in cluster mode for this license features and service domain. Chapter 1. Configuration Files 569 lsf.licensescheduler Use keyword default to set for all clusters not specified, and the keyword interactive (in place of cluster name) to set for taskman jobs. For example: PEAK_INUSE_PERIOD = cluster1 1000 cluster2 700 default 300 When defining the interval for LSF AE submission clusters, the interval is determined for the entire LSF AE mega-cluster (the submission cluster and its execution clusters). | | | Used for cluster mode only. When defined in both the Parameters section and the Feature section, the Feature section definition is used for that license feature. Default 300 seconds PREEMPT_ORDER Syntax PREEMPT_ORDER=BY_OWNERSHIP Description Optional. Sets the preemption order based on configured OWNERSHIP. Used for project mode only. Default Not defined. PREEMPT_RESERVE Syntax PREEMPT_RESERVE=Y | N Description Optional. If PREEMPT_RESERVE=Y, enables License Scheduler to preempt either licenses that are reserved or already in use by other projects. The number of jobs must be greater than the number of licenses owned. If PREEMPT_RESERVE=N, License Scheduler does not preempt reserved licenses. Used for project mode only. Default Y. Reserved licenses are preemptable. RETENTION_FACTOR Syntax RETENTION_FACTOR=integer% 570 Platform LSF Configuration Reference lsf.licensescheduler Description Ensures that when tokens are reclaimed from an overfed cluster, the overfed cluster still gets to dispatch additional jobs, but at a reduced rate. Specify the retention factor as a percentage of tokens to be retained by the overfed cluster. For example: Begin Feature NAME = f1 CLUSTER_MODE = Y CLUSTER_DISTRIBUTION = LanServer(LAN1 1 LAN2 1) ALLOC_BUFFER = 20 RETENTION_FACTOR = 25% End Feature With RETENTION_FACTOR set, as jobs finish in the overfed cluster and free up tokens, at least 25% of the tokens can be reused by the cluster to dispatch additional jobs. Tokens not held by the cluster are redistributed to other clusters. In general, a higher value means that the process of reclaiming tokens from an overfed cluster takes longer, and an overfed cluster gets to dispatch more jobs while tokens are being reclaimed from it. | | | When the entire LSF AE mega-cluster (the submission cluster and its execution clusters) is overfed, the number of retained tokens is from the entire LSF AE mega-cluster. Used for cluster mode only. Default Not defined SERVICE_DOMAINS Syntax SERVICE_DOMAINS=service_domain_name ... service_domain_name Specify the name of the service domain. Description Required if GROUP_DISTRIBUTION is defined. Specifies the service domains that provide tokens for this feature. Only a single service domain can be specified when using cluster mode or fast dispatch project mode. WORKLOAD_DISTRIBUTION Syntax WORKLOAD_DISTRIBUTION=[service_domain_name(LSF lsf_distribution NON_LSF non_lsf_distribution)] ... service_domain_name Specify a License Scheduler service domain (described in the ServiceDomain section) that distributes the licenses. Chapter 1. Configuration Files 571 lsf.licensescheduler lsf_distribution Specify the share of licenses dedicated to LSF workloads. The share of licenses dedicated to LSF workloads is a ratio of lsf_distribution:non_lsf_distribution. non_lsf_distribution Specify the share of licenses dedicated to non-LSF workloads. The share of licenses dedicated to non-LSF workloads is a ratio of non_lsf_distribution:lsf_distribution. Description Optional. Defines the distribution given to each LSF and non-LSF workload within the specified service domain. When running in cluster mode, WORKLOAD_DISTRIBUTION can only be specified for WAN service domains; if defined for a LAN feature, it is ignored. Use blinfo -a to display WORKLOAD_DISTRIBUTION configuration. Example Begin Feature NAME=ApplicationX DISTRIBUTION=LicenseServer1(Lp1 1 Lp2 2) WORKLOAD_DISTRIBUTION=LicenseServer1(LSF 8 NON_LSF 2) End Feature On the LicenseServer1 domain, the available licenses are dedicated in a ratio of 8:2 for LSF and non-LSF workloads. This means that 80% of the available licenses are dedicated to the LSF workload, and 20% of the available licenses are dedicated to the non-LSF workload. If LicenseServer1 has a total of 80 licenses, this configuration indicates that 64 licenses are dedicated to the LSF workload, and 16 licenses are dedicated to the non-LSF workload. FeatureGroup section Description Optional. Collects license features into groups. Put FeatureGroup sections after Feature sections in lsf.licensescheduler. The FeatureGroup section is supported in both project mode and cluster mode. FeatureGroup section structure The FeatureGroup section begins and ends with the lines Begin FeatureGroup and End FeatureGroup. Feature group definition consists of a unique name and a list of features contained in the feature group. Example Begin FeatureGroup NAME = Synposys FEATURE_LIST = ASTRO VCS_Runtime_Net Hsim Hspice End FeatureGroup 572 Platform LSF Configuration Reference lsf.licensescheduler Begin FeatureGroup NAME = Cadence FEATURE_LIST = Encounter NCSim End FeatureGroup NCVerilog Parameters v NAME v FEATURE_LIST NAME Required. Defines the name of the feature group. The name must be unique. FEATURE_LIST Required. Lists the license features contained in the feature group.The feature names in FEATURE_LIST must already be defined in Feature sections. Feature names cannot be repeated in the FEATURE_LIST of one feature group. The FEATURE_LIST cannot be empty. Different feature groups can have the same features in their FEATURE_LIST. ProjectGroup section Description Optional. Defines the hierarchical relationships of projects. Used for project mode only. When running in cluster mode, any ProjectGroup sections are ignored. The hierarchical groups can have multiple levels of grouping. You can configure a tree-like scheduling policy, with the leaves being the license projects that jobs can belong to. Each project group in the tree has a set of values, including shares, limits, ownership and non-shared, or exclusive, licenses. Use blstat -G to view the hierarchical dynamic license information. Use blinfo -G to view the hierarchical configuration. ProjectGroup section structure Define a section for each hierarchical group managed by License Scheduler. The keywords GROUP, SHARES, OWNERSHIP, LIMIT, and NON_SHARED are required. The keywords PRIORITY and DESCRIPTION are optional. Empty brackets are allowed only for OWNERSHIP, LIMIT, and PRIORITY. SHARES must be specified. Begin ProjectGroup GROUP SHARES (root(A B C)) (1 1 1) (A (P1 D)) (1 1) (B (P4 P5)) (1 1) (C (P6 P7 P8)) (1 1 1) (D (P2 P3)) (1 1) End ProjectGroup OWNERSHIP () () () () () LIMITS () () () () () NON_SHARED () () () () () PRIORITY (3 2 -) (3 5) () (8 3 0) (2 1) If desired, ProjectGroup sections can be completely independent, without any overlapping projects. Chapter 1. Configuration Files 573 lsf.licensescheduler Begin ProjectGroup GROUP SHARES OWNERSHIP LIMITS NON_SHARED(digital_sim (sim sim_reg)) (40 60) End ProjectGroup Begin ProjectGroup GROUP SHARES OWNERSHIP LIMITS NON_SHARED (analog_sim (app1 multitoken app1_reg)) (50 10 40) (65 25 0) (- 50 -) () End ProjectGroup Parameters v v v v DESCRIPTION GROUP LIMITS NON_SHARED v OWNERSHIP v PRIORITY v SHARES DESCRIPTION Optional. Description of the project group. The text can include any characters, including white space. The text can be extended to multiple lines by ending the preceding line with a backslash (\). The maximum length for the text is 64 characters. When the DESCRIPTION column is not empty it should contain one entry for each project group member. For example: GROUP (R (A B)) (A (p1 p2)) (B (p3 p4)) SHARES (1 1) (1 1) (1 1) OWNERSHIP () (40 60) () LIMITS () () () NON_SHARED (10 10) () () DESCRIPTION () ("p1 desc." "") ("p3 desc." "p4 desc.") Use blinfo -G to view hierarchical project group descriptions. GROUP Defines the project names in the hierarchical grouping and its relationships. Each entry specifies the name of the hierarchical group and its members. For better readability, you should specify the projects in the order from the root to the leaves as in the example. Specify the entry as follows: (group (member ...)) LIMITS Defines the maximum number of licenses that can be used at any one time by the hierarchical group member projects. Specify the maximum number of licenses for each member, separated by spaces, in the same order as listed in the GROUP column. A dash (-) is equivalent to INFINIT_INT, which means there is no maximum limit and the project group can use as many licenses as possible. You can leave the parentheses empty () if desired. 574 Platform LSF Configuration Reference (100 0) () () lsf.licensescheduler NON_SHARED Defines the number of licenses that the hierarchical group member projects use exclusively. Specify the number of licenses for each group or project, separated by spaces, in the same order as listed in the GROUP column. A dash (-) is equivalent to a zero, which means there are no licenses that the hierarchical group member projects use exclusively. For hierarchical project froups in fast dispatch project mode, License Scheduler ignores the NON_SHARED value configured for project groups, and only uses the NON_SHARED value for the child projects. The project group's NON_SHARED value is the sum of the NON_SHARED values of its child projects. Normally, the total number of non-shared licenses should be less than the total number of license tokens available. License tokens may not be available to project groups if the total non-shared licenses for all groups is greater than the number of shared tokens available. For example, feature p4_4 is configured as follows, with a total of 4 tokens: Begin Feature NAME =p4_4 # total token value is 4 GROUP_DISTRIBUTION=final SERVICE_DOMAINS=LanServer End Feature The correct configuration is: GROUP SHARES (final (G2 G1)) (1 1) (G1 (AP2 AP1)) (1 1) OWNERSHIP () () LIMITS () () NON_SHARED (2 0) (1 1) Valid values Any positive integer up to the LIMITS value defined for the specified hierarchical group. If defined as greater than LIMITS, NON_SHARED is set to LIMITS. OWNERSHIP Defines the level of ownership of the hierarchical group member projects. Specify the ownership for each member, separated by spaces, in the same order as listed in the GROUP column. You can only define OWNERSHIP for hierarchical group member projects, not hierarchical groups. Do not define OWNERSHIP for the top level (root) project group. Ownership of a given internal node is the sum of the ownership of all child nodes it directly governs. A dash (-) is equivalent to a zero, which means there are no owners of the projects. You can leave the parentheses empty () if desired. Valid values A positive integer between the NON_SHARED and LIMITS values defined for the specified hierarchical group. v If defined as less than NON_SHARED, OWNERSHIP is set to NON_SHARED. v If defined as greater than LIMITS, OWNERSHIP is set to LIMITS. Chapter 1. Configuration Files 575 lsf.licensescheduler PRIORITY Optional. Defines the priority assigned to the hierarchical group member projects. Specify the priority for each member, separated by spaces, in the same order as listed in the GROUP column. “0” is the lowest priority, and a higher number specifies a higher priority. This column overrides the default behavior. Instead of preempting based on the accumulated inuse usage of each project, the projects are preempted according to the specified priority from lowest to highest. By default, priorities are evaluated top down in the project group hierarchy. The priority of a given node is first decided by the priority of the parent groups. When two nodes have the same priority, priority is determined by the accumulated inuse usage of each project at the time the priorities are evaluated. Specify LS_PREEMPT_PEER=Y in the Parameters section to enable bottom-up license token preemption in hierarchical project group configuration. A dash (-) is equivalent to a zero, which means there is no priority for the project. You can leave the parentheses empty () if desired. Use blinfo -G to view hierarchical project group priority information. Priority of default project If not explicitly configured, the default project has the priority of 0. You can override this value by explicitly configuring the default project in Projects section with the chosen priority value. SHARES Required. Defines the shares assigned to the hierarchical group member projects. Specify the share for each member, separated by spaces, in the same order as listed in the GROUP column. Projects section Description Required for project mode only. Ignored in cluster mode. Lists the License Scheduler projects. Projects section structure The Projects section begins and ends with the lines Begin Projects and End Projects. The second line consists of the required column heading PROJECTS and the optional column heading PRIORITY. Subsequent lines list participating projects, one name per line. Examples The following example lists the projects without defining the priority: Begin Projects PROJECTS Lp1 Lp2 576 Platform LSF Configuration Reference lsf.licensescheduler Lp3 Lp4 ... End Projects The following example lists the projects and defines the priority of each project: Begin Projects PROJECTS Lp1 Lp2 Lp3 Lp4 default ... End Projects PRIORITY 3 4 2 1 0 Parameters v DESCRIPTION v PRIORITY v PROJECTS DESCRIPTION Optional. Description of the project. The text can include any characters, including white space. The text can be extended to multiple lines by ending the preceding line with a backslash (\). The maximum length for the text is 64 characters. Use blinfo -Lp to view the project description. PRIORITY Optional. Defines the priority for each project where “0” is the lowest priority, and the higher number specifies a higher priority. This column overrides the default behavior. Instead of preempting in order the projects are listed under PROJECTS based on the accumulated inuse usage of each project, the projects are preempted according to the specified priority from lowest to highest. Used for project mode only. When 2 projects have the same priority number configured, the first project listed has higher priority, like LSF queues. Use blinfo -Lp to view project priority information. Priority of default project If not explicitly configured, the default project has the priority of 0. You can override this value by explicitly configuring the default project in Projects section with the chosen priority value. PROJECTS Defines the name of each participating project. Specify using one name per line. Automatic time-based configuration Variable configuration is used to automatically change License Scheduler license token distribution policy configuration based on time windows. You define automatic configuration changes in lsf.licensescheduler by using if-else Chapter 1. Configuration Files 577 lsf.licensescheduler constructs and time expressions in the Feature section. After you change the file, check the configuration with the bladmin ckconfig command, and restart License Scheduler in the cluster with the bladmin reconfig command. Used for both project mode and cluster mode. The expressions are evaluated by License Scheduler every 10 minutes based on the bld start time. When an expression evaluates true, License Scheduler dynamically changes the configuration based on the associated configuration statements, restarting bld automatically. When LSF determines a feature has been added, removed, or changed, mbatchd no longer restarts automatically. Instead a message indicates that a change has been detected, prompting the user to restart manually with badmin mbdrestart. This affects automatic time-based configuration in the Feature section of lsf.licensescheduler. When mbatchd detects a change in the Feature configuration, you must restart mbatchd for the change to take effect. Example Begin Feature NAME = f1 #if time(5:16:30-1:8:30 20:00-8:30) DISTRIBUTION=Lan(P1 2/5 P2 1) #elif time(3:8:30-3:18:30) DISTRIBUTION=Lan(P3 1) #else DISTRIBUTION=Lan(P1 1 P2 2/5) #endif End Feature lsf.shared The lsf.shared file contains common definitions that are shared by all load sharing clusters defined by lsf.cluster.cluster_name files. This includes lists of cluster names, host types, host models, the special resources available, and external load indices, including indices required to submit jobs using JSDL files. This file is installed by default in the directory defined by LSF_CONFDIR. Changing lsf.shared configuration After making any changes to lsf.shared, run the following commands: v lsadmin reconfig to reconfigure LIM v badmin mbdrestart to restart mbatchd Cluster section (Required) Lists the cluster names recognized by the LSF system Cluster section structure The first line must contain the mandatory keyword ClusterName. The other keyword is optional. The first line must contain the mandatory keyword ClusterName and the keyword Servers in a MultiCluster environment. 578 Platform LSF Configuration Reference lsf.shared Each subsequent line defines one cluster. Example Cluster section Begin Cluster ClusterName Servers cluster1 hostA cluster2 hostB End Cluster ClusterName Defines all cluster names recognized by the LSF system. All cluster names referenced anywhere in the LSF system must be defined here. The file names of cluster-specific configuration files must end with the associated cluster name. By default, if MultiCluster is installed, all clusters listed in this section participate in the same MultiCluster environment. However, individual clusters can restrict their MultiCluster participation by specifying a subset of clusters at the cluster level (lsf.cluster.cluster_name RemoteClusters section). Servers MultiCluster only. List of hosts in this cluster that LIMs in remote clusters can connect to and obtain information from. For other clusters to work with this cluster, one of these hosts must be running mbatchd. MultiCluster shared configuration A MultiCluster environment allows common configurations to be shared by all clusters. Use #INCLUDE to centralize the configuration work for groups of clusters when they all need to share a common configuration. Using #INCLUDE lets you avoid having to manually merge these common configurations into each local cluster's configuration files. Local administrators for each cluster open their local configuration files (lsf.shared and lsb.applications) and add the #include "path_to_file" syntax to them. All #include lines must be inserted at the beginning of the local configuration file. For example: #INCLUDE "/Shared/lsf.shared.common.a" #INCLUDE "/Shared/lsf.shared.common.c" Begin Cluster Cluster Name Servers ... To make the new configuration active in lsf.shared, restart LSF with the lsfrestart command. Both common resources and local resources will take effect on the local cluster. Once LSF is running again, use the lsinfo command to check whether the configuration is active. HostType section (Required) Lists the valid host types in the cluster. All hosts that can run the same binary executable are in the same host type. Chapter 1. Configuration Files 579 lsf.shared CAUTION: If you remove NTX86, NTX64, or NTIA64 from the HostType section, the functionality of lspasswd.exe is affected. The lspasswd command registers a password for a Windows user account. HostType section structure The first line consists of the mandatory keyword TYPENAME. Subsequent lines name valid host types. Example HostType section Begin HostType TYPENAME SOL64 SOLSPARC LINUX86LINUXPPC LINUX64 NTX86 NTX64 NTIA64 End HostType TYPENAME Host type names are usually based on a combination of the hardware name and operating system. If your site already has a system for naming host types, you can use the same names for LSF. HostModel section (Required) Lists models of machines and gives the relative CPU scaling factor for each model. All hosts of the same relative speed are assigned the same host model. LSF uses the relative CPU scaling factor to normalize the CPU load indices so that jobs are more likely to be sent to faster hosts. The CPU factor affects the calculation of job execution time limits and accounting. Using large or inaccurate values for the CPU factor can cause confusing results when CPU time limits or accounting are used. HostModel section structure The first line consists of the mandatory keywords MODELNAME, CPUFACTOR, and ARCHITECTURE. Subsequent lines define a model and its CPU factor. Example HostModel section Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE PC400 13.0 (i86pc_400 i686_400) PC450 13.2 (i86pc_450 i686_450) Sparc5F 3.0 (SUNWSPARCstation5_170_sparc) Sparc20 4.7 (SUNWSPARCstation20_151_sparc) Ultra5S 10.3 580 (SUNWUltra5_270_sparcv9 SUNWUltra510_270_sparcv9) Platform LSF Configuration Reference lsf.shared End HostModel ARCHITECTURE (Reserved for system use only) Indicates automatically detected host models that correspond to the model names. CPUFACTOR Though it is not required, you would typically assign a CPU factor of 1.0 to the slowest machine model in your system and higher numbers for the others. For example, for a machine model that executes at twice the speed of your slowest model, a factor of 2.0 should be assigned. MODELNAME Generally, you need to identify the distinct host types in your system, such as MIPS and SPARC first, and then the machine models within each, such as SparcIPC, Sparc1, Sparc2, and Sparc10. About automatically detected host models and types When you first install LSF, you do not necessarily need to assign models and types to hosts in lsf.cluster.cluster_name. If you do not assign models and types to hosts in lsf.cluster.cluster_name, LIM automatically detects the model and type for the host. If you have versions earlier than LSF 4.0, you may have host models and types already assigned to hosts. You can take advantage of automatic detection of host model and type also. Automatic detection of host model and type is useful because you no longer need to make changes in the configuration files when you upgrade the operating system or hardware of a host and reconfigure the cluster. LSF will automatically detect the change. Mapping to CPU factors Automatically detected models are mapped to the short model names in lsf.shared in the ARCHITECTURE column. Model strings in the ARCHITECTURE column are only used for mapping to the short model names. Example lsf.shared file: Begin HostModel MODELNAME CPUFACTOR ARCHITECTURE SparcU5 5.0 (SUNWUltra510_270_sparcv9) PC486 2.0 (i486_33 i486_66) PowerPC 3.0 (PowerPC12 PowerPC16 PowerPC31) End HostModel If an automatically detected host model cannot be matched with the short model name, it is matched to the best partial match and a warning message is generated. If a host model cannot be detected or is not supported, it is assigned the DEFAULT model name and an error message is generated. Chapter 1. Configuration Files 581 lsf.shared Naming convention Models that are automatically detected are named according to the following convention: hardware_platform [_processor_speed[_processor_type]] where: v hardware_platform is the only mandatory component v processor_speed is the optional clock speed and is used to differentiate computers within a single platform v processor_type is the optional processor manufacturer used to differentiate processors with the same speed v Underscores (_) between hardware_platform, processor_speed, processor_type are mandatory. Resource section Optional. Defines resources (must be done by the LSF administrator). Resource section structure The first line consists of the keywords. RESOURCENAME and DESCRIPTION are mandatory. The other keywords are optional. Subsequent lines define resources. Example Resource section Begin Resource RESOURCENAME TYPE INTERVAL INCREASING CONSUMABLE DESCRIPTION # Keywords patchrev Numeric () Y () (Patch revision) specman Numeric () N () (Specman) switch Numeric () Y N (Network Switch) rack String () () () (Server room rack) owner String () () () (Owner of the host) elimres Numeric 10 Y () (elim generated index) ostype String () () () (Operating system and version) lmhostid String () () () (FlexLM’s lmhostid) limversion String () () () (Version of LIM binary) End Resource RESOURCENAME The name you assign to the new resource. An arbitrary character string. v A resource name cannot begin with a number. v A resource name cannot contain any of the following characters: : . ( ) [ + - * / ! & | < > @ = v A resource name cannot be any of the following reserved names: cpu cpuf io logins ls idle maxmem maxswp maxtmp type model status it mem ncpus define_ncpus_cores define_ncpus_procs define_ncpus_threads ndisks pg r15m r15s r1m swap swp tmp ut v To avoid conflict with inf and nan keywords in 3rd-party libraries, resource names should not begin with inf or nan (upper case or lower case). Resource requirment strings, such as -R "infra" or -R "nano" will cause an error. Use -R "defined(infxx)" or -R "defined(nanxx)", to specify these resource names. 582 Platform LSF Configuration Reference lsf.shared v Resource names are case sensitive v Resource names can be up to 39 characters in length v For Solaris machines, the keyword int is reserved and cannot be used. TYPE The type of resource: v Boolean—Resources that have a value of 1 on hosts that have the resource and 0 otherwise. v Numeric—Resources that take numerical values, such as all the load indices, number of processors on a host, or host CPU factor. v String— Resources that take string values, such as host type, host model, host status. Default If TYPE is not given, the default type is Boolean. INTERVAL Optional. Applies to dynamic resources only. Defines the time interval (in seconds) at which the resource is sampled by the ELIM. If INTERVAL is defined for a numeric resource, it becomes an external load index. Default If INTERVAL is not given, the resource is considered static. INCREASING Applies to numeric resources only. If a larger value means greater load, INCREASING should be defined as Y. If a smaller value means greater load, INCREASING should be defined as N. CONSUMABLE Explicitly control if a resource is consumable. Applies to static or dynamic numeric resources. Static and dynamic numeric resources can be specified as consumable. CONSUMABLE is optional. The defaults for the consumable attribute are: v Built-in indicies: – The following are consumable: r15s, r1m, r15m, ut, pg, io, ls, it, tmp, swp, mem. – All other built-in static resources are not consumable. (e.g., ncpus, ndisks, maxmem, maxswp, maxtmp, cpuf, type, model, status, rexpri, server, hname). v External shared resources: – All numeric resources are consumable. – String and boolean resources are not consumable. You should only specify consumable resources in the rusage section of a resource requirement string. Non-consumable resources are ignored in rusage sections. Chapter 1. Configuration Files 583 lsf.shared A non-consumable resource should not be releasable. Non-consumable numeric resource should be able to used in order, select and same sections of a resource requirement string. When LSF_STRICT_RESREQ=Y in lsf.conf, LSF rejects resource requirement strings where an rusage section contains a non-consumable resource. DESCRIPTION Brief description of the resource. The information defined here will be returned by the ls_info() API call or printed out by the lsinfo command as an explanation of the meaning of the resource. RELEASE Applies to numeric shared resources only. Controls whether LSF releases the resource when a job using the resource is suspended. When a job using a shared resource is suspended, the resource is held or released by the job depending on the configuration of this parameter. Specify N to hold the resource, or specify Y to release the resource. Default N lsf.sudoers About lsf.sudoers The lsf.sudoers file is an optional file to configure security mechanisms. It is not installed by default. You use lsf.sudoers to set the parameter LSF_EAUTH_KEY to configure a key for eauth to encrypt and decrypt user authentication data. On UNIX, you also use lsf.sudoers to grant permission to users other than root to perform certain operations as root in LSF, or as a specified user. These operations include: v LSF daemon startup/shutdown v User ID for LSF authentication v User ID for LSF pre- and post-execution commands. v User ID for external LSF executables If lsf.sudoers does not exist, only root can perform these operations in LSF on UNIX. On UNIX, this file is located in /etc. There is one lsf.sudoers file per host. On Windows, this file is located in the directory specified by the parameter LSF_SECUREDIR in lsf.conf. 584 Platform LSF Configuration Reference lsf.sudoers Changing lsf.sudoers configuration After making any changes to lsf.sudoers, run badmin reconfig to reload the configuration files. lsf.sudoers on UNIX In LSF, certain operations such as daemon startup can only be performed by root. The lsf.sudoers file grants root privileges to specific users or user groups to perform these operations. Location lsf.sudoers must be located in /etc on each host. Permissions lsf.sudoers must have permission 600 and be readable and writable only by root. lsf.sudoers on Windows The lsf.sudoers file is shared over an NTFS network, not duplicated on every Windows host. By default, LSF installs lsf.sudoers in the %SYSTEMROOT% directory. The location of lsf.sudoers on Windows must be specified by LSF_SECUREDIR in lsf.conf. You must configure the LSF_SECUREDIR parameter in lsf.conf if using lsf.sudoers on Windows. Windows permissions Restriction: The owner of lsf.sudoers on Windows be Administrators. If not, eauth may not work. The permissions on lsf.sudoers for Windows are: Workgroup Environment v Local Admins (W) v Everyone (R) Domain Environment v Domain Admins (W) v Everyone (R) File format The format of lsf.sudoers is very similar to that of lsf.conf. Each entry can have one of the following forms: v NAME=VALUE v NAME= v NAME= "STRING1 STRING2 ..." Chapter 1. Configuration Files 585 lsf.sudoers The equal sign = must follow each NAME even if no value follows and there should be no space beside the equal sign. NAME describes an authorized operation. VALUE is a single string or multiple strings separated by spaces and enclosed in quotation marks. Lines starting with a pound sign (#) are comments and are ignored. Do not use #if as this is reserved syntax for time-based configuration. Example lsf.sudoers File LSB_PRE_POST_EXEC_USER=user100 LSF_STARTUP_PATH=/usr/share/lsf/etc LSF_STARTUP_USERS="user1 user10 user55" Creating and modifying lsf.sudoers You can create and modify lsf.sudoers with a text editor. After you modify lsf.sudoers, you must run badmin hrestart all to restart all sbatchds in the cluster with the updated configuration. Parameters v LSB_PRE_POST_EXEC_USER v LSF_EAUTH_KEY v LSF_EAUTH_USER v LSF_EEXEC_USER v LSF_EGO_ADMIN_PASSWD v LSF_EGO_ADMIN_USER v LSF_LOAD_PLUGINS v LSF_STARTUP_PATH v LSF_STARTUP_USERS LSB_PRE_POST_EXEC_USER Syntax LSB_PRE_POST_EXEC_USER=user_name Description Specifies the UNIX user account under which pre- and post-execution commands run. This parameter affects host-based pre- and post-execution processing defined at the first level. You can specify only one user account. If the pre-execution or post-execution commands perform privileged operations that require root permissions on UNIX hosts, specify a value of root. If you configure this parameter as root, the LD_PRELOAD and LD_LIBRARY_PATH variables are removed from the pre-execution, post-execution, and eexec environments for security purposes. 586 Platform LSF Configuration Reference lsf.sudoers Default Not defined. Pre-execution and post-execution commands run under the user account of the user who submits the job. LSF_EAUTH_KEY Syntax LSF_EAUTH_KEY=key Description Applies to UNIX, Windows, and mixed UNIX/Windows clusters. Specifies the key that eauth uses to encrypt and decrypt user authentication data. Defining this parameter enables increased security at your site. The key must contain at least six characters and must use only printable characters. For UNIX, you must edit the lsf.sudoers file on all hosts within the cluster and specify the same encryption key. For Windows, you must edit the shared lsf.sudoers file. Default Not defined. The eauth executable encrypts and decrypts authentication data using an internal key. LSF_EAUTH_USER Syntax LSF_EAUTH_USER=user_name Description UNIX only. Specifies the UNIX user account under which the external authentication executable eauth runs. Default Not defined. The eauth executable runs under the account of the primary LSF administrator. LSF_EEXEC_USER Syntax LSF_EEXEC_USER=user_name Description UNIX only. Specifies the UNIX user account under which the external executable eexec runs. Chapter 1. Configuration Files 587 lsf.sudoers Default Not defined. The eexec executable runs under root or the account of the user who submitted the job. LSF_EGO_ADMIN_PASSWD Syntax LSF_EGO_ADMIN_PASSWD=password Description When the EGO Service Controller (EGOSC) is configured to control LSF daemons, enables UNIX and Windows users to bypass the additional login required to start res and sbatchd. Bypassing the EGO administrator login enables the use of scripts to automate system startup. Specify the Admin EGO cluster administrator password as clear text. You must also define the LSF_EGO_ADMIN_USER parameter. Default Not defined. With EGOSC daemon control enabled, the lsadmin and badmin startup subcommands invoke the egosh user logon command to prompt for the Admin EGO cluster administrator credentials. LSF_EGO_ADMIN_USER Syntax LSF_EGO_ADMIN_USER=Admin Description When the EGO Service Controller (EGOSC) is configured to control LSF daemons, enables UNIX and Windows users to bypass the additional login required to start res and sbatchd. Bypassing the EGO administrator login enables the use of scripts to automate system startup. Specify the Admin EGO cluster administrator account. You must also define the LSF_EGO_ADMIN_PASSWD parameter. Default Not defined. With EGOSC daemon control enabled, the lsadmin and badmin startup subcommands invoke the egosh user logon command to prompt for the Admin EGO cluster administrator credentials. LSF_LOAD_PLUGINS Syntax LSF_LOAD_PLUGINS=y | Y 588 Platform LSF Configuration Reference lsf.sudoers Description If defined, LSF loads plugins from LSB_LSBDIR. Used for Kerberos authentication and to enable the LSF cpuset plugin for SGI. Default Not defined. LSF does not load plugins. LSF_STARTUP_PATH Syntax LSF_STARTUP_PATH=path Description UNIX only. Enables the LSF daemon startup control feature when LSF_STARTUP_USERS is also defined. Define both parameters when you want to allow users other than root to start LSF daemons. Specifies the absolute path name of the directory in which the LSF daemon binary files (lim, res, sbatchd, and mbatchd) are installed. LSF daemons are usually installed in the path specified by LSF_SERVERDIR defined in the cshrc.lsf, profile.lsf or lsf.conf files. Important: For security reasons, you should move the LSF daemon binary files to a directory other than LSF_SERVERDIR or LSF_BINDIR. The user accounts specified by LSF_STARTUP_USERS can start any binary in the LSF_STARTUP_PATH. Default Not defined. Only the root user account can start LSF daemons. LSF_STARTUP_USERS Syntax LSF_STARTUP_USERS=all_admins | "user_name..." Description UNIX only. Enables the LSF daemon startup control feature when LSF_STARTUP_PATH is also defined. Define both parameters when you want to allow users other than root to start LSF daemons. On Windows, the services admin group is equivalent to LSF_STARTUP_USERS. On UNIX hosts, by default only root can start LSF daemons. To manually start LSF daemons, a user runs the commands lsadmin and badmin, which have been installed as setuid root. LSF_STARTUP_USERS specifies a list of user accounts that can successfully run the commands lsadmin and badmin to start LSF daemons. all_admins Chapter 1. Configuration Files 589 lsf.sudoers v Allows all UNIX users defined as LSF administrators in the file lsf.cluster.cluster_name to start LSF daemons as root by running the lsadmin and badmin commands. v Not recommended due to the security risk of a non-root LSF administrator adding to the list of administrators in the lsf.cluster.cluster_name file. v Not required for Windows hosts because all users with membership in the services admin group can start LSF daemons. "user_name..." v Allows the specified user accounts to start LSF daemons by running the lsadmin and badmin commands. v Separate multiple user names with a space. v For a single user, do not use quotation marks. Default Not defined. Only the root user account can start LSF daemons. See also LSF_STARTUP_PATH lsf.task Users should not have to specify a resource requirement each time they submit a job. LSF supports the concept of a task list. This chapter describes the files used to configure task lists: lsf.task, lsf.task.cluster_name, and .lsftask. Changing task list configuration After making any changes to the task list files, run the following commands: v lsadmin reconfig to reconfigure LIM v badmin reconfig to reload the configuration files About task lists A task list is a list in LSF that keeps track of the default resource requirements for different applications and task eligibility for remote execution. The term task refers to an application name. With a task list defined, LSF automatically supplies the resource requirement of the job whenever users submit a job unless one is explicitly specified at job submission. LSF takes the job's command name as the task name and uses that name to find the matching resource requirement for the job from the task list. If a task does not have an entry in the task list, LSF assumes the default resource requirement; that is, a host that has the same host type as the submission host will be chosen to run the job. An application listed in a task file is considered for load sharing by its placement in either the local tasks or remote tasks list. v A local task is typically an application or command that it does not make sense to run remotely such as ls. 590 Platform LSF Configuration Reference lsf.task v A remote task is an application or command that can be run on another machine in the LSF cluster. The compress command is an example of a remote task. Some applications require resources other than the default. LSF can store resource requirements for specific applications in remote task list files, so that LSF automatically chooses candidate hosts that have the correct resources available. For frequently used commands and software packages, the LSF administrator can set up cluster–wide resource requirements that apply to all users in the cluster. Users can modify and add to these requirements by setting up additional resource requirements that apply only to their own jobs. Cluster-wide resource requirements The resource requirements of applications are stored in the remote task list file. LSF automatically picks up a job’s default resource requirement string from the remote task list files, unless you explicitly override the default by specifying the resource requirement string on the command line. User-level resource requirements You may have applications that you need to control yourself. Perhaps your administrator did not set them up for load sharing for all users, or you need a non-standard setup. You can use LSF commands to find out resource names available in your system, and tell LSF about the needs of your applications. LSF stores the resource requirements for you from then on. You can specify resource requirements when tasks are added to the user's remote task list. If the task to be added is already in the list, its resource requirements are replaced. lsrtasks + myjob/swap>=100 && cpu This adds myjob to the remote tasks list with its resource requirements. Task files There are 3 task list files that can affect a job: v lsf.task - system-wide defaults apply to all LSF users, even across multiple clusters if MultiCluster is installed v lsf.task.cluster_name - cluster-wide defaults apply to all users in the cluster v $HOME/.lsftask - user-level defaults apply to a single user. This file lists applications to be added to or removed from the default system lists for your jobs. Resource requirements specified in this file override those in the system lists. The clusterwide task file is used to augment the systemwide file. The user’s task file is used to augment the systemwide and clusterwide task files. LSF combines the systemwide, clusterwide, and user-specific task lists for each user's view of the task list. In cases of conflicts, such as different resource requirements specified for the same task in different lists, the clusterwide list overrides the systemwide list, and the user-specific list overrides both. Chapter 1. Configuration Files 591 lsf.task LSF_CONFDIR/lsf.task Systemwide task list applies to all clusters and all users. This file is used in a MultiCluster environment. LSF_CONFDIR/lsf.task.cluster_name Clusterwide task list applies to all users in the same cluster. $HOME/.lsftask User task list, one per user, applies only to the specific user. This file is automatically created in the user’s home directory whenever a user first updates his task lists using the lsrtasks or lsltasks commands. For details about task eligibility lists, see the ls_task(3) API reference man page. Permissions Only the LSF administrator can modify the systemwide task list (lsf.task) and the clusterwide task list (lsf.task.cluster_name). A user can modify his own task list(.lsftask) with the lsrtasks and lsltasks commands. Format of task files Each file consists of two sections, LocalTasks and RemoteTasks. For example: Begin LocalTasks ps hostname uname crontab End LocalTasks Begin RemoteTasks + "newjob/mem>25" + "verilog/select[type==any && swp>100]" make/cpu nroff/End RemoteTasks Tasks are listed one per line. Each line in a section consists of a task name, and, for the RemoteTasks section, an optional resource requirement string separated by a slash (/). A plus sign (+) or a minus sign (-) can optionally precede each entry. If no + or - is specified, + is assumed. A + before a task name means adding a new entry (if non-existent) or replacing an entry (if already existent) in the task list. A - before a task name means removing an entry from the application's task lists if it was already created by reading higher level task files. 592 Platform LSF Configuration Reference lsf.task LocalTasks section The section starts with Begin LocalTasks and ends with End LocalTasks. This section lists tasks that are not eligible for remote execution, either because they are trivial tasks or because they need resources on the local host. RemoteTasks section The section starts with Begin RemoteTasks and ends with End RemoteTasks. This section lists tasks that are eligible for remote execution. You can associate resource requirements with each task name. See Administering IBM Platform LSF for information about resource requirement strings. If the resource requirement string is not specified for a remote task, the default is "select[type==local] order[r15s:pg]". setup.config About setup.config The setup.config file contains options for License Scheduler installation and configuration for systems without LSF. You only need to edit this file if you are installing License Scheduler as a standalone product without LSF. Template location A template setup.config is included in the License Scheduler installation script tar file and is located in the directory created when you uncompress and extract the installation script tar file. Edit the file and uncomment the options you want in the template file. Replace the example values with your own settings to specify the options for your new License Scheduler installation. Important: The sample values in the setup.config template file are examples only. They are not default installation values. After the License Scheduler installation, the setup.config containing the options you specified is located in LS_TOP/9.1/install/. Format Each entry in setup.config has the form: NAME="STRING1 STRING2 ..." The equal sign = must follow each NAME even if no value follows and there should be no spaces around the equal sign. A value that contains multiple strings separated by spaces must be enclosed in quotation marks. Blank lines and lines starting with a pound sign (#) are ignored. Chapter 1. Configuration Files 593 setup.config Parameters v v v v LS_ADMIN LS_HOSTS LS_LMSTAT_PATH LS_TOP LS_ADMIN Syntax LS_ADMIN="user_name [user_name ... ]" Description Lists the License Scheduler administrators. The first user account name in the list is the primary License Scheduler administrator. The primary License Scheduler administrator account is typically named lsadmin. CAUTION: You should not configure the root account as the primary License Scheduler administrator. Valid Values User accounts for License Scheduler administrators must exist on all hosts using License Scheduler prior to installation. Example LS_ADMINS="lsadmin user1 user2" Default The user running the License Scheduler installation script. LS_HOSTS Syntax LS_HOSTS="host_name [host_name ... ]" Description Defines a list of hosts that are candidates to become License Scheduler master hosts. Provide at least one host from which the License Scheduler daemon will run. Valid Values Any valid License Scheduler host name. Example LS_HOSTS="host_name1 host_name2" Default The local host in which the License Scheduler installation script is running. 594 Platform LSF Configuration Reference setup.config LS_LMSTAT_PATH Syntax LS_LMSTAT_PATH="/path" Description Defines the full path to the lmstat program. License Scheduler uses lmstat to gather the FlexNet license information for scheduling. This path does not include the name of the lmstat program itself. Example LS_LMSTAT_PATH="/usr/bin" Default The installation script attempts to find a working copy of lmstat on the current system. If it is unsuccessful, the path is set as blank (""). LS_TOP Syntax LS_TOP="/path" Description Defines the full path to the top level License Scheduler installation directory. Valid Values Must be an absolute path to a shared directory that is accessible to all hosts using License Scheduler. Cannot be the root directory (/). Recommended Value The file system containing LS_TOP must have enough disk space for all host types (approximately 1.5 GB per host type). Example LS_TOP="/usr/share/ls" Default None—required variable slave.config About slave.config Dynamically added LSF hosts that will not be master candidates are slave hosts. Each dynamic slave host has its own LSF binaries and local lsf.conf and shell environment scripts (cshrc.lsf and profile.lsf). You must install LSF on each slave host. Chapter 1. Configuration Files 595 slave.config The slave.config file contains options for installing and configuring a slave host that can be dynamically added or removed. Use lsfinstall -s -f slave.config to install LSF using the options specified in slave.config. Template location A template slave.config is located in the installation script directory created when you extract the installer script package. Edit the file and uncomment the options you want in the template file. Replace the example values with your own settings to specify the options for your new LSF installation. Important: The sample values in the slave.config template file are examples only. They are not default installation values. Format Each entry in slave.config has the form: NAME="STRING1 STRING2 ..." The equal sign = must follow each NAME even if no value follows and there should be no spaces around the equal sign. A value that contains multiple strings separated by spaces must be enclosed in quotation marks. Blank lines and lines starting with a pound sign (#) are ignored. Parameters v v v v v v v v EGO_DAEMON_CONTROL ENABLE_EGO EP_BACKUP LSF_ADMINS LSF_ENTITLEMENT_FILE LSF_LIM_PORT LSF_SERVER_HOSTS LSF_TARDIR v v v v LSF_LOCAL_RESOURCES LSF_TOP SILENT_INSTALL LSF_SILENT_INSTALL_TARLIST EGO_DAEMON_CONTROL Syntax EGO_DAEMON_CONTROL="Y" | "N" 596 Platform LSF Configuration Reference slave.config Description Enables EGO to control LSF res and sbatchd. Set the value to "Y" if you want EGO Service Controller to start res and sbatchd, and restart if they fail. All hosts in the cluster must use the same value for this parameter (this means the value of EGO_DAEMON_CONTROL in this file must be the same as the specification for EGO_DAEMON_CONTROL in install.config). To avoid conflicts, leave this parameter undefined if you use a script to start up LSF daemons. Note: If you specify EGO_ENABLE="N", this parameter is ignored. Example EGO_DAEMON_CONTROL="N" Default N (res and sbatchd are started manually) ENABLE_EGO Syntax ENABLE_EGO="Y" | "N" Description Enables EGO functionality in the LSF cluster. ENABLE_EGO="Y" causes lsfinstall uncomment LSF_EGO_ENVDIR and sets LSF_ENABLE_EGO="Y" in lsf.conf. ENABLE_EGO="N" causes lsfinstall to comment out LSF_EGO_ENVDIR and sets LSF_ENABLE_EGO="N" in lsf.conf. Set the value to "Y" if you want to take advantage of the following LSF features that depend on EGO: v LSF daemon control by EGO Service Controller v EGO-enabled SLA scheduling Default N (EGO is disabled in the LSF cluster) EP_BACKUP Syntax EP_BACKUP="Y" | "N" Chapter 1. Configuration Files 597 slave.config Description Enables backup and rollback for enhancement packs. Set the value to "N" to disable backups when installing enhancement packs (you will not be able to roll back to the previous patch level after installing an EP, but you will still be able to roll back any fixes installed on the new EP). You may disable backups to speed up install time, to save disk space, or because you have your own methods to back up the cluster. Default Y (backup and rollback are fully enabled) LSF_ADMINS Syntax LSF_ADMINS="user_name [ user_name ... ]" Description Required. List of LSF administrators. The first user account name in the list is the primary LSF administrator. It cannot be the root user account. Typically this account is named lsfadmin. It owns the LSF configuration files and log files for job events. It also has permission to reconfigure LSF and to control batch jobs submitted by other users. It typically does not have authority to start LSF daemons. Usually, only root has permission to start LSF daemons. All the LSF administrator accounts must exist on all hosts in the cluster before you install LSF. Secondary LSF administrators are optional. Valid Values Existing user accounts Example LSF_ADMINS="lsfadmin user1 user2" Default None—required variable LSF_ENTITLEMENT_FILE Syntax LSF_ENTITLEMENT_FILE=path Description Full path to the LSF entitlement file. LSF uses the entitlement to determine which feature set to be enable or disable based on the edition of the product. The entitlement file for LSF Standard Edition is platform_lsf_std_entitlement.dat. For LSF Express Edition, the file is platform_lsf_exp_entitlement.dat. The entitlement file is installed as <LSF_TOP>/conf/lsf.entitlement. 598 Platform LSF Configuration Reference slave.config You must download the entitlement file for the edition of the product you are running, and set LSF_ENTITLEMENT_FILE to the full path to the entitlement file you downloaded. Once LSF is installed and running, run the lsid command to see which edition of LSF is enabled. Example LSF_ENTITLEMENT_FILE=/usr/share/lsf_distrib/lsf.entitlement Default None - required variable LSF_LIM_PORT Syntax LSF_LIM_PORT="port_number" Description TCP service port for slave host. Use the same port number as LSF_LIM_PORT in lsf.conf on the master host. Default 7869 LSF_SERVER_HOSTS Syntax LSF_SERVER_HOSTS="host_name [ host_name ...]" Description Required for non-shared slave host installation. This parameter defines a list of hosts that can provide host and load information to client hosts. If you do not define this parameter, clients will contact the master LIM for host and load information. List of LSF server hosts in the cluster to be contacted. Recommended for large clusters to decrease the load on the master LIM. Do not specify the master host in the list. Client commands will query the LIMs on the LSF_SERVER_HOSTS, which off-loads traffic from the master LIM. Define this parameter to ensure that commands execute successfully when no LIM is running on the local host, or when the local LIM has just started. You should include the list of hosts defined in LSF_MASTER_LIST in lsf.conf; specify the primary master host last. For example: LSF_MASTER_LIST="lsfmaster hostE" LSF_SERVER_HOSTS="hostB hostC hostD hostE lsfmaster" Specify a list of host names two ways: Chapter 1. Configuration Files 599 slave.config v Host names separated by spaces v Name of a file containing a list of host names, one host per line. Valid Values Any valid LSF host name Examples List of host names: LSF_SERVER_HOSTS="hosta hostb hostc hostd" Host list file: LSF_SERVER_HOSTS=:lsf_server_hosts The file lsf_server_hosts contains a list of hosts: hosta hostb hostc hostd Default None LSF_TARDIR Syntax LSF_TARDIR="/path" Description Full path to the directory containing the LSF distribution tar files. Example LSF_TARDIR="/usr/local/lsf_distrib" Default The parent directory of the current working directory. For example, if lsfinstall is running under usr/share/lsf_distrib/lsf_lsfinstall the LSF_TARDIR default value is usr/share/lsf_distrib. LSF_LOCAL_RESOURCES Syntax LSF_LOCAL_RESOURCES="resource ..." Description Defines instances of local resources residing on the slave host. v For numeric resources, define name-value pairs: "[resourcemap value*resource_name]" v For Boolean resources, define the resource name in the form: "[resource resource_name]" 600 Platform LSF Configuration Reference slave.config When the slave host calls the master host to add itself, it also reports its local resources. The local resources to be added must be defined in lsf.shared. If the same resource is already defined in lsf.shared as default or all, it cannot be added as a local resource. The shared resource overrides the local one. Tip: LSF_LOCAL_RESOURCES is usually set in the slave.config file during installation. If LSF_LOCAL_RESOURCES are already defined in a local lsf.conf on the slave host, lsfinstall does not add resources you define in LSF_LOCAL_RESOURCES in slave.config. You should not have duplicate LSF_LOCAL_RESOURCES entries in lsf.conf. If local resources are defined more than once, only the last definition is valid. Important: Resources must already be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name. If the ResourceMap section does not exist, local resources are not added. Example LSF_LOCAL_RESOURCES="[resourcemap 1*verilog] [resource linux]" Default None LSF_TOP Syntax LSF_TOP="/path" Description Required. Full path to the top-level LSF installation directory. Important: You must use the same path for every slave host you install. Valid value The path to LSF_TOP cannot be the root directory (/). Example LSF_TOP="/usr/local/lsf" Default None—required variable Chapter 1. Configuration Files 601 slave.config SILENT_INSTALL Syntax SILENT_INSTALL="Y" | "N" Description Enabling the silent installation (setting this parameter to Y) means you want to do the silent installation and accept the license agreement. Default N LSF_SILENT_INSTALL_TARLIST Syntax LSF_SILENT_INSTALL_TARLIST="ALL" | "Package_Name ..." Description A string which contains all LSF package names to be installed. This name list only applies to the silent install mode. Supports keywords ?all?, ?ALL? and ?All? which can install all packages in LSF_TARDIR. LSF_SILENT_INSTALL_TARLIST="ALL" | "lsf9.1.3_linux2.6-glibc2.3x86_64.tar.Z" Default None 602 Platform LSF Configuration Reference Chapter 2. Environment Variables Environment variables set for job execution LSF transfers most environment variables between submission and execution hosts. Environment variables related to file names and job spooling directories support paths that contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows. Environment variables related to command names and job names can contain up to 4094 characters for UNIX and Linux, or up to 255 characters for Windows. In addition to environment variables inherited from the user environment, LSF also sets several other environment variables for batch jobs: v LSB_ERRORFILE: Name of the error file specified with a bsub -e. v LSB_JOBID: Job ID assigned by LSF. v LSB_JOBINDEX: Index of the job that belongs to a job array. v LSB_CHKPNT_DIR: This variable is set each time a checkpointed job is submitted. The value of the variable is chkpnt_dir/job_Id, a subdirectory of the checkpoint directory that is specified when the job is submitted. The subdirectory is identified by the job ID of the submitted job. v LSB_HOSTS: The list of hosts that are used to run the batch job. For sequential jobs, this is only one host name. For parallel jobs, this includes multiple host names. v LSB_RESIZABLE: Indicates that a job is resizable or auto-resizable. v LSB_QUEUE: The name of the queue the job is dispatched from. v LSB_JOBNAME: Name of the job. v LSB_RESTART: Set to ‘Y’ if the job is a restarted job or if the job has been migrated. Otherwise this variable is not defined. v LSB_EXIT_PRE_ABORT: Set to an integer value representing an exit status. A pre-execution command should exit with this value if it wants the job to be aborted instead of requeued or executed. v LSB_EXIT_REQUEUE: Set to the REQUEUE_EXIT_VALUES parameter of the queue. This variable is not defined if REQUEUE_EXIT_VALUES is not configured for the queue. v LSB_INTERACTIVE: Set to ‘Y’ if the job is submitted with the -I option. Otherwise, it is not defined. v LS_EXECCWD: Sets the current working directory for job execution. v LS_JOBPID: Set to the process ID of the job. v LS_SUBCWD: This is the directory on the submission when the job was submitted. This is different from PWD only if the directory is not shared across machines or when the execution account is different from the submission account as a result of account mapping. v LSB_BIND_JOB: Set to the value of binding option. But when the binding option is USER, LSB_BIND_JOB is set to the real binding decision of end user. Note: © Copyright IBM Corp. 1992, 2014 603 Environment variables set for job execution If the binding option is Y, LSB_BIND_JOB is set to BALANCE. If the binding option is N, LSB_BIND_JOB is set to NONE. v LSB_BIND_CPU_LIST: Set to the actual CPU list used when the job is sequential job and single host parallel job. If the job is a multi-host parallel job, LSB_BIND_CPU_LIST is set to the value in submission environment variable $LSB_USER_BIND_CPU_LIST. If there is no such submission environment variable in user's environment, LSB_BIND_CPU_LIST is set to an empty string. The following environment variables are set only in the post job environment: v LSB_ACCUMULATED_CPUTIME: Job accumulated CPU time. For migrated jobs, the CPU time can be accumulated across migration runs. Job CPU time is shown to two decimal places. v LSB_MAX_MEM_RUSAGE: Maximum memory rusage of the job processes, not including post_exec. Always in KB. v LSB_MAX_SWAP_RUSAGE: Maximum swap rusage of the job processes, not including post_exec. Always in KB. v LSB_MAX_PROCESSES_RUSAGE: Number of processes for the job, not including post_exec. v LSB_MAX_THREADS_RUSAGE: Number of threads for the job, not including post_exec. v LSB_JOB_SUBMIT_TIME: The time that the job was submitted. v LSB_JOB_START_TIME: The time that the job was started. For requeued or migrated jobs, the start time is the time the job started after it was requeued or migrated. For chunk job members, it is the time the member actually starts, not the start time of the chunk. v LSB_JOB_END_TIME: The time that the job ended, not including post_exec. v LSB_JOB_PEND_TIME: Pend time for the job, in seconds, which is calculated from submit time and start time (start time - submit time). For a requeued or migrated job the pend time may be longer than its real time in PEND state, including the time for the previous run. In those cases, the pend time is the time from job submission to the time of last job start.. v LSB_DJOB_NUMPROC: The number of processors on which the job starts. For a job that has been resized, the value is the size of the job at its finish point. v LSB_MAX_NUM_PROCESSORS: The maximum number of processors requested when the job is submitted. For example, for a job submitted with -n 2,4, the maximum number of processors requested is 4. v LSB_JOB_STATUS: Job status value as defined in lsbatch.h. LSB_JOB_STATUS is set to 32 for job exit and 64 for job doneb. v LSB_SUB_USER: User name for the job submission. v LSB_SUB_RES_REQ: Job level resource requirement for the job submission. If the job level resource requirement was changed by bmod –R for a running job, then the changed resource requirement will not be available via LSB_SUB_RES_REQ. v LSB_EFFECTIVE_RSRCREQ: Job effective resource requirement. If the job level resource requirement was changed for a running job by bmod -R, the changed effective resource requirement via LSB_EFFECTIVE_RSRCREQ will be unavailable. Environment variables for resize notification command All environment variables that are set for a job are also set when a job is resized. 604 Platform LSF Configuration Reference Environment variables for job resize notification The following (additional) environment variables apply only to the resize notification command environment (when using resizable jobs): v LSB_RESIZE_NOTIFY_OK: A notification command should exit with this variable if the allocation resize notification command succeeds. LSF updates the job allocation to reflect the new allocation. v LSB_RESIZE_NOTIFY_FAIL: A notification command should exit with this variable if the allocation resize notification command fails. For an allocation grow event, LSF schedules the pending allocation request. For an allocation shrink event, LSF fails the release request. v LSB_RESIZE_EVENT = grow | shrink: Indicates why the notification command was called. Grow means add more resources to an existing allocation. Shrink means remove some resources from existing allocation. v LSB_RESIZE_HOSTS = hostA numA hostB numB ... hostZ numZ: Lists the additional slots for a grow event, or the released slots for a shrink event. Environment variables for session scheduler (ssched) By default, all environment variables that are set as part of the session are available in each tasks's execution environment. Variables for the execution host of each task The following environment variables are reset according to the execution host of each task: v EGO_SERVERDIR v LSB_TRAPSIGS v LSF_SERVERDIR v v v v v v v HOSTTYPE LSB_HOSTS LSF_BINDIR EGO_BINDIR PWD HOME LSB_ERRORFILE v v v v v v v LSB_OUTPUTFILE TMPDIR LSF_LIBDIR EGO_LIBDIR LSB_MCPU_HOSTS PATH (prepend LSF_BINDIR) LD_LIBRARY_PATH (prepend LSF_LIBDIR and EGO_LIBDIR) Environment variables NOT available in the task environment v LSB_JOBRES_PID v LSB_EEXEC_REAL_UID v LS_EXEC_T v LSB_INTERACTIVE v LSB_CHKFILENAME Chapter 2. Environment Variables 605 Environment variables for session scheduler v v v v v SPOOLDIR LSB_ACCT_FILE LSB_EEXEC_REAL_GID LSB_CHKPNT_DIR LSB_CHKPNT_PERIOD v v v v v v v LSB_JOB_STARTER LSB_EXIT_REQUEUE LSB_DJOB_RU_INTERVAL LSB_DJOB_HB_INTERVAL LSB_DJOB_HOSTFILE LSB_JOBEXIT_INFO LSB_JOBPEND v LSB_EXECHOSTS Environment variables corresponding to the session job v LSB_JOBID v LSB_JOBINDEX v LSB_JOBINDEX_STEP v LSB_JOBINDEX_END v LSB_JOBPID v LSB_JOBNAME v LSB_JOBFILENAME Environment variables set individually for each task v LSB_TASKID—The current task ID v LSB_TASKINDEX—The current task index Environment variable reference BSUB_BLOCK Description If set, tells NIOS that it is running in batch mode. Default Not defined Notes If you submit a job with the -K option of bsub, which is synchronous execution, then BSUB_BLOCK is set. Synchronous execution means you have to wait for the job to finish before you can continue. Where defined Set internally 606 Platform LSF Configuration Reference Environment variable reference See also The -K option of bsub BSUB_CHK_RESREQ Syntax BSUB_CHK_RESREQ=any_value Description When BSUB_CHK_RESREQ is set, bsub checks the syntax of the resource requirement selection string without actually submitting the job for scheduling and dispatch. Use BSUB_CHK_RESREQ to check the compatibility of your existing resource requirement select strings against the stricter syntax enabled by LSF_STRICT_RESREQ=y in lsf.conf. LSF_STRICT_RESREQ does not need to be set to check the resource requirement selection string syntax. bsub only checks the select section of the resource requirement. Other sections in the resource requirement string are not checked. Default Not defined Where defined From the command line Example BSUB_CHK_RESREQ=1 BSUB_QUIET Syntax BSUB_QUIET=any_value Description Controls the printing of information about job submissions. If set, bsub will not print any information about job submission. For example, it will not print <Job is submitted to default queue <normal>, nor <Waiting for dispatch>. Default Not defined Where defined From the command line Example BSUB_QUIET=1 Chapter 2. Environment Variables 607 Environment variable reference BSUB_QUIET2 Syntax BSUB_QUIET2=any_value Description Suppresses the printing of information about job completion when a job is submitted with the bsub -K option. If set, bsub will not print information about job completion to stdout. For example, when this variable is set, the message <<Job is finished>> will not be written to stdout. If BSUB_QUIET and BSUB_QUIET2 are both set, no job messages will be printed to stdout. Default Not defined Where defined From the command line Example BSUB_QUIET2=1 BSUB_STDERR Syntax BSUB_STDERR=y Description Redirects LSF messages for bsub to stderr. By default, when this parameter is not set, LSF messages for bsub are printed to stdout. When this parameter is set, LSF messages for bsub are redirected to stderr. Default Not defined Where defined From the command line on UNIX. For example, in csh: setenv BSUB_STDERR Y From the Control Panel on Windows, as an environment variable 608 Platform LSF Configuration Reference Environment variable reference CLEARCASE_DRIVE Syntax CLEARCASE_DRIVE=drive_letter: Description Optional, Windows only. Defines the virtual drive letter for a Rational ClearCase view to the drive. This is useful if you wish to map a Rational ClearCase view to a virtual drive as an alias. If this letter is unavailable, Windows attempts to map to another drive. Therefore, CLEARCASE_DRIVE only defines the default drive letter to which the Rational ClearCase view is mapped, not the final selected drive letter. However, the PATH value is automatically updated to the final drive letter if it is different from CLEARCASE_DRIVE. Notes: CLEARCASE_DRIVE is not case sensitive. Where defined From the command line Example CLEARCASE_DRIVE=F: CLEARCASE_DRIVE=f: See also CLEARCASE_MOUNTDIR, CLEARCASE_ROOT CLEARCASE_MOUNTDIR Syntax CLEARCASE_MOUNTDIR=path Description Optional. Defines the Rational ClearCase mounting directory. Default /vobs Notes: CLEARCASE_MOUNTDIR is used if any of the following conditions apply: v A job is submitted from a UNIX environment but run in a Windows host. v The Rational ClearCase mounting directory is not the default /vobs Chapter 2. Environment Variables 609 Environment variable reference Where defined From the command line Example CLEARCASE_MOUNTDIR=/myvobs See also CLEARCASE_DRIVE, CLEARCASE_ROOT CLEARCASE_ROOT Syntax CLEARCASE_ROOT=path Description The path to the Rational ClearCase view. In Windows, this path must define an absolute path starting with the default ClearCase drive and ending with the view name without an ending backslash (\). Notes CLEARCASE_ROOT must be defined if you want to submit a batch job from a ClearCase view. For interactive jobs, use bsub -I to submit the job. Where defined In the job starter, or from the command line Example In UNIX: CLEARCASE_ROOT=/view/myview In Windows: CLEARCASE_ROOT=F:\myview See also CLEARCASE_DRIVE, CLEARCASE_MOUNTDIR, LSF_JOB_STARTER ELIM_ABORT_VALUE Syntax ELIM_ABORT_VALUE 610 Platform LSF Configuration Reference Environment variable reference Description Used when writing an elim executable to test whether the elim should run on a particular host. If the host does not have or share any of the resources listed in the environment variable LSF_RESOURCES, your elim should exit with $ELIM_ABORT_VALUE. When the MELIM finds an elim that exited with ELIM_ABORT_VALUE, the MELIM marks the elim and does not restart it on that host. Where defined Set by the master elim (MELIM) on the host when the MELIM invokes the elim executable LS_EXEC_T Syntax LS_EXEC_T=START | END | CHKPNT | JOB_CONTROLS Description Indicates execution type for a job. LS_EXEC_T is set to: v START or END for a job when the job begins executing or when it completes execution v CHKPNT when the job is checkpointed v JOB_CONTROLS when a control action is initiated Where defined Set by sbatchd during job execution LS_JOBPID Description The process ID of the job. Where defined During job execution, sbatchd sets LS_JOBPID to be the same as the process ID assigned by the operating system. LS_SUBCWD Description The current working directory (cwd) of the submission host where the remote task command was executed. The current working directory can be up to 4094 characters long for UNIX and Linux or up to 255 characters for Windows. Chapter 2. Environment Variables 611 Environment variable reference How set 1. LSF looks for the PWD environment variable. If it finds it, sets LS_SUBCWD to PWD. 2. If the PWD environment variable does not exist, LSF looks for the CWD environment variable. If it finds CWD, sets LS_SUBCWD to CWD. 3. If the CWD environment variable does not exist, LSF calls the getwd() system function to retrieve the current working directory path name. LSF sets LS_SUBCWD to the value that is returned. Where defined Set by sbatchd LSB_AFFINITY_HOSTFILE Syntax LSB_AFFINITY_HOSTFILE=file_path Description Path to the NUMA CPU and memory affinity binding decision file. On the first execution host, LSF sbatchd will create a binding decision file per job under the same location as $LSB_DJOB_HOSTFILE. The binding decision file has a format similar to the job Host File, one task per line. Each line includes: host_name cpu_id_list NUMA_node_id_list memory_policy. For memory policy, 1 means localonly, 2 means localprefer, as specified in the affinity resource requirement membind parameter. Comma (,) is the only supported delimiter for the list of CPU IDs and the list of NUMA node IDs. The following Host File represents a job with 6 tasks: v Host1 and Host2 each have two tasks bound to CPUs {0,1,2,3} and {4,5,6,7}, and NUMA nodes 0 and 1 respectively with a membind=localprefer policy. v Host3 and Host4 each have one task bound to CPUs {0,1,2,3} and NUMA node 0, again with a membind=localprefer policy. Host1 Host1 Host2 Host2 Host3 Host4 0,1,2,3 4,5,6,7 0,1,2,3 4,5,6,7 0,1,2,3 0,1,2,3 0 1 0 1 0 0 2 2 2 2 2 2 Default Not defined Where defined Set during job execution. 612 Platform LSF Configuration Reference Environment variable reference LSB_BIND_CPU_LIST Syntax LSB_BIND_CPU_LIST=cpu_list Description LSB_BIND_CPU_LIST contains allocated CPUs on each host. LSF will combine all of the allocated CPUs for all the tasks that ended up on the host (for the given job), and set this environment variable to the entire list before launching tasks. The following example corresponds to the tasks specified in the example for LSB_AFFINITY_HOSTFILE: LSB_BIND_CPU_LIST LSB_BIND_CPU_LIST LSB_BIND_CPU_LIST LSB_BIND_CPU_LIST on on on on Host1: Host2: Host3: Host4: 0,1,2,3,4,5,6,7 0,1,2,3,4,5,6,7 0,1,2,3 0,1,2,3 Default Not defined Where defined Set during job execution. LSB_BIND_MEM_LIST Syntax LSB_BIND_MEM_LIST=memory_node_list Description LSB_BIND_MEM_LIST contains allocated memory nodes on each host. If the job is submitted with a memory affinity requirement, LSF will combine all of the allocated NUMA node IDs for all the tasks that ended up on the host (for the given job), and set this environment variable to the entire list before launching tasks. The following example corresponds to the tasks specified in the example for LSB_AFFINITY_HOSTFILE: LSB_BIND_MEM_LIST LSB_BIND_MEM_LIST LSB_BIND_MEM_LIST LSB_BIND_MEM_LIST on on on on Host1: Host2: Host3: Host4: 0,1 0,1 0 0 Default Not defined Where defined Set during job execution. LSB_BIND_MEM_POLICY Syntax LSB_BIND_MEM_POLICY=localprefer | localonly Chapter 2. Environment Variables 613 Environment variable reference Description For jobs submitted with a NUMA memory affinity resource requirement, LSF sets the memory binding policy in LSB_BIND_MEM_POLICY environment variable, as specified in the membind affinity resource requirement parameter: either localprefer or localonly. Default Not defined Where defined Set during job execution. LSB_BJOBS_FORMAT | | This parameter can be set from the command line or from lsf.conf. | See LSB_BJOBS_FORMAT in lsf.conf. LSB_BSUB_ERR_RETRY Syntax LSB_BSUB_ERR_RETRY=RETRY_CNT[number] ERR_TYPE[error1 error2 ...] Description In some cases, jobs can benefit from being automatically retried in the case of failing for a particular error. When specified, LSB_BSUB_ERR_RETRY automatically retries jobs that exit with a particular reason, up to the number of times specified by RETRY_CNT. Only the following error types (ERR_TYPE) are supported: v BAD_XDR: Error during XDR. v MSG_SYS: Failed to send or receive a message. v INTERNAL: Internal library error. The number of retries (RETRY_CNT) can be a minimum of 1 to a maximum of 50. Considerations when setting this parameter: v Users may experience what seems like a lag during job submission while the job is retried automatically in the background. v Users may see a job submitted more than once, with no explanation (no error is communicated to the user; the job keeps getting submitted until it succeeds or reaches its maximum retry count). In this case, the job ID also changes each time the error is retried. 614 Platform LSF Configuration Reference Environment variable reference Default N LSB_CHKPNT_DIR Syntax LSB_CHKPNT_DIR=checkpoint_dir/job_ID Description The directory containing files related to the submitted checkpointable job. Valid values The value of checkpoint_dir is the directory you specified through the -k option of bsub when submitting the checkpointable job. The value of job_ID is the job ID of the checkpointable job. Where defined Set by LSF, based on the directory you specified when submitting a checkpointable job with the -k option of bsub. | LSB_DATA_CACHE_TOP | | | Contains a string defining the location of the data management staging area relative to the compute node. The value of this variable is equivalent to STAGING_AREA in lsf.datamanager. | See also | lsf.datamanager | LSB_DATA_META_FILE | | | | For jobs submitted with data requirements, this variable contains a string defining the location of the job's metadata file relative to the compute node. The value of this variable is equivalent to the value of STAGING_AREA/work/cluster_name/jobID/ stgin.meta. | See also | LSB_OUTDIR, the bsub -data option LSB_DEBUG This parameter can be set from the command line or from lsf.conf. See LSB_DEBUG in lsf.conf. LSB_DEBUG_CMD This parameter can be set from the command line or from lsf.conf. See LSB_DEBUG_CMD in lsf.conf. Chapter 2. Environment Variables 615 Environment variable reference LSB_DEBUG_MBD This parameter can be set from the command line with badmin mbddebug or from lsf.conf. See LSB_DEBUG_MBD in lsf.conf. LSB_DEBUG_SBD This parameter can be set from the command line with badmin sbddebug or from lsf.conf. See LSB_DEBUG_SBD in lsf.conf. LSB_DEBUG_SCH This parameter can be set from the command line or from lsf.conf. See LSB_DEBUG_SCH in lsf.conf. LSB_DEFAULT_JOBGROUP Syntax LSB_DEFAULT_JOBGROUP=job_group_name Description The name of the default job group. When you submit a job to LSF without explicitly specifying a job group, LSF associates the job with the specified job group. LSB_DEFAULT_JOBGROUP overrrides the setting of DEFAULT_JOBGROUP in lsb.params. The bsub -g job_group_name option overrides both LSB_DEFAULT_JOBGROUP and DEFAULT_JOBGROUP. If you submit a job without the -g option of bsub, but you defined LSB_DEFAULT_JOBGROUP, then the job belongs to the job group specified in LSB_DEFAULT_JOBGROUP. Job group names must follow this format: v Job group names must start with a slash character (/). For example, LSB_DEFAULT_JOBGROUP=/A/B/C is correct, but LSB_DEFAULT_JOBGROUP=A/B/C is not correct. v Job group names cannot end with a slash character (/). For example, LSB_DEFAULT_JOBGROUP=/A/ is not correct. v Job group names cannot contain more than one slash character (/) in a row. For example, job group names like LSB_DEFAULT_JOBGROUP=/A//B or LSB_DEFAULT_JOBGROUP=A////B are not correct. v Job group names cannot contain spaces. For example, LSB_DEFAULT_JOBGROUP=/ A/B C/D is not correct. v Project cannot v Project cannot 616 names and user names used for macro substitution with %p and %u start or end with slash character (/). names and user names used for macro substitution with %p and %u contain spaces or more than one slash character (/) in a row. Platform LSF Configuration Reference Environment variable reference v Project names or user names containing slash character (/) will create separate job groups. For example, if the project name is canada/projects, LSB_DEFAULT_JOBGROUP=/%p results in a job group hierarchy /canada/projects. Where defined From the command line Example LSB_DEFAULT_JOBGROUP=/canada/projects Default Not defined See also DEFAULT_JOBGROUP in lsb.params, the -g option of bsub LSB_DEFAULTPROJECT Syntax LSB_DEFAULTPROJECT=project_name Description The name of the project to which resources consumed by a job will be charged. Default Not defined Notes Project names can be up to 59 characters long. If the LSF administrator defines a default project in the lsb.params configuration file, the system uses this as the default project. You can change the default project by setting LSB_DEFAULTPROJECT or by specifying a project name with the -P option of bsub. If you submit a job without the -P option of bsub, but you defined LSB_DEFAULTPROJECT, then the job belongs to the project specified in LSB_DEFAULTPROJECT. If you submit a job with the -P option of bsub, the job belongs to the project specified through the -P option. Where defined From the command line, or through the -P option of bsub Example LSB_DEFAULTPROJECT=engineering Chapter 2. Environment Variables 617 Environment variable reference See also DEFAULT_PROJECT in lsb.params, the -P option of bsub LSB_DEFAULTQUEUE Syntax LSB_DEFAULTQUEUE=queue_name Description Defines the default LSF queue. Default mbatchd decides which is the default queue. You can override the default by defining LSB_DEFAULTQUEUE. Notes If the LSF administrator defines a default queue in the lsb.params configuration file, then the system uses this as the default queue. Provided you have permission, you can change the default queue by setting LSB_DEFAULTQUEUE to a valid queue (see bqueues for a list of valid queues). Where defined From the command line See also DEFAULT_QUEUE in lsb.params LSB_DJOB_COMMFAIL_ACTION Syntax LSB_DJOB_COMMFAIL_ACTION="KILL_TASKS" Description Defines the action LSF should take if it detects a communication failure with one or more remote parallel or distributed tasks. If defined, LSF will try to kill all the current tasks of a parallel or distributed job associated with the communication failure. If not defined, the job RES notifies the task RES to terminate all tasks, and shut down the entire job. Default Terminate all tasks, and shut down the entire job Valid values KILL_TASKS 618 Platform LSF Configuration Reference Environment variable reference Where defined Set by the system based on the value of the parameter DJOB_COMMFAIL_ACTION in lsb.applications when running bsub -app for the specified application See also DJOB_COMMFAIL_ACTION in lsb.applications LSB_DJOB_ENV_SCRIPT Syntax LSB_DJOB_ENV_SCRIPT=script_name Description Defines the name of a user-defined script for setting and cleaning up the parallel or distributed job environment. This script will be executed by LSF with the argument setup before launching a parallel or distributed job, and with argument cleanup after the parallel job is finished. The script will run as the user, and will be part of the job. If a full path is specified, LSF will use the path name for the execution. Otherwise, LSF will look for the executable from $LSF_BINDIR. Where defined Set by the system to the value of the parameter DJOB_ENV_SCRIPT in lsb.applications when running bsub -app for the specified application See also DJOB_ENV_SCRIPT in lsb.applications LSB_DJOB_HB_INTERVAL Syntax LSB_DJOB_HB_INTERVAL=seconds Description Defines the time interval between heartbeat messages sent by the remote execution tasks to the head node. If the head node does not receive a heartbeat message from one task within 2 intervals, LSF will take action according to how LSB_DJOB_COMMFAIL_ACTION is specified. Heartbeat message sending cannot be disabled. LSB_DJOB_HB_INTERVAL can be set as an environment variable of bsub. If defined, it will overwrite DJOB_HB_INTERVAL configuration in the application profile. If neither parameter is defined, LSF uses the default value to report heartbeat messages. The default value is calculated with the following formula: MAX(60, number_of_execution_hosts * 0.12) Valid values must be positive integers. Chapter 2. Environment Variables 619 Environment variable reference LSB_DJOB_RU_INTERVAL Syntax LSB_DJOB_RU_INTERVAL=seconds Description Defines the time interval that LSF reports parallel job rusage on each execution host. A value 0 seconds means disable the resource update. LSB_DJOB_RU_INTERVAL can be set as an environment variable of bsub. If defined, it overwrites the DJOB_RU_INTERVAL configuration in application profile. If neither LSB_DJOB_RU_INTERVAL nor DJOB_RU_INTERVAL is defined, LSF will use the default value to report resource usage. The default value is calculate with the following formula: MAX(60, number_of_execution_hosts * 0.3) Valid values are non-negative integers. LSB_DJOB_PE_NETWORK Description Network resource information for IBM Parallel Environment (PE) jobs submitted with the bsub -network option, or to a queue (defined in lsb.queues) or an application profile (defined in lsb.applications) with the NETWORK_REQ parameter defined. Where defined Set by sbatchd before a job is dispatched. LSB_DJOB_NUMPROC Syntax LSB_DJOB_NUMPROC=num Description The number of processors (slots) allocated to the job. Default Not defined Where defined Set by sbatchd before starting a job on the execution host. See Also LSB_MCPU_HOSTS 620 Platform LSF Configuration Reference Environment variable reference | LSB_DJOB_RANKFILE | Syntax | LSB_DJOB_RANKFILE=file_path | Description | | | | | When a job is submitted (bsub -hostfile) or modified (bmod -hostfile) with a user-specified host file , the LSB_DJOB_RANKFILE environment variable is generated from the user-specified host file. If a job is not submitted with a user-specified host file then LSB_DJOB_RANKFILE points to the same file as LSB_DJOB_HOSTFILE. | | | | Duplicate host names are combined, along with the total number of slots for a host name and the results are used for scheduling (LSB_DJOB_HOSTFILE groups the hosts together) and for LSB_MCPU_HOSTS. LSB_MCPU_HOSTS represents the job allocation. | | The esub parameter LSB_SUB4_HOST_FILE reads and modifies the value of the –hostfile option. | | A host name repeated sequentially will be combined to one host name with a summary of slots but host name order will be maintained. | | | | | | | | For example, if the user specified host file contains: | | | | | | | | | then the bjobs and bhist commands will show the following allocation summary for the job: | Default | By default, points to LSB_DJOB_HOSTFILE. host01 host01 host02 host01 host02 host02 host03 2 USER-SPECIFIED HOST FILE: HOST SLOTS host01 3 host02 1 host01 1 host02 2 host03 1 LSB_DJOB_TASK_BIND Syntax LSB_DJOB_TASK_BIND=Y | y | N | n Description For CPU and memory affinity scheduling jobs launched with the blaunch distributed application framework. Chapter 2. Environment Variables 621 Environment variable reference If LSB_DJOB_TASK_BIND=Y in the submission environment before submitting the job, you must use blaunch to start tasks to enable LSF to bind each task to the proper CPUs or NUMA nodes Only the CPU and memory bindings allocated to the task itself will be set in each tasks environment. If LSB_DJOB_TASK_BIND=N, or it is not set, each task will have the same CPU or NUMA node binding on one host. If you do not use blaunch to start tasks, and use another MPI mechanism such as IBM Platform MPI or IBM Parallel Environment, you should not set DJOB_TASK_BIND or set it to N. Default N Description Where defined Set by sbatchd before a job is dispatched. LSB_ECHKPNT_METHOD This parameter can be set as an environment variable and/or in lsf.conf. See LSB_ECHKPNT_METHOD in lsf.conf. LSB_ECHKPNT_METHOD_DIR This parameter can be set as an environment variable and/or in lsf.conf. See LSB_ECHKPNT_METHOD_DIR in lsf.conf. LSB_ECHKPNT_KEEP_OUTPUT This parameter can be set as an environment variable and/or in lsf.conf. See LSB_ECHKPNT_KEEP_OUTPUT in lsf.conf. LSB_ERESTART_USRCMD Syntax LSB_ERESTART_USRCMD=command Description Original command used to start the job. This environment variable is set by erestart to pass the job’s original start command to a custom erestart method erestart.method_name. The value of this variable is extracted from the job file of the checkpointed job. If a job starter is defined for the queue to which the job was submitted, the job starter is also included in LSB_ERESTART_USRCMD. For example, if the job starter is /bin/sh -c "%USRCMD" in lsb.queues, and the job name is myapp -d, LSB_ERESTART_USRCMD will be set to: /bin/sh -c "myapp -d" 622 Platform LSF Configuration Reference Environment variable reference Where defined Set by erestart as an environment variable before a job is restarted See also LSB_ECHKPNT_METHOD, erestart, echkpnt LSB_EXEC_RUSAGE Syntax LSB_EXEC_RUSAGE="resource_name1 resource_value1 resource_name2 resource_value2..." Description Indicates which rusage string is satisfied to permit the job to run. This environment variable is necessary because the OR (||) operator specifies alternative rusage strings for running jobs. Valid values resource_value1, resource_value2,... refer to the resource values on resource_name1, resource_name2,... respectively. Default Not defined Where defined Set by LSF after reserving a resource for the job. LSB_EXECHOSTS Description A list of hosts on which a batch job will run. Where defined Set by sbatchd Product MultiCluster LSB_EXIT_IF_CWD_NOTEXIST Syntax LSB_EXIT_IF_CWD_NOTEXIST=Y | y | N | n Description Indicates that the job will exit if the current working directory specified by bsub -cwd or bmod -cwd is not accessible on the execution host. Chapter 2. Environment Variables 623 Environment variable reference Default Not defined Where defined From the command line LSB_EXIT_PRE_ABORT Description The queue-level or job-level pre_exec_command can exit with this value if the job is to be aborted instead of being requeued or executed Where defined Set by sbatchd See also See PRE_EXEC in lsb.queues, or the -E option of bsub LSB_EXIT_REQUEUE Syntax LSB_EXIT_REQUEUE="exit_value1 exit_value2..." Description Contains a list of exit values found in the queue’s REQUEUE_EXIT_VALUES parameter defined in lsb.queues. Valid values Any positive integers Default Not defined Notes If LSB_EXIT_REQUEUE is defined, a job will be requeued if it exits with one of the specified values. LSB_EXIT_REQUEUE is not defined if the parameter REQUEUE_EXIT_VALUES is not defined. Where defined Set by the system based on the value of the parameter REQUEUE_EXIT_VALUES in lsb.queues Example LSB_EXIT_REQUEUE="7 31" 624 Platform LSF Configuration Reference Environment variable reference See also REQUEUE_EXIT_VALUES in lsb.queues LSB_FRAMES Syntax LSB_FRAMES=start_number,end_number,step Description Determines the number of frames to be processed by a frame job. Valid values The values of start_number, end_number, and step are positive integers. Use commas to separate the values. Default Not defined Notes When the job is running, LSB_FRAMES will be set to the relative frames with the format LSB_FRAMES=start_number,end_number,step. From the start_number, end_number, and step, the frame job can know how many frames it will process. Where defined Set by sbatchd Example LSB_FRAMES=10,20,1 LSB_HOSTS Syntax LSB_HOSTS="host_name..." Description A list of hosts selected by LSF to run the job. Notes If a job is run on a single processor, the system sets LSB_HOSTS to the name of the host used. For parallel jobs, the system sets LSB_HOSTS to the names of all the hosts used. Chapter 2. Environment Variables 625 Environment variable reference Where defined Set by sbatchd when the job is executed. LSB_HOSTS is set only when the list of host names is less than 4096 bytes. See also LSB_MCPU_HOSTS LSB_INTERACTIVE Syntax LSB_INTERACTIVE=Y Description Indicates an interactive job. When you submit an interactive job using bsub -I, the system sets LSB_INTERACTIVE to Y. Valid values LSB_INTERACTIVE=Y (if the job is interactive) Default Not defined (if the job is not interactive) Where defined Set by sbatchd LSB_JOB_INCLUDE_POSTPROC Syntax LSB_JOB_INCLUDE_POSTPROC=Y | y | N | n Description Enables the post-execution processing of the job to be included as part of the job. LSB_JOB_INCLUDE_POSTPROC in the user environment overrides the value of JOB_INCLUDE_POSTPROC in lsb.params and lsb.applications. Default Not defined Where defined From the command line 626 Platform LSF Configuration Reference Environment variable reference LSB_JOBEXIT_INFO Syntax LSB_JOBEXIT_INFO="SIGNAL signal_value signal_name" Description Contains information about signal that caused a job to exit. Applies to post-execution commands. Post-execution commands are set with POST_EXEC in lsb.queues. When the post-execution command is run, the environment variable LSB_JOBEXIT_INFO is set if the job is signalled internally. If the job ends successfully, or the job is killed or signalled externally, LSB_JOBEXIT_INFO is not set. Examples LSB_JOBEXIT_INFO="SIGNAL -1 SIG_CHKPNT" LSB_JOBEXIT_INFO="SIGNAL -14 SIG_TERM_USER" LSB_JOBEXIT_INFO="SIGNAL -23 SIG_KILL_REQUEUE" Default Not defined Where defined Set by sbatchd LSB_JOBEXIT_STAT Syntax LSB_JOBEXIT_STAT=exit_status Description Indicates a job’s exit status. Applies to post-execution commands. Post-execution commands are set with POST_EXEC in lsb.queues. When the post-execution command is run, the environment variable LSB_JOBEXIT_STAT is set to the exit status of the job. Refer to the man page for the wait(2) command for the format of this exit status. The post-execution command is also run if a job is requeued because the job’s execution environment fails to be set up, or if the job exits with one of the queue’s REQUEUE_EXIT_VALUES. The LSB_JOBPEND environment variable is set if the job is requeued. If the job’s execution environment could not be set up, LSB_JOBEXIT_STAT is set to 0. Valid values Any positive integer Chapter 2. Environment Variables 627 Environment variable reference Where defined Set by sbatchd LSB_JOBFILENAME Syntax LSB_JOBFILENAME=file_name Description The path to the batch executable job file that invokes the batch job. The batch executable job file is a /bin/sh script on UNIX systems or a .BAT command script on Windows systems. LSB_JOBGROUP Syntax LSB_JOBGROUP=job_group_name Description The name of the job group associated with the job. When a job is dispatched, if it belongs to a job group, the runtime variable LSB_JOBGROUP is defined as its group. For example, if a dispatched job belongs to job group /X, LSB_JOBGROUP=/X. Where defined Set during job execution based on bsub options or the default job group defined in DEFAULT_JOBGROUP in lsb.params and the LSB_DEFAULT_JOBGROUP environment variable. Default Not defined LSB_JOBID Syntax LSB_JOBID=job_ID Description The job ID assigned by sbatchd. This is the ID of the job assigned by LSF, as shown by bjobs. Valid values Any positive integer Where defined Set by sbatchd, defined by mbatchd 628 Platform LSF Configuration Reference Environment variable reference See also LSB_REMOTEJID LSB_JOBINDEX Syntax LSB_JOBINDEX=index Description Contains the job array index. Valid values Any integer greater than zero but less than the maximum job array size. Notes LSB_JOBINDEX is set when each job array element is dispatched. Its value corresponds to the job array index. LSB_JOBINDEX is set for all jobs. For non-array jobs, LSB_JOBINDEX is set to zero (0). Where defined Set during job execution based on bsub options. Example You can use LSB_JOBINDEX in a shell script to select the job command to be performed based on the job array index. For example: if [$LSB_JOBINDEX -eq 1]; then cmd1 fi if [$LSB_JOBINDEX -eq 2]; then cmd2 fi See also LSB_JOBINDEX_STEP, LSB_REMOTEINDEX LSB_JOBINDEX_STEP Syntax LSB_JOBINDEX_STEP=step Description Step at which single elements of the job array are defined. Valid values Any integer greater than zero but less than the maximum job array size Default 1 Chapter 2. Environment Variables 629 Environment variable reference Notes LSB_JOBINDEX_STEP is set when a job array is dispatched. Its value corresponds to the step of the job array index. This variable is set only for job arrays. Where defined Set during job execution based on bsub options. Example The following is an example of an array where a step of 2 is used: array[1-10:2] elements:1 3 5 7 9 If this job array is dispatched, then LSB_JOBINDEX_STEP=2 See also LSB_JOBINDEX LSB_JOBNAME Syntax LSB_JOBNAME=job_name Description The name of the job defined by the user at submission time. Default The job’s command line Notes The name of a job can be specified explicitly when you submit a job. The name does not have to be unique. If you do not specify a job name, the job name defaults to the actual batch command as specified on the bsub command line. The job name can be up to 4094 characters long for UNIX and Linux or up to 255 characters for Windows. Where defined Set by sbatchd Example When you submit a job using the -J option of bsub, for example: % bsub -J "myjob" job sbatchd sets LSB_JOBNAME to the job name that you specified: LSB_JOBNAME=myjob 630 Platform LSF Configuration Reference Environment variable reference LSB_JOBPEND Description Set if the job is requeued. Where defined Set by sbatchd for POST_EXEC only See also LSB_JOBEXIT_STAT, REQUEUE_EXIT_VALUES, POST_EXEC LSB_JOBPGIDS Description A list of the current process group IDs of the job. Where defined The process group IDs are assigned by the operating system, and LSB_JOBPGIDS is set by sbatchd. See also LSB_JOBPIDS LSB_JOBPIDS Description A list of the current process IDs of the job. Where defined The process IDs are assigned by the operating system, and LSB_JOBPIDS is set by sbatchd. See also LSB_JOBPGIDS LSB_MAILSIZE Syntax LSB_MAILSIZE=value Description Gives an estimate of the size of the batch job output when the output is sent by email. It is not necessary to configure LSB_MAILSIZE_LIMIT. LSF sets LSB_MAILSIZE to the size in KB of the job output, allowing the custom mail program to intercept output that is larger than desired. Chapter 2. Environment Variables 631 Environment variable reference LSB_MAILSIZE is not recognized by the LSF default mail program. To prevent large job output files from interfering with your mail system, use LSB_MAILSIZE_LIMIT to explicitly set the maximum size in KB of the email containing the job information. Valid values A positive integer If the output is being sent by email, LSB_MAILSIZE is set to the estimated mail size in kilobytes. -1 If the output fails or cannot be read, LSB_MAILSIZE is set to -1 and the output is sent by email using LSB_MAILPROG if specified in lsf.conf. Not defined If you use the -o or -e options of bsub, the output is redirected to an output file. Because the output is not sent by email in this case, LSB_MAILSIZE is not used and LSB_MAILPROG is not called. If the -N option is used with the -o option of bsub, LSB_MAILSIZE is not set. Where defined Set by sbatchd when the custom mail program specified by LSB_MAILPROG in lsf.conf is called. LSB_MCPU_HOSTS Syntax LSB_MCPU_HOSTS="host_nameA num_processors1 host_nameB num_processors2..." Description Contains a list of the hosts and the number of CPUs used to run a job. Valid values num_processors1, num_processors2,... refer to the number of CPUs used on host_nameA, host_nameB,..., respectively Default Not defined Notes The environment variables LSB_HOSTS and LSB_MCPU_HOSTS both contain the same information, but the information is presented in different formats. LSB_MCPU_HOSTS uses a shorter format than LSB_HOSTS. As a general rule, sbatchd sets both these variables. However, for some parallel jobs, LSB_HOSTS is not set. 632 Platform LSF Configuration Reference Environment variable reference For parallel jobs, several CPUs are used, and the length of LSB_HOSTS can become very long. sbatchd needs to spend a lot of time parsing the string. If the size of LSB_HOSTS exceeds 4096 bytes, LSB_HOSTS is ignored, and sbatchd sets only LSB_MCPU_HOSTS. To verify the hosts and CPUs used for your dispatched job, check the value of LSB_HOSTS for single CPU jobs, and check the value of LSB_MCPU_HOSTS for parallel jobs. Where defined Set by sbatchd before starting a job on the execution host Example When the you submit a job with the -m and -n options of bsub, for example, % bsub -m "hostA hostB" -n 6 job sbatchd sets the environment variables LSB_HOSTS and LSB_MCPU_HOSTS as follows: LSB_HOSTS= "hostA hostA hostA hostB hostB hostB" LSB_MCPU_HOSTS="hostA 3 hostB 3" Both variables are set in order to maintain compatibility with earlier versions. See also LSB_HOSTS LSB_NQS_PORT This parameter can be defined in lsf.conf or in the services database such as /etc/services. See LSB_NUM_NIOS_CALLBACK_THREADS in lsf.conf for more details. LSB_NTRIES Syntax LSB_NTRIES=integer Description The number of times that LSF libraries attempt to contact mbatchd or perform a concurrent jobs query. For example, if this parameter is not defined, when you type bjobs, LSF keeps displaying "batch system not responding" if mbatchd cannot be contacted or if the number of pending jobs exceeds MAX_PEND_JOBS specified in lsb.params or lsb.users. If this parameter is set to a value, LSF only attempts to contact mbatchd the defined number of times and then quits. LSF will wait for a period of time equal to SUB_TRY_INTERVAL specified in lsb.params before attempting to contact mbatchd again. Chapter 2. Environment Variables 633 Environment variable reference Valid values Any positive integer Default INFINIT_INT (The default is to continue the attempts to contact mbatchd) LSB_OLD_JOBID Syntax LSB_OLD_JOBID=job_ID Description The job ID of a job at the time it was checkpointed. When a job is restarted, it is assigned a new job ID and LSB_JOBID is replaced with the new job ID. LSB_OLD_JOBID identifies