BDS Config file

The config.bds file allows customizing bds's behavior.

The config file is usually located in $HOME/.bds/bds.config. Running bds without any arguments shows the config's file default location. You can provide an alternative path using command line option -c.

The config file is roughly divided into sections. It is not required that parameters are in specific sections, we just do it to have some order. We explain the parameters for each section below.

Default parameters

This section defines default parameters used in running tasks (such as system type, number of CPUs, memory, etc.). Most of the time you'd rather keep options unspecified but it can be convenient to set system = local in your laptop and system = cluster in your production cluster.

Parameter Comments / examples
mem Default memory in bytes (negative number means unspecified)
node Default execution node (empty means unspecified)
queue Add default queue name (empty means unspecified)
retry Default number of retries when a task fails (0 means no retry). Upon failire, a task is re-executed up to 'retry' times. I.e. a task is considered failed only after failing 'retry + 1' times.
system Default system type. If unspecified, the default system is 'local' (run tasks on local computer)
timeout Task timeout in seconds (default is one day)
walltimeout Task's wall-timeout in seconds (default is one day). Wall timeout includes all the time that the task is waiting to be executed. I.e. the total amount of time we are willing to wait for a task to finish. For example if walltimeout is one day and a task is queued by the cluster system for one day (and never executed), it will timeout, even if the task was never run.
taskShell Shell to be used when running a task (default /bin/bash -eu\nset -o pipefail\n). WARNING: Make sure you use "-e" or some command line option that stops execution when an error if found.
sysShell Shell to be used when running a sys (default /bin/bash -euo pipefail -c). WARNING: Make sure you use "-e" or some command line option that stops execution when an error if found. WARNING: Make sure you use "-c" or some command line option that allows providing a script

Cluster options

This section defines parameters to customize bds to run tasks on your cluster.

Parameter Comments / examples
pidRegex Regex used to extract PID from cluster command (e.g. qsub). When bds dispatches a task to the cluster management system (e.g. running 'qsub' command), it expects the cluster system to inform the jobID. Typically cluster systems show jobIDs in the first output line. This regex is used to match that jobID. Default, use the whole line. Note: Some clusters add the domain name to the ID and then never use it again, some other clusters add a message (e.g. 'You job ...'). Examples: pidRegex = "(.+).domain.com" and pidRegex = "Your job (\\S+)"
clusterRunAdditionalArgs These command line arguments are added to every cluster 'run' command (e.g. 'qsub'). The string is split into spaces (regex: '\s+') and added to the cluster's run command. E.g.: clusterRunAdditionalArgs = -A accountID -M user@gmail.com will cause four additional arguments { '-A', 'accountID', '-M', 'user@gmail.com' } to be added immediately after 'qsub' (or similar) command used to run tasks on a cluster.
clusterKillAdditionalArgs These command line arguments are added to every cluster 'kill' command (e.g. 'qdel'). Same rules as 'clusterRunAdditionalArgs' apply.
clusterStatAdditionalArgs These command line arguments are added to every cluster 'stat' command (e.g. 'qstat'). Same rules as 'clusterRunAdditionalArgs' apply.
clusterPostMortemInfoAdditionalArgs These command line arguments are added to every cluster 'post mortem info' command (e.g. 'qstat -f'). Same rules as 'clusterRunAdditionalArgs' apply.

SGE Cluster options

This section defines parameters to customize bds to run tasks on a Sun Grid Engine cluster.

IMPORTANT: In SGE clusters it is important to enable ENABLE_ADDGRP_KILL=true to the execd_params parameter of qconf -sconf. Otherwise SGE might not be able to kill bds subprocesses running on slave nodes if this option is not enabled. So, if you don't activate ENABLE_ADDGRP_KILL=true killing processes may not work in SGE clusters, nodes will continue to run tasks even after they've been killed either Ctrl-C to bds or by a direct qdel command (the cluster reports them as finished, but they might still be running in the slave node).

Parameter Comments / examples
sge.pe Parallel environment in SGE (e.g. 'qsub -pe mpi 4').
sge.mem Parameter for requesting amount of memory in qsub (e.g. qsub -l mem 4G)
sge.timeout Parameter for timeout in qsub (e.g. qsub -l h_rt 24:00:00)

Note on SGE's parallel environment ('-pe'):

The defaults were set to be compatible with StarCluster. Parallel environment defines how 'slots' (number of cpus requested) are allocated. StarCluster by default sets up a parallel environment, called “orte”, that has been configured for OpenMPI integration within SGE and has a number of slots equal to the total number of processors in the cluster. See details qconf -sp orte:

pe_name            orte
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Notice the allocation_rule = $round_robin. This defines how to assign slots to a job. By default StarCluster configures round_robin allocation. This means that if a job requests 8 slots for example, it will go to the first machine, grab a single slot if available, move to the next machine and grab a single slot if available, and so on wrapping around the cluster again if necessary to allocate 8 slots to the job.

You can also configure the parallel environment to try and localize slots as much as possible using the "fill_up" allocation rule and job_is_first_task of TRUE.

To configure: qconf -mp orte

Generic Cluster options

Cluster generic invokes user defined scripts for manupulating tasks. This allows the user to customize scripts for particular cluster environments (e.g. environments not currently supported by bds)

Note: You should either provide the script's full path or the scripts should be in your PATH

Note: These scripts "communicate" with bds by printing information on STDOUT. The information has to be printed in a very specific format. Failing to adhere to the format will cause bds to fail in unexpected ways.

Note: You can use command path starting with '~' to indicate $HOME dir or '.' to indicate path relative to config file's dir

Parameter Comments / examples
clusterGenericRun The specified script is executed when a task is submitted to the cluster
clusterGenericKill The specified script is executed in order to kill a task
clusterGenericStat The specified script is executed in order to show the jobID of all jobs currently scheduled in the cluster
clusterGenericPostMortemInfo The specified script is executed in order to get information of a recently finished jobId. This information is typically used for debuging and is added to bds's output.

Generic cluster: clusterGenericRun

Script's expected output: The script MUST print the cluster's jobID AS THE FIRST LINE. Make sure to flush STDOUT to avoid other lines to be printed out of order.

Command line arguments:

  • Task's timeout in seconds. Negative number means 'unlimited' (i.e. let the cluster system decide)
  • Task's required CPUs: number of cores within the same node.
  • Task's required memory in bytes. Negative means 'unspecified' (i.e. let the cluster system decide)
  • Cluster's queue name. Empty means "use cluster's default"
  • Cluster's STDOUT redirect file. This is where the cluster should redirect STDOUT.
  • Cluster's STDERR redirect file. This is where the cluster should redirect STDERR
  • Cluster command and arguments to be executed (typically is a "bds -exec ...").

Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.

Generic cluster: clusterGenericKill

Script's expected output: None

Command line arguments: jobId: This is the jobId returned as the first line in 'clusterGenericRun' script (i.e. the jobID provided by the cluster management system)

Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.

Generic cluster: clusterGenericStat

The specified script is executed in order to show the jobID of all jobs currently scheduled in the cluster

Script's expected output: This script is expected to print all jobs currently scheduled or running in the cluster (e.g. qstat), one per line. The FIRST column should be the jobID (columns are space or tab separated). Other columns may exist (but are currently ignored).

Command line arguments: None

Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.

Generic cluster: clusterGenericPostMortemInfo

Script's expected output: The output is not parsed, it is stored and later shown in bds's report. Is should contain information relevant to the job's execution (e.g. qstat -f $jobId or checkjob -v $jobId)

Command line arguments: jobId: This is the jobId returned as the first line in 'clusterGenericRun' script (i.e. the jobID provided by the cluster management system)

Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.

SSH Cluster options

Cluster shh creates a virtual cluster using several nodes access via shh.

Parameter Comments / examples
ssh.nodes This defines the userName and nodes to be accessed via ssh.

Examples:

A trivial 'ssh' cluster composed only of the localhost accesed via ssh (useful for debugging)

ssh.nodes = user@localhost

Some company's servers used as an ssh cluster

ssh.nodes = user@lab1-1company.com, user@lab1-2company.com, user@lab1-3company.com, user@lab1-4company.com, user@lab1-5company.com

A StarCluster run on Amazon AWS

ssh.nodes = sgeadmin@node001, sgeadmin@node002, sgeadmin@node003, sgeadmin@node004, sgeadmin@node005, sgeadmin@node006