Download & Install

Please take a look at the download page.

Introduction

BigDataScript is intended as a scripting language for big data pipeline.

What?

BigDataScript is a cross-system scripting language for working with big data pipelines in computer systems of different sizes and capabilities.

Why?

Working with heavyweight computation and big data pipelines involves making use of several specialized programs. Those specialized routines need to be scheduled, called and coordinated; their progress need to be tracked and their results logged. That is the job of another script or program. This is when BigDataScript becomes extremely handy.

Developing traditional shell scripts or small programs to coordinate data pipelines presents a fundamental dilemma. It is not cross-platform, it simply does not work on all environments or it needs adaptations and re-work for the same thing to work on a laptop, server, server farm, cluster and cloud. Often it is simply not possible. Because of that, developing big data pipelines for a different environment is time consuming. The behaviour on the target environment cannot be assumed to be an exact extrapolation of the results obtained on the development environment. This not only is a waste of time, money and energy, it is also reliable source of frustration.

BigDataScript is the solution to the problem.

With BigDataScript, creating jobs for big data is as easy as creating a shell script and it runs seamlessly on any computer system, no matter how small or big it is. If you normally use specialized programs to perform heavyweight computations, then BigDataScript is the glue to those commands you need to create a reliable pipeline.

How?

Benefits of BigDataScript

  • Reduced development time

    Spend less time debugging your work on big systems with a huge data volumes. Now you can debug the same jobs using a smaller sample on your computer. Get immediate feedback, debug, fix and deploy when it's done. Shorter development cycles means better software.

  • System independent

    Cross-system, seamless execution, the same program runs on a laptop, server, server farm, cluster or cloud. No changes to the program required. Work once.

  • Easy to learn

    The syntax is intuitive and it resembles the syntax of most commonly used programming languages. Reading the code is easy as pi.

  • Automatic Checkpointing

    If any task fails to execute, BigDataScript creates a checkpoint file, serializing all the information from the program. Want to restart were it stopped? No problem, just resume the execution from the checkpoint.

  • Automatic logging

    Everything is logged (-log command line option), no explicit actions required. Every time you execute a system command or a task, BigDataScript logs the executed commands, stdout & stderr and exit codes.

  • Clean stop with no mess behind

    You have a BigDataScript running on a terminal and suddenly you realized there is something wrong... Just hit Ctrl-C. All scheduled tasks and running jobs will be terminated, removed from the queue, deallocated from the cluster. A clean stop allows you to focus on the problem at hand without having to worry about restoring a clean state.

  • Task dependencies

    In complex pipelines, tasks usually depend on each other. BigDataScript provides ways to easily manage task dependencies.

  • Avoid re-work

    Executing the pipeline over and over should not re-do jobs that were completed successfully and moreover are time consuming. Task dependency based on timestamps is a built-in functionality, thus making it easy to avoid starting from scratch every time.

  • Built in debugger

    Debugging is an integral part of programming, so it is part of bds language. Statements breakpoint and debug make debugging part of the language, instead of requiring platform specific tools.

  • Built in test cases facility

    Code testing is performed in everyday programming, so testing is built in bds.

Paper & Citations

If you are using BigDataScript in an academic environment, please cite our paper:

BigDataScript: A scripting language for data pipelines 
P. Cingolani; R. Sladek; M. Blanchette
Bioinformatics 2014;
doi: 10.1093/bioinformatics/btu595

A word about performance

BigDataScript is meant to be used in the context or heavyweight computations. Potential delays incurred by BigDataScript should not affect the overall time.
Think about it this way: If you are invoking a set of programs to perform big data computations, these programs usually take hours or days to run. The fact that BigDataScript takes a few milliseconds more to invoke those programs, really doesn't make any difference.

Why is it called "BigDataScript"

Because that's the lamest name I could find.

Disclaimer

BigDataScript is experimental and under heavy development. Use at your own risk. Know side effect include: computer explosions, instant decapitation, spontaneous human combustion, and dead kittens.

Hello world

As we all know, showing that we can print "Hello world" is more important than showing that the language is Turing complete.

  • Create a simple program and execute it
    File test_01.bds
    #!/usr/bin/env bds
    
    print "Hello world\n"
    
    $ ./test_01.bds 
    Hello world
    
  • This time we do it by running a system command ( echo ), using bds' sys expression. A sys executes the command immediately in the local computer and waits until the command finishes. Everything after sys until the end of the line is interpreted as an OS command.
    File test_02.bds
    #!/usr/bin/env bds
    
    sys echo Hello world
    
    $ ./test_02.bds 
    Hello world
    
  • Now let's run the same in as a 'task'. Tasks schedule the system command for execution (either locally, on a cluster, etc.)

    File test_03.bds
    #!/usr/bin/env bds
    
    task echo Hello world
    
    Just run the script to execute tasks locally
    $ ./test_03.bds
    Hello world
    
    You can also execute on a cluster, for instance, if you are on a cluster's head node, just run:
    $ bds -s cluster ./test_03.bds
    Hello world
    
    Note that in order to execute on another architecture (cluster), we did not change the bds program, we just added a command line option. Programs can be executed on different computer systems of different sizes without changing the code.


Language

Learning BigDataScript language (bds) is almost trivial, all the statements and expression and data types do what you expect.

BigDataScript is really simple and you should be able to code within a few minutes. This section is intended as a reference, so just glance through it.

  • Comments: The usual statements are available
    // Single line comment
    
    # Another single line comment
    
    /*
       Multi-line comment
    */
    
  • Statements can be terminated either by semicolon or by a new line.
    # Two statements
    print "Hi\n"; print "Bye\n";
    
    # Two statements, same as before but using lines instead of semicolon
    print "Hi\n" 
    print "Bye\n"
    
  • break : Breaks from current loop
    for( int i=0 ; i < 10 ; i++ ) {
        if( i == 5 ) break;	   // Finish when we reach 5
    }
    
  • breakpoint : Inserts a debugging breakpoint. I.e. when the statement is executed, bds switches execution to debug mode (STEP)
    breakpoint "Program execution will switch do debug mode here!\n"
    
  • continue : Continue at the end of the current loop
    for( int i=0 ; i < 10 ; i++ ) {
        if( i == 5 ) continue;	// Skip value 5
    }
    
  • debug : Show a debug message on STDERR only if bds is running in 'debug' mode (otherwise the statement is ignored).
    debug "Show this message only if we are in debug mode!\n"
    
  • error Show an error message and exit the program
    if( num <= 0 )	warning "Number MUST be positive\n"
    
  • exit : Exit program, optional expression calculates an exit value.
    exit 1
    
  • for C or Java like for statement
    for( int i=0 ; i < 10 ; i++ ) print("$i\n")
    
    or
    for( int i=0 ; i < 10 ; i++ ) {
        print("$i\n")
    }
    
  • for Java like for iterator on lists
    string[] mylist
    
    // ... some code to populate the list
    
    for( string s : mylist ) print("$s\n")
    
  • if / else It does exactly what you expect
    if( i < 10 )	print("Less than ten\n")
    
    or
    if( i < 10 ) {
        print("Less than ten\n")
    } else if( i <= 20 ) {
        print("Between ten and twenty\n")
    } else {
        print("More than twenty\n")
    }
    
  • include Include source code from another file
    include "mymodule"
    
    // ... use functions from 'mymodule.bds'
    
  • kill Kill a task
    kill taskId
    
  • print / println Print to sdtout
    print "Show this mesage without a new line at the end."
    println "This one gets a new line at the end."
    
  • return Return from a function. Optional expression is a return value.
    // Define a function
    int twice(int n) {
    	return( 2 * n )
    }
    
  • switch Switch statements are similar to multiple if / else if statements
    in := 'x'
    out := 1
    
    switch( in ) {
        case 'a': 
            out *= 3
            break
    
        case 'z'+'x':   # Note that the 'case' expressions are evaluated at run time (you can even call functions here)
            out *= 5    # Note that this falls through to "case 'b'"
    
        case 'b':
            out *= 7
            break
    
        default:        # You can define 'default' anywhere (no need to do it after 'case')
            out *= 100
    }
    
  • warning Show a warning message
    if( num <= 0 )	warning "Number should be positive\n"
    
  • while typical while iterator
    while( i < 10 ) i++
    
  • type varName Declare variable 'var' as type 'type'
    int i      # 'i' is an 64 bit int variable
    real r     # 'r' is a double-precision floating-point number
    string s   # 's' is a string
    
  • type varName = expr Declare variable 'var' as type 'type', evaluate expression and assign result to initialize 'var'.
    int i = 42
    real r = 3.1415927
    string s = "Hello!"
    
  • varName := expr Declare variable 'var', use type inference, evaluate expression 'expr' and assign result to initialize 'var'
    i := 42
    r := 3.1415927
    s := "Hello!"
    
  • var = expr Evaluate expression 'expr' and assign result to 'var'
    i = j + 1
    s = "Hello " + world
    
  • ( var1, var2, ..., varN ) = expr Evaluate expression 'expr' (which must return a list) and assign results to 'var1', 'var2', etc. If the list size is less than the number of variables, variables are assigned default values (e.g. '0' for int). If the list has more values, they are ignored.
    (name, value) = line.split('\t')
    
  • Ternary operator expr ? exprTrue : exprFalse Evaluate 'expr', if true evaluate and return 'exprTrue', otherwise evalaute and return 'exprFalse'
    sign = ( i >= 0 ? 1 : -1 )
    


Example: A simple, and useless, example:
// Define a function
int sumPositive(int n) {
    if( n <= 0 )	return 0

    int sum = 0
    for( int i=0 ; i <= n ; i++ ) sum = sum + i
    return sum
}

// Function definition in one line
int twice(int n)    return( 2 * n )

// Main
n := 5
print("The sum is : " + sumPositive( twice(n) ) + "\n" )
Obviously, if you run it
$ bds z.bds 
The sum is : 55

These are statements, operators, and expressions that are unique to bds. We just enumerate them here, but we explain details on what they mean and how they work in the following sections. This list is intended as a reference for people that are already familiar with these concepts, so don't despair if you don't understand what they mean.

  • <- Dependency operator. Return true if any left-hand side file needs to be updated with respect to any right-hand side file
    # Evaluate dependency 
    if( 'out.txt' <- 'in.txt' )    print("File out.txt needs to be updated\n")
    
  • checkpoint Create a checkpoint. Optional expression is a file name
    # Wait for all tasks to finish
    checkpoint "program.chp"
    
  • dep Define dependency tasks. Tasks are not scheduled for execution until goal decides which dependencies must be executed to satisfy an output.
    # Execute bwa command (create an index of the human genome)
    task bwa index hg19.fasta
    
  • par Execute code in parallel
    par {
        for( int i=0 ; i < 10 ; i++ ) {
            print("This is executed in parallel: $i\n")
        }
    }
    
    or just call a function in parallel
    par doSomething(arg1, arg2)
    
  • sys Execute an OS command. Execution is immediate and local (i.e. in the same computer as bds is running). It's intended for executing fast OS commands (not heavyweight processing).
    # Execute an "ls" command
    sys ls -al
    
  • task Schedule an OS command for execution. Depending on the value of system, the execution can be local, in a cluster, remote server, etc.
    # Execute bwa command (create an index of the human genome)
    task bwa index hg19.fasta
    
  • wait Wait for task(s) to finish. It can be one task, a list of tasks or all tasks (if no expression)
    # Wait for all tasks to finish
    wait
    
    # Wait for one task to finish
    wait taskId
    
    # Wait for several tasks to finish
    wait listOfTaskIDs
    

BDS is a statically typed language that has simple data types.

The language is statically typed. The intention is to avoid runtime errors.

There are only a few basic types and, for the moment, bds doesn't offer extensible data types (structs, or classes), but this might change soon.
Type Meaning
string A string (same a Java's String)
int A 64 bit integer number (same a Java's long)
real A 64 bit IEEE 754 number (same as Java's double)
bool A boolean value, can be 'true' or 'false' (same as Java's boolean)
Arrays, List, Stacks These are all the same, just a different way to call a list of elements
Maps Maps are hashes (a.k.a. dictionaries that have string keys.
There is no "null" element. Again, the idea is to minimize points of failure.

Strings

There are several basic methods defined for strings:
Return type Method / Operator Meaning
string s = s1 + s2 Concatenate strings.
string s += s2 Append to string.
bool string.endsWith(string str) True if string ends with str
bool string.isEmpty() True if the string is empty
int string.indexOf(string str) Index of the first occurrence of str in string
int string.lastIndexOf(string str) Index of the last occurrence of str in string
int string.length() String's length
string string.replace(string str1,string str2) A new string replacing 'str1' with 'str2'
bool string.parseBool() Parse a bool
int string.parseInt() Parse an int number
real string.parseReal() Parse a real number
string[] string.split(string regex) Split using a regular expression
bool string.startsWith(string str) True if string starts with str
string string.substr(int start) Substring from start to end of string
string string.substr(int start,int end) Substring from start to end
string string.toLower() Return a lower case version of the string
string string.toUpper() Return an upper case version of the string
string string.trim() Trim spaces at the beginnig and at the end

Strings as file names

Strings can be used in several different ways. For instance, it is common that a string can represent a file name in a script. So you can use 'file' related methods on string. E.g.:
string f = "in.txt"
if( f.canRead() ) {
    print (" Can read file $f\n" )
}
Here f is a string, but it has a method canRead() which retuurns true if f is a file and it can be read.

# More file related methods:
Return type Method Meaning
string string.baseName() File's base name
string string.baseName(string ext) File's base name, remove extention 'ext'
string string.download() Donwload data from URL (string). Returns local file name (empty string if failed)
bool string.download(string file) Donwload data from URL to 'file'. Returns true if succeeded.
bool string.canRead() True if file has read permission
bool string.canWrite() True if file has write permission
bool string.canExec() True if file has execution permission
void string.chdir() Change current directory
bool string.delete() Delete file
string[] string.dir() List files in a directory ('ls')
string[] string.dir(string regex) List files matching a 'glob' (regular expression for files)
string string.dirName() File's directory name
string[] string.dirPath() List files using canonical paths
string[] string.dirPath(string regex) List files, matching a 'glob' (regular expression for files), using canonical paths
string string.extName() File's extension
bool string.exists() True if file exists
bool string.isDir() True if it's a directory
bool string.isFile() True if it's a file
bool string.mkdir() Create dir ('mkdir -p')
string string.path() Canonical path to file
string string.pathName() Canonical dir to file
string string.read() Read the whole file into a string
string[] string.readLines() Read the whole file and split the lines
string string.removeExt() Remove file extension
string string.removeExt(string ext) Remove file extension, only if it matches the provided one
bool string.rm() Delete a file
bool string.rmExit() Remove a file when execution finishes (thread execution).
int string.size() File size in bytes
string string.swapExt(string newExt) Swap file extension
string string.swapExt(string oldExt,string newExt) Swap file extension, only if extension matches the provided 'oldExt'
string string.upload() Upload data to URL (string). Returns true if succeeded.
bool string.upload(string file) Upload data from 'file' to URL. Returns true if succeeded.
string string.write(string file) Write string to 'file'

Strings as task IDs

Strings can also be used to refer to tasks. When a task is created, the task expression returns a task ID, which is a string. This task ID can be used for task operations, for instance:
tid := task echo Hello
wait tid
Here the wait statement will wait until the task "echo Hello" finishes executing.

More task related methods:
Return type Method Meaning
bool string.isDone() True if the task finished
bool string.isDoneOk() True if the task finished without errors
string string.stdout() A string with all the STDOUT generated from this task
string string.stderr() A string with all the STDERR generated from this task
int string.exitCode() Exit code

Arrays, List, Stacks

Arrays, list and stacks are all the same thing. You can create a list of strings simply by declaring:
string[] arrayEmpty
string[] array = ["one", "two", "three"]
Similarly, a list of ints is just
int[] listIntEmpty
int[] primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]

Methods
Returns Method Meaning
+= Append element(s) at the end of the list
Element added add(X) Add X to the end of the list
Element added add(int idx,X) Add X to position idx in the list
Same list delete() Delete all files in the list (assumes list elements are file names). Same as list.rm()
int count(X) Count number of occurrences of X in the list
bool has(X) Does the list contain X?
First element head() Get first element
int indexOf(X) Position of element X in the list
bool isEmpty() 'true' if the list is empty
string join() A string joining all elements of the list (separator ' ')
string join(string sep) A string joining all elements of the list (separator 'sep')
Last element pop() Get last element and remove it from the list
Element pushed push() Add at the end of the list
Element to remove remove(X) Remove element X from the list
Element to remove removeIdx(int idx) Remove element at position idx from the list
New reversed list reverse() Create a new list and reverse it
Same list rm() Delete all files (assumes list elements are file names)
Same list rmOnExit() Delete all files when current thread finishes execution (assumes list elements are file names)
int size() Return the number of elements in the list
New sorted list sort() Create a new list sorting the elements of this list
List tail() Create a new list with all but the first element

Iterating on an array/list
You can iterate on an array simply by doing
$ cat z.bds 
string[] array = ["one", "two", "three"]

for( string val : array ) { print("Value: $val\n") }

$ bds /z.bds
Value: one
Value: two
Value: three

Maps

Maps are hashes that have string as keys. You can create a map simply by declaring:
string{} mstr	# This maps string keys to string values

mstr{"Hello"} = "Bye"
mstr{"Bonjour"} = "Au revoir"
mstr{"Hola"} = "Adios"
or a map of real numbers
real{} mre   # This maps string keys to real values
mre{"one"}   = 1.0
mre{"two"}   = 2.0
mre{"e"}     = 2.7182818
mre{"three"} = 3.0
mre{"pi"}    = 3.1415927

Methods
Returns Method Meaning
bool hasKey(string key) True if the key is in the map
bool hasValue(value) True if 'value' is in the map
list keys() A sorted list of all keys in the map
bool remove(key) Remove key from this map
int size() Number of elements in this map
list values() A sorted list of all values in the map

Iterating on a map
You can iterate over all values in a map, simply by doing
$ cat z.bds 
string{} mstr = { "Hello" => "Bye", "Bonjour" => "Au revoir", "Hola" => "Adios" }

for(string v : mstr ) { 
		print("Values : $v\n") 
}

$ bds z.bds
Values : Adios
Values : Au revoir
Values : Bye

If you want to iterate on keys instead of values, you can do this:
$ cat z.bds 
string{} mstr = { "Hello" => "Bye", "Bonjour" => "Au revoir", "Hola" => "Adios" }

for(string k : mstr.keys() ) { 
		print("Key : $k\tValue : " + mstr{k} + "\n") 
}

$ bds z.bds
Key : Bonjour	Value : Au revoir
Key : Hello	Value : Bye
Key : Hola	Value : Adios

BigDataScript provides some predefined functions.


Function Meaning
int abs(int x) Absolute value of a number
real abs(real x) Absolute value of a number
real acos(real x) The trigonometric arc-cosine of a number
real asin(real x) The trigonometric arc-sine of a number
real atan(real x) The trigonometric arc-tangent of a number
real atan2(real x) Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
assert(bool expr) Used for testing: Throw an error if expr is false
assert(string msg, bool expr) Used for testing: Throw error message msg if expr is false
assert(string msg, bool expected, bool result) Used for testing: Throw error message msg if result value is not equal to expected (compare bool)
assert(int expected, int result) Used for testing: Throw error message msg if result value is not equal to expected (compare int)
assert(string msg, int expected, int result) Used for testing: Throw error message if result value is not equal to expected (compare int)
assert(string expected, string result) Used for testing: Throw error message if result value is not equal to expected (compare string)
assert(string msg, string expected, string result) Used for testing: Throw error message msg if result value is not equal to expected (compare string)
real cbrt(real x) The cube root of a number
real ceil(real x) The ceiling of a number
string{} config(string fileName) Read and parse 'fileName', return <name,value> pairs in a map.
Parsing: Lines starting with '#' are ignored, so are blank lines.
Name/Value delimiters can be any of ':', '=' or '\t' (the first one found in each line will be used).
The following are valid and equivalent:
name : value
name = value
name \t value
string{} config(string fileName, string{} defaults) Same as string{} config(string fileName), but using defaults as default values (if not found in fileName)
real copySign(real x, real y) Returns the first floating-point argument with the sign of the second floating-point argument
real cos(real x) The trigonometric cosine of an angle
real cosh(real x) The hyperbolic cosine of an angle
real exp(real x) Return e^x
real exppm1(real x) Return e^x-1
real floor(real x) The floor of a number
int getExponent(real x) exponent used in the representation of a real
real hypot(real x, real y) Returns sqrt(x2 +y2) without intermediate overflow or underflow.
real IEEEremainder(real x, real y) Computes the remainder operation on two arguments as prescribed by the IEEE 754 standard..
log(string msg) Log 'msg' (i.e. show to stderr)
real log(real x) Natural logarithm of a number
real log10(real x) Logarithm (base 10) of a number
real log1p(real x) Natural logarithm of '1+x'
int max(int n1, int n2) Maximum of two numbers
real max(real n1, real n2) Maximum of two numbers
int min(int n1, int n2) Minimum of two numbers
real min(real n1, real n2) Minimum of two numbers
real nextAfter(real x, real y) Returns the number adjacent to the first argument in the direction of the second argument
real nextUp(real x) Returns the floating-point value adjacent to d in the direction of positive infinity.
real pow(real x, real y) Return x^y
print( expr ) Show to stdout (same as 'print' statement)
printErr( expr ) Show to stderr
printHelp() Print automatically generated help message (see 'help' statement)
real rand() Random number [0, 1] interval
int randInt() Random number (64 bits)
int randInt(int range) Random number [0, range] interval
void randSeed(int seed) Set random seed (for current thread)
int[] range(min, max) A list of numbers between [min, max] inclusive
int[] range(min, max, step) A list of numbers between [min, min+step, min+2*step, ... ]. Includes max if min+N*step = max
real[] range(min, max, step) A list of numbers between [min, min+step, min+2*step, ... ]. Includes max if min+N*step = max
real rint(real x) Returns the real value that is closest in value to the argument and is equal to a mathematical integer
int round(real x) Rounded number
real scalb(real x, int sf) Return x * 2^sf rounded
real signum(real x) The sign function of a number
real sin(real x) The trigonometric sine of an angle
real sinh(real x) The hyperbolic-sine of an angle
sleep( int seconds ) Sleep for 'seconds'
sleep( real seconds ) Sleep for '1000 * seconds' milliseconds. E.g. sleep(0.5) sleeps for half a second
real sqrt(real x) The square root of a number
real tan(real x) The trigonometric tangent of an angle
real tanh(real x) The hyperbolic tangent of an angle
int time() Return the milliseconds elapsed since epoch
real toDegrees(real x) Convert x radians to degrees
real toRadians(real x) Convert x degrees to radians
int toInt(bool b) Convert boolean to int
int toInt(real r) Convert real to int
real ulp(real r) Returns the size of an ulp of the argument

BigDataScript provides some predefined variables.


Whenever you run a BigDataScript program you have several predefined variables:
Variable name Meaning
allowEmpty , canFail , cpus , mem , timeout , node , queue , retry , system , taskName , timeout , walltimeout These are the default values for task. Their meanings are explained in the Task section of this handbook.
string programName The program's name
string programPath The program's path
string[] args Arguments used to invoke the program, i.e. all command line options after program name (bds [options] prog.bds args ...).
string ppwd Canonical (physical) path to directory where the program is being executed.
int cpusLocal Number of cores in the computer running the script
All shell variables All shell variables at the moment of invocation (e.g. HOME, PWD, etc.)
int K, M, G, T, P Kilo, Mega, Giga, Tera, Peta (2^10, 2^20, 2^30, 2^40 and 2^50 respectively)
real E, PI Euler's constant (2.718281...) and Pi (3.1415927...)
int minute, hour, day, week Number of seconds in a minute, hour, day and week respectively

Creating data pipelines

In order to create data pipelines, you need to execute 'tasks' and coordinate execution dependencies. Here we show how to do it in bds

The most basic operation is to execute a task, which is done using a task expression. bds takes care of executing a task on different environments (local computer, server, cluster, etc.), so you don't need to focus on mundane details (such as cluster queue monitoring, or querying remote computers for resources).

In this toy example, we schedule 10 tasks for execution. I'm running this on a computer that only has 8 CPUs, so not all tasks can execute in parallel.
File test_04.bds
#!/usr/bin/env bds

for( int i=0 ; i < 10 ; i++ ) {
	task echo Hi $i ; sleep 1 ; echo Done $i ; sleep 1
}
Note that bds interpolates variable $i (string interpolation simply means replacing by the value of a variable within the string)

$ ./test_04.bds
Hi 1
Hi 6
Hi 2
Hi 4
Hi 3
Hi 7
Hi 0
Hi 5
Done 1
Done 6
Done 2
Done 4
Done 3
Done 7
Done 0
Done 5
Hi 9
Hi 8
Done 9
Done 8
In this case, we are running on an 8 core computer, so the first 8 tasks get executed in parallel. The rest is executed when the first tasks finish. Task execution order is not guaranteed (e.g. a cluster scheduler can decide to run tasks out of order). We'll see how to coordinate tasks later.

Here we show how exactly the same script is run on a cluster, keep in mind that not a single line of code changed. We copy tha same script (used in the previous section) to a cluster. We execute it, but now the tasks are scheduled using MOAB, Torque, PBS or Grid Engine.

$ bds -s cluster test_04.bds


In a typical pipeline we must be able to control task execution which depends on previously executed tasks. Here we show how.

The simplest form of execution control is to wait until one or more tasks finish before executing the next task(s). This is done using the wait statement.

In this example, we run a set of tasks (echo Hi $i) and after all the tasks finished, we run another set of tasks (echo Bye $i)
File test_05.bds
#!/usr/bin/env bds

for( int i=0 ; i < 10 ; i++ ) {
    task echo Hi $i ; sleep 1 
}

wait
print("After wait\n")

for( int i=0 ; i < 10 ; i++ ) {
    task echo Bye $i ; sleep 1
}
Running the script, we get
$ ./test_05.bds
Hi 7
Hi 0
Hi 6
Hi 5
Hi 1
Hi 3
Hi 4
Hi 2
Hi 8
Hi 9
After wait
Bye 1
Bye 0
Bye 7
Bye 2
Bye 6
Bye 4
Bye 3
Bye 5
Bye 9
Bye 8
  • The wait statement is like a barrier. Until all tasks are finished, the program does not continue. If any task fails, a checkpoint file is created, where all program data is serialized. We can correct the problem and restart the pipeline where it left.
  • The wait statement can wait for all tasks, for single tasks or for a list of tasks.
  • We then run the same script on a cluster. Again, this is done simply by using -s cluster command line. The video shows how tasks are scheduled and on the cluster, always honoring the wait statement
  • The exit code of a bds script is 0 if all tasks executed without any problems, and non-zero if any task failed.

The dependency operator provides a simple way to see if a file needs to be updated (i.e. recalculated) with respect to some inputs. For instance, when we already processed some files and have the corresponding results, we may save some work if the inputs have not changed (like "make" command).

We introduce the dependency operator <- (pronounced 'dep') which is a "make" style operator. The expression out <- in is true if 'out' file needs to be updated. More formally, the expression is true if the file name represented by the variable 'out' does not exist, is empty (zero length) or has a creation date before 'in'.
E.g.:

File test_06.bds

#!/usr/bin/env bds

string inFile  = "in.txt"
string outFile = "out.txt"

# Create 'in.txt' if it doesn't exist
if( !inFile.exists() ) {
    task echo Creating $inFile; echo Hello > $inFile
}

wait

# Create 'out.txt' only if needs to be updated resepct to 'in.txt'
if( outFile <- inFile ) {
    task echo Creating $outFile; cat $inFile > $outFile
}
When executing for the first time, both tasks are executed and both files ('in.txt' and 'out.txt') are created.
$ ./test_06.bds
Creating in.txt
Creating out.txt
If we execute for a second time, since files have not changed, no task is executed.
$ ./test_06.bds
$
If we now change the contents of 'in.txt', and run the script again, the second task will be executed (because 'out.txt' needs to be updated with respect to 'in.txt')
# Update 'in.txt'
$ date > in.txt

# Since we updated the input file, the output must be recalculated
$ ./test_06.bds 
Creating out.txt
  • Summary: If the file 'out.txt' is up to date with respect to 'in.txt', the following condition will be false and the task will not execute
    if( outFile <- inFile ) {
    	task echo Creating $outFile; cat $inFile > $outFile
    }
    
  • This construction is so common that we allow for some syntactic sugar.
    task( outFile <- inFile ) { 
    	sys echo Creating $outFile; cat $inFile > $outFile
    }
    


    Programming task dependencies can be difficult. BDS can help by automatically inferring task dependencies and executing tasks in the correct order.


    In this example, we have two tasks:

    • The first task uses an input file 'in.txt', to create an intermediate file 'inter.txt'
    • The second task uses the intermediate file 'inter.txt' to create the ouptut file 'out.txt'.
    The script below does not have a wait statement. Instead bds automatically infers that the second task depends on the first one, and does not start execution until the first task begins. Notice that we don't tell bds to wait for the first task to finish (there is no explicit wait statement).

    File test_07.bds
    #!/usr/bin/env bds
    
    # We use ':=' for declaration with type inference
    inFile       := "in.txt"		
    intermediate := "inter.txt"
    outFile      := "out.txt"
    
    task( intermediate <- inFile) {
        sys echo Creating $intermediate; cat $inFile > $intermediate; sleep 1 ; echo Done $intermediate
    }
    
    task( outFile <- intermediate ) {
        sys echo Creating $outFile; cat $intermediate > $outFile; echo Done $outFile
    }
    
    As a side note: We used the := operator to declare variables using type inference. So we can write inFile := "in.txt" instead of string inFile = "in.txt", which not only is shorter to type, but also makes the code look cleaner.

    Now let's run the script
    # Delete old file (if anY)
    $ rm *.txt
    
    # Create input file
    $ date > in.txt
    
    # Run
    $ ./test_07.bds 
    Creating inter.txt
    Done inter.txt
    Creating out.txt
    Done out.txt
    
    Note how the second task is executed only after the first one finished.

    The goal statement helps to program complex task scheduling interdependencies.

    In the previous example, we had an input file 'in.txt', an intermediate file 'inter.txt' and an output file 'out.txt'. One problem is that if we delete the intermediate file 'inter.txt' (e.g. because we may want to delete big files with intermediate results), then both tasks will be executed
    For convenience, here is the code again. File test_07.bds

    #!/usr/bin/env bds
    
    inFile       := "in.txt"		
    intermediate := "inter.txt"
    outFile      := "out.txt"
    
    task( intermediate <- inFile) {
        sys echo Creating $intermediate; cat $inFile > $intermediate; sleep 1 ; echo Done $intermediate
    }
    
    task( outFile <- intermediate ) {
        sys echo Creating $outFile; cat $intermediate > $outFile; echo Done $outFile
    }
    
    # Remove intermediate file
    $ rm inter.txt 
    
    # Re-execute script 
    $ ./test_07.bds
    Creating inter.txt
    Done inter.txt
    Creating out.txt
    Done out.txt
    
    Why is this happening? The reason is that task statements are evaluated in order. So when bds evaluates the first task expression, the dependency intermediate <- inFile is true (because 'inter.txt' doesn't exist, so it must be updated with respect to 'in.txt'). After that, when the second task expression is evaluated, out <- intermediate is also true, since 'inter.txt' is newer than 'out.txt'. As a result, both tasks are re-executed, even though 'out.txt' is up to date with respect to 'in.txt'. This can be a problem, particularly if each task requires several hours of execution.

    There are two ways to solve this, the obvious one is to add a simple 'if' statement surrounding the tasks:

    #!/usr/bin/env bds
    
    inFile       := "in.txt"		
    intermediate := "inter.txt"
    outFile      := "out.txt"
    
    if( outFile <- inFile) {
      task( intermediate <- inFile) {
        sys echo Creating $intermediate; cat $inFile > $intermediate; sleep 1 ; echo Done $intermediate
      }
    
      task( outFile <- intermediate ) {
        sys echo Creating $outFile; cat $intermediate > $outFile; echo Done $outFile
      }
    }
    
    Although it solves the issue, the code is not elegant.

    The alternative is to use dep and goal
    • dep defines a task exactly the same way as task expression, but it doesn't evaluate if the tasks should be executed or not (it's just declarative).
    • goal executes all dependencies nescesary to create an output
    Example:

    File test_08.bds
    #!/usr/bin/env bds
    
    inFile       := "in.txt"		
    intermediate := "inter.txt"
    outFile      := "out.txt"
    
    dep( intermediate <- inFile) {
        sys echo Creating $intermediate; cat $inFile > $intermediate; sleep 1 ; echo Done $intermediate
    }
    
    dep( outFile <- intermediate ) {
        sys echo Creating $outFile; cat $intermediate > $outFile; echo Done $outFile
    }
    
    goal outFile
    
    If we execute this script
    # Delete old files (if any)
    $ rm *.txt
    
    # Create input file
    $ date > in.txt
    
    # Run script (both tasks should be executed)
    $ ./test_08.bds
    Creating out.txt
    Done out.txt
    Creating inter.txt
    Done inter.txt
    
    Now we delete 'inter.txt' and re-execute
    # Delete intermediate file
    $ rm inter.txt 
    
    # Run again (out.txt is still up to date with respect to in.txt, so no task should be executed)
    $ ./test_08.bds
    $ 
    
    As you can see, no task is executed the second time, since 'out.txt' is up to date, with respect to 'in.txt'. The fact that intermediate file 'inter.txt' was deleted, is ignored, which is what we wanted.
  • Sys

    Local, immediate command execution.

    A few rules about sys expression:

    • Everything after sys until the end of the line is interpreted to be a command
    • If the line ends with a backslash, next line is interpreted as part of the same command (same as in a shell script)
    • Variables are interpolated. E.g. sys echo Hello $person will replace '$person' by the variable's name before sending it to the OS for execution.
    • BDS immediately executes the OS command in the local machine and waits until the command finishes execution.
    • If the command exits with an error condition, then bds creates a checkpoint, and exits with a non-zero exit code.
    • No resource accounting is performed, the command is executed even if all CPUs are busy executing tasks. E.g.
      $ cat z.bds
      print("Before\n")
      sys echo Hello
      print("After\n")
      
      $ bds z.bds
      Before
      Hello
      After
      
    • Everything after a sys keyword until the end of the line is considered part of the command to be executed. So multiple shell commands can be separated by semicolon
      sys echo Hello ; echo Bye
      
    • Multi-line statements are allowed, by using a backslash at the end of the line (same as a shell script)
      sys echo HeLLo \
      	| tr [A-Z] [a-z] \
      	| grep hell
      
    • sys returns the STDOUT of the command:
      File test_12.bds
      #!/usr/bin/env bds
      
      dir := sys ls *.bds | head -n 3
      print("\nVariable dir is:\n$dir\n")
      
      Executing, we get:
      $ ./test_12.bds 
      test_01.bds
      test_02.bds
      test_03.bds
      
      Variable dir is:
      test_01.bds
      test_02.bds
      test_03.bds
      
      Note that (roughly) the first half of the output is printed by the command execution, while the second half is printed by the print statement (i.e. printing the variable dir).
    • Characters are passed literally to the interpreting shell. For example, when you write '\t' it is NOT converted to a tab character before sending it to the shell (no escaping is required):
      $ cat z.bds
      sys echo -e "Hello\tWorld" | awk '{print $1 "\n" $2}'
      
      $ bds z.bds
      $ bds z.bds
      Hello
      World
      
      

    Task

    Queued command execution with resource management.

    A taskexpression, just like sys expression, also executes a command. The main difference is that a task is "scheduled for execution" instead of executed immediately. Task execution order is not guaranteed but bds provides a mechanism for creating task dependencies by means of wait statements.

    A task expression either performs basic resource management or delegates resource management to cluster management tools. The idea is that if you schedule a hundred tasks, but you are executing on your laptop which only has 4 CPUs, then bds will only execute 4 tasks at a time (assuming each task is declared to consume 1 CPU). The rest of the tasks are queued for later execution. As executing tasks finish and CPUs become available, the remaining tasks are executed.
    Similarly, if you schedule 10,000 tasks for execution, but your cluster only has 1,000 cores, then only 1,000 tasks will be executed as a given time. Again, other tasks are queued for later execution, but in this case, all the resource management is done by your cluster's workload management system (e.g. GridEngine, PBS, Torque, etc.).

    Most cluster resource management do not guarantee that tasks are executed in the same order as queued. Even if they do or if they are executed in the same host, a task can start execution and immediately be preempted. So the next task in the queue can effectively start before the previous one.

    There are different ways to execute tasks
    System type Typical usage How it is done
    local Running on a single computer. E.g. programming and debugging on your laptop or running stuff on a server A local queue is created, the total number of CPUs used by all tasks running is less or equal than the number of CPU cores available
    ssh A server farm or a bunch of desktops or servers without a workload management system (e.g. computers in a University campus) Basic resource management is performed by logging into all computers in the 'cluster' and monitoring resource usage.
    cluster Running on a cluster (GridEngine, Torque) Tasks are scheduled for execution (using 'qsub' or equivalent command). Resource management is delegated to cluster workload management.
    moab Running on a MOAB/PBS cluster Tasks are scheduled for execution (using 'msub'). Resource management is delegated to cluster workload management.
    pbs Running on a PBS cluster Tasks are scheduled for execution (using 'msub'). Resource management is delegated to cluster workload management.
    sge Running on a SGE cluster Tasks are scheduled for execution (using 'qsub'). Resource management is delegated to cluster workload management.
    generic Enable user defined scripts to run, kill and find information on tasks This 'generic' cluster allows the user to write/customize scripts that send jobs to the cluster system. It can be useful to either add cluster systems not currently supported by bds, or to customize parameters and scheduling options beyond what bds allows to customize in the config file. For details, see bds.config file and examples in the project's source code (directories config/clusterGeneric*).
    mesos Running on a Mesos framework Tasks are scheduled for execution in Mesos framework and resource management is delegated to Mesos.

    Scheduling tasks

    A task is scheduled by means of a task expression. A task expression returns a task ID, a string representing a task. E.g.:
    File test_09.bds
    tid := task echo Hello 
    print("Task is $tid\n")
    
    Running we get:
    $ ./test_09.bds
    Task is test_09.bds.20140730_214947_810/task.line_3.id_1
    Hello
    
    task is non-blocking, which means that bds continues execution immediately without waiting for the task to finish. So, many tasks can be scheduled by simply invoking a task statement many times.
    Once a task is scheduled, execution order depends on the underliying system and there is absolutely no guarantee about execution order (unless you use a wait statements or other dependency mechanism).
    E.g., this example shows clearly all the tasks are NOT executed in order, even on local computers:
    File test_10.bds
    #!/usr/bin/env bds
    
    for( int i=0 ; i < 10 ; i++ ) task echo Hi $i
    
    $ ./test_10.bds
    Hi 0
    Hi 5
    Hi 4
    Hi 3
    Hi 2
    Hi 1
    Hi 7
    Hi 6
    Hi 9
    Hi 8
    

    Resource consumption and task options

    Often task requires many CPUs or resources. In such case, we should inform the resource management system in order to get an efficient allocation of resources (plus many cluster systems kill tasks that fail to report resources correctly).
    E.g., In this example we allocate 4 CPUs per task and run it on an 8-core computer, so obviously only 2 tasks can run at the same time:
    File test_11.bds
    #!/usr/bin/env bds
    
    for( int i=0 ; i < 10 ; i++ ) {
        # Inform resource management that we need 4 core on each of these tasks
        task ( cpus := 4 ) {
            sys echo Hi $i ; sleep 1; echo Done $i
        }
    }
    
    Executing on my 8-core laptop, you can see that only 2 tasks are executed each time (each task is declared to require 4 cpus)
    $ ./test_11.bds
    Hi 0
    Hi 1
    Done 0
    Done 1
    Hi 3
    Hi 2
    Done 2
    Done 3
    Hi 4
    Hi 5
    Done 4
    Done 5
    Hi 6
    Hi 7
    Done 6
    Done 7
    Hi 9
    Hi 8
    Done 8
    Done 9
    

    List of resources or task options
    Variable name Default value Resource / Task options
    cpus 1 Number of CPU (cores) used by the process.
    allowEmpty false If true, empty files are allowed in task's outputs. This means that a task producing empty files does not result in program termination and checkpointing.
    canFail false If true, a task is allowed to fail. This means that a failed task execution does not result in program termination and checkpointing.
    timeout 0 Number of seconds that a process is allowed to execute. Ignored if zero or less. If process runs more than timeout seconds, it is killed.
    node If possible this task should be executed on a particular cluster node. This option is only used for cluster systems and ignored on any other systems.
    queue Queue name of preferred execution queue (only for cluster systems).
    retry 0 Number of times a task can be re-executed until it's considered failed.
    taskName Assign a task name. This adds a label to the task as well as the taskId returned by task expression. Task ID is used to create log files related to the task (shell script, STDOUT, STDERR and exitCode files) so those file names are also changed. This makes it easier to find tasks in the final report and log files (it has no effect other than that). Note: If taskName contains non-allowed characters, they are sanitized (replaced by '_').

    Conditional execution

    Conditional execution of tasks can, obviously, be achieved using an if statement. Since conditional execution is so common, we allow for some syntactic sugar by task( expression1, expression2, ... ) { ... } . where expression1, expression2, etc. are either boolean expressions or variable declarations. The task is executed only if all bool expressions are true.
    So the following programs are equivalent
    shouldExec := true
    if( shouldExec ) {
    	task( cpus := 4 ) {
    		sys echo RUNNING
    	}
    }
    
    Is the same as:
    shouldExec := true
    
    task( shouldExec, cpus := 4 ) {
    	sys echo RUNNING
    }
    
    This feature is particularly useful when combined with the dependency operator <-. For instance, the following task will be executed only if 'out.txt' needs to be updated with respect to 'in.txt'
    in  := 'in.txt'
    out := 'out.txt'
    
    task( out <- in , cpus := 4 ) {
    	sys echo $in > $out
    }
    

    Syntax sugar

    There are many ways to write task expressions, here we show some examples.
    • A simple task
      task echo RUNNING
      
    • The same simple task
      task {
      	sys echo RUNNING
      }
      
    • A simple, multi-line task (a backslash at the end of the line continues in the next line, just like in a shell script)
      task cat file.txt \
      		| grep "^results" \
      		| cut -f 2 \
      		| sort \
      		> out.txtx
      
    • A more complex multi-line task (sys commands are just multiple lines in a bash script)
      task {
      	sys cat file.txt | grep "^results" > out.txt
      	sys cat other.txt | grep "^exclude" > words.txt
      	sys grep -v -f words.txt out.txt > excluded.txt
      	sys wc -l excluded.txt
      }
      
    • A task with dependencies
      task ( out <- in ) {
      	sys cat $in | grep "^results" > $out
      	sys cat other.txt | grep "^exclude" > words.txt
      	sys grep -v -f words.txt $out > excluded.txt
      	sys wc -l excluded.txt
      }
      
    • A task with multiple inputs and outputs dependencies
      task ( [out1, out2] <- [in1, in2] ) {
      	sys cat $in1 | grep "^results" > $out1
      	sys cat $in1 $in2 | wc -l > $out2
      }
      
    • A task with multiple inputs and outputs dependencies, using 4 CPUs and declaring a local variable 'tmp'
      task ( [out1, out2] <- [in1, in2] , cpus := 4 , tmp := "$in1.tmp" ) {
      	sys cat $in1 | grep "^results" > $out1
      	sys cat $in1 $in2 > $tmp
      	sys wc -l $tmp | wc -l > $out2
      }
      
    • A task with a label (taskName) is easier to find in the report
      task ( out <- in, cpus := 4 , taskName := "Filter results" ) {
      	sys cat $in | grep "^results" > $out
      }
      

    Wait

    Task coordination mechanisms rely on waiting for some tasks to finish before starting new ones.

    As we mentioned several times, task execution order is not guaranteed.
    File test_13.bds
    #!/usr/bin/env bds
    
    for( int i=0 ; i < 10 ; i++ ) task echo BEFORE $i
    for( int i=0 ; i < 10 ; i++ ) task echo AFTER $i
    
    $ ./test_13.bds
    BEFORE 0
    BEFORE 4
    BEFORE 3
    BEFORE 2
    BEFORE 1
    BEFORE 5
    BEFORE 7
    BEFORE 6
    BEFORE 8
    AFTER 1
    AFTER 0
    BEFORE 9	<-- !!!
    AFTER 6
    AFTER 5
    AFTER 4
    AFTER 3
    AFTER 2
    AFTER 7
    AFTER 8
    AFTER 9
    
    If a task must be executed after another task finishes, we can introduce a wait statement.
    File test_13.bds
    #!/usr/bin/env bds
    
    for( int i=0 ; i < 10 ; i++ ) task echo BEFORE $i
    
    wait    # Wait until ALL scheduled tasks finish
    print("We are done waiting, continue...\n")
    
    for( int i=0 ; i < 10 ; i++ ) task echo AFTER $i
    
    
    Now, we are sure that all tasks 'AFTER' really run after 'BEFORE'
    $ ./test_14.bds 
    BEFORE 0
    BEFORE 2
    BEFORE 1
    BEFORE 4
    BEFORE 3
    BEFORE 5
    BEFORE 6
    BEFORE 7
    BEFORE 8
    BEFORE 9
    We are done waiting, continue...
    AFTER 0
    AFTER 1
    AFTER 2
    AFTER 3
    AFTER 4
    AFTER 5
    AFTER 6
    AFTER 7
    AFTER 8
    AFTER 9
    
    We can also wait for a specific task to finish by providing a task ID wait taskId, e.g.:
    string tid = task echo Hi
    wait tid	# Wait only for one task
    
    Or you can wait for a list of tasks. For instance, in this program, we create a list of two task IDs and wait on the list:
    string[] tids
    
    for( int i=0 ; i < 10 ; i++ ) {
    	# Tasks that wait a random amount of time
    	int sleepTime = randInt( 5 )
    	string tid = task echo BEFORE $i ; sleep $sleepTime ; echo DONE $i
    
    	# We only want to wait for the first two tasks
    	if( i < 2 ) tids.add(tid)
    }
    
    # Wait for all tasks in the lists (only the first two tasks)
    wait tids
    print("End of wait\n")
    
    When we run it, we get:
    $ bds z.bds
    BEFORE 2
    BEFORE 0
    BEFORE 7
    BEFORE 5
    BEFORE 6
    BEFORE 4
    BEFORE 3
    BEFORE 1
    DONE 0
    DONE 3
    DONE 4
    DONE 5
    DONE 6
    DONE 7
    BEFORE 8
    BEFORE 9
    DONE 1
    End of wait		<- Wait finished here
    DONE 2
    DONE 8
    DONE 9
    
    There is an implicit wait statement at the end of the program. So a program does not exit until all tasks have finished running.

    Dependency operator

    You can use a dependency operator <- to decide whether tasks should be run or not, based on file existence and time-stamps.

    The dependency operator is written as out <- in. It is true if out file needs to be created or updated. This means that the operator is true if any of the following is satisfied:
    • out file does not exist
    • out file is empty (has zero length)
    • out latest modification time is earlier than in latest modification time

    File test_15.bds
    in  := "in.txt"
    out := "out.txt"
    if( out <- in ) print("We should update $out\n")
    
    Running the script:
    $ touch in.txt              # Create in.txt
    $ ./test_15.bds
    We should update out.txt
    
    $ touch out.txt             # Create zero length out.txt
    $ ./test_15.bds
    We should update out.txt
    
    $ ls > out.txt              # Create a non-empty out.txt
    $ ./test_15.bds
    $                           # Nothing done
    
    $ echo hi > in.txt          # Update in.txt
    $ ./test_15.bds
    We should update out.txt    # Logically, out needs updating
    

    In the dependency operator, in and out can be lists of files. The same rules apply: The operator is true if any out file is missing, zero length or the minimum of modification times in out is less than the maximum modificaton times in in

    This can be also used on lists:
    in1 := "in1.txt"
    in2 := "in2.txt"
    out := "out.txt"
    
    if( out <- [in1, in2] ) print("We should update $out\n")
    
    or even:
    in1 := "in1.txt"
    in2 := "in2.txt"
    out1 := "out1.txt"
    out2 := "out2.txt"
    
    if( [out1, out2] <- [in1, in2] ) print("We should update $out1 and $out2\n")
    

    A typical usage of <- is in conjunction with task. E.g.
    task( out <- in ) {
        sys cat $in > $out
    }
    
    The command is executed only if out need updating

    Goals

    Complex dependencies can be defined using goal and dep

    goal and dep are used to express dependencies in a declarative manner. As opposed to task expression, which are evaluated immediately, dep can be used to define a dependency (using the same syntax as task). A dep is not evaluated until a goal requires that dependency to be triggered.

    E.g.: File test_18.bds
    #!/usr/bin/env bds
    
    in   := 'in.txt'
    mid1 := 'mid1.txt'
    mid2 := 'mid2.txt'
    out  := 'out.txt'
    
    stime := 3
    
    # Dependencies: There is no need declare them in order
    dep( out <- mid2 )     sys echo $mid2 > $out  ; echo OUT   ; sleep 1
    dep( mid2 <- mid1 )    sys echo $mid1 > $mid2 ; echo MID2  ; sleep 1
    dep( mid1 <- in )      sys echo $in   > $mid1 ; echo MID1  ; sleep 1
    
    goal out
    
    
    Running the code, we get
    # Remove old files (if any)
    $ rm *.txt
    
    # Create input
    $ date > in.txt
    
    $ ./test_18.bds
    MID1
    MID2
    OUT
    
    In this case, bds created a directed acyclic graph of the dependencies needed to satisfy the goal 'out.txt' and then executed the required 'dep' declarations.

    Note: A goal expression returns a list of task Ids to be executed, which can be quite useful for debugging purposes. So in the previous example you could write:
    tids := goal out
    print "Executing tasks: $tids\n"
    

    Intermediate files within a 'goal' can be deleted


    In the previous example, if we delete the intermediate files 'mid1.txt' and or 'mid2.txt', and we re-execute the script, bds will notice that the output 'out.txt' is still valid with respect to the input 'in.txt' and will not execute any task.

    # Remove intermediate files
    $ rm mid?.txt
    
    # Re-execute (out.txt is still valid because in.txt was not changed)
    $ ./test_18.bds
    $
    

    How this works: bds calculates the dependency graph and checks whether the goal is up to date with respect to the inputs (which are the leaves in the dependency graph). If the goal up to date, then nothing is done. This feature is particularly useful when intermediate files are large and we need to clean them up (since we are working with big data problems, this is often the case).

    Dependencies that do not create files


    What if the final step in your pipeline does not create any files? in this case, you can use taskId as a goal, for example:

    #!/usr/bin/env bds
    
    tid := dep( taskName := 'hi' ) {
        sys echo Hello
    }
    
    goal tid	# We use task Id instead of a file name
    


    Multiple goals


    Sometimes it is convenient to fire multiple goals at once. You can do this by passing a list, instead of a string, to goal.

    #!/usr/bin/env bds
    
    string[] outs
    for(int i=0; i < 3 ; i++ ) {
        in := "in.$i.txt"
        out := "out.$i.txt"
        outs += out
    
        sys date > $in
        dep( out <- in ) sys cat $in > $out ; echo Hi $i
    }
    
    goal outs	# We use a list of goals, it is interpreted as multiple goal statements (one for each item in the list)
    

    Remote files (Amazon S3, Http, etc.)

    Often applications need to run tasks on remote data files, bds can transparently handle remote data dependencies

    In many cases data files may reside in non-local file systems, such as HTTP or Amazon's S3 object storage. Fortunately bds can transparently handle remote dependencies, download the input files and upload the results without you having to write a single line of code.

    Example 1: In this example, the remote file index.html is remote input file to the task. Obviously index.html is hosted on GitHub's servers, thus not available on the computer where the script is running. Before the command (cat) is executed, the remote file is transparently downloaded by bds.

    in  := 'http://pcingola.github.io/BigDataScript/index.html'
    out := 'tmp.html'
    
    task( out <- in ) sys cat $in > $out
    

    Notice that:
    1. there is no code for downloading the remote file (index.html) in the script;
    2. the file is downloaded on the processing node performing the task, which may differ from the node running the script (e.g. if it is running on a cluster);
    3. task dependencies are verified without downloading data, so the task, as well as the corresponding download / upload operations, are only performed if required;
    4. if the file is required in the future, bds checks if the local (cached) copy is still valid, and uses the cached file if possible (saving bandwith and time).

    Example 2: The following example is slightly more complicated, the input ('index.html') is processed (cat and echo commands) and the results are stored in an Amazon S3 object. Once more, notice that bds transparently takes care of downloading the file and then uploading the output to Amazon's S3.

    in  := 'http://pcingola.github.io/BigDataScript/index.html'
    out := 's3://pcingola.bds/test_remote_12.txt'
    
    task( out <- in ) {
    	sys cat $in > $out
    	sys echo "This line is appended to the file" >> $out
    }
    

    Parallel execution par

    bds can run parallel code as threads in the same program.


    Sometimes multiple branches of an analysis pipeline must be run in parallel. bds provides a simple par expression to run code in parallel. Originally this was called parallel, but then I realized I was too lazy to type all those letters, so I reduced it to par (both of them work if you choose to be more verbose).

    E.g.: File test_16.bds
    #!/usr/bin/env bds
    
    par {
        # This block runs in parallel
        for( int i : range(1, 5) ) {
            print("Parallel $i\n")
            sleep( 0.2 )
        }
    }
    
    for( int i : range(1, 5) ) {
        print("Main $i\n")
        sleep( 0.2 )
    }
    
    If we run this code:
    $ ./test_16.bds
    Parallel 1
    Main 1
    Parallel 2
    Main 2
    Main 3
    Parallel 3
    Main 4
    Parallel 4
    Parallel 5
    Main 5
    
    Perhaps a more elegant way to write the same code would be:
    #!/usr/bin/env bds
    
    void count(string msg) {
        for( int i : range(1, 5) ) {
            print("$msg $i\n")
            sleep( 0.2 )
        }
    }
    
    par count('Parallel')   # Call function in parallel thread
    count('Main')           # Call function in 'main' thread
    
    par also works with optional expressions that must be all 'true' to evaluate the block.
    par( out <- in )  {
        # This block runs in parallel if 'out' needs to be updated
        for( int i : range(1, 5) ) {
            tmp := "$in.$i.tmp"
            task head -n $i $in | tail -n 1 > $tmp
        }
        wait
        task cat $in.*.tmp > $out
    }
    


    Wait in 'par' context

    par expressions return a 'parallel ID' string that we can use in wait
    pid := par longRunningFunction()    // This function is executed in parallel 
    
    wait pid                            // Wait for parallel to finish
    
    Here wait statement waits until the function "longRunningFunction()" finishes.

    We mentioned before that, by default, a wait statement with no arguments would wait for 'all' tasks to finish. Specifically, wait statement waits for all tasks scheduled by the same thread and for all 'parallels'. So, wait statement with no arguments, will not restore execution until all threads and tasks triggered by the current thread have finished.

    Calling functions with 'par'

    A function can be called in a parallel thread using par statements.
    E.g.:
    par someFunction(x, y)
    
    It is important to notice that the return value from a par it is a 'parallel ID' (i.e. a thread ID) and not the function's return value. This is because the parallel thread could take a long time to process and we don't want to stop execution in the current thread until the function finishes.

    So, this sample code will show the 'parallel ID' independently of the function's return value:
    pid := par someFunction(x, y)  # 'par' returns a thread ID
    print "Parallel ID: $pid\n"
    

    Important: When calling a function, arguments are evaluated before the new thread is created. The reason for this is to simplify race conditions.

    Race conditions in 'par' and how to avoid them

    As is the case when creating threads in any programming language, using par can lead to race conditions.
    As an example, consider this code:
    #!/usr/bin/env bds
    
    for( int i : range(0, 10) ) {
        par {
            print "Number: $i\n"
        }
    }   
    
    The output is (comments added for clarification):
    $ ./z.bds
    Number: 0
    Number: 2		# We missed number 1?
    Number: 3
    Number: 4
    Number: 6		# We missed number 5?
    Number: 6		# Two '6'?
    Number: 8
    Number: 8
    Number: 10		# Three number 10?
    Number: 10
    Number: 10
    
    This is clearly not the result we wanted.
    What happened? Well, obviously this had a race condition. From the time thread is created (par), until the variable i is evaluated in print statement (parallel thread), the main thread has already changed i's value.

    To avoid this type of race condition, when using par to call a function, arguments are evaluated in the current thread. Then a new thread is created and the function is invoked. See what happens when we refactor the code:
    #!/usr/bin/env bds
    
    void show(int num) {
        print "Number: $num\n"
    }
    
    for( int i : range(0, 10) ) {
        par show(i)
    }   
    

    Now the output is what we expect:
    $ ./z.bds
    Number: 0
    Number: 1
    Number: 2
    Number: 3
    Number: 4
    Number: 5
    Number: 6
    Number: 7
    Number: 8
    Number: 9
    Number: 10
    

    Checkpoints

    BigDataScript can save the full state of a running script to a file and restart execution from that point

    A checkpoint is the full serialization of the state of a program. This is a powerful tool to create robust pipelines and to recover from several failure conditions.

    A checkpoint is created either when a task fails or when an explicit checkpoint command is executed. E.g.: The following program counts from 0 to 9, creating a checkpoint when the counter gets to 5
    File test_19.bds
    for( int i=0 ; i < 10 ; i++ ) {
    	if( i == 5 ) {
    		print("Checkpoint\n")
    		checkpoint "my.chp"
    	}
    	print("Counting $i\n")
    }
    
    If we execute it, we get
    $ bds z.bds
    Counting 0
    Counting 1
    Counting 2
    Counting 3
    Counting 4
    Checkpoint
    Counting 5
    Counting 6
    Counting 7
    Counting 8
    Counting 9
    
    A checkpoint file my.chp created. We can restart execution from this checkpoint file, by using the bds -r command line option
    $ bds -r my.chp		# Restart execution from checkpoint file
    Counting 5
    Counting 6
    Counting 7
    Counting 8
    Counting 9
    
    You can also see information on what was happening when the checkpoint was created:
    $ bds -i my.chp
    Program file: './test_19.bds'
         1 |#!/usr/bin/env bds
         2 |
         3 |for( int i=0 ; i < 10 ; i++ ) {
         4 |	if( i == 5 ) {
         5 |		print("Checkpoint\n")
         6 |		checkpoint
         7 |	}
         8 |	print("Counting $i\n")
         9 |}
    
    Stack trace:
    test_19.bds, line 3 :	for( int i=0 ; i < 10 ; i++ ) {
    test_19.bds, line 4 :		if( i == 5 ) {
    test_19.bds, line 6 :			checkpoint
    
    --- Scope: ./test_19.bds:3 ---
    int i = 5
    ...
    

    You can even copy the file(s) to another computer and restart execution there, as shown in this video

    Test cases

    Because nobody writes perfect code.

    bds provides a simple unit testing functionality. Simply use the -t command line option and bds will run all functions test*() (that is functions whose names start with 'test' and have no arguments).
    File test_24.bds
    #!/usr/bin/env bds
    
    int twice(int n)    return 3 * n    // Looks like I don't really know what "twice" means...
    
    void test01() {
        print("Nice test code 01\n")
    }
    
    void test02() {
        i := 1
        i++
        if( i != 2 )    error("I can't add")
    }
    
    void test03() {
        i := twice( 1 )
        if( i != 2 )    error("This is weird")
    }
    
    When we execute the tests, we get
    $ bds -t ./test_24.bds 
    
    Nice test code 01
    00:00:00.002	Test 'test01': OK
    
    00:00:00.003	Test 'test02': OK
    
    00:00:00.004	Error: This is weird
    00:00:00.004	Test 'test03': FAIL
    
    00:00:00.005	Totals
                      OK    : 2
                      ERROR : 1
    
    

    Debugger (built in)

    Bds provides a simple yet powerful built in debugger using breakpoint and debug statements.

    bds provides a simple built in debugger that can be activated using breakpoint statement. When a breakpoint statement is found, bds switches to debug mode and prompts the user on the console.
    File test_25.bds
    #!/usr/bin/env bds
    
    int inc(int x) {
        return x + 1;
    }
    
    debug "This won't be printed because we are not (yet) in debug mode\n" 
    breakpoint "Activate debug mode and insert a breakpoint here!\n"
    
    for( int i=0 ; i < 3 ; i = inc(i) ) {
        print "hi $i\n"
        debug "Variable: $i\n"  # This will be printed
    }
    
    When we run this example, the program runs until the first breakpoint and then bds prompts for debug commands on the console:
    $ bds test_25.bds 
    Breakpoint test_25.bds, line 8: Activate debug mode and insert a breakpoint here!
    DEBUG [STEP]: test_25.bds, line 10: 
    	for( int i = 0 ; i < 3 ; i = inc( i ) ) {
    		print "hi $i\\n"
    		debug "Variable: $i\\n"
    	}
    >
    
    You can type 'h' for help in debug commands:
    > h
    Help:
    	[RETURN]  : step
    	f         : show current Frame (variables within current scope)
    	h         : Help
    	o         : step Over
    	p         : show Program counter
    	r         : Run program (until next breakpoint)
    	s         : Step
    	t         : show stack Trace
    	v varname : show Variable 'varname'
    
    Here is an example of a debug session (comments after '#' added for clarity):
    $ bds test_25.bds 
    Breakpoint test_25.bds, line 8: Activate debug mode and insert a breakpoint here!
    DEBUG [STEP]: test_25.bds, line 10: 
        for( int i = 0 ; i < 3 ; i = inc( i ) ) {
            print "hi $i\\n"
            debug "Variable: $i\\n"
        }
    > 
    DEBUG [STEP]: test_25.bds, line 10: int i = 0 >                      # Pressing Return runs the next step ('int i=0')
    DEBUG [STEP]: test_25.bds, line 10: i = 0 > v i                      # Show variable 'i'
    int : 0
    DEBUG [STEP]: test_25.bds, line 10: i < 3 > 
    DEBUG [STEP]: test_25.bds, line 11: print "hi $i\\n" > 
    hi 0                                                                 # Output to STDOUT from print statement
    DEBUG [STEP]: test_25.bds, line 12: debug "Variable: $i\\n" > 
    Debug test_25.bds, line 12: Variable: 0                              # Since we are in debug mode, 'debug' prints to SDTERR
    DEBUG [STEP]: test_25.bds, line 10: i = inc( i ) > 
    DEBUG [STEP]: test_25.bds, line 10: inc( i ) >                       # Step into function 'inc(i)'
    DEBUG [STEP]: test_25.bds, line 4: return x + 1 > t                  # Show stack trace
    test_25.bds, line 10 :    for( int i=0 ; i < 3 ; i = inc(i) ) {
    test_25.bds, line 3 :    int inc(int x) {
    test_25.bds, line 4 :        return x + 1;
    
    DEBUG [STEP]: test_25.bds, line 4: return x + 1 > f                  # Show frames (variables)
    
    ---------- Scope Global ----------
    string _ = "/Users/pcingola/.bds/bds"
    ...                                                                  # Edited for brevity
    int walltimeout = 86400
    int week = 604800
    
    ---------- Scope test_25.bds:10:ForLoop ----------                   
    int i = 0
    
    ---------- Scope test_25.bds:3:FunctionDeclaration ----------
    int x = 0
    
    DEBUG [STEP]: test_25.bds, line 4: return x + 1 >                    # Step, exeute 'return' statement
    DEBUG [STEP]: test_25.bds, line 10: i < 3 > 
    DEBUG [STEP]: test_25.bds, line 11: print "hi $i\\n" > 
    hi 1
    DEBUG [STEP]: test_25.bds, line 12: debug "Variable: $i\\n" > 
    Debug test_25.bds, line 12: Variable: 1
    DEBUG [STEP]: test_25.bds, line 10: i = inc( i ) > o
    DEBUG [STEP_OVER]: test_25.bds, line 10: inc( i ) >                  # Step Over: execute 'inc(i)' and stop after function returns
    DEBUG [STEP_OVER]: test_25.bds, line 10: i < 3 > 
    DEBUG [STEP_OVER]: test_25.bds, line 11: print "hi $i\\n" > r        # Run (until another breakpoint). Since there are no more breakpoints, runs until the end of the program
    hi 2
    Debug test_25.bds, line 12: Variable: 2
    

    Automatic command line parsing

    No need to manually parse command line options for your scripts, bds does it for you.

    Automatic command line parsing parses any command line argument that starts with "-" and assigns the value to the corresponding variable.
    File test_20.bds
    #!/usr/bin/env bds
    
    in := "in.txt"
    print("In file is '$in'\n")
    
    If we run this, we get
    $ ./test_20.bds 
    In file is 'in.txt'
    
    Now we pass a command line argument -in another_file.txt, and bds automatically parses that command line option replacing the value of variable 'in'
    $ ./test_20.bds -in another_file.txt
    In file is 'another_file.txt'
    
    This feature also works for other data types (int, real, bool). In case of bool if the option is present, the variable is set to 'true'.
    File test_21.bds
    #!/usr/bin/env bds
    
    bool flag
    print("Variable flag is $flag\n")
    
    $ ./test_21.bds
    Variable flag is false
    
    $ ./test_21.bds -flag
    Variable flag is true
    
    Or you can specify the value (true or false which is useful to set to false a bool that is by default true:
    File test_21b.bds
    #!/usr/bin/env bds
    
    flagOn  := true
    flagOff := false
    print("flagOn = $flagOn\nflagOff = $flagOff\n")
    
    So in this example we can reverse the defaults by running this (note that we can use -flagOff instead of -flagOff true ):
    $ ./test_21b.bds -flagOn false -flagOff true 
    flagOn = false
    flagOff = true
    
    Note that we can use -flagOff instead of -flagOff true (the outcome is the same).

    You can also apply this to a list of strings. In this case, all command line arguments following the -listName will be included in the list (up to the next argument starting with '-').
    E.g.: Note that list in is populated using 'in1.txt in2.txt in3.txt' and out is set to 'zzz.txt'
    File test_22.bds
    #!/usr/bin/env bds
    
    in  := ["in.txt"]
    out := "out.txt"
    ok	:= false
    
    print("In : $in\n")
    print("Out: $out\n")
    print("OK : $ok\n")
    
    $ ./test_22.bds  -ok -in in1.txt in2.txt in3.txt -out zzz.txt
    In : [in1.txt, in2.txt, in3.txt]
    Out: zzz.txt
    OK : true
    

    Automatic command line help

    A command line 'help' for your scripts can be created automatically by bds.

    When you create variables that are used in command line arguments, you can provide an optional help string that bds will show when the script is run using either: -h, -help or --help command line options.

    For example, if we have the following script:
    File test_26.bds
    #!/usr/bin/env bds
    
    int num = 3		help Number of times 'hi' should be printed
    int min			help Help for argument 'min' should be printed here
    mean := 5		help Help for argument 'mean' should be printed here
    someVeryLongCommandLineArgumentName := true    help This command line argument has a really long name
    
    for( int i=0 ; i < num ; i++ ) {
    	print "hi $i\n"
    }
    
    When running the script using -h command line option a help screen is created and printed out automatically (no action is programmed in the script to process the '-h' command line option). Note that script command line options are given AFTER script name:
    $ bds test_26.bds -h
    Command line options 'test_26.bds' :
    	-num                                 : Number of times 'hi' should be printed
    	-min                                 : Help for argument 'min' should be printed here
    	-mean                                : Help for argument 'mean' should be printed here
    	-someVeryLongCommandLineArgumentName      : This command line argument has a really long name
    

    The same happens if you run the script directly:
    $ ./test_26.bds -h
    Command line options 'test_26.bds' :
    	num                                 : Number of times 'hi' should be printed
    	min                                 : Help for argument 'min' should be printed here
    	mean                                : Help for argument 'mean' should be printed here
    	someVeryLongCommandLineArgumentName : This command line argument has a really long name
    

    Help sort order

    By default, variables are sorted alphabetically when help is shown. This can be overridden by creating a global variable helpUnsorted (regardless of its type and value, since the program may not even be running when the help is shown).
    File test_26b.bds
    #!/usr/bin/env bds
    
    # This variable is use to indicate that help should be shown unsorted 
    # (i.e. in the same order that variables are declared)
    helpUnsorted := true
    
    zzz := 1        help Help for argument 'zzz' should be printed here
    aaa := 1        help Help for argument 'aaa' should be printed here
    
    print "Done\n"
    
    Now when we run bds -h help lines are shown unsorted:
    $ ./test_26b.bds -h
    Command line options 'test_26b.bds' :
    	-zzz   : Help for argument 'zzz' should be printed here
    	-aaa   : Help for argument 'aaa' should be printed here
    

    Showing help on empty command line arguments

    The function printHelp() can be called to show the help message. This can be used, for instance, to show a help message when there are no command line arguments by doing something like this:
    File test_26c.bds
    #!/usr/bin/env bds
    
    zzz := 1        help Help for argument 'zzz' should be printed here
    aaa := 1        help Help for argument 'aaa' should be printed here
    bbb := 1        help Help for argument 'bbb' should be printed here
    
    if( args.isEmpty() ) {
        printHelp()
        exit(1)
    }
    
    print "Done\n"
    
    Now when we run test_26c.bds without any command line arguments, the help message is shown:
    $ ./test_26c.bds 
    Command line options 'test_26c.bds' :
    	-aaa   : Help for argument 'aaa' should be printed here
    	-bbb   : Help for argument 'bbb' should be printed here
    	-zzz   : Help for argument 'zzz' should be printed here
    

    Help sections

    Sometimes it is useful to divide the help message into sections. Sections are marked by help statements as in this example:
    File test_26d.bds
    #!/usr/bin/env bds
    
    help This program does blah
    help Actually, a lot of blah blah
    help     and even more blah
    help     or blah
    
    verbose := false    help Be verbose
    quiet   := false    help Be very quiet
    
    help Options related to database
    dbPort := 5432      help Database port
    dbName := "testDb"  help Database name
    
    print "OK\n"
    

    When run, variables are grouped in two "help sections" (note that variables are sorted within each section):
    $ ./test_26d.bds -h
    This program does blah
    Actually, a lot of blah blah
        and even more blah
        or blah
    	-quiet      : Be very quiet
    	-verbose    : Be verbose
    Options related to database
    	-dbName   : Database name
    	-dbPort      : Database port
    

    Logging

    Logging is mundane and boring, but many times necessary. Not many people enjoy adding hundreds of line of code just to perform logging. That's why bds can log everything for you.

    Both sys and task commands create a shell file, execute it and save STDOUT and STDERR to files. This gives you an automatic log of everything that was executed, as well as the details of the outputs and exit status from each execution.
    For example, let's create a simple program and run it
    string name = "Pablo"
    
    sys echo Hello $name
    
    Now let's run this script, we use -v command line option to make the output verbose:
    $ bds -v -log z.bds
    00:00:00.169	Process ID: z.bds.20140328_224825_685
    00:00:00.174	Queuing task 'z.bds.20140328_224825_685/sys.line_4.id_1'
    00:00:00.674	Running task 'z.bds.20140328_224825_685/sys.line_4.id_1'
    00:00:00.689	Finished task 'z.bds.20140328_224825_685/sys.line_4.id_1'
    Hello Pablo
    00:00:00.692	Finished running. Exit value : 0
    
    What happened?
    1. bds parses the sys statement and interpolates "echo Hello $name" to "echo Hello Pablo"
    2. Creates a task and assigns a task ID z.bds.20140328_224825_685/sys.line_4.id_1
    3. It creates a shell script file z.bds.20140328_224825_685/sys.line_4.id_1.sh with the code:
      $ cat z.bds.20140328_224825_685/sys.line_4.id_1.sh 
      #!/bin/sh
      
      echo Hello Pablo
      
    4. Then executes this shell script and saves stdout and stderr to z.bds.20140328_224825_685/sys.line_4.id_1.stdout and z.bds.20140328_224825_685/sys.line_4.id_1.stderr respectively
      $ cat z.bds.20140328_224825_685/sys.line_4.id_1.stdout
      Hello Pablo
      $ cat z.bds.20140328_224825_685/sys.line_4.id_1.stderr
      
      Notice that there was no output on stderr (it's empty)
    5. Command finished without problems, so it continues with the rest of the program
    So at the end of the run, we have the file with the script, plus the stdout and stderr files. All the information is logged automatically.

    Cleanup

    If a script fails, bds automatically cleans up stale files and kills pending tasks.

    In order to make sure that data pipelines are correctly re-executed after a failure, bds automatically cleans all dependent files from failed tasks. This saves time because the user doesn't need to check for consistency on putative stale files.

    Also, bds ensures that resources are not wasted, by killing all pending tasks. In large pipelines, thousands of tasks can be scheduled for execution in a cluster and it is quite difficult for the user to keep track of them and clean them if the pipeline fails. Fortunately, bds takes care of all these details and issues appropriate commands to kill all pending tasks.
    File test_22.bds

    #!/usr/bin/env bds
    
    for(int i : range(1,10) ) {
        in  := "in_$i.txt"
        sys date > $in
    
        out := "out_$i.txt"
        task( out <- in ) {
            sys echo Task $i | tee $out; sleep $i; echo Done $i
        }
    }
    
    If I run this example and interrupt it (by pressing Ctrl-C) before it ends:
    $ ./test_23.bds
    Task 4
    Task 5
    Task 2
    Task 3
    Task 8
    Task 1
    Task 7
    Task 6
    Done 1
    Task 9
    Done 2
    Task 10
    Done 3
    ^C				# Here I pressed Ctr-C
    2014/07/31 00:43:46 bds: Received OS signal 'interrupt'
    2014/07/31 00:43:46 bds: Killing PID '32523'
    2014/07/31 00:43:46 bds: Killing PID '32534'
    2014/07/31 00:43:46 bds: Deleting file 'out_6.txt'
    2014/07/31 00:43:46 bds: Deleting file 'out_9.txt'
    2014/07/31 00:43:46 bds: Deleting file 'out_5.txt'
    2014/07/31 00:43:46 bds: Killing PID '32546'
    2014/07/31 00:43:46 bds: Deleting file 'out_4.txt'
    2014/07/31 00:43:46 bds: Deleting file 'out_8.txt'
    2014/07/31 00:43:46 bds: Killing PID '32597'
    2014/07/31 00:43:46 bds: Killing PID '32557'
    2014/07/31 00:43:46 bds: Deleting file 'out_7.txt'
    2014/07/31 00:43:46 bds: Killing PID '32575'
    2014/07/31 00:43:46 bds: Killing PID '32610'
    2014/07/31 00:43:46 bds: Deleting file 'out_10.txt'
    
    You can see how bds cleans up all stale files and kills all processes. When this is executed in a cluster, the appropriate qdel, canceljob or similar command is issued (depending on the type of cluster used).

    Sometimes tasks just disappear from clusters.

    Sometimes clusters fail in ways that the cluster management system is unable to detect, let alone report the error. It can happen that tasks disappear without any trace from the cluster (this is not as rare as you may think, particularly when executing thousands of tasks per pipeline). For this reason, bds performs active monitoring, to ensure that tasks are still alive. If any task "mysteriously disappears", bds reports the problem and considers the task as failed.

    BDS Command line options

    BigDataScript (bds) command line arguments.

    Running the bds command without any arguments shows a help message

    $ bds
    BigDataScript 0.999i (build 2015-03-28), by Pablo Cingolani
    
    Usage: BigDataScript [options] file.bds
    
    Available options: 
      [-c | -config ] bds.config     : Config file. Default : /Users/pcingola/.bds/bds.config
      [-checkPidRegex]               : Check configuration's 'pidRegex' by matching stdin.
      [-d | -debug  ]                : Debug mode.
      -dryRun                        : Do not run any task, just show what would be run. Default: false
      [-extractSource]               : Extract source code files from checkpoint (only valid combined with '-info').
      [-i | -info   ] checkpoint.chp : Show state information in checkpoint file.
      [-l | -log    ]                : Log all tasks (do not delete tmp files). Default: false
      -noReport                      : Do not create any report.
      -noReportHtml                  : Do not create HTML report.
      -noRmOnExit                    : Do not remove files marked for deletion on exit (rmOnExit). Default: false
      [-q | -queue  ] queueName      : Set default queue name.
      -quiet                         : Do not show any messages or tasks outputs on STDOUT. Default: false
      -reportHtml                    : Create HTML report. Default: true
      -reportYaml                    : Create YAML report. Default: false
      [-r | -restore] checkpoint.chp : Restore state from checkpoint file.
      [-s | -system ] type           : Set system type.
      [-t | -test   ]                : Run user test cases (runs all test* functions).
      [-v | -verbose]                : Be verbose.
      -version                       : Show version and exit.
      [-y | -retry  ] num            : Number of times to retry a failing tasks.
      -pid                     : Write local processes PIDs to 'file'
    
    Do not create any report. Do not create HTML report.
    Option short Option long Meaning
    -c -config Path to bds.config file (most of the times no config file is needed)
    -checkPidRegex Check configuration's 'pidRegex' by matching stdin.
    -d -debug Debug mode, shows detailed information for debugging scripts or debugging bds itself.
    -dryRun Do not run any task, just show what would be run. This mode is used when you just want to test your script's logic without executing any tasks.
    -extractSource Extract source code files from checkpoint (only valid combined with '-info').
    -i -info Show state information in checkpoint file. Prints variables, scopes, etc.
    -l -log Log all tasks (do not delete tmp files).
    -noReport
    -noReportHtml
    -noRmOnExit Do not remove files marked for deletion on exit (rmOnExit).
    -q -queue Set default cluster's queue name.
    -quiet Do not show any messages or tasks outputs on STDOUT. This means that only the output from print (and other print-like bds statements), statements will be shown on the console
    -reportHtml Create HTML report. Create an HTML report (only if at least one task is executed). By default this option is activated.
    -reportYaml Create YAML report. Create a YAML report (only if at least one task is executed).
    -r -restore Restore from a a checkpoint and continue execution.
    -s -system Define the 'system' type the script is running on (e.g. cluster type)
    -t -test Perform all tests in a script (i.e. run all functions that are called "test*")
    -v -verbose Be verbose (show more information)
    -version Show version number and exit.
    -y -retry Number of times to retry a failing tasks.
    -pid Write local processes PIDs to 'file'. Under normal circumstances, you should never use this command line option (unless you are debugging bds itself).

    BDS Config file

    BigDataScript's config file allows customizing bds's behavior.

    BigDataScript's config file is usually located in $HOME/.bds/bds.config. Running bds without any arguments shows the config's file default location. You can provide an alternative path using command line option -c.

    The config file is roughly divided into sections. It is not required that parameters are in specific sections, we just do it to have some order. We explain the parameters for each section below.

    Default parameters

    This section defines default parameters used in running tasks (such as system type, number of CPUs, memory, etc.). Most of the time you'd rather keep options unspecified but it can be convenient to set system = local in your laptop and system = cluster in your production cluster.
    Parameter Comments / examples
    mem Default memory in bytes (negative number means unspecified)
    node Default execution node (empty means unspecified)
    queue Add default queue name (empty means unspecified)
    retry Default number of retries when a task fails (0 means no retry). Upon failire, a task is re-executed up to 'retry' times. I.e. a task is considered failed only after failing 'retry + 1' times.
    system Default system type. If unspecified, the default system is 'local' (run tasks on local computer)
    timeout Task timeout in seconds (default is one day)
    walltimeout Task's wall-timeout in seconds (default is one day). Wall timeout includes all the time that the task is waiting to be executed. I.e. the total amount of time we are willing to wait for a task to finish. For example if walltimeout is one day and a task is queued by the cluster system for one day (and never executed), it will timeout, even if the task was never run.
    taskShell Shell to be used when running a task (default '/bin/sh -e')
    WARNING: Make sure you use "-e" or some command line option that stops execution when an error if found.
    sysShell Shell to be used when running a sys (default '/bin/sh -e -c')
    WARNING: Make sure you use "-e" or some command line option that stops execution when an error if found.
    WARNING: Make sure you use "-c" or some command line option that allows providing a script

    Cluster options

    This section defines parameters to customize bds to run tasks on your cluster.
    Parameter Comments / examples
    pidRegex Regex used to extract PID from cluster command (e.g. qsub).

    When bds dispatches a task to the cluster management system (e.g. running 'qsub' command), it expects the cluster system to inform the jobID. Typically cluster systems show jobIDs in the first output line. This regex is used to match that jobID.

    Default, use the whole line
    Note: Some clusters add the domain name to the ID and then never use it again, some other clusters add a message (e.g. 'You job ...')

    Examples:
    pidRegex = "(.+).domain.com"
    pidRegex = "Your job (\\S+)"
    
    clusterRunAdditionalArgs These command line arguments are added to every cluster 'run' command (e.g. 'qsub')
    The string is split into spaces (regex: '\s+') and added to the cluster's run command.

    For instance the following configuration:

     clusterRunAdditionalArgs = -A accountID -M user@gmail.com 
    will cause four additional arguments { '-A', 'accountID', '-M', 'user@gmail.com' } to be added immediately after 'qsub' (or similar) command used to run tasks on a cluster.
    clusterKillAdditionalArgs These command line arguments are added to every cluster 'kill' command (e.g. 'qdel')
    Same rules as 'clusterRunAdditionalArgs' apply
    clusterStatAdditionalArgs These command line arguments are added to every cluster 'stat' command (e.g. 'qstat')
    Same rules as 'clusterRunAdditionalArgs' apply
    clusterPostMortemInfoAdditionalArgs These command line arguments are added to every cluster 'post mortem info' command (e.g. 'qstat -f') Same rules as 'clusterRunAdditionalArgs' apply

    SGE Cluster options

    This section defines parameters to customize bds to run tasks on a Sun Grid Engine cluster.

    IMPORTANT: In SGE clusters it is important to enable ENABLE_ADDGRP_KILL=true to the execd_params parameter of qconf -sconf. Otherwise SGE might not be able to kill bds subprocesses running on slave nodes if this option is not enabled. So, if you don't activate ENABLE_ADDGRP_KILL=true killing processes may not work in SGE clusters, nodes will continue to run tasks even after they've been killed either Ctrl-C to bds or by a direct qdel command (the cluster reports them as finished, but they might still be running in the slave node).

    Parameter Comments / examples
    sge.pe Parallel environment in SGE (e.g. 'qsub -pe mpi 4')

    Note on SGE's parallel environment ('-pe'):

    The defaults were set to be compatible with StarCluster.
    Parallel environment defines how 'slots' (number of cpus requested) are allocated. StarCluster by default sets up a parallel environment, called “orte”, that has been configured for OpenMPI integration within SGE and has a number of slots equal to the total number of processors in the cluster.
    See details qconf -sp orte:
    pe_name            orte
    slots              16
    user_lists         NONE
    xuser_lists        NONE
    start_proc_args    /bin/true
    stop_proc_args     /bin/true
    allocation_rule    $round_robin
    control_slaves     TRUE
    job_is_first_task  FALSE
    urgency_slots      min
    accounting_summary FALSE
    

    Notice the allocation_rule = $round_robin. This defines how to assign slots to a job. By default StarCluster configures round_robin allocation. This means that if a job requests 8 slots for example, it will go to the first machine, grab a single slot if available, move to the next machine and grab a single slot if available, and so on wrapping around the cluster again if necessary to allocate 8 slots to the job.
    You can also configure the parallel environment to try and localize slots as much as possible using the "fill_up" allocation rule and job_is_first_task of TRUE.
    To configure: qconf -mp orte
    sge.mem Parameter for requesting amount of memory in qsub (e.g. qsub -l mem 4G)
    sge.timeout Parameter for timeout in qsub (e.g. qsub -l h_rt 24:00:00)

    Generic Cluster options

    Cluster generic invokes user defined scripts for manupulating tasks. This allows the user to customize scripts for particular cluster environments (e.g. environments not currently supported by bds)

    Note: You should either provide the script's full path or the scripts should be in your PATH

    Note: These scripts "communicate" with bds by printing information on STDOUT. The information has to be printed in a very specific format. Failing to adhere to the format will cause bds to fail in unexpected ways.

    Note: You can use command path starting with '~' to indicate $HOME dir or '.' to indicate path relative to config file's dir

    Parameter Comments / examples
    clusterGenericRun The specified script is executed when a task is submitted to the cluster

    Script's expected output: The script MUST print the cluster's jobID AS THE FIRST LINE. Make sure to flush STDOUT to avoid other lines to be printed out of order.

    Command line arguments:
    1. Task's timeout in seconds. Negative number means 'unlimited' (i.e. let the cluster system decide)
    2. Task's required CPUs: number of cores within the same node.
    3. Task's required memory in bytes. Negative means 'unspecified' (i.e. let the cluster system decide)
    4. Cluster's queue name. Empty means "use cluster's default"
    5. Cluster's STDOUT redirect file. This is where the cluster should redirect STDOUT.
    6. Cluster's STDERR redirect file. This is where the cluster should redirect STDERR
    7. Cluster command and arguments to be executed (typically is a "bds -exec ...").


    Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.
    clusterGenericKill The specified script is executed in order to kill a task

    Script's expected output: None

    Command line arguments: jobId: This is the jobId returned as the first line in 'clusterGenericRun' script (i.e. the jobID provided by the cluster management system)

    Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.
    clusterGenericStat The specified script is executed in order to show the jobID of all jobs currently scheduled in the cluster

    Script's expected output: This script is expected to print all jobs currently scheduled or running in the cluster (e.g. qstat), one per line. The FIRST column should be the jobID (columns are space or tab separated). Other columns may exist (but are currently ignored).

    Command line arguments: None

    Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.
    clusterGenericPostMortemInfo The specified script is executed in order to get information of a recently finished jobId. This information is typically used for debuging and is added to bds's output.

    Script's expected output: The output is not parsed, it is stored and later shown in bds's report. Is should contain information relevant to the job's execution (e.g. qstat -f $jobId or checkjob -v $jobId)

    Command line arguments: jobId: This is the jobId returned as the first line in 'clusterGenericRun' script (i.e. the jobID provided by the cluster management system)

    Example: For examples on how to build this script, take a look at config/clusterGeneric* directory in the source code.

    SSH Cluster options

    Cluster shh creates a virtual cluster using several nodes access via shh.


    Parameter Comments / examples
    ssh.nodes This defines the userName and nodes to be accessed via ssh.

    Examples:
    • A trivial 'ssh' cluster composed only of the localhost accesed via ssh (useful for debugging)
      ssh.nodes = user@localhost
      
    • Some company's servers used as an ssh cluster
      ssh.nodes = user@lab1-1company.com, user@lab1-2company.com, user@lab1-3company.com, user@lab1-4company.com, user@lab1-5company.com
      
    • A StarCluster run on Amazon AWS
      ssh.nodes = sgeadmin@node001, sgeadmin@node002, sgeadmin@node003, sgeadmin@node004, sgeadmin@node005, sgeadmin@node006