Tasks on AWS
You can set bds
tasks to run on the Amazon cloud (i.e. an EC2 instance).
This is achieved by using system='aws'
.
WARNING
WARNING: Using bds
with cloud can incur in costs, unexpected expenses
You need to thoroughly review cloud resources used by your bds
programs to make sure you are not incurring into unwanted or unexpected expenses.
bds
does not make any warranties, use at your own risk.
Pre-requisites
You need all the following to run tasks on AWS:
- Amazon AWS access: Obviously you need an AWS account
- Privileges: The AWS security role should have access to the following services
- EC2: Create, run and terminate EC2 instances
- S3: Read, write and delete objects from an S3 bucket
- SQS: Create, send messages and delete SQS queues
- EC2 image (AMI): An image with
bds
installed (i.e. in order to runbds
tasks on an instance, you need an image that is capable of runningbds
)
Example walk-through
Example walk-through: Program and AWS parameters
The next "toy example" shows how to run a simple tasks on AWS.
Parameters for the EC2 instance are set in taskResources
hash.
In this example, the task will run on an instance in us-east-1
region.
It is also assumed that the EC2 image has been setup and bds
is properly installed in that image.
Obviously, you should replace the AMI number ami-123456abcdef
with your own image ID.
Similarly, all the other parameters in taskResources
should be set properly according to your account.
#!/usr/bin/env bds
# These hash called 'taskResources' contains the parameters we need to run a task on AWS
# WARNING: You need to replace ALL this parameters to use your account's settings
taskResources := { \
'region' => 'us-east-1' \
, 'instanceType' => 't3a.medium' \
, 'imageId' => 'ami-123456abcdef' \
, 'securityGroupIds' => 'sg-987654321abc' \
, 'subnetId' => 'subnet-192837465fed' \
, 'instanceProfile' => 'AWS_BDS_INSTANCE_ROLE' \
}
# This task is run on an AWS instance!
task(system := 'aws') {
sys echo HI
sys for i in `seq 10`; do echo "count: \$i"; sleep 1; done
sys echo BYE
}
println "After" # This message is show after the task is scheduled (the instance being requested in the background)
wait # Wait until the task finishes (i.e. the instance finishes running)
println "Done" # This message is shown after the task finished running
Example walk-through: Running the program
OK, let's run this example program and analyze each step of the bds
output.
We run using the -v
command line options, so we'll see more verbose output and -log
to get the outputs logged to files.
$ bds -log -v z.bds
00:00:00.004 Bds 3.0b (build 2021-01-14 14:34), by Pablo Cingolani
Before
After
These are just the println
statements from our example program.
The message After
is shown after the task statement, so at this point the task was scheduled for execution (but it's not running yet).
00:00:00.484 INFO : Creating AWS SQS queue 'bds_20210121_090423_234750293cafe'
This indicates that an SQS queue was created.
This SQS queue is used to communicate task StdOut, StdErr, and task exit status to the computer running the bds
program (e.g. your laptop)
Since the instance can be inaccessible from our network and vice-versa (e.g. we could be running the script on our laptop behind a corporate firewall), SQS is used to communicate both ends.
00:00:04.229 INFO Cmd 'z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785': Created EC2 instance: 'i-0d5931b8e30fba349', for task 'z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785'
The instance was created and the task will be executed in that instance. Note that the instance will be automatically terminated when the task finishes executing.
00:01:01.835 INFO : Writing report file 'z.bds.20210121_090423_293.report.html'
00:01:05.810 INFO : Tasks [CloudAws[13]] Pending: 0 Running: 1 Done: 0 Failed: 0
| PID | Task state | Task name | Dependencies | Task definition |
| ------------------- | ----------------- | ------------------------------------ | ------------ | ------------------------------------------------------------------------ |
| i-0d5931b8e30fba349 | running (RUNNING) | task.z.line_16.id_1.337e55c1b5c62785 | | echo HI; for i in `seq 10`; do echo "count: $i"; sleep 1; done; echo BYE |
The instance usually takes over a minute to startup.
We run using bds -v ...
, in verbose mode bds
will show a report of all tasks running (roughly) every one minute.
In this case, there is only one task.
HI
count: 1
count: 2
count: 3
count: 4
count: 5
count: 6
count: 7
count: 8
count: 9
count: 10
BYE
These are the output lines from the task, which is running on the EC2 instance. Note that even though we are not connected to the instance, we can the task's Stdout because it's sent via an SQS queue. After the task finished, the instance is terminated. You can check on the AWS management console that the instance changes state first to "Shutting down" and later to "Terminated".
Done
This was the println "Done"
statement from our program, since the statement was after a wait
statement, this is shown after the task finished running.
00:01:36.518 INFO : Writing report file 'z.bds.20210121_090423_293.report.html'
00:01:36.524 INFO : Writing report file 'z.bds.20210121_090423_293.report.yaml'
00:01:36.529 INFO : Deleting AWS SQS queue 'az-ngs-seqauto_20210121_090423_74c7f1be2f3ccca5', url: 'https://sqs.us-east-1.amazonaws.com/671016219382/az-ngs-seqauto_20210121_090423_74c7f1be2f3ccca5'
The last messages show that a report was created and the SQS queue was deleted.
Example walk-through: Looking to the log files
When the bds
script is run using -log
command line option, files will be created to log every task.
In the previous example there was only one task, here are the files and details:
# Go to the log direcotory (created by 'bds -log ...')
$ cd z.bds.20210121_090423_293
# List files (added comments for each file)
$ ls
task.z.line_16.id_1.337e55c1b5c62785.ec2_request_response.i-0d5931b8e30fba349.txt
task.z.line_16.id_1.337e55c1b5c62785.startup_script.sh
task.z.line_16.id_1.337e55c1b5c62785.exitCode
task.z.line_16.id_1.337e55c1b5c62785.stdout
task.z.line_16.id_1.337e55c1b5c62785.stderr
task.z.line_16.id_1.337e55c1b5c62785.sh
Let's review each file:
- Task STDOUT (task.z.line_16.id_1.337e55c1b5c62785.stdout
): The standard output from the task is logged in this file. Note that these are messages sent by the EC2 instance via SQS.
$ cat task.z.line_16.id_1.337e55c1b5c62785.stdout
HI
count: 1
count: 2
count: 3
count: 4
count: 5
count: 6
count: 7
count: 8
count: 9
count: 10
BYE
- Task STDERR (
task.z.line_16.id_1.337e55c1b5c62785.stderr
): The standard error from the task is logged in this file. Note that these are messages sent by the EC2 instance via SQS. IMPORTANT: If there is no STDERR, this file is not created.
# This file was not created because the task did not have any output to STDERR
$ cat task.z.line_16.id_1.337e55c1b5c62785.stderr
cat: task.z.line_16.id_1.337e55c1b5c62785.stderr: No such file or directory
- Task exit code (
task.z.line_16.id_1.337e55c1b5c62785.exitCode
): This is the exit code from the task
# Exit code '0' means that the task run succesfully
$ cat task.z.line_16.id_1.337e55c1b5c62785.exitCode
0
- EC2 request file (
task.z.line_16.id_1.337e55c1b5c62785.ec2_request_response.i-0d5931b8e30fba349.txt
): This file logs the detailed request parameters when requesting the EC2 instance.
$ cat task.z.line_16.id_1.337e55c1b5c62785.ec2_request_response.i-0d5931b8e30fba349.txt
RunInstancesResponse(Groups=[], Instances=[Instance(AmiLaunchIndex=0, ImageId=ami-123456abcdef, InstanceId=i-0d5931b8e30fba349, InstanceType=t3a.medium, LaunchTime=...
- Task script (
task.z.line_16.id_1.337e55c1b5c62785.sh
): This is "raw" script defined in the task.
$ cat task.z.line_16.id_1.337e55c1b5c62785.sh
#!/bin/bash -eu
set -o pipefail
cd '/home/myuser/bds/example'
# SYS command. line 17
echo HI
# SYS command. line 18
for i in `seq 10`; do echo "count: $i"; sleep 1; done
# SYS command. line 19
echo BYE
# Checksum: 3a69f4dc
- Startup script (
task.z.line_16.id_1.337e55c1b5c62785.startup_script.sh
): This is the startup script provided to the instance (EC2 "user data" parameter). This is how the instance knows how to execute the task. we'll discuss this in the next sub-section
Example walk-through: The instance startup script
Whenever bds
creates an instance, it must also instruct the instance on what to do (i.e. how to execute the task).
This is achieved using a start-up script (a.k.a. "used data" if you are familiar with the aws ec2 run-instances
command).
In the example, we saw that there was a "startup script" file created (task.z.line_16.id_1.337e55c1b5c62785.startup_script.sh
).
Here is the startup script, we'll analyze each part:
#!/bin/bash -eu
set -o pipefail
function exit_script {
shutdown -h now
}
export HOME='/root'
trap exit_script EXIT
echo "INFO: Starting script '$0'"
mkdir -p '/home/myuser/bds/example/z.bds.20210121_090423_293'
# #!/bin/bash -eu
# set -o pipefail
#
# cd '/home/myuser/bds/example'
#
# # SYS command. line 17
# echo HI
# # SYS command. line 18
# for i in `seq 10`; do echo "count: $i"; sleep 1; done
# # SYS command. line 19
# echo BYE
# # Checksum: 3a69f4dc
grep '^# ' "$0" | cut -c 3- > '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
chmod u+x '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
bds exec -stdout '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.stdout' -stderr '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.stderr' -exit '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.exit' -taskId 'z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785' -awsSqsName 'https://sqs.us-east-1.amazonaws.com/671016219382/az-ngs-seqauto_20210121_090423_74c7f1be2f3ccca5' -timeout '86400' '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
echo "INFO: Finished script '$0'"
The script has three main sections:
- Capture any error (trap) and make sure the instance is shut down if any error occurs or when the script finishes successfully. Since the instances are run with "terminate on shutdown" option enabled, this forces the instance to be terminated (so you don't get charged by an instance that has finished or failed to run the task)
- Copy the task lines defined in the
bds
program, into a new file and change mode to allow execution - Execute the file that runs the task using
bds ... -awsSqsName ...
which redirect STDOUT, STDERR (and exit code) to the SQS queue - Script finishes
Let's see some details of each part of the startup script:
- Capture any error (trap) and make sure the instance is shut down if any error occurs or when the script finishes successfully. Since the instances are run with "terminate on shutdown" option enabled, this forces the instance to be terminated (so you don't get charged by an instance that has finished or failed to run the task)
#!/bin/bash -eu # This makes sure that bash exits if there are any errors
set -o pipefail # Tell bash to exit even if the errors are in piped command
function exit_script { # This function is used to shutdown the instance
shutdown -h now
}
export HOME='/root' # The startup script is executed by the instance at startup time as 'root' user, HOME variable is not defined yet so we define it
trap exit_script EXIT # Trap any EXIT signal and execute 'exit_script' funtion (which shut downs the instance)
- Copy the task lines defined in the
bds
program, into a new file and change mode to allow execution
mkdir -p '/home/myuser/bds/example/z.bds.20210121_090423_293' # Create a directory (same location as in the original bds program)
The lines below are the script lines defined by the bds program (see "task line").
Note that they are the exact same lines as defined in the task script (task.z.line_16.id_1.337e55c1b5c62785.sh
), but they have a #\t
prepended to each line
# #!/bin/bash -eu
# set -o pipefail
#
# cd '/home/myuser/bds/example'
#
# # SYS command. line 17
# echo HI
# # SYS command. line 18
# for i in `seq 10`; do echo "count: $i"; sleep 1; done
# # SYS command. line 19
# echo BYE
# # Checksum: 3a69f4dc
Next, there is a grep
command to extract all the lines starting with #\t
to a new file.
So this writes all the line defined in the original task to a new script file and makes it executable
grep '^# ' "$0" | cut -c 3- > '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
chmod u+x '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
- Execute the file that runs the task using
bds ... -awsSqsName ...
which redirect STDOUT, STDERR (and exit code) to the SQS queue
bds exec -stdout '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.stdout' -stderr '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.stderr' -exit '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.exit' -taskId 'z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785' -awsSqsName 'https://sqs.us-east-1.amazonaws.com/671016219382/az-ngs-seqauto_20210121_090423_74c7f1be2f3ccca5' -timeout '86400' '/home/myuser/bds/example/z.bds.20210121_090423_293/task.z.line_16.id_1.337e55c1b5c62785.startup_script_instance.sh'
echo "INFO: Finished script '$0'"
- Script finishes: Several things happen whenever the startup script finishes executing:
- After the last line is successfully executed, the script will exit, i.e. issue an
EXIT
signal. - This signal is trapped (see previous
trap exit_script EXIT
line) and theexit_script
function will be invoked. - The
exit_script
function performs ashutdown -h now
which shuts down the instance. - Since the instance was run using "terminate on shutdown" behaviour (i.e.
instance-initiated-shutdown-behavior='shutdown'
), the instance will be terminated.
- After the last line is successfully executed, the script will exit, i.e. issue an
AWS tasks
In the next sub-sections we cover some details on AWS tasks resources, EC2 parameters, task dependencies, etc.
AWS resource cleanup
bds
will try to clean up AWS resources when the program ends successfully or when it is interrupted (e.g. you press Ctr-C
on the terminal running the script).
This means that, under normal circumstances, EC2 instances will be terminated, SQS queues deleted, etc.
Unfortunately, there is no way to guarantee that the cleanup will be performed or that it will succeed.
You can easily imagine a situations where the bds
script is unable to clean up AWS resources, for instance:
- It is running on a laptop and you accidentally close the laptop
- The server where the bds
script is running loses internet connection
- The bds
script is killed with kill -9 ...
, thus it has no chance to send API messages to AWS to clean up.
IMPORTANT: There is no warranty on cleaning resources. bds
attempt to cleanup resources on AWS, but this is a "best effort" approach.
WARNING: In order to avoid unnecessary AWS costs, you should always monitor AWS resources used by bds
.
StdOut / StdErr
As we've already mentioned, the instance executing a Task will send bith STDOUT and STDERR to an SQS queue so that the original program can show them on the main terminal.
It is important to understand that if a task generates many output lines, this could potentially generate millions of SQS messages. Not only this could generates costs due to SQS messages, but also can create high network traffic and delay the task significantly (since the process must wait for all those messages to be sent).
A simple option would be to redirect STDOUT/STDERR to a file (or to /dev/null
) when not needed, e.g.:
task( system := 'aws' ) {
sys command_with_lots_of_output > /dev/null 2>&1
}
Improper tasks
AWS tasks can also be "improper", this means that they can execute arbitrary bds
code.
In this case, as always happens in improper tasks, the whole program state will be accessible to the task runningin the instance:
task( system := 'aws' ) {
println "Executing standard bds code in this task"
for(int i=0 ; i < 10 ; i++ ) {
println "Count: $i"
}
println "End of task, the EC2 instance will terminate after this"
}
Dependencies
Tasks executed on AWS can have dependencies, as any other tasks. Usually dependencies are files on S3.
In this example you see two tasks dependent on each other:
1. The main bds script creates an "input" file on S3 (in := "s3://my_bds_test_bucket/exmaple/in.txt"
)
1. The first task (task(out1 <- in) ...
) reads the input file, adds some data and writes the result to out1
output file (out1 := "s3://my_bds_test_bucket/exmaple/out1.txt"
)
1. The second task (task(out2 <- out1) ...
) reads the out1
file, adds some more data and writes the result to out2
output file (out2 := "s3://my_bds_test_bucket/exmaple/out2.txt"
)
system = 'aws'
# These hash called 'taskResources' contains the parameters we need to run a task on AWS
# WARNING: You need to replace ALL this parameters to use your account's settings
taskResources := { \
'region' => 'us-east-1' \
, 'instanceType' => 't3a.medium' \
, 'imageId' => 'ami-123456abcdef' \
, 'securityGroupIds' => 'sg-987654321abc' \
, 'subnetId' => 'subnet-192837465fed' \
, 'instanceProfile' => 'AWS_BDS_INSTANCE_ROLE' \
}
# Input and output files on S3
in := "s3://my_bds_test_bucket/exmaple/in.txt"
out1 := "s3://my_bds_test_bucket/exmaple/out1.txt"
out2 := "s3://my_bds_test_bucket/exmaple/out2.txt"
# Create input file
in.write(inTxt)
# Task 1
println "Before task1"
task(out1 <- in) {
println "Start: Task1 improper"
inFileTxt := in.read().trim()
println "Input text: '$inFileTxt'"
out1.write("OUT1: '$inFileTxt'")
println "End: Task1 Improper"
}
println "After task1"
# Task 2
println "Before task2"
task(out2 <- out1) {
sys echo "Start: Task2"
sys echo 'OUT2' > '$out2'
sys cat '$out1' >> '$out2'
sys echo 'Input:'
sys cat '$out1'
sys echo
sys echo "End: Task2"
}
println "After task2"
wait
println "Done"
Each of these two tasks are executed in different EC2 instances.
There is a dependency: the second needs out1
which is created by the first task.
So the second task will not be executed until the first task finishes successfully, i.e. the second instance will only be created when the first instance finishes executing the first tasks and the output file out1
is created.
Detached tasks
A "detached" task is a task that is run independently from bds
.
The original bds
program can finish and the detached task continue running.
In this case it will continue in an AWS EC2 instance.
Here is an example with a "detached" task (i.e. detached := true
)
system = 'aws'
# These hash called 'taskResources' contains the parameters we need to run a task on AWS
# WARNING: You need to replace ALL this parameters to use your account's settings
taskResources := { \
'region' => 'us-east-1' \
, 'instanceType' => 't3a.medium' \
, 'imageId' => 'ami-123456abcdef' \
, 'securityGroupIds' => 'sg-987654321abc' \
, 'subnetId' => 'subnet-192837465fed' \
, 'instanceProfile' => 'AWS_BDS_INSTANCE_ROLE' \
}
# This detached AWS taks will continue executing after the bds script finishes
task(system := 'aws', detached := true) {
sys echo HI
sys for i in `seq 60`; do echo "count: \$i"; sleep 1; done
sys echo BYE
}
println "After"
wait
println "Done"
The task continues running in the EC2 instance, even though the bds
script finishes immediately after the instance has been requested.
dep
and goal
As any other tasks, AWS tasks can also be defined using dep
and goal
statements, e.g.:
Here is an example that we saw before (two dependent tasks) using dep
and goal
:
system = 'aws'
# These hash called 'taskResources' contains the parameters we need to run a task on AWS
# WARNING: You need to replace ALL this parameters to use your account's settings
taskResources := { \
'region' => 'us-east-1' \
, 'instanceType' => 't3a.medium' \
, 'imageId' => 'ami-123456abcdef' \
, 'securityGroupIds' => 'sg-987654321abc' \
, 'subnetId' => 'subnet-192837465fed' \
, 'instanceProfile' => 'AWS_BDS_INSTANCE_ROLE' \
}
# Input and output files on S3
in := "s3://my_bds_test_bucket/exmaple/in.txt"
out1 := "s3://my_bds_test_bucket/exmaple/out1.txt"
out2 := "s3://my_bds_test_bucket/exmaple/out2.txt"
# Create input file
in.write(inTxt)
# Dep 1
println "Before task1"
dep(out1 <- in) {
println "Start: Dep1 improper"
inFileTxt := in.read().trim()
println "Input text: '$inFileTxt'"
out1.write("OUT1: '$inFileTxt'")
println "End: Dep1 Improper"
}
println "After task1"
# Dep 2
println "Before task2"
dep(out2 <- out1) {
sys echo "Start: Dep2"
sys echo 'OUT2' > '$out2'
sys cat '$out1' >> '$out2'
sys echo 'Input:'
sys cat '$out1'
sys echo
sys echo "End: Dep2"
}
println "After task2"
# Goal
println "Goal: '$out2'"
goal out2
wait
println "Done"
AWS task parameters
Any task running on AWS requires many parameters to be set.
Most parameters can be set in the tataskResources
hash we've seen in the examples.
Here is a list of all the parameters for AWS tasks:
tataskResources entry |
Meaning |
---|---|
region | AWS region where the EC2 instance should run |
bucket | Bucket name, the bucket can sometimes used to store data by bds |
instanceType | AWS EC2 instance type (e.g. t3a.medium) |
imageId | Image used to create the instance (AMI ID). The image must have bds installed |
securityGroupIds | A comma separated list of security groups IDs |
subnetId | A comma separated list of subnet IDs |
instanceProfile | Profile (role name) to use for the instance |
s3tmp | S3 temporary path, this is used to store temporary data (e.g. checkpoints for improper tasks) |
keepInstanceAliveAfterFinish | If this is set to true , the instance will NOT be terminated after the task is finished. This is used for loggging into the instance and debugging AWS tasks |
Other task parameters can also be used (timeout
, taskName
, allowEmpty
, etc.).
Here are some parameters specific to AWS tasks:
Variable | Meaning |
---|---|
cloudQueueNamePrefix | Is non-empty, the name will be used as a prefix for the SQS quque |
system | Must be set to 'aws' for a task to be executed in an AWS EC2 instance |
Instance request retry
Assuming that all parameters are set correctly, an EC2 instance request can still fail for many different reasons, e.g.: - AWS doesn't have availability of a specific instance type in the requested region - There are no more IPs in the network - You've reached some limit in any EC2 related parameter (total number of instances, total disk, etc)
In any case when bds
cannot launch an EC2 instance it will retry several times, waiting some random time between each try
The retry algorithm is:
START_FAIL_MAX_ATTEMPTS = 50
START_FAIL_SLEEP_RAND_TIME = 60
for i in 1 .. START_FAIL_MAX_ATTEMPTS:
- parse_ec2_instance_parametres # Parse hash tataskResources
- randomly_select_subnetId # If more than one is specified as a comma separated list of values
- request_ec2_instance # Request instance to AWS
- if succeess: return OK # Success, instance created
- wait_random_time # Reuqest failed, wait up to START_FAIL_SLEEP_RAND_TIME seconds
Note: Randomly selecting a subnet from a comma separated list in tataskResources{'subnetId'}
, allows you to randomly create instances on different zones within a region.