Running batch jobs
We highly recommend using Jobs for any kind of development and computations in our cluster. This will ensure you never lose your work, get the results in the most convenient way, and don't waste resources, since this method does not require any babysitting of processes from you. Once your development is done, you are immediately ready to run a large-scale stuff with no changes to the code and minimal changes in the definition, plus your changes are saved in Git.
Warning
Since jobs in Nautilus can run forever, you can only run jobs with meaningful command
field. Running in manual mode (sleep infinity
command and manual start of computation) is prohibited, and user can be banned.
Use our tutorial for a simple job example.
Pulling code from GIT
You can put your code to our GitLab GIT repository and pull it from there using the following example. To modify your code between iterations you can use the Web IDE - simply click the Web IDE button on your project's repository page once you are logged in on Gitlab. Make sure your repo is not private or use the private repo example.
apiVersion: batch/v1
kind: Job
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: demo
image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp
command:
- "python"
args:
- "/opt/repo/REPONAME/my_script.py"
- "arg_job_to_run"
volumeMounts:
- name: git-repo
mountPath: /opt/repo
resources:
limits:
memory: 6Gi
cpu: "6"
nvidia.com/gpu: "1"
requests:
memory: 4Gi
cpu: "1"
nvidia.com/gpu: "1"
initContainers:
- name: init-clone-repo
image: alpine/git
args:
- clone
- --single-branch
- https://gitlab.nrp-nautilus.io/USERNAME/REPONAME
- /opt/repo/REPONAME
volumeMounts:
- name: git-repo
mountPath: /opt/repo
volumes:
- name: git-repo
emptyDir: {}
restartPolicy: Never
backoffLimit: 5
Two containers, init-clone-repo and demo, share the initially empty storage volume.
This pod will:
- start initContainer, pull your code from Git repository and put it in /opt/repo/REPONAME
- then will start your main container, and execute the script that was downloaded from the git repo
- when script is finished, terminate the whole pod.
Running several bash commands
You can group several commands, and use pipes, like this:
command:
- sh
- -c
- "cd /home/user/my_folder && apt-get install -y wget && wget pull some_file && do something else"
Logs
All stdout and stderr output from the script will be preserved and accessible by running
Output from initContainer can be seen with
To see logs in real time do:
The pod will remain in Completed state until you delete it or timeout is passed.
Retries
The backoffLimit field specifies how many times your pod will run in case the exit status of your script is not 0 or if pod was terminated for a different reason (for example a node was rebooted). It's a good idea to have it more than 0.
Fair queueing
There is no fair queue implemented on Nautilus. If you submit 1000 jobs, you block all other users from submitting in the cluster.
To limit your submittion to a fair portion of the cluster, refer to this guide. Make sure to use a deployment and persistent storage for Redis pod. Here's our example
Private repo
If your repository is private, you should create a Gitlab Personal Access Token (see Access Tokens and Access Tokens for command line) of type read_repository, and put it in your namespace secret to be consumed by the pod:
kubectl create secret generic gitlab-secret --from-literal=user=USERNAME --from-literal=password=TOKEN
The pod yaml file references the secret via user
and password
keys (must match secret literal definitions in the above command:
apiVersion: batch/v1
kind: Job
metadata:
name: myapp
spec:
template:
spec:
containers:
- name: demo
image: gitlab-registry.nrp-nautilus.io/prp/jupyter-stack/prp
command:
- "python"
args:
- "/opt/repo/REPONAME/my_script.py"
- "arg_job_to_run"
volumeMounts:
- name: git-repo
mountPath: /opt/repo
resources:
limits:
memory: 6Gi
cpu: "6"
nvidia.com/gpu: "1"
requests:
memory: 4Gi
cpu: "1"
nvidia.com/gpu: "1"
initContainers:
- name: init-clone-repo
image: alpine/git
env:
- name: GIT_USERNAME
valueFrom:
secretKeyRef:
name: gitlab-secret
key: user
- name: GIT_PASSWORD
valueFrom:
secretKeyRef:
name: gitlab-secret
key: password
args:
- clone
- --single-branch
- https://$(GIT_USERNAME):$(GIT_PASSWORD)@gitlab.nrp-nautilus.io/USERNAME/REPONAME
- /opt/repo/REPONAME
volumeMounts:
- name: git-repo
mountPath: /opt/repo
volumes:
- name: git-repo
emptyDir: {}
restartPolicy: Never
backoffLimit: 5
You can use several Work Queue Brokers, like Redis or RabbitMQ, to distribute tasks once you're ready to scale out your computation.