GPU jobs

Running GPU pods

Use this definition to create your own pod and deploy it to kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: gpu-container
    image: gitlab-registry.nautilus.optiputer.net/prp/jupyterlab:latest
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1

This example requests 1 GPU device. You can have up to 8 per node. If you request GPU devices in your pod, kubernetes will auto schedule your pod to the appropriate node. There’s no need to specify the location manually. To request NVIDIA Tesla K40c GPUs use this example:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod-example
spec:
  containers:
  - name: gpu-container
    image: tensorflow/tensorflow:latest-gpu
    command: ["sleep", "infinity"]
    resources:
      limits:
        nvidia.com/gpu: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In # Use NotIn for other types
            values:
            - K40

You should always delete your pod when your computation is done to let other users use the GPUs. Consider using Jobs whenever possible to ensure your pod is not wasting GPU time. If you have never used Kubernetes before, see the tutorial.

Requesting high-demand GPUs

Sertain kinds of GPUs have much higher specs than the others, and to avoid wasting those for regular jobs, your pods will only be scheduled on those if you request the type explicitly.

Currently those include:

  • V100
  • RTX6000
  • RTX8000

Choosing GPU type

We have a variety of GPU flavors attached to Nautilus. This table describes the types of GPUs available for use, but is not up to date - it’s better to use the actual cluster information (f.e. kubectl get nodes -L gpu-type).

NOTE: Not all nodes are available to all users. You can consult about your available resources in rocketchat and on resources page. Labs connecting their hardware to our cluster have preferential access to all our resources.

Node GPU Type Count
capcom.calit2.optiputer.net RTX8000 1
clu-fiona2.ucmerced.edu 1080Ti 8
dtn-gpu2.kreonet.net titan-xp 6
epic001.clemson.edu 2080Ti 7
evldtn.evl.uic.edu M4000 1
fiona8-0.calit2.uci.edu 1080Ti 8
fiona8-1.calit2.uci.edu 1080Ti 8
fiona8-2.calit2.uci.edu 1080Ti 8
fiona8-3.calit2.uci.edu 2080Ti 7
fiona8.ucsc.edu 1080Ti 8
hydra.gi.ucsc.edu 1080Ti 2
k8s-bafna-01.calit2.optiputer.net 1080Ti 8
k8s-bharadia-01.sdsc.optiputer.net 1080Ti 8
k8s-bharadia-02.sdsc.optiputer.net 1080Ti 8
k8s-bharadia-03.sdsc.optiputer.net 1080Ti 8
k8s-bharadia-04.sdsc.optiputer.net 1080Ti 8
k8s-chase-ci-01.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-01.noc.ucsb.edu 1080Ti 8
k8s-chase-ci-02.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-03.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-04.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-05.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-06.calit2.optiputer.net 1080Ti 8
k8s-chase-ci-07.calit2.optiputer.net 2080Ti 8
k8s-chase-ci-08.calit2.optiputer.net 2080Ti 8
k8s-chase-ci-09.calit2.optiputer.net 2080Ti 8
k8s-chase-ci-10.calit2.optiputer.net 2080Ti 8
k8s-gen4-08.calit2.optiputer.net RTX6000 1
k8s-gen4-09.calit2.optiputer.net 2080Ti 1
k8s-gpu-01.calit2.optiputer.net 1080 8
k8s-gpu-02.calit2.optiputer.net titan-x 8
k8s-gpu-03.sdsc.optiputer.net 1080Ti 8
k8s-gpu-1.ucr.edu 1080Ti 8
k8s-gpu-1.ucsc.edu 1080Ti 8
k8s-gpu-2.ucsc.edu 1080Ti 8
k8s-ravi-01.calit2.optiputer.net titan-x 8
k8s-tyan-gpu-01.sdsu.edu K40 4
knuron.calit2.optiputer.net K40 2
nrp-g1.nysernet.org 2080Ti 8
patternlab.calit2.optiputer.net M40 2
prp-gpu-1.t2.ucsd.edu 1080Ti 8
prp-gpu-2.t2.ucsd.edu 1080Ti 8
prp-gpu-3.t2.ucsd.edu 1080Ti 8
suncave-* mix of 1080 and 1080Ti 2
uicnrp01.evl.uic.edu M4000 1
uicnrp02.evl.uic.edu M4000 1
uicnrp02.evl.uic.edu M4000 1
wave-head.ucmerced.edu 1080 2
wave[00-09].ucmerced.edu 1080 2

To use a specific type of GPU add the affinity definition to you pod yaml file. The example below specifies 1080Ti GPU:

 affinity:
   nodeAffinity:
     requiredDuringSchedulingIgnoredDuringExecution:
       nodeSelectorTerms:
       - matchExpressions:
         - key: gpu-type
           operator: In
           values:
           - 1080Ti