Skip to content

Main storage

❗We now have storage located in several geographic regions. Make sure you use the right compute nodes to ensure the optimal speed accessing it!

❗Installing conda and pip packages on all CephFS (shared) filesystems is strictly prohibited!

Cleaning up

Please purge any data you don't need. We're not an archival storage, and can only store the data actively used for computations.

Posix volumes

Persistent data in kubernetes comes in a form of Persistent Volumes (PV), which can only be seen by cluster admins. To request a PV, you have to create a PersistentVolumeClaim (PVC) of a supported StorageClass in your namespace, which will allocate storage for you.

Currently available storageClasses:

StorageClass Filesystem Type Region AccessModes Restrictions Storage Type Size
rook-cephfs CephFS US West ReadWriteMany Spinning drives with NVME meta 2.5 PB
rook-cephfs-central CephFS US Central ReadWriteMany Spinning drives with NVME meta 1 PB
rook-cephfs-east CephFS US East ReadWriteMany Mixed 1 PB
rook-cephfs-pacific CephFS Hawaii+Guam ReadWriteMany Spinning drives with NVME meta 384TB
rook-cephfs-haosu CephFS US West (local) ReadWriteMany Hao Su and Ravi cluster NVME 131 TB
beegfs BeeGFS US West ReadWriteMany 2PB
rook-ceph-block (*default*) RBD US West ReadWriteOnce Spinning drives with NVME meta 2.5 PB
rook-ceph-block-east RBD US East ReadWriteOnce Mixed 1 PB
rook-ceph-block-pacific RBD Hawaii+Guam ReadWriteOnce Spinning drives with NVME meta 384 TB
rook-ceph-block-central RBD US Central ReadWriteOnce Spinning drives with NVME meta 1 PB
seaweedfs-storage-hdd SeaweedFS US West ReadWriteMany Spinning drives 24 TB
seaweedfs-storage-nvme SeaweedFS US West ReadWriteMany NVME 22 TB

Ceph shared filesystem (CephFS) is the primary way of storing data in nautilus and allows mounting same volumes from multiple PODs in parallel (ReadWriteMany). Same applies to the BeegFS mounts accessed using NFS.

Ceph block storage allows RBD (Rados Block Devices) to be attached to a single pod at a time (ReadWriteOnce). Provides fastest access to the data, and is preferred for smaller (below 500GB) datasets, and all datasets not needing shared access from multiple pods.

When to choose each type of filesystem

RBD (Rados Block Device) is similar to a normal hard drive as it implements block storage on top of Ceph, and can run many kinds of file I/O operations including small files, thus may accommodate conda/pip installation and code compilation. It can provide higher IOPS than CephFS, but overall read/write performance tends to be slower than CephFS because it is less parallelized. Optimal read/write performance can be achieved when using a program or library that supports librados or you code your own program using the library. Use this for housing databases or workloads that require quick response but not necessarily high read/write rates. It shares the storage pool with CephFS, therefore being the largest storage pool in Nautilus.

CephFS is a distibuted parallel filesystem which stores files as objects. It may not handle lots of small files rapidly because it has to use metadata servers for annotating the files. Thus, conda/pip and code compilation should not be performed over CephFS. However, it has a much higher read/write performance than RBD. CephFS has the largest storage pool in Nautilus, and thus it is suitable for workloads that deal with comparably larger files than RBD which requires high I/O performance, for example checkpoint files of various tasks. There is a per-file size limit of 16 TB in CephFS.

BeeGFS is also a distributed parallel filesystem but stores files in a slightly different way to CephFS. It also may not handle lots of small files rapidly, thus conda/pip and code compilation should not be performed here as well. Because of networking limitations to the storage cluster, the read/write performance is very low, and therefore it is mainly suitable for archival purposes of files that are used less frequently than those that are housed in CephFS and RBD.

SeaweedFS is a new experimental filesystem that improves many issues that exist in other filesystems. It can handle both many small and large files efficiently while having high read/write performance. IOPS can also be quite high. However, the storage pool dedicated to SeaweedFS is comparably smaller to the CephFS and RBD cluster, thus there is a practical limitation on the storage space that may be used by one user.

Both SeaweedFS and Ceph provide an S3-compatible protocol interface (with a per-file size limit of 5 TiB). This is a native object storage protocol that can supply the maximum read/write performance. It uses the HTTP protocol instead of POSIX, if your tool supports the protocol instead of only POSIX file I/O. Many data science tools and libraries support the S3-compatible protocol as an alternative file I/O interface, and the protocol is well optimized for such purposes.

Creating and mounting the PVC

Use kubectl to create the PVC:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: examplevol
spec:
  storageClassName: <required storage class>
  accessModes:
  - <access mode, f.e. ReadWriteOnce >
  resources:
    requests:
      storage: <volume size, f.e. 20Gi>

After you've created a PVC, you can see it's status (kubectl get pvc pvc_name). Once it has the Status Bound, you can attach it to your pod (claimName should match the name you gave your PVC):

apiVersion: v1
kind: Pod
metadata:
  name: vol-pod
spec:
  containers:
  - name: vol-container
    image: ubuntu
    args: ["sleep", "36500000"]
    volumeMounts:
    - mountPath: /examplevol
      name: examplevol
  restartPolicy: Never
  volumes:
    - name: examplevol
      persistentVolumeClaim:
        claimName: examplevol

Using the right region for your pod

Latency is significantly affecting the I/O performance. If you want optimal access speed to Ceph, add the region affinity to your pod for the correct region (us-east or us-west):

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - us-west

You can list the nodes region label using: kubectl get nodes -L topology.kubernetes.io/region

Volumes expanding

All ceph volumes created starting from December 2020 can be expanded by simply modifying the storage field of the PVC (either by using kubectl edit pvc ..., or kubectl update -f updated_pvc_definition.yaml)

For older ones, all rook-ceph-block-* and most rook-cephfs-* volumes can be expanded. If yours is not expanding, you can ask cluster admins to do it in manual mode.