Main storage
We now have storage located in several geographic regions. Make sure you use the right compute nodes to ensure the optimal speed accessing it!
Installing
conda
and pip
packages on all CephFS (shared) filesystems is strictly prohibited!
Cleaning up
Please purge any data you don't need. We're not an archival storage, and can only store the data actively used for computations.
Posix volumes
Persistent data in kubernetes comes in a form of Persistent Volumes (PV), which can only be seen by cluster admins. To request a PV, you have to create a PersistentVolumeClaim (PVC) of a supported StorageClass in your namespace, which will allocate storage for you.
Currently available storageClasses:
StorageClass | Filesystem Type | Region | AccessModes | Restrictions | Storage Type | Size |
---|---|---|---|---|---|---|
rook-cephfs | CephFS | US West | ReadWriteMany | Spinning drives with NVME meta | 2.5 PB | |
rook-cephfs-central | CephFS | US Central | ReadWriteMany | Spinning drives with NVME meta | 1 PB | |
rook-cephfs-east | CephFS | US East | ReadWriteMany | Mixed | 1 PB | |
rook-cephfs-pacific | CephFS | Hawaii+Guam | ReadWriteMany | Spinning drives with NVME meta | 384TB | |
rook-cephfs-haosu | CephFS | US West (local) | ReadWriteMany | Hao Su and Ravi cluster | NVME | 131 TB |
beegfs | BeeGFS | US West | ReadWriteMany | 2PB | ||
rook-ceph-block (*default*) | RBD | US West | ReadWriteOnce | Spinning drives with NVME meta | 2.5 PB | |
rook-ceph-block-east | RBD | US East | ReadWriteOnce | Mixed | 1 PB | |
rook-ceph-block-pacific | RBD | Hawaii+Guam | ReadWriteOnce | Spinning drives with NVME meta | 384 TB | |
rook-ceph-block-central | RBD | US Central | ReadWriteOnce | Spinning drives with NVME meta | 1 PB | |
seaweedfs-storage-hdd | SeaweedFS | US West | ReadWriteMany | Spinning drives | 24 TB | |
seaweedfs-storage-nvme | SeaweedFS | US West | ReadWriteMany | NVME | 22 TB |
Ceph shared filesystem (CephFS) is the primary way of storing data in nautilus and allows mounting same volumes from multiple PODs in parallel (ReadWriteMany). Same applies to the BeegFS mounts accessed using NFS.
Ceph block storage allows RBD (Rados Block Devices) to be attached to a single pod at a time (ReadWriteOnce). Provides fastest access to the data, and is preferred for smaller (below 500GB) datasets, and all datasets not needing shared access from multiple pods.
When to choose each type of filesystem
RBD (Rados Block Device) is similar to a normal hard drive as it implements block storage on top of Ceph, and can run many kinds of file I/O operations including small files, thus may accommodate conda/pip installation and code compilation. It can provide higher IOPS than CephFS, but overall read/write performance tends to be slower than CephFS because it is less parallelized. Optimal read/write performance can be achieved when using a program or library that supports librados or you code your own program using the library. Use this for housing databases or workloads that require quick response but not necessarily high read/write rates. It shares the storage pool with CephFS, therefore being the largest storage pool in Nautilus.
CephFS is a distibuted parallel filesystem which stores files as objects. It may not handle lots of small files rapidly because it has to use metadata servers for annotating the files. Thus, conda/pip and code compilation should not be performed over CephFS. However, it has a much higher read/write performance than RBD. CephFS has the largest storage pool in Nautilus, and thus it is suitable for workloads that deal with comparably larger files than RBD which requires high I/O performance, for example checkpoint files of various tasks. There is a per-file size limit of 16 TB in CephFS.
BeeGFS is also a distributed parallel filesystem but stores files in a slightly different way to CephFS. It also may not handle lots of small files rapidly, thus conda/pip and code compilation should not be performed here as well. Because of networking limitations to the storage cluster, the read/write performance is very low, and therefore it is mainly suitable for archival purposes of files that are used less frequently than those that are housed in CephFS and RBD.
SeaweedFS is a new experimental filesystem that improves many issues that exist in other filesystems. It can handle both many small and large files efficiently while having high read/write performance. IOPS can also be quite high. However, the storage pool dedicated to SeaweedFS is comparably smaller to the CephFS and RBD cluster, thus there is a practical limitation on the storage space that may be used by one user.
Both SeaweedFS and Ceph provide an S3-compatible protocol interface (with a per-file size limit of 5 TiB). This is a native object storage protocol that can supply the maximum read/write performance. It uses the HTTP protocol instead of POSIX, if your tool supports the protocol instead of only POSIX file I/O. Many data science tools and libraries support the S3-compatible protocol as an alternative file I/O interface, and the protocol is well optimized for such purposes.
Creating and mounting the PVC
Use kubectl to create the PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplevol
spec:
storageClassName: <required storage class>
accessModes:
- <access mode, f.e. ReadWriteOnce >
resources:
requests:
storage: <volume size, f.e. 20Gi>
After you've created a PVC, you can see it's status (kubectl get pvc pvc_name
). Once it has the Status Bound
, you can attach it to your pod (claimName should match the name you gave your PVC):
apiVersion: v1
kind: Pod
metadata:
name: vol-pod
spec:
containers:
- name: vol-container
image: ubuntu
args: ["sleep", "36500000"]
volumeMounts:
- mountPath: /examplevol
name: examplevol
restartPolicy: Never
volumes:
- name: examplevol
persistentVolumeClaim:
claimName: examplevol
Using the right region for your pod
Latency is significantly affecting the I/O performance. If you want optimal access speed to Ceph, add the region affinity to your pod for the correct region
(us-east
or us-west
):
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-west
You can list the nodes region label using: kubectl get nodes -L topology.kubernetes.io/region
Volumes expanding
All ceph volumes created starting from December 2020 can be expanded by simply modifying the storage
field of the PVC (either by using kubectl edit pvc ...
, or kubectl update -f updated_pvc_definition.yaml
)
For older ones, all rook-ceph-block-*
and most rook-cephfs-*
volumes can be expanded. If yours is not expanding, you can ask cluster admins to do it in manual mode.