Intro
We now have storage located in several geographic regions. Make sure you use the right compute nodes to ensure the optimal speed accessing it!
Cleaning up
Please purge any data you don't need. We're not an archival storage, and can only store the data actively used for computations.
Posix volumes
Most persistent data in kubernetes comes in a form of Persistent Volumes (PV), which can only be seen by cluster admins. To request a PV, you have to create a PersistentVolumeClaim (PVC) of a supported StorageClass in your namespace, which will allocate storage for you.
Provided filesystems
Credit: Combined filesystems
How to choose the filesystem to use
RBD (Rados Block Device) is similar to a normal hard drive as it implements block storage on top of Ceph, and can run many kinds of file I/O operations including small files, thus may accommodate conda/pip installation and code compilation. It can provide higher IOPS than CephFS, but overall read/write performance tends to be slower than CephFS because it is less parallelized. Optimal read/write performance can be achieved when using a program or library that supports librados or you code your own program using the library. Use this for housing databases or workloads that require quick response but not necessarily high read/write rates. It shares the storage pool with CephFS, therefore being the largest storage pool in Nautilus.
CephFS is a distibuted parallel filesystem which stores files as objects. It may not handle lots of small files rapidly because it has to use metadata servers for annotating the files. Thus, conda/pip and code compilation should not be performed over CephFS. However, it has a much higher read/write performance than RBD. CephFS has the largest storage pool in Nautilus, and thus it is suitable for workloads that deal with comparably larger files than RBD which requires high I/O performance, for example checkpoint files of various tasks. There is a per-file size limit of 16 TB in CephFS.
BeeGFS is also a distributed parallel filesystem but stores files in a slightly different way to CephFS. It also may not handle lots of small files rapidly, thus conda/pip and code compilation should not be performed here as well. Because of networking limitations to the storage cluster, the read/write performance is very low, and therefore it is mainly suitable for archival purposes of files that are used less frequently than those that are housed in CephFS and RBD.
CVMFS provides read-only access to data on XROOTD OSG origins via a set of Stashcaches, that can be mapped as a PVC to the pods. The access is read-only, and this is mostly used for rarely changing large files collections, like software packages and large training datasets.
Linstor provides the fastest and minimal latency block storage, but can't handle large (>10TB) volumes. Can be used for VM images, high-loaded databases, etc.
SeaweedFS is a new experimental filesystem that improves many issues that exist in other filesystems. It can handle both many small and large files efficiently while having high read/write performance. IOPS can also be quite high. However, the storage pool dedicated to SeaweedFS is comparably smaller to the CephFS and RBD cluster, thus there is a practical limitation on the storage space that may be used by one user.
Both SeaweedFS and Ceph provide an S3-compatible protocol interface (with a per-file size limit of 5 TiB). This is a native object storage protocol that can supply the maximum read/write performance. It uses the HTTP protocol instead of POSIX, if your tool supports the protocol instead of only POSIX file I/O. Many data science tools and libraries support the S3-compatible protocol as an alternative file I/O interface, and the protocol is well optimized for such purposes.
S3 storage doesn't use PVC, and can be accessed directly from applications.
Creating and mounting the PVC
Use kubectl to create the PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: examplevol
spec:
storageClassName: <required storage class>
accessModes:
- <access mode, f.e. ReadWriteOnce >
resources:
requests:
storage: <volume size, f.e. 20Gi>
After you've created a PVC, you can see it's status (kubectl get pvc pvc_name
). Once it has the Status Bound
, you can attach it to your pod (claimName should match the name you gave your PVC):
apiVersion: v1
kind: Pod
metadata:
name: vol-pod
spec:
containers:
- name: vol-container
image: ubuntu
args: ["sleep", "36500000"]
volumeMounts:
- mountPath: /examplevol
name: examplevol
restartPolicy: Never
volumes:
- name: examplevol
persistentVolumeClaim:
claimName: examplevol
Using the right region for your pod
Latency is significantly affecting the I/O performance. If you want optimal access speed to Ceph, add the region affinity to your pod for the correct region
(us-east
, us-west
, etc):
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/region
operator: In
values:
- us-west
You can list the nodes region label using: kubectl get nodes -L topology.kubernetes.io/region
Volumes expanding
All volumes created starting from December 2020 can be expanded by simply modifying the storage
field of the PVC (either by using kubectl edit pvc ...
, or kubectl update -f updated_pvc_definition.yaml
)
For older ones, all rook-ceph-block-*
and most rook-cephfs-*
volumes can be expanded. If yours is not expanding, you can ask cluster admins to do it in manual mode.