Skip to content

Ceph S3

The Nautilus ceph storage cluster can be accessed via S3 protocol. It uses our own storage, which is free for our users and is not related to Amazon or any commercial cloud.

Access

You should request your credentials (key and secret) in Matrix chat. Go there and let admins know you'd like to access S3, and which pool works best for you.

West pool (default):

East pool:

Note that the inside endpoint is http (without SSL) and the outside endpoint is https (with SSL). You can use the outside endpoint within the kubernetes cluster but it will end up going through a load balancer. By using the inside endpoint it is possible for multiple parallel requests from one or many machines to hit multiple separate OSD's and therefore achieve very large training set bandwith.

Using Rclone

The easiest way to access S3 is Rclone.

Use these options:

Storage: Amazon S3 Compliant Storage Providers

S3 provider: Ceph Object Storage

AWS Access Key ID, AWS Secret Access Key: ask in Matrix chat

Endpoint: use the access section

Using s3cmd

S3cmd is an open-source tool for accessing S3.

To configure, create the ~/.s3cfg file with contents if you're accessing from outside of the cluster:

[default]
access_key = <your_key>
host_base = https://s3-west.nrp-nautilus.io
host_bucket = https://s3-west.nrp-nautilus.io
secret_key = <your_secret>
use_https = True

or this if accessing from inside:

[default]
access_key = <your_key>
host_base = http://rook-ceph-rgw-nautiluss3.rook
host_bucket = http://rook-ceph-rgw-nautiluss3.rook
secret_key = <your_secret>
use_https = False

Run s3cmd ls to see the available buckets.

Uploading files

Upload files with the s3cmd put FILE

$ s3cmd put <FILE> s3://<BUCKET>/<DIR>

Or, to upload a file to be public, use the -P for public file:

$ s3cmd put -P <FILE> s3://<BUCKET>/<DIR>
Public URL of the object is: http://s3-west.nrp-nautilus.io/...

Using AWS S3 tool

Credentials

First add your credentials to ~/.aws/credentials.

If you are familiar with the AWS CLI you can create an additional profile preserving your AWS credentials by adding it to ~/.aws/credentials:

[default]
aws_access_key_id=xxxx
aws_secret_access_key=yyyy

[profile prp]
aws_access_key_id=iiiiii
aws_secret_access_key=jjjjj

If you don't use AWS then you can just add credentials to [default] and skip the [profile] selection.

We recommend to use awscli-plugin-endpoint to write endpoint url in .aws/config, instead of typing endpoint in the CLI repeatedly. Install the plugin with:

pip install awscli-plugin-endpoint

There are a few steps on the awscli-plugin-endpoint README.md to install this plugin. If you do not wish to add this plugin, add --endopint-url https://s3-west.nrp-nautilus.io to all commands below.

Your .aws/config file should look like:

[profile prp]

s3api =
    endpoint_url = https://s3-west.nrp-nautilus.io

[plugins]
endpoint = awscli_plugin_endpoint

Using AWS CLI

Since aws s3 doesn't support regionless s3 bucket, the user should use aws s3api instead.

  1. Create a bucket:

    aws s3api create-bucket --bucket my-bucket-name --profile prp 
    
  2. List objects in the bucket:

    aws s3api list-buckets --profile prp 
    
  3. Upload a file:

    aws s3api copy-object ~/hello.txt my-bucket-name --profile prp
    
  4. Upload a file and make it publicly accessible:

    aws s3api copy-object ~/hello.txt my-bucket-name --profile prp --acl public-read
    

    You can how access this file via a browser as https://s3-west.nrp-nautilus.io/my-bucket/hello.txt

  5. Download a file:

    aws s3api copy-object my-bucket-name/hello.txt hello.txt 
    
  6. Give multiple users full access to the bucket (this does not extend permission to objects in the bucket, follow step #7 in addition to this step to allow shared access to the objects in the bucket):

    aws s3api put-bucket-acl --profile prp --bucket BUCKETNAME --grant-full-control id=<user1id>,id=<user2id>
    

    NOTE: These ID's need to be the name that the PRP sys admin uses when providing you your key and secret. Also note that this operation is not additive so if you first do id=user1 and later do id=user2, user1 will no longer have access. Instead call get-bucket-acl to get the list of id's and then use them as well as the new id.

  7. Give multiple users full access to all objects in the bucket (replace BUCKETNAME and create file policy.json):

     aws s3api put-bucket-policy --bucket BUCKETNAME --policy file://policy.json
     # Create file policy.json with the following text:
     {
        "Statement": [
           {
              "Effect": "Allow",
              "Principal": "*",
              "Action": [
                 "s3:GetObject",
                 "s3:DeleteObject",
                 "s3:PutObject"
              ],
              "Resource": "arn:aws:s3:::BUCKETNAME/*"
           }
        ]
     }
    

    More detailed policy.json examples at: https://docs.aws.amazon.com/cli/latest/reference/s3api/put-bucket-policy.html

  8. Use awscli_plugin_endpoint

    Note that you can skip typing the endpoint everytime by using awscli_plugin_endpoint installation (assuming your awscli is set up correctly, with profile prp)

     pip install awscli-plugin-endpoint
     aws configure set plugins.endpoint awscli_plugin_endpoint
     aws configure --profile prp set s3api.endpoint_url https://s3-west.nrp-nautilus.io
    

    If you want, you can set up a function to further simplify typing Add following to .bashrc

     s3prp() {
        args=("$@")
        aws s3 --profile prp "${args[@]}"
     }
    

S3 from tensorflow

with smart_open.open('s3://bucket/myfile.mat', 'rb') as f:
   # yield your samples from the f file in your tensorflow dataset as usual

Note that smart_open supports both local and S3 files, so when you're testing this on a local file, it'll work as well as when you run it on the cluster and pass it in a file located on S3.

See this TFRecord presentation for details.

Setting up s3fs (posix mount)

To mount a S3 bucket to filesystem, use s3fs-fuse. Also see the FUSE docs

Example mounting commands are as follows

access from outside the cluster

s3fs bucket /mount/point -o passwd_file=${HOME}/.passwd-s3fs -o url=https://s3-west.nrp-nautilus.io -o use_path_request_style -o umask=0007,uid=$UID

access inside the cluster

s3fs bucket /mount/point -o passwd_file=${HOME}/.passwd-s3fs -o url=http://rook-ceph-rgw-nautiluss3.rook -o use_path_request_style -o umask=0007,uid=$UID

Things to Note

(2 and 3 are from the issue here: https://github.com/s3fs-fuse/s3fs-fuse/issues/673)

  1. -o use_path_request_style is required for non-amazon S3 compliant storage.

  2. -o umask=0007 is used to set up the access permission. s3fs defaults to no access for any objects for POSIX compliant.

  3. -o uid=$UID set up the owner of the objects. Default is root.

unmount

sudo umount /mount/point

or for unprivileged user

fusermount -u /mount/point

fstab

Add following line to /etc/fstab

outside the cluster

s3fs#mybucket /path/to/mountpoint fuse _netdev,allow_other,use_path_request_style,url=https://s3-west.nrp-nautilus.io,passwd_file=/path/to/passwd-file,umask=0007,uid=1001 0 0

inside the cluster

s3fs#mybucket /path/to/mountpoint fuse _netdev,allow_other,use_path_request_style,url=http://rook-ceph-rgw-nautiluss3.rook,passwd_file=/path/to/passwd-file,umask=0007,uid=1001 0 0

You can find current user id through

id

Using S3 in GitLab CI

In GitLab project go to Settings->CI/CD, open the Variables tab, and add the variables holding your S3 credentials: ACCESS_KEY_ID and SECRET_ACCESS_KEY. Choose protect variable and mask variable.

Your .gitlab-ci.yml file can look like:

build:
  image: ubuntu
  before_script:
    - apt-get update && apt-get install -y curl unzip
    - curl https://rclone.org/install.sh | bash
  stage: build
  script:
    - rclone config create nautilus-s3 s3 endpoint https://s3-west.nrp-nautilus.io provider Ceph access_key_id $ACCESS_KEY_ID secret_access_key $SECRET_ACCESS_KEY
    - rclone ls "nautilus-s3:"