LLM as a service

One of easy ways to deploy an LLM is to use a model provided by HuggingFace with the help of SHALB helm chart.

The helm chart allows installing a text genertation inference container, optionally accompanied by the chat-ui interface to talk to the service.

To deploy the LLM, choose a text generation model without download restrictions and modest footprint (f.e. Mistral is a good one) and create the helm values file (huggingface-values.yaml) similar to this one:

model:
  organization: "mistralai"
  name: "Mistral-7B-Instruct-v0.2"

persistence:
  accessModes:
  - ReadWriteOnce
  storageClassName: rook-ceph-block
  storage: 500Gi

updateStrategy:
  type: Recreate

ingress:
  enabled: true
  annotations:
    kubernetes.io/ingress.class: haproxy
  hosts:
  - host: <subdomain>.nrp-nautilus.io
    paths:
      - path: /
        pathType: Prefix
  tls:
  - hosts:
    - <subdomain>.nrp-nautilus.io

resources:
  requests:
    cpu: "3"
    memory: "10Gi"
    nvidia.com/gpu: 2
  limits:
    cpu: "8"
    memory: "25Gi"
    nvidia.com/gpu: 2

affinity:
 nodeAffinity:
   requiredDuringSchedulingIgnoredDuringExecution:
     nodeSelectorTerms:
     - matchExpressions:
       - key: nvidia.com/gpu.product
         operator: In
         values:
         - <desired_gpu_type>

chat:
  enabled: true
  resources:
    limits:
      cpu: "2"
      memory: "5G"
    requests:
      cpu: "500m"
      memory: "512M"

  ingress:
    enabled: true
    annotations:
      kubernetes.io/ingress.class: haproxy
    hosts:
    - host: <subdomain>-chat.nrp-nautilus.io
      paths:
      - path: /
        pathType: Prefix
    tls:
    - hosts:
      - <subdomain>-chat.nrp-nautilus.io

mongodb:
  updateStrategy:
    type: Recreate
  resources:
    limits:
      cpu: "10"
      memory: "10G"
    requests:
      cpu: "1"
      memory: "1G"

Replace <subdomain>. Optionally leave and modify the section with desired_gpu_type or remove the whole affinity block.

Install Helm and deploy the LLM into your namespace:

helm install hug -n <your_namespace> oci://registry-1.docker.io/shalb/huggingface-model -f huggingface-values.yaml

If you see 3 pods started in your namespace, you're almost done! The model will be downloaded and cached by the init container. Go stretch, make some tea, and give it some time to be downloaded into our persistent storage. Onse the init container is done and main one starts, give it some more time to start, and you can start chatting with the AI.

Your chat-ui will be available at <subdomain>-chat.nrp-nautilus.io, and API at <subdomain>.nrp-nautilus.io.

Please scale down or purge unused deployments to free up resources for other users of the cluster. Your model will remain cached in our persistent storage and next time the start up will be much quicker.

kubectl scale deployment -n <your_namespace> hug-mistral-7b-instruct-v0-2 hug-mistral-7b-instruct-v0-2-chat hug-mongodb --replicas=0