LLM as a service
One of easy ways to deploy an LLM is to use a model provided by HuggingFace with the help of SHALB helm chart.
The helm chart allows installing a text genertation inference container, optionally accompanied by the chat-ui interface to talk to the service.
To deploy the LLM, choose a text generation model without download restrictions and modest footprint (f.e. Mistral is a good one) and create the helm values file (huggingface-values.yaml
) similar to this one:
model:
organization: "mistralai"
name: "Mistral-7B-Instruct-v0.2"
persistence:
accessModes:
- ReadWriteOnce
storageClassName: rook-ceph-block
storage: 500Gi
updateStrategy:
type: Recreate
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: haproxy
hosts:
- host: <subdomain>.nrp-nautilus.io
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- <subdomain>.nrp-nautilus.io
resources:
requests:
cpu: "3"
memory: "10Gi"
nvidia.com/gpu: 2
limits:
cpu: "8"
memory: "25Gi"
nvidia.com/gpu: 2
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.product
operator: In
values:
- <desired_gpu_type>
chat:
enabled: true
resources:
limits:
cpu: "2"
memory: "5G"
requests:
cpu: "500m"
memory: "512M"
ingress:
enabled: true
annotations:
kubernetes.io/ingress.class: haproxy
hosts:
- host: <subdomain>-chat.nrp-nautilus.io
paths:
- path: /
pathType: Prefix
tls:
- hosts:
- <subdomain>-chat.nrp-nautilus.io
mongodb:
updateStrategy:
type: Recreate
resources:
limits:
cpu: "10"
memory: "10G"
requests:
cpu: "1"
memory: "1G"
Replace <subdomain>
. Optionally leave and modify the section with desired_gpu_type
or remove the whole affinity
block.
Install Helm and deploy the LLM into your namespace:
helm install hug -n <your_namespace> oci://registry-1.docker.io/shalb/huggingface-model -f huggingface-values.yaml
If you see 3 pods started in your namespace, you're almost done! The model will be downloaded and cached by the init container. Go stretch, make some tea, and give it some time to be downloaded into our persistent storage. Onse the init container is done and main one starts, give it some more time to start, and you can start chatting with the AI.
Your chat-ui will be available at <subdomain>-chat.nrp-nautilus.io
, and API at <subdomain>.nrp-nautilus.io
.
Please scale down or purge unused deployments to free up resources for other users of the cluster. Your model will remain cached in our persistent storage and next time the start up will be much quicker.
kubectl scale deployment -n <your_namespace> hug-mistral-7b-instruct-v0-2 hug-mistral-7b-instruct-v0-2-chat hug-mongodb --replicas=0