Deploy Ray using Helm

To distribute our computational requirements and run heavy jobs like synthesizing your data, we will need to deploy a Ray cluster using Helm for our API to connect to. The chart can be found in the repository here or can be supplied by a repo URL to be used in Helm directly. Please contact the Syntho team for this repo URL.

The folder consists of the chart/directory and the crds/ directory. The chart directory contains the actual Helm chart, which includes Custom Resource Definitions (CRDs). However, if the CRDs need to be installed separately from the Helm chart, the crds/directory can be used to install them.

This part of the documentation will assume access to the folder helm/ray in the master branch of the aforementioned GitHub repository.

A note for OpenShift users: Ray requires us to create many threads within a pod, especially when scaling the Ray cluster. By default, OpenShift limits how many processes a pod can spawn, due to its usage of CRI-O as its Container Runtime Interface (CRI). We recommend updating this limit: see the section #Additional changes for OpenShift/CRI-O

Setting the image

In the values.yaml file in helm/ray, set the following fields to ensure the usage of the correct Docker image:

kuberay-operator:
  image:
    repository: syntho.azurecr.io/kuberay-operator
    tag: v1.2.2
    pullPolicy: IfNotPresent

ray-cluster:
  image:
    repository: synthoregistry.azurecr.io/syntho-ray
    tag: <tag>
    pullPolicy: IfNotPresent

The image tag will be provided by the Syntho Team. In some cases, the latest tag can be used, but we recommend setting a specific tag.

Next to setting the correct Docker image, define the Kubernetes Secret that is created under imagePullSecrets:

kuberay-operator:
    imagePullSecrets: 
        - name: syntho-cr-secret

ray-operator:
    imagePullSecrets: 
        - name: syntho-cr-secret

This value is set to syntho-cr-secret by default.

License key - Ray

The license key can be set under SynthoLicense in the values.yaml file. An example of this would be:

SynthoLicense: <license-key>
ray-cluster:
    ...

Please use the license key provided by Syntho.

Cluster name

The default cluster name is set to ray-cluster. In case this needs to be adjusted, you can do so by changing nameOverride and fullnameOverride:

ray-cluster:
  nameOverride: "ray-cluster"
  fullnameOverride: "ray-cluster"

Workers and nodes

First of all, the Syntho Team will have some recommendations on what the exact size of the cluster should be given your data requirements. If you haven't received any information about this, please contact the Syntho team first to discuss the optimal setup. The rest of this section will give an example configuration to show what that looks like in the Helm chart.

Depending on the size and amount of nodes of the cluster, adjust the number of workers that Ray has available for tasks. Ray will need at least one head instance. To increase performance, we can create additional worker groups as well. Under head we can set the resources for the head node. This head node will mostly be used for administrative tasks in Ray and the worker nodes will be picking up most of the tasks for the Syntho Application.

For a production environment, we recommend a pool of workers next to the head node. The Syntho Team can indicate what resources should be assigned to the head node and worker nodes. Here is an example configuration of a cluster with a head node and one node pool, of 1 machine, with 16 CPUs and 64GB of RAM:

ray-cluster:
  head:
    resources:
      limits:
        cpu: "16"
        memory: "64G"
      requests:
        cpu: "16"
        memory: "64G"
  
  worker:
    disabled: false
    groupName: workergroup
    replicas: 1
    resources:
      limits:
        cpu: "16"
        memory: "64G"
      requests:
        cpu: "16"
        memory: "64G"

If autoscaling is enabled in Kubernetes, new nodes will be created once the Ray requirements are higher than the available resources. Please discuss together with the Syntho Team which situation would fit your data requirements.

For development or experimental environments, most of the time a less advanced setup is needed. In this case, we recommend only setting up a head node type to begin with and no workers or additional autoscaling setup. The Syntho Team will again advise on the size of this node, given the data requirements. An example configuration using a node with 16 CPUs and 64 GB of RAM would be:

ray-cluster:
  head:
    resources:
      limits:
        cpu: "16"
        memory: "64G"
      requests:
        cpu: "16"
        memory: "64G"
  
  worker:
    disabled: true
  
  additionalWorkerGroups:
    smallGroup:
      # Disabled by default
      disabled: true

Additionally, nodeSelector, tolerations and affinity can be defined for each type of node, to have some control over where the pods/nodes exactly get assigned. securityContext and annotiations can also be set for each type of worker/head node.

Shared storage of Ray workers

We require an additional Persistent Volume for the Ray workers to share some metadata about the current tasks running. This is included in the Helm chart and has the Persistent Volume type ReadWriteMany. In the section storage you can adjust the storageClassName to use for this. Please make sure that you're using a storageClass that supports type ReadWriteMany.

storage:
  storageClassName: default  # Change to correct storageClass

Volume mounts

Optional

If certain volumes need to be mounted, the values volumes and volumeMounts can be adjusted to define those. Keep in mind when using PV that Ray may schedule multiple pods using that particular volume, so it will need to be accessible from multiple machines.

Additional changes for OpenShift/CRI-O

Certain orchestrators or setups use CRI-O as the container runtime interface (CRI). Openshift 4.x currently has CRI-O set a limit of 1024 processes by default. When scaling, this limit can easily be reached using Ray. We recommend setting this limit to around 8096 processes.

The Openshift documentation describes the steps to increase this limit here. The following link has more information about the settings that can be used in CRI-O.

Secrets

There is a secret generated from the Helm chart called ray-secret, which contains the license key for the application. This secret can be manually created as well, by setting manualSecretName to the name of the created secret. The secret only needs one field, called license_key , which will need to have the value of the license key. An example can be found here:

apiVersion: v1
kind: Secret
metadata:
  name: "ray-secret"
  annotations:
    "helm.sh/resource-policy": "keep"
type: Opaque
stringData:
  license_key: <license-key>

If the name of the secret has changed, you'll also need to update another part of the Helm chart. See below:

ray-cluster:
    common:
        # Include Syntho License key here
        containerEnv:
        - name: LICENSE_KEY_SIGNED
          valueFrom:
            secretKeyRef:
              name: <change-this-to-new-secret-name>
              key: license_key

Deploy

Once the values have been set correctly in values.yaml, we can deploy the application to the cluster using the following command:

helm install ray-cluster ./helm/ray/chart --values values.yaml --namespace syntho

Once deployed, we can find the service name in Kubernetes for the Ray application. In the case of using the name ray-cluster as is the case in the command above, the service name (and hostname to use in the variable ray_address for the Core API values section) is ray-cluster-ray-head.

Lastly, we can check the ray-operator pod and subsequent Ray head or Ray worker pods. Running kubectl logs deployment/ray-operator -n syntho will show us the logs of the operator.

PreviousPreparations NextUpgrading Ray CRDs

Last updated 8 months ago

Was this helpful?