Deploy Ray using Helm

To distribute our computational requirements and run heavy jobs like synthesizing your data, we will need to deploy a Ray cluster using Helm for our API to connect to. The chart can be found in the repository here or can be supplied by a repo URL to be used in Helm directly. Please contact the Syntho team for this repo URL.

This part of the documentation will assume access to the folder helm/ray in the master branch of the aforementioned GitHub repository.

A note for OpenShift users: Ray requires us to create many threads within a pod, especially when scaling the Ray cluster. By default, OpenShift limits how many processes a pod can spawn, due to its usage of CRI-O as its Container Runtime Interface (CRI). We recommend updating this limit: see the section Additional changes for OpenShift/CRI-O

Setting the image

In the values.yaml file in helm/ray, set the following fields to ensure the usage of the correct Docker image:

operatorImage:
  repository: syntho.azurecr.io/syntho-ray-operator
  tag: <tag>
  pullPolicy: IfNotPresent

image:
  repository: syntho.azurecr.io/syntho-ray
  tag: <tag>
  pullPolicy: IfNotPresent

The image tag will be provided by the Syntho Team. In some cases, the latest tag can be used, but we recommend setting a specific tag.

Next to setting the correct Docker image, define the Kubernetes Secret that is created under imagePullSecrets:

imagePullSecrets: 
    - name: syntho-cr-secret

This value is set to syntho-cr-secret by default.

License key - Ray

The license key can be set under SynthoLicense in the values.yaml file. An example of this would be:

SynthoLicense: <syntho-license-key>

Please use the license key provided by Syntho.

Cluster name

The default cluster name is set to ray-cluster. In case this needs to be adjusted, you can do so by changing clustername:

clustername: ray-cluster

Workers and nodes

First of all, the Syntho Team will have some recommendations on what the exact size of the cluster should be given your data requirements. If you haven't received any information about this, please contact the Syntho team first to discuss the optimal setup. The rest of this section will give an example configuration to show what that looks like in the Helm chart.

Depending on the size and amount of nodes of the cluster, adjust the number of workers that Ray has available for tasks. Ray will need at least one head instance. To increase performance, we can create additional worker groups as well. Under head we can set the resources for the head node. This head node will mostly be used for administrative tasks in Ray and the worker nodes will be picking up most of the tasks for the Syntho Application.

For a production environment, we recommend a pool of workers next to the head node. The Syntho Team can indicate what resources should be assigned to the head node and worker nodes. Here is an example configuration of a cluster with a head node and one node pool, of 1 machine, with 16 CPUs and 64GB of RAM:

head:
  rayStartParams:
    dashboard-host: '0.0.0.0'
    block: 'true'
  containerEnv:
  - name: RAY_SCHEDULER_SPREAD_THRESHOLD
    value: "0.0"
  envFrom: []
  resources:
    limits:
      cpu: "16"
      memory: "64G"
    requests:
      cpu: "16"
      memory: "64G"
  annotations: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}
  securityContext: {}
  ports:
  - containerPort: 6379
    name: gcs
  - containerPort: 8265
    name: dashboard
  - containerPort: 10001
    name: client
  volumes:
    - name: log-volume
      emptyDir: {}
  volumeMounts:
    - mountPath: /tmp/ray
      name: log-volume
  sidecarContainers: []


worker:
  # If you want to disable the default workergroup
  # uncomment the line below
  # disabled: true
  groupName: workergroup
  replicas: 1
  labels: {}
  rayStartParams:
    block: 'true'
  initContainerImage: 'busybox:1.28'
  initContainerSecurityContext: {}
  containerEnv:
   - name: RAY_SCHEDULER_SPREAD_THRESHOLD
     value: "0.0"
  envFrom: []
  resources:
    limits:
      cpu: "16"
      memory: "64G"
    requests:
      cpu: "16"
      memory: "64G"
  annotations: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}
  securityContext: {}
  volumes:
    - name: log-volume
      emptyDir: {}
  volumeMounts:
    - mountPath: /tmp/ray
      name: log-volume
  sidecarContainers: []

If autoscaling is enabled in Kubernetes, new nodes will be created once the Ray requirements are higher than the available resources. Please discuss together with the Syntho Team which situation would fit your data requirements.

For development or experimental environments, most of the time a less advanced setup is needed. In this case, we recommend only setting up a head node type to begin with and no workers or additional autoscaling setup. The Syntho Team will again advise on the size of this node, given the data requirements. An example configuration using a node with 16 CPUs and 64 GB of RAM would be:

head:
  rayStartParams:
    dashboard-host: '0.0.0.0'
    block: 'true'
  containerEnv:
  - name: RAY_SCHEDULER_SPREAD_THRESHOLD
    value: "0.0"
  envFrom: []
  resources:
    limits:
      cpu: "16"
      # To avoid out-of-memory issues, never allocate less than 2G memory for the Ray head.
      memory: "64G"  # Depending on data requirements
    requests:
      cpu: "16"
      memory: "64G"  # Depending on data requirements
  annotations: {}
  nodeSelector: {}
  tolerations: []
  affinity: {}
  securityContext: {}
  ports:
  - containerPort: 6379
    name: gcs
  - containerPort: 8265
    name: dashboard
  - containerPort: 10001
    name: client
  volumes:
    - name: log-volume
      emptyDir: {}
  volumeMounts:
    - mountPath: /tmp/ray
      name: log-volume
  sidecarContainers: []

worker:
  # If you want to disable the default workergroup
  # uncomment the line below
  disabled: true

# The map's key is used as the groupName.
# For example, key:small-group in the map below
# will be used as the groupName
additionalWorkerGroups:
  smallGroup:
    # Disabled by default
    disabled: false

Additionally, nodeSelector, tolerations and affinity can be defined for each type of node, to have some control over where the pods/nodes exactly get assigned. securityContext and annotiations can also be set for each type of worker/head node.

Shared storage of Ray workers

We require an additional Persistent Volume for the Ray workers to share some metadata about the current tasks running. This is included in the Helm chart and has the Persistent Volume type ReadWriteMany. In the section storage you can adjust the storageClassName to use for this. Please make sure that you're using a storageClass that supports type ReadWriteMany.

storage:
  storageClassName: default  # Change to correct storageClass

Volume mounts

Optional

If certain volumes need to be mounted, the values volumes and volumeMounts can be adjusted to define those. Keep in mind when using PV that Ray may schedule multiple pods using that particular volume, so it will need to be accessible from multiple machines.

Additional changes for OpenShift/CRI-O

Certain orchestrators or setups use CRI-O as the container runtime interface (CRI). Openshift 4.x currently has CRI-O set a limit of 1024 processes by default. When scaling, this limit can easily be reached using Ray. We recommend setting this limit to around 8096 processes.

The Openshift documentation describes the steps to increase this limit here. The following link has more information about the settings that can be used in CRI-O.

Deploy

Once the values have been set correctly in values.yaml under helm/ray, we can deploy the application to the cluster using the following command:

helm install ray-cluster ./helm/ray --values values.yaml --namespace syntho 

Once deployed, we can find the service name in Kubernetes for the Ray application. In the case of using the name ray-cluster as is the case in the command above, the service name (and hostname to use in the variable ray_address for the Core API values section) is ray-cluster-ray-head.

Lastly, we can check the ray-operator pod and subsequent Ray head or Ray worker pods. Running kubectl logs deployment/ray-operator -n syntho will show us the logs of the operator.

Last updated