Troubleshooting

Errors when using ArgoCD to deploy

If you are using ArgoCD to manage the operator, you will encounter the issue which complains the CRDs too long. A similar issue can be found here: issue. The recommended solution is to split the operator into two Argo apps, such as:

  • The first app is just for installing the CRDs with Replace=true directly, snippet:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ray-operator-crds
spec:
  project: default
  source:
    repoURL: <repo_url>
    targetRevision: HEAD
    path: helm/ray/crds
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    syncOptions:
    - Replace=true
  • The second app that installs the Helm chart with skipCrds=true (new feature in Argo CD 2.3.0), snippet:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ray-operator
spec:
  source:
    repoURL: <repo_url>
    targetRevision: HEAD
    path: helm/ray
    helm:
      skipCrds: true
  destination:
    server: https://kubernetes.default.svc
    namespace: ray-operator
  syncPolicy:
    syncOptions:
    - CreateNamespace=true

Permission issues during Ray pods startup

In some cases, permission issues can arise with one of the mounted volumes (either /tmp/ray-data or /tmp/ray-workflows causing the Ray cluster pod(s) to not startup correctly. In this case, enabling the InitContainer within the Ray head section of the Helm chart might resolve the issues (enabled by default).

If this does not resolve your issues, and still get an error on the startup of the Ray cluster pod(s), we recommend adjusting the securityContext for every pod. The following changes will need to be made:

ray-cluster:
  head:
    securityContext:
      runAsUser: 0
      runAsGroup: 0

If there are any workers enabled for the cluster, the same needs to be added in the worker configurations.

ray-cluster:
    worker:
        securityContext:
            runAsUser: 0
            runAsGroup: 0

Last updated