Making Ray.io Enterprise Ready: Secure deployment of Ray on Google Cloud — GKE

19 min readSep 17, 2023

Authors: Ali Arsanjani, Balazs Pinter, Google AI CoE

Context / Background

Vertex AI is a fully managed ML platform that provides all the tools and infrastructure you need to build, deploy, and scale ML models. It offers a wide range of features, including:

Pre-trained models: Vertex AI offers a library of pre-trained models that you can use to get started with ML.
AutoML: Vertex AI can automatically train and deploy ML models for you, even if you have no prior experience with ML.
Model registry: Vertex AI allows you to store and manage your ML models in a central repository.
Model serving: Vertex AI can deploy your ML models to production, so that they can be used to make predictions in real time.

If you want to build your own custom ML platform infrastructure, you can use Google Kubernetes Engine (GKE) as the underlying scalable, secure compute cluster. GKE is a managed Kubernetes service that provides a platform for running containerized applications. You can use GKE to build your own ML platform by deploying the necessary tools and infrastructure.

Building your own ML infrastructure may be a good idea if you have the internal expertise, are okay with managing a lot of infrastructure, as a tradeoff for giving you more flexibility and control over your ML environment.

However, it also requires more time and effort to set up and manage. If you are new to ML, or if you don’t have the time or resources to build your own ML platform, Vertex AI is a good option.

Adopting Open Source

Open source software (OSS) is a type of software that is made available for anyone to use, modify, and distribute. OSS can be integrated with other existing code, enabling developers using it for their own applications to focus on their core strengths and on building innovative and creative solutions for their enterprise and its customers.

Using OSS has the potential to introduce risks into enterprise software systems, including vulnerabilities and dependency sprawl. According to the research on The State of Dependency Management and Top 10 OSS Risks, 80% of code in modern applications is code your developers didn’t write, but “borrowed” from the internet.

In this blog we will be discussing some of the best-practices in conjunction with the use of OSS, specifically, in the use of the increasingly popular OSS framework for distributed compute, namely, Ray in an Enterprise context. Ray is a popular open-source framework for distributed computing that enables scalable and efficient execution of machine learning tasks. The usage of OSS often introduces the need for more security and network / VPC aware considerations that go beyond OSS trial usage.

Every OSS, no matter how powerful and flexible, may need to be augmented to meet the stringent constraints of enterprise computing. While Ray is an open-source software that offers a lot of flexibility, there are some key considerations to address that may restrict its full potential for use in Enterprise environments. In this blog, we will delve into these limitations and propose ways to overcome them with a set of best-practices that will make Ray a more suitable solution for Enterprise-level computing.

We will cover additional security features for improving scalability and reliability on GKE, we will cover it all to help you better understand how Ray can be elevated to meet the demanding needs of Enterprise environments. Join us as we explore the ways in which Ray can be enhanced to provide a more robust and scalable solution for Enterprise computing.

In a previous blog “Building a Machine Learning Platform with Kubeflow and Ray on Google Kubernetes Engine”, Google has demonstrated how to deploy Ray on GKE. In this article we are going to focus on the infrastructure aspects of the deployment and showing how to structure the resulting environment so that it confirms with the more restrictive Enterprise Scale use-cases that require, scale, security, networking constraints etc.; and we explore possible ways to deploy Ray into a Enterprise Scale, Google Cloud environment.

Google’s Security foundations blueprint presents an opinionated view on creating a secure-by-default Google Cloud landing zone. We will leverage those principles here.

Implementing Open Source Software in an Enterprise Environment

As we implement OSS in a more secure enterprise environment, we need to apply a checklist of best-practices that help us cross check and navigate the more secure networking and access constraints that are a natural part of any enterprise application environment.

Next we will consider some important security and networking checklists to consider along with the solutions Google Cloud provides to cover those areas. There are two aspects in public cloud: making the cloud environment secure, and making the open source application secure that’s running on Cloud.

Here are the items to be cross check in adopting OSS in an enterprise environment:

Authentication and Authorization: Ensure that the open source software is integrated with enterprise authentication and authorization mechanisms, such as Active Directory, LDAP, or SAML, to authenticate users and control access to resources.
Access Control Policies: Implement access control policies to ensure that only authorized users have access to sensitive data and resources.
Encryption: Ensure that the open source software uses encryption mechanisms, such as TLS, SSL, or SSH, to protect data in transit and at rest.
Vulnerability Assessment: Perform a vulnerability assessment of the open source software and address any identified vulnerabilities.
Patch Management: Implement a patch management process to ensure that security patches and updates are regularly applied to the open source software.
Network Segmentation: Segment the network to limit the exposure of the open source software to potential attacks and ensure that it is isolated from other critical enterprise systems.
Firewall Rules: Configure firewall rules to restrict traffic to and from the open source software to only necessary ports and protocols.
Monitoring and Logging: Implement monitoring and logging mechanisms to detect and respond to security incidents and ensure compliance with security policies.
User Training: Train enterprise users on security best practices and policies related to the use of the open source software.
Third-Party Software: Ensure that any third-party software or libraries used by the open source software are secure and comply with enterprise security policies.

In the following detailed sections, we will explore some of the Google Cloud services that support each of the above checklist items. Note that you do not need to necessarily apply all the indicated services, but rather exercise judgment as to the appropriate one for addressing the checklist item based on the unique enterprise circumstances.

Authentication, Authorization and Access Control Policies

Ensure that the open source software is integrated with enterprise authentication and authorization mechanisms, such as Active Directory, LDAP, or SAML, to authenticate users and control access to resources.

As Ray doesn’t have fine-grained access control built in, we have to rely on the access controls provided by the infrastructure and making sure only authorized and authenticated individuals are able to manage Ray clusters and interact with the Ray endpoints.

Google Cloud services and features to help here:

Cloud Identity and Google Workspace provides the authentication layer for the infrastructure and can be integrated with 3rd party identity providers
Private GKE Clusters ensure that the cluster nodes and control plane are not publicly accessible
Cloud IAM provides fine grained access control to Cloud resources and can be used to restrict who can manage and access GKE clusters
Kubernetes RBAC integrated with Cloud IAM provides fine grained access control over Kubernetes objects
Identity Aware Proxy provides zero-trust security and provides endpoint protection for published HTTP/S endpoints like the Ray dashboard.
VPC Service Controls provides an extra layer of control with a defense-in-depth approach for multi-tenant services that helps protect service access from both insider and outsider threats. For example VPC SC can limit access to your BigQuery dataset storing sensitive data so only your GKE cluster hosting Ray would be able to access it, and even leaked credentials outside your VPC would be blocked.

You can see an example of exposing the Ray dashboard with authentication (Identity Aware Proxy) in the deployment section.

Encryption

Ensure that the open source software uses encryption mechanisms, such as TLS, SSL, or SSH, to protect data in transit and at rest.

Google Cloud provides encryption at rest and encryption in transit by default.

Google Cloud services and features to help here:

If you need more control over the encryption keys you may use Customer Managed Encryption Keys (CMEK) together with our managed services that support it, optionally combined with external keys through External Key Manager (EKM).
Both GCE and GKE support CMEK for boot disks and additional data disks.
HTTPS Load Balancing provides TLS termination and we also provide managed SSL certificates for your domains.

You can see an example of exposing the Ray dashboard with encryption (HTTPS) with Google Managed certificates in the deployment section.

Additionally Ray supports mTLS.

Vulnerability Assessment

Perform a vulnerability assessment of the open source software and address any identified vulnerabilities through the use of the following Google Cloud services and features:

Container Analysis provides vulnerability information for the container images in Container Registry and Artifact Registry. Container Analysis scans and extracts information about the system packages in new images when they’re uploaded.
Binary Authorization is a deploy-time security control that ensures only trusted container images are deployed on Google Kubernetes Engine (GKE). With Binary Authorization, you can require images to be signed by trusted authorities during the development process and then enforce signature validation when deploying.
Binary Authorization can block the deployment of container images with known vulnerabilities through the integration with Container Analysis.

Patch Management

Implement a patch management process to ensure that security patches and updates are regularly applied to the open source software.

Google regularly patches GKE for known vulnerabilities, which includes both Kubernetes and the underlying operating system. Google provides end to end support for Container Optimized OS (COS) which is based on Chromium OS. Container-Optimized OS implements several security design principles to provide a well-configured platform for running production services.

On the application side, make sure to have CI/CD processes in place to release quickly in case a vulnerability is discovered. This should include building and publishing a new container image version, automated testing and deployment.

Network Segmentation and Firewall Rules

Segment the network to limit the exposure of the open source software to potential attacks and ensure that it is isolated from other critical enterprise systems.

Google Cloud services and features that can address this item are as follows:

Google Cloud supports complex network architectures including the hub-and-spoke model. You can have dedicated isolated projects and / or VPCs for specific workloads.
Firewall rules can be specified on different levels and enforced across a Cloud Organization.
Kubernetes Network Policy provides fine grained access control within the workloads deployed on the GKE clusters.
Anthos Config Management enables platform operators to automatically deploy shared environment configurations and enforce approved security policies across Kubernetes clusters.

Monitoring and Logging

Implement monitoring and logging mechanisms to detect and respond to security incidents and ensure compliance with security policies.

As part of Cloud Audit Logging, Google Cloud services write audit logs to help you answer the questions, “Who did what, where, and when?” within your Google Cloud resources. Your Google Cloud projects contain only the audit logs for resources that are directly within the Cloud project.

The following audit logs are available for GKE:

Kubernetes: These logs are generated by the Kubernetes API Server component and they contain information about actions performed using the Kubernetes API. For example, any changes you make on a Kubernetes resource by using the kubectl command are recorded by the k8s.io service. Kubernetes audit log entries are useful for investigating suspicious API requests, for collecting statistics, or for creating monitoring alerts for unwanted API calls.
Kubernetes Engine: These logs are generated by the GKE Control Plane and contain information about actions performed using the GKE API. For example, any changes you perform on a GKE cluster configuration using a gcloud CLI are recorded by the container.googleapis.com service.
Container Security API (GKE Security Posture): These logs are generated by GKE security posture dashboard activity and contain information about data retrieved using the dashboard.

User Training

Train enterprise users on security best practices and policies related to the use of the open source software.

There are many certifications, training, hands-on labs available for Google Cloud. Our Security foundations blueprint is an opinionated and informative whitepaper outlining many aspects of security on Google Cloud.

Third-Party Software

Ensure that any third-party software or libraries used by the open source software are secure and comply with enterprise security policies.

Google Cloud services and features to help here:

Software Delivery Shield is a fully-managed, end-to-end software supply chain security solution. It provides a comprehensive and modular set of capabilities and tools across Google Cloud services that developers, DevOps, and security teams can use to improve the security posture of the software supply chain.

How to run Ray on GKE clusters in a policy-constrained Cloud Organization

This section describes typical ways organizations restrict the use of cloud resources.

Enterprise environments are by nature secure and thus policy constrained. The deployment of Ray on GKE clusters is somewhat simpler than the GCE case, as it doesn’t require SSH tunnels and one-off setup sequence due to using pre-built container images and Kubernetes providing many aspects out of box, such as networking, deployment, some security aspects, etc.

Potential Restrictions

Aspects to consider for GKE cluster creation based on our best-practices specified in the Security Foundations Architecture document.

Organization Policy Service

The policy constraints applied to VMs are also applied to GKE clusters, as the node pools are essentially groups of VMs.

The following policy constraints restrict the way VMs can be created and / or accessed. Along with the restriction you can see the corresponding flags / settings for GKE:

Google Cloud Platform — Resource Location Restriction (constraints/gcp.resourceLocations) — the GKE cluster must be located in one of the allowed locations
Compute Engine — Shielded VMs (constraints/compute.requireShieldedVm) — Enable Shielded nodes feature upon node pool creation
Compute Engine — Require OS Login (constraints/compute.requireOsLogin) — only private GKE clusters support OS Login
Compute Engine — Restrict Non-Confidential Computing (constraints/compute.restrictNonConfidentialComputing) — Enable GKE confidential node pool in case of a restriction
Compute Engine — Define allowed external IPs for VM instances (constraints/compute.vmExternalIpAccess) — Create a private GKE cluster to satisfy this constraint
Identity and Access Management — Disable Workload Identity Cluster Creation (constraints/iam.disableWorkloadIdentityClusterCreation) — This enforces that all new GKE clusters have Workload Identity disabled at creation time. — The recommendation is to have Workload Identity enabled.
Google Cloud Platform — Restrict which services may create resources without CMEK (constraints/gcp.restrictNonCmekServices) — Configure CMEK key for the node pool’s boot disk if the restriction is enforced for the project.
Google Cloud Platform — Restrict which projects may supply KMS CryptoKeys for CMEK (constraints/gcp.restrictCmekCryptoKeyProjects) — The configured CMEK key needs to be located in one of the allowlisted projects.

VPC firewall rules and firewall policies

GKE by default allows full-mesh connectivity within the clusters by automatically managing the required firewall rules.

Kubernetes Network Policies

Kubernetes Network Policies allow the fine grained configuration of the network access control between pods within the cluster. They can allow or deny the network connectivity from / to pods.

In GKE, there are two ways of enabling network policies:

Enabling the network policy feature (which will use Calico for network policies)
Enabling Dataplane V2 (which will use the DPv2 feature built on top of Cilium)

By default connections are not restricted, but there may be rules applied based on the security requirements of the organization.

Specific policies may need to be created to make sure the communication between the Ray pods is working. The Ray documentation has a specific page for explaining the networking aspects.

Cloud Identity and Access Management (IAM)

Identity and Access Management (IAM) lets administrators authorize who can take action on specific resources, giving you full control and visibility to manage Google Cloud resources centrally. For enterprises with complex organizational structures, hundreds of workgroups, and many projects, IAM provides a unified view into security policy across your entire organization, with built-in auditing to ease compliance processes.

To manage (create / delete / update) GKE clusters you need to have the Kubernetes Engine Cluster Admin role.

If the cluster is already created you will need the Kubernetes Engine Developer role to interact with the cluster. This role does not grant the permissions required for creating / deleting / updating clusters, it allows you to interact with the Kubernetes APIs.

Deploying the Ray Operator will require permissions to create “role” “rolebinding” “clusterrole” and “clusterrolebinding” objects as well. These permissions are granted by the Kubernetes Engine Admin builtin IAM role. Once the Ray operator is deployed, the Kubernetes Engine Developer role is sufficient to interact with the Kubernetes cluster, including the RayCluster CRDs.

Cloud Identity and Access Management (IAM) — Calling Google APIs from the Ray workloads

Recommended mode: Workload Identity

Ray workers running on GKE might need access to certain Google Cloud APIs such as BigQuery API, Cloud Storage API or Machine Learning APIs.

Workload Identity allows a Kubernetes pod in your GKE cluster to act as an IAM service account and automatically authenticate as the IAM service account when accessing Google Cloud APIs. Using Workload Identity allows you to assign distinct, fine-grained identities and authorization for each application in your cluster (instead of granting a potentially wide range of permissions to the node’s service account). This mapping is done by annotating the GKE Service Account and granting IAM roles to the Google Service Account.

Authenticating to Google Cloud services from your code is the same process as authenticating using the Compute Engine metadata server. Workload Identity emulates the Instance Metadata Server and does the token exchange for you. Existing applications using Application Default Credentials through the Metadata Server, like applications using Google Cloud client libraries should work without modification and the only noticeable difference is that instead of the Service Account associated with the GKE Node the Service Account associated with the Ray Worker Pod’s Kubernetes Service Account will be used for the API calls.

You may want to enforce that Ray pods get scheduled only on Workload Identity enabled nodes. To do so, add the nodeSelector to the ray head / worker specifications in the RayCluster resource:

apiVersion: ray.io/v1alpha1
kind: RayCluster
spec:
…
headGroupSpec:
template:
spec:
nodeSelector:
iam.gke.io/gke-metadata-server-enabled: “true”
…

workerGroupSpecs:
template:
spec:
nodeSelector:
iam.gke.io/gke-metadata-server-enabled: “true”
…

Alternative / not recommended: Service Account keys

You may provide downloaded service account keys and provide them to the workers. As Service Account keys are long lived credentials extra care must be taken to secure them and make sure to have plans for key rotation in the unfortunate event of a leak.

Alternative: pods identified as the GKE Node’s Service Account

In the traditional setup GKE Node Pools are associated with an IAM service account and all pods running on the same node share this service account. If you need to work in this mode the recommendation is to isolate the ray workers on a dedicated node pool and make sure that the service account of the target node pool has the least required IAM permissions assigned.

If you do not specify a service account during node pool creation, GKE uses the Compute Engine default service account of the project containing the cluster, which may have too broad permissions. This violates the principle of least privilege and is inappropriate for multi-tenant clusters.

Architecture

The diagram below shows the high level architecture of Ray deployed on GKE, where external access is granted through Identity Aware Proxy, and internal access is available through an Internal Load Balancer. Services hosted within the VPC of the cluster like Vertex AI Workbench can access the services exposed by Ray using an internal IP address.

GKE cluster deployment

When you create a cluster in GKE, you do so by using one of the following modes of operation:

Autopilot: in this mode of operation in GKE in which Google manages your cluster configuration, including your nodes, scaling, security, and other preconfigured settings. Autopilot clusters are optimized to run most production workloads, and provision compute resources based on your Kubernetes manifests. The streamlined configuration follows GKE best practices and recommendations for cluster and workload setup, scalability, and security.

Standard: Provides advanced configuration flexibility over the cluster’s underlying infrastructure. For clusters created using the Standard mode, you determine the configurations needed for your production workloads. In the Standard mode you need to configure the GKE cluster in advance to accommodate Ray worker pods of the desired size: you need to make sure autoscaling is configured and you have the right node pool sizes to run the ray worker pods.

Create a Standard GKE cluster with Cluster Autoscaler, satisfying the constraints:

gcloud container clusters create “my-gke-cluster” \

— zone “us-central1-c” \

— node-locations “us-central1-c” \

— machine-type “n2d-standard-2” \

— enable-confidential-nodes \

— service-account “gke-node-sa@[project_id].iam.gserviceaccount.com” \

— enable-private-nodes \

— workload-pool “[project_id].svc.id.goog” \

— enable-shielded-nodes — shielded-secure-boot \

— boot-disk-kms-key “projects/[KMS key project]/locations/[keyring location]/keyRings/[keyring]/cryptoKeys/[key]” \

— enable-autoscaling

— min-nodes=1

— max-nodes=4

— logging=SYSTEM,WORKLOAD \

— monitoring=SYSTEM

Get cluster credentials:

$ gcloud container clusters get-credentials — internal-ip my-ray-cluster — zone us-central1-c

Fetching cluster endpoint and auth data.

kubeconfig entry generated for my-ray-cluster.

KubeRay operator and Ray cluster deployment

Checkout ray from github

$ git clone https://github.com/ray-project/kuberay

$ cd kuberay/

$ git checkout release-0.4

Deploy the Kuberay operator (cluster scoped):

$ kubectl create -k ray-operator/config/default

namespace/ray-system created

customresourcedefinition.apiextensions.k8s.io/rayclusters.ray.io created

customresourcedefinition.apiextensions.k8s.io/rayjobs.ray.io created

customresourcedefinition.apiextensions.k8s.io/rayservices.ray.io created

serviceaccount/kuberay-operator created

service/kuberay-operator created

deployment.apps/kuberay-operator created

role.rbac.authorization.k8s.io/kuberay-operator-leader-election created

clusterrole.rbac.authorization.k8s.io/kuberay-operator created

rolebinding.rbac.authorization.k8s.io/kuberay-operator-leader-election created

clusterrolebinding.rbac.authorization.k8s.io/kuberay-operator created

Check operator status:

$ kubectl get pods -n ray-system

NAME READY STATUS RESTARTS AGE

kuberay-operator-565d6bd64f-2tsgh 1/1 Running 0 15m

$ kubectl logs kuberay-operator-565d6bd64f-2tsgh -n ray-system

2023–02–06T13:03:32.220Z INFO controller.raycluster-controller Starting workers {“reconciler group”: “ray.io”, “reconciler kind”: “RayCluster”, “worker count”: 1}

Deploy a cluster and check status:

$ kubectl apply -f ray-operator/config/samples/ray-cluster.autoscaler.yaml

raycluster.ray.io/raycluster-autoscaler configured

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

raycluster-autoscaler-head-clqmp 2/2 Running 0 6h32m

raycluster-autoscaler-worker-small-group-g7tfd 1/1 Running 0 6h32m

$ kubectl get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

raycluster-autoscaler-head-svc ClusterIP 10.36.8.207 <none> 6379/TCP,8265/TCP,10001/TCP 6h32m

You can access the ray head service using its ClusterIP address from within the GKE cluster. If you want to access the cluster from a VM on the same VPC, you can expose the service through an Internal Load Balancer.

Sample Service definition:

apiVersion: v1

kind: Service

metadata:

annotations:

networking.gke.io/load-balancer-type: “Internal”

namespace: default

spec:

ports:

- appProtocol: tcp

port: 6379

protocol: TCP

targetPort: 6379

- appProtocol: tcp

port: 8265

protocol: TCP

targetPort: 8265

- appProtocol: tcp

port: 10001

protocol: TCP

targetPort: 10001

selector:

app.kubernetes.io/created-by: kuberay-operator

app.kubernetes.io/name: kuberay

ray.io/cluster: raycluster-autoscaler

ray.io/identifier: raycluster-autoscaler-head

ray.io/node-type: head

sessionAffinity: None

type: LoadBalancer

Exposing the dashboard in a secure way

Identity Aware proxy provides zero-trust security and endpoint protection out of the box and GKE supports enabling Cloud Identity Aware Proxy for your endpoint via the BackendConfig custom resource. Google Managed Certificates provide automatically provisioned SSL certificates for your endpoints. Cloud DNS provides name resolution for your private and public DNS zones both for externally registered domains and Cloud Domains registered domains.

$ gcloud compute addresses create ray-dashboard-ip — global

$ gcloud compute addresses describe ray-dashboard-ip — global — format “get(address)”

a.b.c.d

$ gcloud dns record-sets create “ray.my-example-domain.org” \

— zone my-example-domain \

— rrdatas=’[a.b.c.d / reserved IP address]’ \

— type A \

–ttl 300

To obtain the client_id and client_secret you need to create the Credential. This credential can be created programmatically for internal applications, where only users from within your organization can access. For external applications there are additional manual steps needed.

$ kubectl create secret generic ray-iap-client \

— from-literal=client_id=[client_id_key] \

— from-literal=client_secret=[client_secret_key]

ray-backendconfig.yml

apiVersion: cloud.google.com/v1

kind: BackendConfig

metadata:

namespace: default

spec:

iap:

enabled: true

oauthclientCredentials:

secretName: ray-iap-client

ray-managedcertificate.yml

apiVersion: networking.gke.io/v1

kind: ManagedCertificate

metadata:

spec:

domains:

- ray.my-example-domain.org

raycluster-autoscaler-head-iap-svc.yml (instead of modifying the ray-created head service you should make a copy and add the backend-config annotation to avoid reconciling the manual modification by the Ray Operator)

apiVersion: v1

kind: Service

metadata:

annotations:

cloud.google.com/backend-config: ‘{“ports”: {“8265”:”ray-backendconfig”}}’

namespace: default

spec:

ports:

- appProtocol: tcp

port: 8265

protocol: TCP

targetPort: 8265

selector:

app.kubernetes.io/created-by: kuberay-operator

app.kubernetes.io/name: kuberay

ray.io/cluster: raycluster-autoscaler

ray.io/identifier: raycluster-autoscaler-head

ray.io/node-type: head

sessionAffinity: None

type: ClusterIP

ray-ingress.yml

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

annotations:

kubernetes.io/ingress.global-static-ip-name: ray-dashboard-ip

networking.gke.io/managed-certificates: ray-managed-cert

kubernetes.io/ingress.class: “gce”

spec:

defaultBackend:

service:

port:

number: 8265

After the configuration and certificate provisioning is complete unauthenticated users will be redirected to a login prompt where they can use their accounts. If the account has the IAP-Secured Web App User role on the project or on the backend level they will be able to access the exposed dashboard otherwise their access will be denied.

Testing (submit a job)

We need to note the IP address assigned to the Internal Load Balancer representing the Service. Internal Load Balancer IPs are accessible from within the VPC and from the same region where they were created. Global access to the IP can be enabled by setting the required annotation on the Service.

$ kubectl get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

raycluster-autoscaler-head-ilb-svc LoadBalancer 10.36.14.82 10.128.15.230 6379:32309/TCP,8265:31530/TCP,10001:30977/TCP 15h

raycluster-autoscaler-head-svc ClusterIP 10.36.8.207 <none> 6379/TCP,8265/TCP,10001/TCP 22h

Set the address in the RAY_ADDRESS environment variable. Alternatively this address can be supplied to the ray command via –address.

$ export RAY_ADDRESS=http://10.128.15.230:8265

Submit the job (python code):

The below example has been taken from the Ray example repository.

$ ray job submit — python -c “$(cat monte-carlo-pi.py)”

Job submission server address: http://10.128.15.230:8265

— — — — — — — — — — — — — — — — — — — — — — — — — — — -

Job ‘raysubmit_cWRyyLnC6FmAeHkA’ submitted successfully

— — — — — — — — — — — — — — — — — — — — — — — — — — — -

…

Progress: 100%

Estimated value of π is: 3.14158984

— — — — — — — — — — — — — — — — — — — — —

Job ‘raysubmit_cWRyyLnC6FmAeHkA’ succeeded

— — — — — — — — — — — — — — — — — — — — —

Conclusion

In this article we have covered the following key points:

The importance of adopting open source software (OSS) in enterprise environments.
The challenges of implementing OSS in a secure enterprise environment.
A checklist of security and networking considerations for implementing OSS in an enterprise environment.
A detailed example of how to deploy Ray, a popular open source framework for distributed computing, on Google Kubernetes Engine (GKE) in a secure and scalable way.

We hope this article provides you with a good checklist, as an enterprise that seeks to implement distributed computing OSS in a secure and enterprise compliant manner.

In addition to the specific security and networking considerations discussed in this article, there are a number of other factors that enterprises should consider when adopting OSS. These include:

The maturity and stability of the OSS project.
The level of community support for the OSS project.
The licensing terms of the OSS project.
The potential impact of the OSS project on the enterprise’s existing IT infrastructure.

Carefully weigh these factors, as you make informed decisions about which OSS projects to adopt and how to integrate them into your Enterprise scale IT production environments.

As the use of OSS continues to grow, enterprises that are able to successfully adopt and integrate OSS into their IT environments will be well-positioned to maintain a set of open options and be able to pivot to new capabilities and maintain competitive advantage.

Making Ray.io Enterprise Ready: Secure deployment of Ray on Google Cloud — GKE

Context / Background

Adopting Open Source

Implementing Open Source Software in an Enterprise Environment

Authentication, Authorization and Access Control Policies

Encryption

Vulnerability Assessment

Patch Management

Network Segmentation and Firewall Rules

Monitoring and Logging

User Training

Third-Party Software

How to run Ray on GKE clusters in a policy-constrained Cloud Organization

Potential Restrictions

Organization Policy Service

VPC firewall rules and firewall policies

Kubernetes Network Policies

Cloud Identity and Access Management (IAM)

Cloud Identity and Access Management (IAM) — Calling Google APIs from the Ray workloads

Recommended mode: Workload Identity

Alternative / not recommended: Service Account keys

Alternative: pods identified as the GKE Node’s Service Account

Architecture

GKE cluster deployment

KubeRay operator and Ray cluster deployment

Exposing the dashboard in a secure way

Testing (submit a job)

Conclusion

Written by Ali Arsanjani