Kubernetes On NVIDIA GPUs

2y ago
46 Views
3 Downloads
1.63 MB
33 Pages
Last View : 12d ago
Last Download : 3m ago
Upload by : Wren Viola
Transcription

KUBERNETES ON NVIDIA GPUSDU-09016-001 v1.0 November 2018Installation Guide

TABLE OF CONTENTSChapter 1. Introduction.1Chapter 2. Supported Platforms. 2Chapter 3. Installing Kubernetes. 33.1. Master Nodes. 43.1.1. Installing and Running Kubernetes. 43.1.2. Checking Cluster Health.43.1.3. DGX Station.53.1.3.1. For NetworkManager.53.1.3.2. For systemd-resolved. 63.2. Worker Nodes.63.2.1. Installing and Running Kubernetes. 63.2.2. Check Your Cluster State.73.3. Run GPU Tasks.73.3.1. Enable GPUs. 73.3.2. Run a GPU Workload.7Chapter 4. Cluster Customization. 9Chapter 5. Using Distribution-Specific Features. 115.1. Exposing GPU Attributes In a Node. 115.2. Scheduling GPUs By Attribute.115.3. Setting Up Monitoring. 125.4. CRI-O Runtime Preview Feature Support. 145.4.1. Install CRI-O.145.4.2. Run the CRI-O Service.155.4.3. Configure the Kubelet to Use CRI-O. 165.4.4. Run a GPU Task. 17Chapter 6. Troubleshooting.186.1. Package Installation.186.2. Cluster Initialization. 186.3. Monitoring Issues. 19Appendix A. DGX Systems. 21A.1. DGX and NGC Images.21A.2. Install NVIDIA Container Runtime for Docker 2.0. 21A.2.1. Uninstalling Old Versions. 21A.2.2. Setting Up the Repository. 22A.2.3. Install NVIDIA Container Runtime. 22Appendix B. Upgrading the Cluster. 24B.1. Upgrading the Cluster from 1.9.7 to 1.9.10. 24B.1.1. Upgrading the Control Plane.24B.1.2. Finalizing the Upgrade. 25B.2. Upgrading the Cluster from 1.9.10 to 1.10.8. 25www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 ii

B.2.1. Upgrading the Control Plane.26B.2.2. Finalizing the Upgrade. 26Appendix C. Support. 28www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 iii

www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 iv

Chapter 1.INTRODUCTIONKubernetes is an open-source platform for automating deployment, scaling andmanaging containerized applications. Kubernetes on NVIDIA GPUs includes supportfor GPUs and enhancements to Kubernetes so users can easily configure and use GPUresources for accelerating workloads such as deep learning. This document serves as astep-by-step guide to installing Kubernetes and using it with NVIDIA GPUs.www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 1

Chapter 2.SUPPORTED PLATFORMSReleases of Kubernetes up to and including 1.10.8 are supported on the followingplatforms. Note that there are certain prerequisites that must be satisfied beforeproceeding to install Kubernetes. These are detailed in the “Before you begin” section.On-Premises‣‣DGX-1 Pascal and Volta with OS Server v3.1.6DGX-Station with OS Desktop v3.1.6Cloud‣NVIDIA GPU Cloud virtual machine images available on Amazon EC2 and GoogleCloud Platform.Cluster Topology‣‣One master CPU or GPU nodeAt least one worker GPU nodeBefore You Begin‣‣‣Ensure that NVIDIA drivers are loadedEnsure that a supported version of Docker is installed.Ensure that NVIDIA Container Runtime for Docker 2.0 is installed.www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 2

Chapter 3.INSTALLING KUBERNETESKubernetes can be deployed through different mechanisms. NVIDIA recommends usingkubeadm to deploy Kubernetes.The master nodes run the control plane components of Kubernetes. These include theAPI Server (front-end to the kubectl CLI), etcd (stores the cluster state) and othercomponents. Master nodes need to be setup with the following three components, ofwhich only the kubelet has been customized with changes from NVIDIA:‣‣‣KubeletKubeadmKubectlWe recommend that your master nodes not be equipped with GPUs and to only run themaster components, such as the following:‣‣‣SchedulerAPI-serverController ManagerBefore proceeding to install the components, check that all the Kubernetes prerequisiteshave been satisfied:‣‣‣‣‣‣Check network adapters and required ports.Disable swap for kubelet to work correctlyInstall dependencies such as the Docker container runtime. To install Docker onUbuntu, follow the official instructions provided by Docker.The worker nodes must be provisioned with the NVIDIA driver.Ensure that NVIDIA Container Runtime for Docker 2.0 is installed on the machineRun ps -ef grep resolv to determine whether Network Manager or systemdresolved is being used.If you are setting up a single node GPU cluster for development purposes or you wantto run jobs on the master nodes as well, then you must install the NVIDIA ContainerRuntime for Docker.www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 3

Installing Kubernetes3.1. Master NodesInstall the required components on your master node:3.1.1. Installing and Running Kubernetes1.2.3.4.Add the official GPG keys. curl -s gpg sudo aptkey add curl -s -L https://nvidia.github.io/kubernetes/gpgkey sudo apt-key add curl -s -L idiakubernetes.list \sudo tee te the package index. sudo apt updateInstall packages. VERSION 1.10.8 nvidia sudo apt install -y kubectl {VERSION} kubelet {VERSION} \kubeadm {VERSION} helm {VERSION}Start your cluster. sudo kubeadm init --ignore-preflight-errors all --config /etc/kubeadm/config.ymlYou may choose to save the token and the hash of the CA certificate as part ofof kubeadm init to join worker nodes to the cluster later. This will take a fewminutes.5. For NGC VMIs, issue a chmod command on the newly created file. sudo chmod 644 2. Checking Cluster HealthCheck that all the control plane components are running on the master node: kubectl get all temkube-scheduler-master1mwww.nvidia.comKubernetes on NVIDIA -001 v1.0 4

Installing f7b9-t5qqz1/1Running03.1.3. DGX StationOn DGX Station (and other Ubuntu 16.04 desktop systems), there is a known issuewith Kubernetes 1.9.7 and Ubuntu 16.04 Desktop where the kube-dns service willnot run. In order to work around this issue, take the following actions, dependingon the DNS resolver service you are using. In most cases for Ubuntu 16.04 desktopsystems, NetworkManager is the DNS resolver service and the procedure in ForNetworkManager applies.Run ps -ef grep resolv to determine whether Network Manager or systemdresolved is being used.3.1.3.1. For NetworkManager1.Find the active interface. route grep ' default' grep -o '[ ]* '2.(Alternately, use ifconfig.)Determine the nameservers. For interface, use the active interface listed in theoutput of the previous command. nmcli device show interface grep IP4.DNSFor example:3. nmcli device show enp2s0f0 grep .DNS[3]:192.0.2.2Create the configuration file and submit it to kubernetes with the followingcommands: cat config.ymlapiVersion: v1kind: ConfigMapmetadata:name: kube-dnsnamespace: kube-systemdata:upstreamNameservers: # 3 Nameservers Maximum ["192.0.2.0", "192.0.2.1","192.0.2.0"] kubectl create -f config.ymlwww.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 5

Installing Kubernetes3.1.3.2. For systemd-resolved1.2.Add the following line to nf.Environment "KUBELET RESOLVER ARGS --resolv-conf /run/systemd/resolve/resolv.conf"Start kubelet. sudo systemctl start kubelet3.2. Worker NodesThe procedures in this section do not need to be completed if the goal is a singlenode cluster.DGX and NGC ImagesOn DGX systems installed with nvidia-docker version 1.0.1, NVIDIA provides an optionto upgrade the existing system environment to NVIDIA Container Runtime. Followthese instructions to upgrade your environment. Skip the following section and proceedto installing Kubernetes on the worker nodes.If you are using the NGC images on AWS or GCP, then you may skip the followingsection and proceed to installing Kubernetes on the worker nodes.3.2.1. Installing and Running Kubernetes1.2.3.4.Add the official GPG keys. curl -s gpg sudo aptkey add curl -s -L https://nvidia.github.io/kubernetes/gpgkey sudo apt-key add curl -s -L idiakubernetes.list \sudo tee te the package index. sudo apt updateInstall packages. VERSION 1.10.8 nvidia sudo apt install -y kubectl {VERSION} kubelet {VERSION} \kubeadm {VERSION} helm {VERSION}Before starting your cluster, retrieve the token and CA certificate hash you recordedfrom when kubeadm init was run on the master node. Alternatively, to retrievethe token, use the following command. sudo kubeadm token create --print-join-commandwww.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 6

Installing Kubernetes5.Join the worker node to the cluster with a command similar to the following. (Thecommand below is an example that will not work for your installation.) sudo kubeadm join --token token master-ip : master-port --discoverytoken-ca-cert-hash sha256:: hash --ignore-preflight-errors all3.2.2. Check Your Cluster StateRun the following command on the Master Node and make sure your GPU workernodes appear and their state transitions to Healthy. It may take a few minutes for thestatus to change. kubectl describe nodes3.3. Run GPU Tasks3.3.1. Enable GPUsAs part of installing Kubernetes on the worker nodes, GPUs are enabled by default.It may take up to a minute or two for GPUs to be enabled on your cluster (i.e. forKubernetes to download and run containers).Once Kubernetes has downloaded and run containers, run the following commandto see GPUs listed in the resource section: kubectl describe nodes grep -B 3 gpuCapacity:cpu:8memory:32879772Kinvidia.com/gpu: : 23.3.2. Run a GPU WorkloadMake sure the GPU support has been properly set up by running a simple CUDAcontainer. We provided one in the artifacts you downloaded (you’ll need to have a GPUwith at least 8GB). There are also other examples available in the examples directory.1.2.Start the CUDA sample workload. kubectl create -f /etc/kubeadm/examples/pod.ymlWhen the pod is running, you can execute the nvidia-smi command inside thecontainer. kubectl exec -it gpu-pod nvidia-smi -------------------------- NVIDIA-SMI 384.125Driver Version: 384.125 www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 7

Installing Kubernetes ------------------------------- --------------------- ---------------------- GPU NamePersistence-M Bus-IdDisp.A Volatile Uncorr.ECC Fan Temp Perf Pwr:Usage/Cap Memory-Usage GPU-Util ComputeM. 0 Tesla V100-SXM2. On 00000000:00:1E.0 Off 0 N/A34CP020W / 300W 10MiB / 16152MiB 0%Default ------------------------------- --------------------- ---------------------- -------------------------- Processes:GPUMemory GPUPIDTypeProcess nameUsage No running processes found -------------------------- www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 8

Chapter 4.CLUSTER CUSTOMIZATIONAfter starting up your cluster, it is setup with a few basic utilities:‣‣‣‣The Flannel network pluginThe NVIDIA Device Plugin which allows you to enable GPU supportHelm, the Kubernetes package managerThe NVIDIA Monitoring StackEdit /etc/kubeadm/post-install.sh to change some or all of these auto-deployedutilities.The Flannel Network PluginKubernetes clusters need a pod network addon installed. Flannel is deployed by default,for multiple reasons:‣‣‣Recommended by KubernetesUsed in production clustersIntegrates well with the CRI-O runtimeFor more information and other networking options refer to: https://kubernetes.io/podnetwork.HelmHelm is automatically installed and deployed with Kubernetes on NVIDIA GPUs. Helmhelps you manage Kubernetes applications, through helm charts it allows you to define,install, and upgrade even the most complex Kubernetes application. Charts are packagesof pre-configured Kubernetes resources.For more information on helm refer to: https://github.com/helm/helm.Monitoring StackAn integrated monitoring stack is deployed by default in your cluster to monitor healthand get metrics from GPUs in Kubernetes.www.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 9

Cluster CustomizationThe stack is deployed using helm (the charts can be found at /etc/kubeadm/monitoring)and uses the NVIDIA Datacenter GPU Manager (DCGM), Prometheus (usingPrometheus Operator), and Grafana for visualizing the various metrics.You can change some of the configuration in the values file of the ts/exporter-node/values.ymlTainting the Master Node (Optional)If you are setting up a multi-node cluster and you do not want jobs to run on the masternode (to avoid impact on control plane performance), set the master Kubernetes node todeny pods that can't run on the master node: kubectl taint nodes MasterNodeName etes on NVIDIA GPUsDU-09016-001 v1.0 10

Chapter 5.USING DISTRIBUTION-SPECIFIC FEATURESKubernetes on NVIDIA GPUs include the following features that are not yet available inthe upstream release of Kubernetes:‣‣‣‣GPU attributes exposed in a nodeScheduling improvementsGPU monitoringSupport for the CRI-O runtime preview feature5.1. Exposing GPU Attributes In a NodeNodes now expose the attributes of your GPUs. This can be inspected by querying theKubernetes API at the node endpoint. The GPUs attributes currently advertised are:‣‣‣GPU memoryGPU ECCGPU compute capabilitiesInspect GPU attributes in a node with the following command: kubectl proxy --port 8000 & curl -s http://localhost:8000/api/v1/nodes grep -B 7 -A 3 gpu-memory5.2. Scheduling GPUs By AttributePods can now specify device selectors based on the attributes that are advertised on thenode. These can be specified at the container level. For example:apiVersion: v1kind: Podmetadata:name: gpu-podspec:containers:- name: cuda-containerimage: nvidia/cuda:9.0-basewww.nvidia.comKubernetes on NVIDIA GPUsDU-09016-001 v1.0 11

Using Distribution-Specific Featurescommand: ["sleep"]args: ["100000"]computeResourceRequests: ["nvidia-gpu"]computeResources:- name: "nvidia-gpu"resources:limits:nvidia.com/gpu: 1affinity:required:- key: "nvidia.com/gpu-memory"operator: "Gt"values: ["8000"] # change value to appropriate mem for GPU1.2.3.Create the pod and check its status. kubectl create -f /etc/kubeadm/examples/pod.ymlList the pods running in the cluster. kubectl get podsRun the nvidia-smi command inside the container. kubectl exec -it gpu-pod nvidia-smi5.3. Setting Up MonitoringTo set up monitoring, follow these steps.1.2.3.4.Label the GPU nodes. kubectl label nodes gpu-node-name hardware-type NVIDIAGPUEnsure that the label has been applied. kubectl get nodes --show-labelsInstall the monitoring charts. (This step is performed automatically at the end ofinstallation.) helm install .tgz --nameprometheus-operator --namespace monitoring helm install /etc/kubeadm/monitoring/kube-prometheus-0.0.43.tgz --namekube-prometheus --namespace monitoringCheck the status of the pods. It may take a few minutes for the components toinitalize and start running. kubectl get pods -n -kube-prometheus-01hwww.nvidia.comKubernetes on NVIDIA 2Running02/2Running0DU-09016-001 v1.0 12

Using Distribution-Specific 1/1Running0Forward the port fo

Kubernetes is an open-source platform for automating deployment, scaling and managing containerized applications. Kubernetes on NVIDIA GPUs includes support for GPUs and enhancements to Kubernetes so users can easily configure and use GPU resources for accelerating w

Related Documents:

The top Kubernetes environments are Minikube (37%), on-prem Kubernetes installations (31%), and Docker Kubernetes (29%). On-prem Kubernetes installation increased to 31% from 23% last year. Packaging Applications What is your preferred method for packaging Kubernetes applications? Helm is still the most popular tool for packaging Kubernetes

Kubernetes support in Docker for Desktop 190 Pods 196 Comparing Docker Container and Kubernetes pod networking 197 Sharing the network namespace 198 Pod life cycle 201 Pod specification 202 Pods and volumes 204 Kubernetes ReplicaSet 206 ReplicaSet specification 207 Self-healing208 Kubernetes deployment 209 Kubernetes service 210

Configuring Kubernetes to run Oracle Programs on Certain Kubernetes Nodes Using Generic Kubernetes Features To leverage these Kubernetes features to limit Oracle licensing requirements for Oracle Programs to certain Kubernetes nodes within a Kubernetes clusters, you should perform the following steps using kubectl and YAML editing tools: 1.

Kubernetes and Canonical This reference architecture based on Canonical's Charmed Kubernetes. Canonical commercially distributes and supports the pure upstream version of Kubernetes. Ubuntu is the reference operating system for Kubernetes deployments, making it an easy way to build Kubernetes clusters.

Kubernetes integration in Docker EE What the community and our customers asked for: Provide choice of orchestrators Make Kubernetes easier to manage Docker Dev to Ops user experience with Kubernetes Docker EE advanced capabilities on Kubernetes Kubernetes management on multiple Linux distributions, multiple clouds and Windows

Kubernetes Engine (GKE), Amazon Elastic Container Service for Kubernetes (EKS) or Azure Kubernetes Service (AKS). B. Install, run, and manage Kubernetes on an IaaS platform such as Amazon EC2, Azure, Google Cloud or DigitalOcean. C. Install, run, and manage Kubernetes on infrastructure you own, either on bare metal or on a private cloud .

M5. The Cisco UCS C240 M5 Rack Server can host up to four NVIDIA T4 Tensor Core GPUs for AI inferencing, or up to two NVIDIA Tesla V100 Tensor Core GPUs for training workloads. The compact, 1RU Cisco UCS C220 M5 Rack Server can host up to two NVIDIA T4 Tensor Core GPUs. NetApp ONTAP. The ONTAP software built into

and Materials, ASTM; and by the Society of Automotive Engineers, SAE. These groups are summarized as follows: 100 Series (C10000) Coppers This group comprises the pure coppers, those with a designated mini-mum copper content of 99.3%, for high electrical conductivity. Also included within this group are the high copper alloys, those with