Achieving True Reliability & Disaster Recovery For Mission .

2y ago
21 Views
2 Downloads
993.94 KB
22 Pages
Last View : 1m ago
Last Download : 3m ago
Upload by : Jacoby Zeller
Transcription

Achieving True Reliability & Disaster Recoveryfor Mission Critical AppsOleg Chunikhin CTO

IntroductionsOleg ChunikhinCTO, Kublr 20 years in software architecture & development Working w/ Kubernetes since its release in 2015 CTO at Kublr—an enterprise ready containermanagement platform Twitter @olgch; @kublrLike what you hear? Tweet at us!

What’s gingObservabilityAPISECURITY rsUsageReporting@olgch, @kublrContainerRegistryContainer RuntimeCI / CDRBACIAMAir GapTLSCertificateRotationAuditApp MgmtKubernetesInfrastructure

Building a Reliable System with Kubernetes @olgch, @kublrDay 1: trivial, K8S will restart my pod!Day 30: this is not so simple.What Kubernetes does, can, and doesn’t doFull stack reliability: tools and approaches

K8S Architecture Refresher: ComponentskubectlKubernetes ClusterclusterDNSMastercontrollermanagerAPI etworkcontainerruntimeetcd dataThe master, agent, etcd, API, overlay network, and DNS@olgch, @kublr

K8S Architecture Refresher: API ObjectsClusterSrvA10.7.0.1SrvB10.7.0.3Node 1Node 2Pod A-1Pod B-1Pod sistentVolumeNodes, pods, services, and persistent volumes@olgch, @kublr

K8S Reliability Tools: Probes & ControllersPod ProbesLiveness and readiness check TCP, HTTP(S), execControllersReplicaSet Maintain specific number of identical replicasDeployment ReplicaSet update strategy, rolling updateStatefulSet Deployment replica identity, persistent volume stabilityDaemonSet Maintain identical replicas on each node (of a specified set)Operators@olgch, @kublr

K8S Reliability Tools: Resources & Scheduling Resource framework Standard: CPU, memory, disk Custom: GPU, FPGA, etc. Requests and limits Kube and system reservations no swap Pod eviction and disruption budget (resource starving) Pod priority and preemption (critical pods) Affinity, anti-affinity, node selectors & matchers@olgch, @kublr

K8S Reliability Tools: AutoscalingHorizontal pod autoscaler (HPA)Vertical pod autoscaler (VPA) In-place updates - WIP (issue #5774)Cluster Autoscaler Depends on infrastructure provider - uses node groupsAWS ASG, Azure ScaleSets, . Supports AWS, Azure, GCE, GKE, Openstack, Alibaba Cloud@olgch, @kublr

Full Stack re” Operations: monitoring, log collection, alerting, etc. Lifecycle: CI/CD, SCM, binary repo, etc. Container management: registry, scanning, governance, etc.Container Persistence: cloud native storage, DB, messagingContainer Orchestrator: Kubernetes “Essentials”: overlay network, DNS, autoscaler, etc. Core: K8S etcd, master, worker components Container engine: Docker, CRI-O, etc.OS: kernel, network, devices, services, etc.Infrastructure: “raw” compute, network, storage@olgch, @kublr

Architecture 101 @olgch, @kublrLayers are separate and independentDisposable/“restartable” componentsRe-attachable dependencies (including data)Persistent state is separate from disposableprocessesPets vs cattle - only data is allowed to be pets(ideally)

Infrastructure and OSIf a node has a problem. Try to fix it (pet) Replace or reset it (cattle)Tools In-cluster: npd, weaveworks kured, . hardware, kernel, servicer, containerruntime issues reboot Infrastructure provider automation AWS ASG, Azure Scale Set, . External node auto recovery logic Custom infrastructure provider API Cluster management solution (Future) cluster API@olgch, @kublr

Kubernetes Components: Auto-RecoveryComponents etcd Master: API server, controllermanager, scheduler Worker: kubelet, kube-proxy Container runtime: Docker,CRI-OAlready 12 factor@olgch, @kublrMonitor liveliness, automaterestart Run as services Run as static podsDependencies to care about etcd data K8S keys and certificates Configuration

Kubernetes Components: Multi-MasterK8S multi-master Pros: HA, scaling Cons: need LB (server or client)etcd cluster Pros: HA, data replication Cons: latency, ops ulerschedulerscheduleretcdetcdetcdetcd data Local ephemeral Local persistent (survives node failure) Remote persistent (survives node replacement)@olgch, @kublretcddataetcdetcddatadatakubelet

Container Persistence Persistent volumes Volume provisioning Storage categories Native block storage: AWS EBS, Azure Disk, vSphere volume,attached block device, etc. HostPath Managed network storage: NFS, iSCSI, NAS/SAN, AWS EFS, etc. Some of the idiosyncrasies Topology sensitivity (e.g. AZ-local, host-local) Cloud provider limitations (e.g. number of attached disks) Kubernetes integration (e.g. provisioning and snapshots)@olgch, @kublr

Cloud Native Storage Integrates with Kubernetes CSI, FlexVolume, or native Volume provisioners Snapshots support Runs in cluster or externally Approach Flexible storage on top of backing storage Augmenting and extending backing storage Backing storage: local, managed, Kubernetes PV Examples: Rook/Ceph, Portworx, Nuvoloso, GlusterFS, Linstor,OpenEBS, etc.@olgch, @kublr

Cloud Native Storage: nImageosdosdosdComponentsS3/SwiftObject eshaosdosdrbd-mirrorFilesystemBlock PoolNFSObject StoreObject Store UserOperatorProvisionersCSI pluginsrawdatarawrawdatadataCeph@olgch, @kublrCephFilesystemCephClusterRook

MiddlewareOperations: monitoring, log collection, alerting, etc.Lifecycle: CI/CD, SCM, binary repo, etc.Container management: registry, scanning, governance, etc.Deployment options: Managed service In Kubernetes Deploy separately@olgch, @kublr

Something Missing? Multi-Site Region to region; cloud to cloud; cloud to on-prem (hybrid) One cluster ( ) vs cluster per location ( )Tasks Physical network connectivity: VPN, directOverlay network connectivity: Calico BGP peering, native routing, Cross-cluster DNS: CoreDNSCross-cluster deployment: K8S federationCross-cluster ingress, load balancing: K8S federation, DNS, CDNCross-cluster data replication @olgch, @kublrnative: e.g. AWS EBS, Snapshots inter-region transferCNS level: e.g. Ceph geo-replicationdatabase level: e.g. Yugabyte geo-replication, sharding, .application level

To Recap. Kubernetes provides robust tools for application reliability Underlying infrastructure and Kubernetes componentsrecovery is responsibility of the cluster operator Kubernetes is just one of the layers Remember Architecture 101 and assess all layers accordingly Middleware, and even CNS, can run in Kubernetes and betreated as regular applications to benefit from K8S capabilities Multi-site HA, balancing, failover is much easier with K8S andthe cloud native ecosystem. Still requires careful planning!@olgch, @kublr

Q&A@olgch, @kublr

Thank you!Take Kublr for a test drive! kublr.com/deploy Free non-production license@olgch, @kublrOleg ChunikhinCTO Kublroleg@kublr.com

Achieving True Reliability & Disaster Recovery for Mission Critical Apps Oleg Chunikhin CTO. Introductions Oleg Chunikhin CTO, Kublr 20 years in software architecture & development . Native block storage: AWS EBS

Related Documents:

NetWorker Server disaster recovery roadmap This guide provides an aid to disaster recovery planning an detailed step-by-step disaster recovery instructions. The following figure shows the high-level steps to follow when performing a disaster recovery of the NetWorker Server. Figure 1 Disaster recovery roadmap Bootstrap and indexes

4.2 State Disaster Recovery policy 4.3 County and Municipal Recovery Relationships 4.4 Recovery Plan Description 4.5 Recovery Management Structure and Recovery Operations 4.6 Draft National Disaster recovery Framework (February 5, 2010) 4.6.1 Draft Purpose Statement of the National Disaster Recovery Framework

1. Post-Disaster Recovery and Disaster Risk Reduction require support from community participation in improving the quality and objectives of Disaster Management; 2. Community-based Disaster Risk Reduction is a key factor in participatory disaster management, including in post-disaster recovery, as indicated by best practices in Yogyakarta and .

community disaster—recognize that preparing for long-term disaster recovery demands as much attention as preparing for short-term response. After a major disaster, the recovery process takes months and even years to bring a community back to a "new normal" and as strong as or better than before the disaster. Disaster Recovery: A Local

recovery mechanisms, and a formalized Disaster Recovery Committee that has responsibility for rehearsing, carrying out, and improving the disaster recovery plan. When a disaster strikes, the normal operations of the enterprise are suspended and replaced with operations spelled out in the disaster recovery plan.

1 Introduction to Oracle Fusion Middleware Disaster Recovery 1.1 Overview of Oracle Fusion Middleware Disaster Recovery 1-1 1.1.1 Problem Description and Common Solutions 1-1 1.1.2 Terminology 1-2 1.2 Setting Up Disaster Recovery for Oracle Fusion Middleware Components 1-5 1.2.1 Oracle Fusion Middleware Disaster Recovery Architecture Overview 1-5

Depending on whether log backup replication is part of the disaster recovery setup or not, the steps for disaster recovery are slightly different. This section describes the disaster recovery testing for data-backup-only replication as well as for data volume replication combined with log backup volume replication. To perform disaster recovery .

Depending on whether the log backup replication is part of the disaster recovery setup, the steps for disaster recovery are slightly different. This section describes the disaster recovery failover for data-backup-only replication as well as for data volume replication combined with log backup volume replication. To execute disaster recovery .