IP/MPLS High Availability Technologies

3y ago
46 Views
2 Downloads
8.21 MB
204 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Gannon Casey
Transcription

IP/MPLSHigh AvailabilityTechnologiesPranav DharwadkarCRBU Product Mgr (pranavd@cisco.com)Originator/Patent Holder of Prefix Independent ConvergenceDecember 4th 2008 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential1

Agenda High Availability – What is it ? What is Cisco’s view on High Availability IP/MPLS High Availability Technologies High Availability Infrastructure Tools 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential2

QuantificationN-NinesDowntime TimeMinutes/Year99%2-Nines5,000 Min/Yr99.9%3-Nines500 Min/Yr99.99%4-Nines50 Min/Yr99.999%5-Nines5 Min/Yr99.9999%6-Nines.5 Min/YrPercent Availability 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential3

Root Cause Analysis: Number offaultsFrom 101 failures in a large IP Network (April – June 2002)RP HW6%Other30%Chassis2%Parity - LCs17%RP SW2%LC HW23%LC SW10%Parity - RPs10%Other category includes: Line problems, operator error, config errors, cables, etc. 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential4

Causes of Unscheduled DowntimeNetwork operations failures87%Physical link failures87%79%Network hardware failuresNetwork software failures67%Customer premises equipment failure67%62%Physical environment failures44%Congestion/overloadUnknown37%Acts of nature37%25%Malicious damage0%20%40%60%80%100%% respondentsSOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes, Costs andContainment Strategies, August 17, 2001, Prepared for Cisco SPLOBSPLOB 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential5

HA – 4 Factors Network FactorLink Failures – Fiber cutsLink Failures – Forwarding logic failureNode Failures – HWLink Failures - SWCongestionSecurity attacks System FactorHW FailureSW Failure Operations FactorNetwork Operations FailureOut of resource conditionsSparing & SupportTraining Environment FactorPhysical Environment FailureMalicious Damage 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential6

Recipe for High Availability - Increase MTBF,Reduce MTTRContinuous Systems Operation99.999 % ServiceAvailabilityCSONo Single points of failurefor unplannedor planned eventsISSUIn ServiceSoftware UpgradeNetwork High AvailabilityFast Convergence (PIC)Hitless RP SwitchoverNon-stop Forwarding (NSF),Non-stop Routing (NSR)Distributed Route Processor,Service Separation ArchitectureIOS XR Basics: microkernel based architecturegranular process restart, protected memory, highlymodular, separation of control, data, managementplanes, fault tolerance, packaging modelHardware Design: redundancy (fabric, power, thermal,route processor, line card), high MTBF, distributedforwarding, Online Insertion Removal (OIR), parity or errorcorrecting memory, fault insertion testingPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute7

Recipe for High Availability - Increase MTBF,Reduce MTTRContinuous Systems Operation99.999 % ServiceAvailabilityCSOISSUIn ServiceSoftware UpgradeNetwork High AvailabilityFast Convergence (PIC)No Single points of failurefor unplannedor planned eventsHitless RP SwitchoverNon-stop Forwarding & RoutingDistributed Route Processor,Service Separation ArchitectureIOS XR Basics: microkernel based architecturegranular process restart, protected memory, highlymodular, separation of control, data, managementplanes, fault tolerance, packaging modelHardware Design: redundancy (fabric, power, thermal,route processor, line card), high MTBF, distributedforwarding, Online Insertion Removal (OIR), parity or errorcorrecting memory, fault insertion testingPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute8

HA: Hardware DesignRedundancy Built into every piece of hardwareNo single Point of FailureFailure of fabric, power, thermal, route processor resultsin immediate switchover to redundant hardwareHardware Memory Error Detection & CorrectionError Correction Seamless; only traffic hittingfaulty memory affected during error correctionPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute9

Hitless Expansion:Multi-chassisGoal:Allow incrementalcapacity expansion withno service lossImplementation:Same switch fabricarchitecture for standaloneand multi-chassis uses loadshared redundancyCRS 88-plane switch fabricMethodology: Take oneplane out of service at atime in a standalone config,upgrade that plane to multichassis configuration andbring it back onlineSingle Plane of 88-plane switch fabricPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute10

Hardware High AvailabilityFull Hardware RedundancyNo Single Point of FailureOnline Insertion & Removal (OIR) for all cardsHW assist arbitrationNo outage on Fabric Upgrade or FailureECC protected memory and R-S FEC for optical linksRedundant Out-of-Band System Control NetworkPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute11

System Control NetworkGERPLCFEGig EtherSwitchOptional 10GGig EtherSwitchRPLCRPLCFERPLCSCS2FESCPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.LCChassisLCChassisFabricChassisS2Under Cisco NDA – Extremely Confidential. Do not distribute12

No single points of Failure – SwitchFabric40 Gbps40 Gbps888 of 822LineLine CardCard112 of 8LineLine CardCard1 of 8Fabric Drivers &ApplicationsMicro Kernel andInfrastructure Supports 1:N redundancy Allows fabric upgrade one card at a timePranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute13

No Single points of Failure - RP HW errors detected on activeRP card Control plane lockup on activecard Routing Protocol Crashes onActive RP Costly to recover on the samenodeActiveStandbyPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute14

Control Plane – Data Plane Separation CRS and 12K has dedicated packetforwarding hardware (SPP / ISE)RPRP Packet forwarding un-affected by:RP-LCCommunicationLCPranav Dharwadkar Packet forwarding on LC’s canfunction autonomously duringcontrol plane outagesISIS, OSPF, BGP, MPLS mcastprocess restartInfrastructure process restartsRP failoverLC 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute15

Hitless Disk Replacement:Disk MirroringGoal:Handle disk failures withoutcausing a RP switchoverAllow replacement of faultydisksImplementation:2 flashdisks on CRS RP extendpersistent storage.Disks configured as redundantpairsAll important data (image andconfig) are replicated on the 2disksAny disk outage (software orhardware issues) can behandled locally using the built-inredundancyPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.CRS Route Processor with dualflashdisksUnder Cisco NDA – Extremely Confidential. Do not distribute16

HA – System FactorSPASD-2007 2007 Cisco Systems, Inc. All rights reserved.Cisco Confidential17

Avoid Single Points of Failure Support Redundant RP and Minimize or EliminateSwitchover or upgrade causes an outageTime spent in troubleshooting the problemThe probability of no on-site spares or the spares don’t work Support Link protection using links on different line cards – So LC is not asingle point of failuree.g., ECMPe.g., VRFs hosted on multiple Line Cards, Multiple Interfaces to CEe.g., Link Bundles Support Redundancy in Switching Fabric – So Switching Fabric is not a singlepoint of failurePranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute18

Recipe for High Availability - Increase MTBF,Reduce MTTRContinuous Systems Operation99.999 % ServiceAvailabilityCSOISSUIn ServiceSoftware UpgradeNetwork High AvailabilityFast Convergence (PIC)No Single points of failurefor unplannedor planned eventsHitless RP SwitchoverNon-stop Forwarding & RoutingDistributed Route Processor,Service Separation ArchitectureHA Infrastructure: microkernel based architecturegranular process restart, protected memory, highlymodular, separation of control, data, managementplanes, fault tolerance, packaging modelHardware Design: redundancy (fabric, power, thermal,route processor, line card), high MTBF, distributedforwarding, Online Insertion Removal (OIR), parity or errorcorrecting memory, fault insertion testingPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute19

Distributed OS for Next Generation NetworksDistributed Subsystems/ProcessesData PlaneCheckpoint DBMulticast nt PlanePFIHost ServiceLPTSQoSFIBACLL2 DriversRIBIGMPData PlanePIMRS VPOSPFISISBRIBBGPControl PlaneManagement PlaneSSHData PlaneSNMPControl PlaneManagement PlaneXMLControl PlaneSystem DBDistributed InfrastructureSchedulerSynch. ServicesIPC MechMemory MgmtLightweight Micro KernelKernel System ServicesPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute20

IOS XR - Micro Kernel ArchitectureProcessManagerTRUE Microkernel(Mach, QNX)FilesystemMMU with full protectionApplications, drivers, andprotocols are protecteddaine e)tnolCApplicationFAULTbtartaseR(Monolithic Kernel(BSD/Linux, NT)ined ApplicationatFAULTnCoble)artatse(RFAULTDriverdaine e)tnoCrtablta(ResApplicationMMU with partial protectionApplications are protectedemSystPranav DharwadkarptionuFilesystemrrCoediWKernel 2007 Cisco Systems, Inc. All rights reserved.NetworkFAULT DriverUnder Cisco NDA – Extremely Confidential. Do not distribute21

High Availability PStackCLIXMLAlarmFile SystemSSHInter ProcessCommunicationDistributed MiddlewareOSOSPFPIMIGMPACLPFIL2DriversNetflow SNMPContained Granular process restart allows for fast recovery from failures Leverage hardware redundancy like link and RP redundancy Graceful restart mechanisms in routing protocolPranav Dharwadkar2004 CiscoSystems,Inc. AllInc.rights 2007CiscoSystems,Allreserved.rights aryUnder CiscoNDA – ExtremelyDo not distribute2222

Modular stOSPFMulticastGMPLSBGPRPLOSPFForwardingLine MandMandBaseBaseAdminOSSCBaseOSAdminMandOS Upgrade specific packages/CompositesAcross Entire systemUseful once a feature is qualified and you want to roll it without lot of commandsTargeted Install to specific cardsUseful while a feature is being qualified–reduces churn in the system to card boundary Point Fix for software faultsPresentation IDPranavDharwadkar2004 CiscoSystems,Inc. AllInc.rights 2007CiscoSystems,Allreserved.rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute2323

Software Upgrade 101 – Bug FixesWhat is available from 1st Release?SMU ( Software Maintenance Unit, point fix )1.Hitless SMU : 64% of all bug fixes posted have NO traffic impact2.Traffic impact SMU: 11% of all bug fixes have traffic impact3.Reload SMU: 25% of bug fixes require a reloadNext step1.Complete the ISSU building blocksPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute24

SMU Installation ImpactPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute25

Recipe for High Availability - Increase MTBF,Reduce MTTRContinuous Systems Operation99.999 % ServiceAvailabilityCSOISSUIn ServiceSoftware UpgradeNetwork High AvailabilityFast Convergence (PIC)No Single points of failurefor unplannedor planned eventsHitless RP SwitchoverNon-stop Forwarding & RoutingDistributed Route Processor,Service Separation ArchitectureIOS XR Basics: microkernel based architecturegranular process restart, protected memory, highlymodular, separation of control, data, managementplanes, fault tolerance, packaging modelHardware Design: redundancy (fabric, power, thermal,route processor, line card), high MTBF, distributedforwarding, Online Insertion Removal (OIR), parity or errorcorrecting memory, fault insertion testingPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute26

Distributed Route Processor and IOS XR ServiceSeparation Architecture Scaling control plane beyondbasic RP with DRPSDRAvailable since 2005Competitors had to react withannouncement in 2008 ! Service Domain RoutersIndependent/isolated physicalrouting instancesContain a subset of (d) RPsand LCs within a common(multi-) chassis.Solution available since 2005Lead customers: BT &ComcastCompleteH/W FaultIsolationFANCCR RP PD DR RP PDRPDRPService Separation ArchitectureRoutingInstance APranav DharwadkarFANRoutingInstance B 2007 Cisco Systems, Inc. All rights nce CUnder Cisco NDA – Extremely Confidential. Do not distribute27

Distributed Control NTSYSTEM PROCESSDISTRIBUTIONRPn Routing protocols and signaling protocols canrun in one or more (D)RP Each (D)RP can have redundancy supportwith standby (D)RP Out of resources handling for proactive planningPranav Dharwadkar2004 CiscoSystems,Inc. AllInc.rights 2007CiscoSystems,Allreserved.rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute2828

Applications—BGP Multi-speakersFrom IOS XR InternalTransport LPTSFrom IOS XR InternalTransport LPTSBGP Instance: DRP#3OSPF or ISIS Instance(Multiple)BGP Instance: DRP#2StaticRoutesBGP Instance: DRP#1Adj RIB INIGP RIBLSDBAdj RIB OUTBGP Placement ManagerGlobal RIB(active)BGPRIBGlobal RIB(standby)RPIOS XR FIB Distribution Distributed BGP speakers to multiple RP and DRPs Single unified BGP RIB to external peers Achieve BGP peering scalabilityPranav Dharwadkar2004 CiscoSystems,Inc. AllInc.rights 2007CiscoSystems,Allreserved.rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute2929

Recipe for High Availability - Increase MTBF,Reduce MTTRContinuous Systems Operation99.999 % ServiceAvailabilityCSOISSUIn ServiceSoftware UpgradeNetwork High AvailabilityFast Convergence (PIC)No Single points of failurefor unplannedor planned eventsHitless RP SwitchoverNon-stop Forwarding & RoutingDistributed Route Processor,Service Separation ArchitectureIOS XR Basics: microkernel based architecturegranular process restart, protected memory, highlymodular, separation of control, data, managementplanes, fault tolerance, packaging modelHardware Design: redundancy (fabric, power, thermal,route processor, line card), high MTBF, distributedforwarding, Online Insertion Removal (OIR), parity or errorcorrecting memory, fault insertion testingPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute30

High Availability InfrastructureContainedContainedHotBGPOSProcess AIS-ISRIBWarmQoSFIBIPStackCheckDistributed MiddlewarePointServerProcess BOSPFIGMPProcess CCold PIMACLPFIACTIVE le SystemProcess BNetflow SNMPSSHProcess CInter ProcessCommunicationSTANDBYCARDContained Distribution improves fault tolerance and recovery time by localizingthe database and system management functionality to each node Granular process restart allows for fast recovery from failures IOS XR is designed to optimize the switch over between redundant hardwareelements (RP, SC, PS, Fan C.)IOS XR is designed to route around fabric failureLine cards are protected by link bundling, APS, IPS, ECMP etc.Presentation IDPranavDharwadkar2004 CiscoSystems,Inc. AllInc.rights 2007CiscoSystems,Allreserved.rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute3131

Continue to Route Despite Failure:Non Stop Routing (NSR)Goal: Maintain routing sessions during primaryRoute Processor failureImplementation: Currently, upon RP failover, routingsessions terminate. Protocols on standbyRP reestablish sessions. With NSR, sessions are migrated fromactive to standby RP without notifyingpeers.No software upgrade on all routersNo manual tuning of timersNo additional load on peering routersPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute32

What is NSR? NSR is a self-contained solution to maintain therouting service (& hence the forwarding service)during:RP/DRP fail-overProcess restartIn Service Software UpgradeRack OIR in the case of Multi-chassis No disruption to the routing protocol interactionwith other routers.Pranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute33

What’s the behavior today? During RP failover:Routing sessions terminate.Once standby RP becomes active, protocols reestablish thesessions, relearn the routes, and populate forwarding.NSF is designed to preserve the forwarding state while this ishappening.But other routers in the network detect the failure and try toroute around – huge network churn, could lead to forwardingloops and/or traffic loss. Protocol Extensions – Graceful Restart (GR)Neighboring routers detect the failure, but do not propagate thefailure in the network.They also assist the router in coming back up.Pranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute34

GR vs. NSR Graceful RestartRequires the software on all routers to be upgraded.Requires manual tuning of timers – If not correctly done, GRwon’t help.Adds load on the peering routers which could cause instability.Introduces a window in which forwarding loops and traffic losscan happen.Not all vendors have implemented GR With Non-stop Routing:Sessions don’t terminate during failover.Routing interaction continues on the newly active RP withoutpeers being aware.Pranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute35

EvolutionAControl PlaneData PlaneBCControl PlaneData PlaneFailure propagationNONEControl PlaneData PlaneControl PlaneData PlaneFailure propagationNSFControl PlaneData PlaneGRControl PlaneData PlaneNSRControl PlaneData PlanePranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Control PlaneData PlaneControl PlaneData PlaneControl PlaneData PlaneUnder Cisco NDA – Extremely Confidential. Do not distribute36

NSF, GR, NSR NSFGRNSRForwarding plane kept intactYesYesYesSession FailureYesYesNoFailure propagation in networkYesNoNoHandling topology changesNoNoYesProtocol extensions neededNoYesNoPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute37

Complementarity NSF is the building block – needed for any HA solution GR is not required for local control plane failureSessions stay up with NSR GR required for remote control plane failure that don’tsupport NSRSession will go downGR helper role has to get triggered GR also helps as a fallback optionPranav Dharwadkar 2007 Cisco Systems, Inc. All rights reserved.Under Cisco NDA – Extremely Confidential. Do not distribute38

NSR & ISSU NSR is the building block for ISSUAfter standby RP upgrade to a newer version, it needs to get toNSR ready state before the RP failover step.Management NodesLoadversionXManaged NodesStep - 2versionXRunversionYManaged NodesLC /FCversionYversionYLC /FC 2007 Cisco Systems, Inc. All rights reserved.LC /FCversionYActiveR

Control Plane Data Plane Management Plane BGP BRIB ISIS OSPF RS VP PIM IGMP RIB L2 Drivers ACL FIB QoS LPTS Host Service PFI Interface CLI SNMP XML Netflow Alarm Per.fMgmt SSH SSH

Related Documents:

slide series thatdescribe the Multiprotocol Label Switching (MPLS) concept . Layer-3 VPNs Layer-2 VPNs MPLS QoS MPLS TE MPLS OAM/MIBs End-to-end Services MPLS Network Services . §MPLS label forwarding and signaling mechanisms Network Infrastructure MPLS Signaling and Forwarding Layer-3 VPNs Layer-2 VPNs

VPN Customer Connectivity—MPLS/VPN Design Choices Summary 11. Advanced MPLS/VPN Topologies Intranet and Extranet Integration Central Services Topology MPLS/VPN Hub-and-spoke Topology Summary 12. Advanced MPLS/VPN Topics MPLS/VPN: Scaling the Solution Routing Convergence Within an MPLS-enabled VPN Network Advertisement of Routes Across the .

MPLS-based VPN services: L3 MPLS VPN and L2 MPLS VPN. MPLS L2VPN has two modes: Virtual Private LAN Service (VPLS) and Virtual Leased Line (VLL). VLL applies to point-to-point networking scenarios, while VPLS supports point-to-multipoint and multipoint-to-multipoint networking. From users' point of view, the whole MPLS network is

MPLS PW OAM mechanisms are described next, and a brief look at existing layer 2 OAM mechanisms is provided. The article goes on to describe the relationship between end-to-end fault detection and the segment-based OAM mechanisms. MPLS PW An MPLS PW is the mechanism used to carry layer 2 traffic over MPLS. It is a point-to-point

MPLS OAM Overview MPLS OAM technology provides the MPLS network with a defect-detection tool and a defect-rectification mechanism that are independent of any Layer 3 or Layer 2 protocols. The check function of the CR-LSP forwarding plane is implemented through MPLS OAM and protection switching.

Usetheping sr-mpls fec fec-type igp isis CLIcommandtoexecuteanIS-ISSRpingandtheping sr-mpls fec fec-type bgp CLIcommandtoexecuteaBGPping. switch# ping sr-mpls 11.1.1.3/32 fec-type igp isis Sending 5, 100-byte MPLS Echos to IGP Prefix SID(IS-IS) FEC 11.1.1.3/32, timeout

SDN in Access network, SDN in Optical Layer & MPLS on top Working in orchestration Depends on -Control Plane, SDN Controllers, APIs Communication through Open Interfaces Access SDN SDN to MPLS Control Plane API Function Edge Gate way Programmable MAC/VLAN/PBB & MPLS to MPLS Mapping Ethernet CPRI/dRoF

MPLS links which serve the purpose of connecting offices in a secure manner. MPLS MPLS and SD-WAN are popular methods for connecting offices together so that users can access internal network resources from any office location securely . Of the choices available to accomplish this task, MPLS and SD-WAN are the most widely deployed methods.