Expanding Root Cause Analysis To Include Organizational Factors And .

9m ago
6 Views
1 Downloads
638.74 KB
12 Pages
Last View : 30d ago
Last Download : 3m ago
Upload by : Mollie Blount
Transcription

·. EXPANDING ROOT CAUSE ANALYSIS TO INCLUDE ORGANIZATIONAL FACTORS AND WORK PROCESSES by R.W. Tuli, J-S. Wu, G.E. Apostolakis* School of Engineering and Applied Science 38-137 Engineering N University of California Los Angeles, CA 90095-1597 USA tel: (310) 825-1300 fax: (310) 206-2302 apostola@seas.ucla.edu Presented at the American Nuclear Society International Topical Meeting on Safety Culture in Nuclear Installations Vienna, 24-28 April 1995 * To whom correspondence should be addressed

·'

R.N. 940608 EXPANDING ROOT CAUSE ANALYSIS TO INCLUDE ORGANIZATIONAL FACTORS AND \VORK PROCESSES R.W. Tuli, J-S. Wu, G.E. Apostolakis· School of Engineering and Applied Science 38-137 Engineering IV University of California Los Angeles, CA 90095-1597 USA tel: (310) 825·1300 fax: (3iO) 206-2302 apostola@Seas.ucla.edu . ABSTRACT All nuclear power plants incorporate root cause analysis to help identify and isolate key factolS judged significant following an incident. Identifying the principal deficiencies can become very difficult when the incident involves not only human and machine interaction but possibly the underlying safety culture of the organization. The current state of root cause analysis in· many plants is to stop after identifying human or hardware failures. In this work, root cause analysis is taken one step further by examining work processes and organizational factors. especially when management deficiency or human failure contribute to the incident Root cause analysis is best designed when the organization. as a whole. wishes to improve the overall operation of the plant by preventing similar incidents from . occurring again in the future. By focusing on the possible solutions, as well as the fault, the organization can begin to address problems hidden deeply within the work processes that operate, maintain, and support the plant. * To whom correspondence should be addressed

2 1. R.N. 940608 Introduction Root cause analysis is a methodology used by nuclear power plants to help identify and isolate key contributing factors judged to be significant leading up to and during an incident. When the occurrence involves many factors" including human pedormance and/or management decision, identifying the root cause of the event may become very difficult, and in many cases. involve the underlying safety culture of the organization. Traditional methods of root cause analysis focus primarily on material deficiency and human error but stop short of looking deeper into the many work processes and organizational factors affecting everyday operation and support of the plant. In this work, a methodology is suggested to systematicaJlyexpand on traditional approaches to root cause analysis to 'incorporate organizational factors and work processes evaluation, thus probing deeper into the event allowing corrective actions to focus not only on the cause but also on improving the safety culture of the organization. . 2. Overview of C rrent Root Cause Analysis The methodology suggested by organizations, such as 'the International Atomic Energy Agency (IAEA), which runs the Assessment of Safety Significant Event Team (ASSE1). and the Nuclear Regulatory Commission (NRq,. which developed The Human Pedormance Investigation Process (HPIP) [US Nuclear Regulatory Commission, 1994] is designed to address latent weaknesses in the Nuclear Power Plant (NPP) which have resulted in an incident or accident Root cause analysis, in most cases, investigates why these weaknesses were not eliminated in a timely manner. To expand root cause analysis to include work processes' and organization factors, this work uses as a case study the application of the ASSET methodology to a selected incident. ASSET analyzes significant events by preparing a descriptive narrative, establishing a chronological sequence of events, and preparing the· logic tree of occurrences which lead to the event. Significant occurrences in the logic tree are then investigated in detail and summarized in an Event Root Cause Analysis Form (ERCAF) [Reisch, 1994]. This work demonstrates that by expanding the ERCAF to include one additional section, i.e. specifically addressing possible latent weaknesses in the key work process(es). we can . improve greatly upon any corrective actions designed by the organization to include, not only preventative measures, but also improvements to the overall safety culture of the organization by improving that work process. The ERCAF is divided into three sections (Table 1). The fi rst section describes the incident by stating specifically what failed to perform as expected, including the nature of the occurrence, i.e. an equipment, personnel or procedure failure. The second section

3 R.N. 940608 addresses the direct cause of the incident by focusing on why the event occurred. This is done by looking into possible latent weaknesses of the failed component. The third section is directed toward the root cause by examining why the event was not prevented. or more specifically. the deficiency to timely eliminate any of the contributing latent weaknesses. 3. Work Processes Since the NPP is organizationally best described as a machine bureaucracy. i.e. operated primarily by the standardization of work [Mintzberg, 1979], we focus our attention on the many work processes that operate. maintain and support the NPP. A work process is defined as a standardized sequence of tasks designed with the objective of achieving a specific goal within the operational environment of an organization. Most of the work processes at NPPs are described and controlled by written procedures. All procedures include an elaborate step-by-step set of instructions that are carefully documented to guide plant operators and maintenance crews through predicable job-related situations. The work processes in a NPP are designed to affect, .either directly or indirectly. the perforrna ce Qf plant personnel and hardware [Davoudian, Wu and Apostolakis, 1994a]. The total number of work processes at a ·NPP may be very large; however, because we begin the work process evaluation with a specific incident in mind, we are usually limited to one or two. In this work, we will look at the corrective maintenance work process. To evaluate the work process using WPAM, we look at the specific tasks that make up the corrective maintenance work process. The first task in the corrective maintenance work process is prioritization. When a plant component has failed or found to be in a degraded slate. a work request is initiated. The request must be prioritized with respect to all other outstanding or incoming requests. The defense or barriers to each task are designed by the organization to prevent failure. For Prioritization, this includes multiple reviews.· Once the different corrective maintenance work request orders have been prioritized, the next step involves planning and assembling the work package to carry out the evolution. The defenses or barriers for this task include Work Control Center (Wcq, Engineering and departmental reviews. The third task involves scheduling/coordinating the planned corrective maintenance between the many departments playing a part in the evolution. The barriers include interdepartmental meetings and reviews. Once the maintenance evolution has been planned and coordinated it is then carried out as per the work order request. The barriers include self verification, quality control, and post maintenance testing. When the maintenance has been completed and tested, it is then returned to service. The defense or barriers include self and independent verification. The last task in the corrective work process is always documentation. Considerable research has been done on how organizational factors affect the everyday operation of nuclear power plants. Using the Work Process Analysis Model (WPAM) [Davoudian, Wu and Apostolakis, 1994a and bl, we begin to bridge the gap between the organization and NPP safety. The organizational factors are defined as the dimensions by

4 R.N. 940608 which each task in the work process is affected the organization. Taking the qualitative aspects of WPAM one step further, it can be shown that root cause analysis can be expanded in specific incidents where management deficiency and/or human performance factors are determined to be the underlying cause of an incident. 4. Loss or ofT-site power, Oconee, 1992 As a case study, we look at an incident resulting in a loss of off-site power. In 1989, a station report was initiated which requested replacement and upgrade of the existing batteries in the switchyard at Oconee Nuclear Station. In December 1990, the associated Nuclear Station Modification (NSM) was initiated. In May, 1992, the utility submitted a request for a revision to Technical Specifications in order to extend a Limiting Condition for Operation (LCO) from 24 hours to 7 days. This would allow one battery or associated DC distribution system panel to be out of service long enough to replace the batteries in accordance with the NSM. As part of the moditication package, two implementation procedures were developed. one for each"battery. During the development of these two procedures, it was decided that the preferred contiguration of the two DC buses would be to maintain separation of the buses, and· to use the associated battery charger as the only source of power for each bus as its battery was replaced. During this decision making process, personnel in Engineering and Operations were consulted and concurred.· After review, procedure TN/5/N2863/oo/AL2 "Replace 230KV SWYD Batteries SY-2" , was approved on October 15, 1992 [LER 05000-270.1992]. On October 19, 1992, while performing this maintenance, Oconee Unit 2 experienced a loss of off-site power, a generator load rejection. and a trip from 100% full power. A battery charg r was placed in service without a connected battery. It produced excessive voltages which caused a series of spurious breaker failure relay actuations, locking out both buses in the 230 KV switchyard. Also, during recovery actions, shutdown of one emergency generator, after the emergency start signal had been reset, resulted in the unanticipated trip of the operating emergency generator leading to a second loss of power on Oconee Unit 2. The root cause of the event was determined to be management deficiency due to a less than adequate corrective action program. The Licensee Event Report (LER) concluded that three specific factors combined to produce the event. First, the breaker failure relay zener diodes would pass a spurious signal when subjected to a greater than 200 VDC for two milliseconds or longer. Second, the 230 KV switehyard DC power system was being operated with the battery isolated from the bus with the battery charger acting as the only source of voltage. Third. the battery charger, when operated in this configuration. produced an output voltage which varied from approximately 70 to over 200 VDC.

5 5. R.N. 940608 Conventional root cause analysis The analysis done by the utility identifies the root cause of the incident to be management deficiency stemming from less than adequate corrective action, specifically poor planning and execution of maintenance. Using the ASSET format (fable 1) to address the root . cause for the first occurrence, i.e. loss of off-site power, we see that the direct cause, as stated in the LER, was zener diodes in the breaker failure relays passing spurious voltage signals causing the breakers to trip open. Contributors to this weakness included a power supply that varied DC source voltage from approximately 70 - 200 VDC. The root cause section addresses the deficiency to timely eliminate this problem. By studying the LER, it is found that there was inadequate detection of possible problems concerning operation .of the electric plant or battery charger in this configuration. The deficiency stemmed from a less-than-adequate corrective action to remove and replace the station batteries. 6. Including work prQCesses and organizational factors By modifying the ERCAF to address possible lalent w aknesses in the organization. it is possible link the organization directly or indirectly to the specific incident by including work processes and organizational factors in the analysis. It is possible to identify the key' work process(es) and associated organizational factors playing significant roles in this incident. Both preventive and corrective maintenance are . included in the maintenance program at all NPPs. Preventive maintenance is usually scheduled periodically to ensure plant components meet technical specifications and/or surveillance requirements while corrective maintenance refers to the repair and/or restoration of equipment or components which have failed or found, as a result of periodic testing. to be in a degraded state. The batteries were being replaced as part of a modification package with a· work package put into place to carry out this upgrade. By examining the standardizedsequence of tasks designed within the operational environment of the NPP to achieve this modification, we see that the corrective maintenance work process (section 4) is the best choice. If the incident took place while testing, starting up, shutting down or other special evolutions were taking place, we would have to direct our search to work processes that contain those specific sequences of tasks. Using the task flow chart developed by WPAM, we can now construct an organizational factors matrix. Its purpose is to show the organizational factors that may impact on the safe performance of each task. Using this matrix, we can then determine which organizational factors are involved in each task and its associated barrier. By taking this matrix one step further, it is possible to prioritize and rank each organizational factor within the associated task leading to a weighted organizational factors matrix (fable 2). We can now easily identify the most signific.1nt organizational factors affecting each task in the corrective maintenance work process. For example. from the organizational factors

6 R.N. 940608 matrix we see that the task of planning is affected by thirteen organizational factors. When we prioritize and rank these factors. we see coordination of work. technical knowledge, time urgency. problem identification and organizational learning appear to be the most significant We now have a way to see the direct and/or indirect impact of the organization on each . task in the work process. If it is desired to look at every single organizational factor, regardless of its relative importance, then it is assumed that it would require greater cost and effort. The weighted organizational factors matrix is suggested as a tool to directly focus on the more salient dimensions. The root cause of the loss of power at Oconee was determined to be management deficiency. By expanding this analysis. suggestions can be made as to what the deficiencies were. and suggestions can be made to improve the corrective maintenance work process in these areas. The loss of off-site power occurred during the "execution" step of this work process, but. we' can learn even more by starting widrthe first task and working our way up to the incident. thereby seeing how certain factors may have been compounded from task to task. Using the definitions of each organizational factor [Jacobs and Haber; 1994]. we can direct the analysis by asking specific questions about the events surrounding each task. To demons.trate this approach. we begin with the first task. prioritization, and look for possible latent weaknesses by looking at the more significant organizational factors. For example. addressing goal prioritization. we could look into instances· where plant personnel may have not understood. accepted nor agreed with the purpose and relevance of plant goals. From the LER we learn that,' in 1980. the vendor of the breaker failure relays had sent out "Product Reliability Letters" stating that these relays actuate spuriously if exposed to a 200 VDC differential for greater than 2 milliseconds. The letter also contained directions for a field change to correct .the problem. Although utility personnel reviewing the letters recommended making to changes. the relays were never modified, thus suggesting that this action was judged to be not of "high priority". This may suggest possible weaknesses in the way the organization prioritizes when economic considerations . are also weighed. Looking specifically at the second task. planning. we see that the organizational factor, technical knowledge, is the most important. We also note that technical knowledge is also one of the most important organizational factors for the tasks of prioritization, scheduling/coordination. execution and returning to normal line-up. Oearly, technical knowledge plays a very important role throughout the corrective maintenance work process.

7 R.N. 940608 Technical knowledge refers to the depth and breadth of requisite understanding plant personnel have regarding plant design and systems, and of the phenomena and events that bear plant safety. When we study this incident looking for specifics where a lack of technical knowledge could have contributed in some way in the planning stages, we learn from the LER that the vendor manual for the battery charger provided some specifications for current and voltage stability while connected to a battery, but no data for operation without a battery. There was no specific statement prohibiting operation without a battery, but setup instructions called for connecting a battery and all wording in the vendor documentation assumed that a battery was a]waysconnected. When the charger vendor was consulted, he stated that the chargers were not intended for use without a battery in the circuit. Looking at another significant factor in the planning stage, organizational learning, we look for whether or not plant personnel and the organization used knowledge gained from past experiences to improve performance. Again from the LER, we find that a similar event had occurred at Vermont Yankee (VY) on April 23, 1991 (approximately 18 months prior to the Oconee incident). The·VY event had also involved operation with one switchyard DC bus powered by a battery charger (isolated from its associated battery), inadequate voltage control by the charger panialJy ue to failed components, and activation of breaker failure relays due to voltage surges associated with establishing the battery configuration. This event was evaluated as per the utility Operating Experience Program (OEP) and it was concluded that the equivalent portion of the circuit would not fail.the same way. The OEP di4 not discover that a different circuit was subject to the same failure mode, with the same result: actuation of the relay. Another imponant factor for this task is problem identification. We focus here on how the organization encouraged plant personnel to draw upon their knowledge, experience, and current information to identify possible problems in the work package. Many departments reviewed the work procedure with none objecting to the switchyard line-up nor power supply configuration. Similar analysis can be done on the remaining tasks, suggesting other possible latent weaknesses in the organization. For this paper, we only looked at the first two tasks. By expanding the analysis to include work process evaluation, we have identified several organizational factors that possibly led to poor decisions by management. In particular, it is suggested that during the planning stages of this maintenance evolution, it was lack of technical knowledge, reluctance to use organizational learning and lack of foreseeing possible problems in the procedure that led to the loss of off-site power. With this expanded analysis. we have pinpointed areas within the organization that, when improved, not only improve the operation of the plant. but may increase the overall safety culture of the organization by improving the work process.

R.N. 940608 By assessing its safety culture, an organizmion can determine where efforts need to be focused to improve the overall plant [Ostrom. Wilhelmsen and Kaplan, 1993]. The benefit of expanding root cause analysis to look additionally at work processes and organizational factors is that we assess and address safety culture when we look for latent weaknesses. Solutions to prevent future occurrences can now include improvements in the overall work process. As an example, from the LER, we learn that as a corrective measure, the utility revised the OEP to improve periodic assessments and effectiveness. From our expanded analysis, we can go beyond this improvement by fully appreciating the value of organizational learning in the planning of any standard maintenance operation. Organizational improvements and allocation of resources to improve organizational learning would not only improve OEP, but the numerous other NPP work processes that utilize this specific organizational dimension. REFERENCES Davoudian, K., Wu, J.S., and Apostolakis. G., 1994a, "Incorporating Organizational Factors into Risk Assessment Through the Analysis of Work Processes," Reliability Engineering and System Safety, 45, 85-105. Davoudian, K., Wu, J.S., and Apostolakis. G., 1994b, "The Work Process Analysis Model, n Reliability Engineering and System safeey, 45, 107-125. Jacobs, R., and Haber, S., 1994, "Organizational Processes and Nuclear Power Plant Safety, n Reliability Engineering and System Safety, 45, 75-83. LER 05000-270. 1992. Loss of Off-site Power and Unit trip Due Deficiency, Less Than Adequate Correcei"oe A.clion Program. 10 Afanagement Mintzberg, H., 1979, The Struclllre of Organizations, Prentice-Hall Inc., Englewood Cliffs, New Jersey. Ostrom, L, Wilhelmsen, C. and, Kaplan, B. 1993, "Assessing Safety Culture", Nuclear Safety,34, 163-171. Reisch, F. 1994, "The IAEA-ASSET Approach to Avoiding Accidents is to Recognize the Precursors to Prevent Incidents". Nuclear Safety, 35, 25-35. US Nuclear Regulatory Commission. 1993, Development of the NRC's Human Performance Investigation Process (HPIPJ. NUREG/CR-5455.

". R.N. 940608 9 Event Title: OCCurrence: What foiled to perform as expected? Breaker fobe relays foiled to withstand excessive voltages OcQX' renee title: Notue: Eqlipment fobe Direct Cause: Why dd if happen? Latent Zener dlodesln SF relays passed spurious voltage weakness signal causing ACes to trip open. Corrective action Breaker reIavs modified per vendor instnJctlons Contributor DC power system was being operated with the MocIficaIion procecUe to existence batt8ly isolated from the bus with the battery revised to maintain of the latent charger octing os the 0ft0I SOU"ce of voltage busses tied together weokness Root Cause: Why was it not Drevented? defk:iency t Inadequate detection of possible problems when Corrective action Oitier 0C0nne timely e1imin operoting battery charger wi1hout tiattery in circUt proceeues were r8llised ate the and precautions odcIed where appropriate latent weakness Contributor Monogement deficiency stemming from less to the than adequate COfl'ective action to perform OEP reWed for enhoncemenIs to i1'lprove existence of required mointenance' bothpr ond the periodic ossessments of deficiency proarom effectiveness Expanded Root cause: Which work process(es) and oraonlzationol factors Dlaved slanificant roles In the Incident? Latent Corrective action weolcnesses Vaious deficiencies it COfl'ective maintenance it the Assess key orQOIlizaIiollOl orgonizotior work process factors wiHn each tosk Ieodingto the incident Contributors l)lock of tecMicol knowledge ond organizational Expand OEP to include to the Ieornlng within the tosk of plornhg. improvemenls in existence 2) lock of problem identification it various orgollizatiollOlleoming. of the deficiency Deportment reviews prier to issuing wak order. Implement schemes to 0S 9SS and upgrode pIont tech. knowledge. e.g. use of behoviotcl c:heckists. Table 1. Expanded root cause analysis tORn

R.N.940608 10 i II f &. . crr.:: 1lQI'l I 0 15 i § g 0 . 1IQn. .,, - ::m I.:s I c 3 3.65 4.42 121 17.2 3.4 5.7 . ccrICIt\. Ex:et:'Ol . iI 3 4.5 3.4 3.87 28 -:'!U'eonon on1:OC: 4.5 eoor:: «Wett . 5.3 on:-c;ZC:lQt1 iGooI D:Io JIN rOt O'goillZcr.IollQl o- . . a.otr'it"O 15.7 5.7 13.2 3.4 5.74 3.11 6.19 15.5 18.9 '9.5 4.9 11.1 4.4 7.6 o or""Q' ICe 1IQr1 w· lc*\1ltcQ1lcn . r.c:fY\CClf oe r;". "'o 4.8 726 10.6 42.3- . 6.5 224 6.56 5.7 7.4 11.2 8.95 '9.54 9.3 9.72 6 5.1 5.8 8 5.4 4.78 16.8 14.6 13.2 12.3 9.98 11.5 12.8 8.26 13.3 13.7 14.9 6.51 13.5 12.6 23.4 26.9 Tobie 2. W.tgtt OCgocllzanonal Fodon MaCdIt loth CocTecttv. Mok1I . Worit PIoQesa (In )

5. Conventional root cause analysis . The analysis done by the utility identifies the root cause of the incident to be management deficiency stemming from less than adequate corrective action, specifically poor planning . and execution of maintenance. Using the ASSET format (fable 1) to address the root .

Related Documents:

USING SAP ROOT CAUSE ANALYSIS & SYSTEM MONITORING FOR SYBASE UNWIRED PLATFORM 6 2. ROOT CAUSE ANALYSIS FOR SUP IN SOLUTION MANAGER After SMD Managed System Setup and Configuration, the Root Cause Analysis features of SAP Solution Manager Diagnostics are available in the Root Cause Analysis work center of SAP Solution Manager. Find further information about End-to-End Root Cause Analysis on SAP .

WHAT IS ROOT CAUSE ANALYSIS? 2 Root cause analysis (RCA), is a structural step by step technique that focuses on finding the real cause of a problem and deals with it. Root Cause Analysis is a procedure for ascertaining and analyzing the cause of problems, to determine how these problems can be solved or be prevented from occurring. 8.6.2014

"Fishbone" Diagram: Measures Top Primary Root-Cause Primary Root-Cause Second level Root-Cause Third level Root-Cause Fourth level Root-Cause Measures Education & Training To Recognize Fatigue Failure Of IRS Fatigue Management Systems Political Will Regulation & Policy Under-Reporting Hours Of Service (HOS) Recording Device

ROOT CAUSE ANALYSIS AND ACTION PLAN FRAMEWORK TEMPLATE The Joint Commission Root Cause Analysis and Action Plan tool has 24 analysis questions. The following framework is intended to provide a template for answering the analysis questions and aid organizing the steps in a root cause analysis

ROOT CAUSE ANALYSIS GUIDANCE DOCUMENT. 1. SUMMARY. This document is a guide for root cause analysis specified by DOE Order 5000.3A, "Occurrence Reporting and Processing of Operations Information."Causal factors identify program control deficiencies and guide early corrective actions.As such, root cause analysis is central to DOE Order 5000.3A.

The Problem with Root Cause Analysis Method A Method B Method C Method G Method E Method H Method J Method F Method D Method I No‐one can agree on "what is a root cause." Everyone says they do "root cause analysis,"yet everyone is doing something different!

It can be used on its own or in conjunction with the fishbone diagram analysis in moving from the chosen root cause to the true root cause. Simply ask Why 5 times starting with the effect of the problem. 5 Whys focuses the investigation toward true root cause

They are intended to accompany, rather than replace, other texts, while offering the student a fresh perspective. About E-International Relations E-International Relations is the world’s leading open access website for students and scholars of international politics, reaching over three million readers per year. E-IR’s daily publications feature expert articles, blogs, reviews and .