Overview
Industry has done a very good job at determining how to fix equipment and keep it up and running. And yet, how often do we look at a failed piece of equipment or equipment that has never worked quite right and:
- blame the manufacturer for the problems
- tell people it has always worked that way (problem? what problem?)
- assume that someone else will fix it (operations or maintenance)
- ask Mr. Machinery (the person who has been around for 30 years) to fix it.
As we all know, equipment can't talk back and so we may not take the time to conduct a thorough troubleshooting and failure analysis effort to determine the actual path to failure and identify human performance issues and root causes of our equipment problems.
By knowing how to determine the:
- Failure Mode
- Failure Agent
- Failure Classification
the professional troubleshooter can better collect necessary information that will help perform better root cause failure analysis and subsequently identify the real human performance root causes that often are the reasons underlying equipment problem(s). When the real reasons for the equipment problems are known, the professional troubleshooter can recommend and implement corrective actions that will prevent further similar problems and allow increased mean time between failures.
An example will be provided to show common shortcomings of equipment troubleshooting followed by an example using Equifactor® and human performance root cause analysis to find the real reasons and root causes for equipment and component downtime and failures.
Machinery Troubleshooting Objectives and Traps
Most people recognize that the primary objective of conducting machinery troubleshooting is to prevent repeat incidents. Although this may sound obvious, I would suggest that we might not be as effective at accomplishing this as desired. One reason may be that in our troubleshooting, we don't always take the time to document our efforts. As a result, when we have a failure and we review the machinery history, the entry looks something like the following:
- 3-25-01 Compressor failed
- 3-26-01 Compressor online.
Unfortunately, this does not indicate what was done during the troubleshooting process and whether actions taken were effective or ineffective. A process that makes documentation while conducting troubleshooting easier is a desired goal. Additionally, if a decision is made to purchase a new piece of equipment or upgrade an existing piece, it is beneficial to have previous problems documented so that the bid specification can be better written, and the new or upgraded equipment better suited for the desired service.
Too often, some common troubleshooting traps may prevent us from fully realizing our troubleshooting potential:
- Everyone's job and no one's job: This is where Operations decides that it is too difficult for them to fix and Maintenance decides that it is too minor for them to get involved. The result is often replacing the item or addressing the symptoms only.
- Mr. Machinery: In this case, a facility may rely on a certain individual who has been at the facility for a long time. This is not necessarily adverse, but it can be a problem if the individual has not kept up with current technology.
- Telephone troubleshooting: This occurs when the troubleshooter attempts to solve the problem by interviewing people over the phone and not taking the opportunity to look at the failed equipment or machinery in person.
- Familiarity: In this situation, a person becomes very familiar with a piece of equipment or machinery and begins to assume that the current problem is the same as last time, whether or not in fact it is.
Systematic Troubleshooting Process
In order to troubleshoot effectively and avoid the above-mentioned traps, one should follow a logical process, determining "what" happened, "why" it happened, and then developing effective fixes for the "why" it happened.
What Happened: The "what" can be portrayed effectively by creating a sequence-of-events chart. This provides a graphical presentation of what happened for those reviewing the incident. A common technique is to put the incident or problem in a circle, the actions or events into boxes, and amplifying information in ovals often called conditions.
Let's take a look at an incident that occurred to a cooling pump motor. This pump has been in operation for over four years. Four months ago it was taken out of service and refurbished. Six weeks ago it was greased as per the manufacturer's recommendations. On the day of the incident, the pump motor began to smoke, eventually catching fire. After disassembly, we find that the inboard bearing is burned and melted. The outboard bearing appears to be in good condition. We can put this into a sequence-of-events chart and then begin to troubleshoot using a combination of brainstorming and cause-and-effect technique.
Why It Happened, Option 1: If we are able to gather the right people we should be able to develop a reasonable list of possibilities for the burned-up bearing. Some of these may include:
- lubrication problems
- misalignment problems
- friction problems
- loading problems
- clearance problems.
Below each of these general categories we could also postulate additional detailed possibilities. Specifically under the lubrication category we could list: insufficient lubrication, overlubrication, incorrect lubrication, etc. We could continue to delve deeper under each of the general categories as the group that we have gathered continues to brainstorm, based on their collective knowledge and experience. This would hopefully get us to at least several different possibilities regarding the Root Cause Failure Analysis for the burned-up failed inboard journal bearing.
Why It Happened, Option 2: Another approach to the troubleshooting process could include gathering the right people, but in addition, this group could also use predeveloped and/or existing checklists that contain the most common symptoms of bearing problems, and then include the possible causes for each of the listed symptoms. An advantage of using well-developed checklists is the tendency of people to forget one or two symptoms and/or to forget possible causes of the symptoms. Also, if the right people can't be gathered, you may have to rely on people with less experience and knowledge. A checklist helps to overcome these problems. Using checklists ensures that each and every troubleshooter is relying on the best available information obtained to create them.
Path To Failure: If checklists aren't available, one can still use a process to determine the "path to failure" to better understand the manner and conditions that existed to create the failure or poorly performing equipment or machinery.
The first step is to determine the Failure Mode. This is the appearance, manner, or form in which a machinery component or unit failure manifests itself. The general categories of Failure Modes include: Deformation, Fracture, Surface/Material Changes, and Displacement. Below each of the general categories we can list more specific forms of each.
After the Failure Mode we would proceed to the Failure Agents. The Failure Agent is the catalyst that allowed the Failure Mode to occur. Failure Modes consist of: force, reactive environment, time and temperature. One of these will be primary, and often, secondary and tertiary Failure Agents are exhibited. The process is similar to asking "why" as in option 1 above, only now we have a more structured and systematic process that anyone can use and document.
Following Failure Agent, we proceed to determine the Failure Classification. This is where we determine whether the issue is strictly an Equipment Difficulty or whether there may be a Human Performance Difficulty associated with it. Too often, many troubleshooters stop at this point. They are missing a valuable opportunity to take their troubleshooting to another level. This is where we now involve the people "talking back" part of the investigation.
Human Performance Difficulty: If human performance was involved in the equipment issue, the troubleshooter will then need to step out of the equipment analysis role and begin to set up interviews with people who have interacted with the equipment or machinery in question.
To begin the process of finding the human performance root causes, we can use the Root Cause Tree®. This will allow us to get the human performance "why" part of the investigation. The process starts by answering the 15 questions on the front of the Root Cause Tree®, which will help the troubleshooter better determine which of the basic cause categories are applicable and which are not applicable for the equipment issue being investigated.
Once the 15 questions have been answered Yes or No, the troubleshooter will turn to the back of the Root Cause Tree® and analyze the basic cause categories identified by the 15 questions from the front of the Root Cause Tree®. Under each of the basic cause categories are root causes the troubleshooter should evaluate to determine if they apply to the equipment issue. An example might be where a person was performing maintenance and the procedure was not specific enough, causing the maintenance person to have to interpret the intent of the procedure and subsequently making an incorrect interpretation, causing the equipment to fail or not work properly.
Of course, to adequately answer the 15 questions and then identify root causes from the Root Cause Tree®, the troubleshooter will need to conduct interviews with the appropriate people. The 15 questions and the root causes on the Root Cause Tree® are designed to minimize if not eliminate the "blame" syndrome of many companies. This does not mean that we shouldn't hold people accountable for their actions, only that we need to look at our systems first to ensure that they are providing people the tools they need to be successful in their jobs. Once we are confident that our systems are in order, we can then have better justification in applying the appropriate discipline that is warranted for the situation.