Log in Article Discussion Edit History Go to the site toolbox

HTM ComDoc 1

From HTMcommunityDB.org


Start here: PM Basics, key concepts and terminology

(This document was last revised on 12-8-18)

1.1 Device failures and measures of reliability

A device or equipment system is considered to have failed when:

  • it no longer performs the function or functions that the user wants it to perform (these are called overt failures), or
  • when it functions as it should, but in an unsafe or otherwise unsatisfactory manner (these are called hidden failures).

It is a truism, similar to the impossibility embedded in the concept of perpetual motion, that there is no such a thing as an infallible device. All devices fail in one way or other, at some time or other. The simplest measure of a device’s reliability is its failure rate - the number of times that it failed to perform, or failed to perform satisfactorily, during a particular time period. Since failures are predominantly random, a device's failure performance (it's reliability) is usually expressed as an average number of failures over a particular time period. However, a more intuitive way of expressing device reliability is in the form the device's mean time between failures or MTBF which is the inverse of it's failure rate over a particular period of time. For example; a device that has demonstrated an average failure rate of one failure every 75 years is demonstrating a mean time between failures of 75 years.

1.1.1 Expressing reliability as a failure rate or as a mean time between failures (MTBF)

Mean time between failures (MTBF) is the inverse of the failure rate. For example, a device that has failed twice in nine years is demonstrating a failure rate of 0.22 failures per year and an MTBF of 4.5 years. Average failure rates can also be derived by dividing the total number of device failures occurring during the observation period by the number of device-years making up the total device experience. For example, if a batch of 10 devices experiences two failures during nine years, then the failure rate is 0.022 failures per year and the MTBF is 45 years. The larger the experience base (in device-years), i.e. the greater the number of devices in the sample and the longer the observation period, the closer the observed failure rate will be to the device’s true failure rate.

It is generally easier for lay persons to relate to an MTBF because it is expressed in units of time periods, such as 3 years or 30 years - a simple, easily comprehended metric. For example, most people will have little difficulty in considering a device with an MTBF of just one month to have a relatively poor level of reliability and, conversely, considering a device with an MTBF of 50 years to be quite reliable. But when expressed as the equivalent failure rate, the MTBF of 1 month (= 12 failures per year) versus the MTBF of 50 years (= 0.02 failures per year) the contrast between the two levels of reliability (12 versus 0.02) does not seem quite so striking.

Since, ideally, we would like to separate various different kinds of devices into neat compartmentalized categories such “safe” and “hazardous” we have to confront the difficulty of setting boundaries and consequent gray areas around those boundaries. For example, setting a threshold of, say, 75 years for the MTBF that should be considered safe creates the hard-to-answer question of how much less reliable (and thus less safe) is a device with an MTBF of 74 years than one with an MTBF of 75 years? There is, of course, no simple answer to that question. There are grey areas. It is all relative.

This discussion is made a little more complicated by the fact that there are a several different reasons why devices fail, and lumping all of these failures for these different reasons into one overall failure rate, or corresponding MTPF, might well raise the question that this total failure rate does not seem to fairly describe what we think of as either the reliability of the device itself, or the effectiveness of the way we maintain it. Section 1.4 below addresses the nature of these different causes of failure and how they can be categorized and used to develop a helpful and meaningful analysis.

1.2 What is maintenance?

There are several adequate dictionary definitions of maintenance but, in the context of maintaining equipment, it is best defined as "the process of keeping the equipment in proper working order, in good physical condition and acceptably safe". The definition used in the highly respected RCM approach to equipment maintenance is “keeping the equipment available for use”. For more about RCM, see HTM ComDoc 14. "An introduction to Reliability-centered Maintenance (RCM): The modern approach to Planned Maintenance".

A traditional equipment maintenance program has three parts:

  1. Corrective maintenance or, as it is more commonly called, repair, is the process of returning a device that is in a failed state (i.e. that is no longer doing what the user wants it to do) to a safe condition and proper working order. This includes correcting any significant hidden failures even though they do not usually affect the primary functions of the device.
  2. Cosmetic repair, is the process of restoring a device that is damaged to a safe and cosmetically like-new condition. While cosmetic repairs are generally considered a lower priority because the device may still be functioning within the manufacturer’s functional specifications it may be damaged in such a way that it is unsafe. For example, a damaged cover may be presenting a sharp edge that could be hazardous to either the patient or to a user.
  3. Preventive maintenance. This third component is very important because from the very beginning, with the earliest machines developed during the time of the industrial revolution, it was widely believed that restoring the device's non-durable parts, as needed, before the end of the device's anticipated lifetime would be beneficial because it would reduce the number of unexpected machine breakdowns. In return for these scheduled PM interventions to restore the device's non-durable parts, the device users expect a lower level of the disruption and loss of productivity, as well as some reduction in overall maintenance costs, because the device should experience fewer breakdowns.

Non-durable parts (NDPs)- which are sometimes loosely called disposables or disposable parts - are components of the device that are subject to progressive wear or deterioration. They typically include moving parts, such as bearings, drive belts, pulleys, mechanical fasteners and cables, which require periodic cleaning and lubrication as well as certain non-moving parts such as electrical batteries, gaskets, flexible tubing and various kinds of filters which may need to be cleaned, adjusted, refurbished or replaced sometime during the useful lifetime of the device. Which particular parts the device manufacturer considers to be non-durables is identified by the presence of corresponding device restoration tasks in the manufacturer's recommended PM procedure.

As we describe more fully in HTM ComDoc 14. "An introduction to Reliability-centered Maintenance (RCM): The modern approach to Planned Maintenance" ............

Belief in this traditional device restoration approach to improving machine reliability continues to this day, particularly in certain relatively small industry sectors, even though the findings that started the revolutionary RCM approach to maintenance in the 1970s have caused a considerable amount of rethinking about whether or not intrusive maintenance interventions really do improve the device's overall reliability. Certainly there are still quite a number of medical devices such as ventilators, spirometers and traction machines that are more mechanical than electronic, where the manufacturers still recommend that certain parts be given some kind of periodic restoration (cleaning, refurbishment or replacement). However, we don’t yet have good, independent evidence as to whether or not these manufacturer-recommended PMs, particularly those involving the more intrusive overhauls, are truly beneficial or cost-effective. We have not yet gathered the data on the impact of these recommended interventions on the reliability of these more mechanical devices. That investigation is one of the goals that the Maintenance Practices Task Force (MPTF) has set for itself. We discuss this data gathering challenge in more detail in HTM ComDoc 4.

1.3 What exactly does the term "PM" mean in the context of medical equipment maintenance?

In the special case of maintaining medical equipment, there is a second very important reason besides device restoration for making periodic scheduled interventions. And that is testing the device to detect critical degradation in the functional performance of the device or in its condition with respect to safety. These deteriorations can be quite subtle, and in RCM jargon these degradations are called hidden failures. The term is appropriate because these subtle changes do not completely disable the device's primary functions and so they will usually go unnoticed by the device users.

It is important to detect these subtle deteriorations (hidden failures) because there are certain kinds of medical devices that can cause a patient injury if their performance becomes significantly substandard or their level of safety falls below the relevant requirements. Elsewhere (see HTM ComDoc 3) we characterize the types of devices that have a theoretical potential to injure a patient if they deteriorate in this way as hidden failure-critical or HF-critical devices. These devices need to be subjected to periodic safety verification tasks. Appropriate safety verification tasks for checking out each particular type of device are typically included as a part of the device manufacturer's recommended PM procedure.

Similarly we can characterize devices that have a theoretical potential to injure a patient, if they simply stop working, as life support devices (See Section 1.8.1 below). As the descriptor (life support) implies it is important to minimize failures of these devices. If these devices have manufacturer-designated non-durable parts (NDPs) they are vulnerable to what the Task Force calls wear out type failures and they need to be subjected to appropriate device restoration (DR) tasks to prevent the device from failing. This will eliminate one (but only one) source of device failures. So, a life support device that has manufacturer-designated non-durable parts is vulnerable to wear-out type failures. The test for this is whether or not the device manufacturer's recommended PM procedure includes any device restoration tasks.

One of the recurring obstacles in our discussions of PM over the years has been the use of a number of imprecise and inconsistent terms. Unfortunately there is still no general consensus. So, in an attempt to establish a standardized and more consistent PM terminology, we are proposing (below) some new terms.

We believe that it would be quite difficult to get the entire population of engineers and technicians practicing in the medical equipment maintenance field to change from using the long-established traditional diminutive “PM”. To accommodate this practical issue we are proposing to introduce another term with the same diminutive. The newer term, "planned maintenance" will be used to define the combination of the traditional device restoration tasks (what we have traditionally called “preventive maintenance”) and the safety-oriented performance/safety testing tasks that are more or less unique to the medical field. In this new formulation we are proposing to use the term “device restoration tasks" as a short label for the restoration of the device's non-durable parts. It is a simple and appropriately descriptive term.

We are suggesting this new terminology in full recognition of the fact that there are a number of other competing terms that have evolved over time. For example the term “scheduled maintenance” has been proposed as an alternative to “preventive maintenance” but it is not a very good fit semantically because it implies that the device restoration tasks are always performed according to some kind of clock; either by conventional timing (e.g. every 6 or 12 months) or by a time-of-use clock (e.g. every 1000 hours of use). There is, however, a more modern practice in which the deteriorating part is restored on a more efficient “just-in-time” basis by monitoring the actual condition of the part. In some cases the monitoring is performed by some kind of sensor but more commonly in the medical equipment sector it is simply done by conducting periodic visual inspections. In the RCM approach this “just-in-time” restoration is called predictive maintenance. In addition to this, what we are proposing to call safety verification (SV) tasks have been given the collective name “inspections” by ECRI Institute and others. We prefer the more descriptive term “safety verification” tasks.

So, in summary, in the context of medical equipment maintenance, the contraction “PM” should be understood to mean “planned maintenance” which is defined as a combination of two different types of tasks; one (device restoration tasks) aimed at preventing wear-out failures, and the other (safety verification tasks) aimed at detecting then correcting hidden failures; i.e.

Planned maintenance (PM) procedure = Device restoration (DR) tasks + Safety verification (SV) tasks

1.4 Classifying and coding the causes of (overt) medical device failures

There are a number of different reasons (causes) why equipment systems fail and it is particularly important to recognize that not all of these failures can be prevented by some kind of planned maintenance. Consider, for example, the following list of possible causes of device failure:

  • The first set of causes can be classified as inherent reliability-related failures (IRFs) that are attributable to the design and construction of the device itself, including the inherent reliability of the components used in the device. They typically represent 45 - 55% of the repair calls. This type of failure can be reduced (but not to zero) only by redesigning the device or changing the way it was constructed.

Category IR1 Random failure. A device failure caused by the random failure or malfunction of a component part of the device.. A result of the device’s inherent unreliability. IR1 calls typically represent between 40-55% of all repair calls.

Category IR2 Poor construction. A device failure attributable to poor fabrication or assembly of the device itself..

Category IR3 Poor design. A device failure attributable to poor design of the hardware or processes required to operate the device..

  • The second set of causes can be classified as process-related failures (PRFs). They typically represent 40 - 50% of the repair calls. Reducing or eliminating these types of failure typically requires some kind of redesign of the system’s processes - for example, by using better methods to train the equipment users to operate the equipment (as intended by the manufacturer) or to train them to treat the equipment more carefully. They are not failures that can prevented by any kind of maintenance activities.

Category PR1 Use error. A device failure attributable to incorrect set-up or operation of the device by the user.. User has not set the device up correctly or does not know how to operate it. Typically PR1 calls represent between 13-20% of all repair calls. (Note that although this type of “failure” does not represent a complete loss of function, it can have the same effect. For example, an incorrectly set defibrillator can result in a failure to resuscitate the patient).

Category PR2 Physical damage. A device failure caused by subjecting the device to physical stress outside its design tolerances.. PR2 calls typically represent between 6-25% of all repair calls.

Category PR3 Discharged battery. A device failure attributable to a failure to recharge a rechargeable battery. PR3 calls typically represent between 7-8% of all repair calls.

Category PR4 Accessory problem. A device failure caused by the use of a wrong or defective accessory.. PR4 calls typically represent between 3-9% of all repair calls.

Category PR5 Environmental stress. A device failure caused by exposing the device to environmental stress outside its design tolerances.. PR5 calls typically represent between 1-7% of all repair calls.

Category PR6 Tampering). A device failure caused by human interference with an internal control.. PR6 calls typically represent <1% of all calls.

Category PR7 Network problem. A device system failure caused by an issue within a data network connected to the device’s output.

  • The third set of causes can be classified as maintenance-related failures (MRFs). They typically represent 2 - 4% of the repair calls. These types of failure can be prevented through some kind of maintenance strategy incorporated into the facility’s maintenance program.

Category MR1 PM-preventable failure. A device failure that could have been prevented by more timely restoration or replacement of a manufacturer-designated non-durable part. E.g. a battery failure, a clogged filter, or build up of dust. Failures due to trapped cables should not be coded this way. MR1 calls typically represent between 1-3% of all repair calls.

Category MR2 Poor set up. A device failure caused by poor or incomplete initial installation or set-up of the device.. MR2 calls typically represent between 1-3% of all repair calls.

Category MR3 Needed recalibration. A device failure attributable to improper periodic calibration. MR3 calls typically represent <1% of all repair calls.

Category MR4 Re-repair. A device failure attributable to a poor quality previous repair of the device.. MR4 calls typically represent <1% of all repair calls.

Category MR5 Intrusive PM. A device failure attributable to earlier intrusive maintenance.. MR5 calls typically represent much <1% of all repair calls.

1.4.1 Coding repair work orders

The Task Force recommends very strongly that all repair work orders be provided with a field for coding what is judged to be the primary reason (cause) why the device failed. As will be described later the statistics obtained from this coding is very useful for managing the various different failure prevention measures. The recommended format for this coding follows the classification arrangement described immediately above in Section 1.4. For example, a failure that is judged to have been caused by the device having been dropped would be coded as a PR2 failure, and a failure that is judged to have no obvious cause would be coded as an IR1 failure.

1.4.2 Different measures of device reliability

While the device’s overall reliability (which corresponds directly to the total number of the repair calls - irrespective of what caused them) determines the device's effective reliability, it is the numbers of maintenance-related failures (MRFs) and inherent reliability-related failures (IRFs) that are of greatest interest to us, as maintainers, at this time. The level of MRFs provides a good measure of the effectiveness of the facility’s maintenance program, and the level of IRFs provides an equally good measure of the basic or inherent reliability of the devices in question.

1.5 Which kinds of medical device failures can be hazardous?

There are four ways in which medical equipment failures can be hazardous. However, not all of those failures are PM-preventable failures.

  • If the device is damaged in such a way that it is presenting some kind of direct physical threat to the safety of patients or staff, such an exposed sharp edge.

For example, the case or enclosure of a piece of equipment might be damaged, say as a result of the item being dropped, in such a way that the damaged casing poses a risk of injury to the patient or user, even though the item still works. Or the protective outer layer of the device's electrical cord might be damaged so that it exposes a live conductor posing the risk of an electric shock. These could be hazardous to the patient, to the device user and possibly others. It is to be expected that damage such as this would be noticed and repaired at the time of its periodic maintenance - so, to the extent that this kind of damage occurs and goes unreported, periodic PM contributes to the levels of overall safety. These are not considered to be PM-preventable failures but periodic PM may shorten the time that individuals are exposed to these potentially hazardous outcomes. Situations such as this appear to be encountered quite rarely.

  • If the failure is a sudden, total failure.

There are a number of devices that are life-supporting in the sense that a sudden, total failure while they are in use could put the patient’s life at risk. Examples include critical care ventilators, anesthesia units, heart lung machines, intra-aortic balloon pumps, external pacemakers, defibrillators, AEDs, cardiac resuscitators, infant incubators, neonatal monitors, apnea monitors - and in some circumstances - patient monitors, oxygen monitors and pressure cycled ventilators. In addition to spontaneous random failures it is possible that a device could suddenly stop working if a part that is recommended for periodic restoration fails prematurely. This could also occur if the maintenance interval has been set too long. The failure of any device that is attributable to the failure of a critical part that requires timely restoration is considered to be a PM-preventable failure. However, situations such as this appear to be encountered quite rarely.

  • If the device develops some kind of hidden failure.

There are some devices that have the potential to cause a patient injury if their functional performance falls below a certain critical point in such a way that the deterioration is not obvious to the user. Examples include a defibrillator whose delivered output energy is significantly lower than the level set by the user; or an infusion device that delivers medication at a significantly lower or higher rate than that set by the user. Similarly there are some devices that have the potential to cause a patient injury if their compliance with a relevant safety specification falls below an acceptable point and this deterioration is not obvious to the user. Examples include; an open ground connection in a device that has exposed metal that could conceivably become "live", and a malfunction in devices that have critical alarms. While, strictly speaking, these failures are not totally prevented by periodic PM, the time that patients are exposed to these potentially hazardous outcomes is reduced. For more on this - see Section 4.7 of HTM ComDoc 4. Elsewhere - See Sections 6.3 and 6.4 of HTM ComDoc 6 - we have shown that the exposure of the patient to this possible hazard is reduced from 100% (as it would be with no PM) to a lesser percentage determined by the ratio of the frequency with which the PM testing is performed to the frequency with which the hidden failure occurs. With typical PM intervals in the range of 6 months to 5 years and mean time between failures of these random hidden failures in the range of 50 to 250 years, the reduction in exposure of the patient will be reduced by 95 - 99%. Hazardous hidden failures appear to be encountered quite infrequently.

  • If the device is used improperly.

Almost all medical devices have the potential to injure patients if they are used improperly. However, this is a type of failure that cannot be prevented or mitigated by conventional planned maintenance and they are not considered to be PM-preventable equipment failures. Accident statistics show that misuse of medical devices represent the most common reason for device-related patient injuries.

For more on this subject see HTM ComDoc 8 "Maximizing medical equipment-related reliability and safety"

1.6 Hidden failures

A hidden failure (HF) is said to have occurred when either:

  • the device is performing in a way that is significantly out of specification, but sufficiently similar to the performance that the user wants, that the failure is not immediately obvious to the user, or
  • the device is no longer in compliance with any safety specifications applicable to the device in question, but this deterioration is also not obvious to the user. These hidden failures are usually the result of imperceptible random failures in the device's components or subsystems. They are detected through performance or safety tests specified in the manufacturer's recommended maintenance procedure and made during the periodic PMs.

When this more subtle type of failure introduces a significant performance or safety degradation that can be detected only by some kind of performance or safety test it can constitute a serious safety threat. For example, a heart rate alarm that has malfunctioned so that it no longer goes off at the set limit will remain as a hidden but potentially hazardous failure until the alarm function is checked and the potentially dangerous degradation discovered. The potential level of severity of the outcome of hidden failures will depend on the nature of the failure and on how far the performance or safety flaw is out of specification. For example; a significant reduction in the output of a defibrillator has to be considered life-threatening but a small excess in the electrical leakage current of a laboratory centrifuge – while it should be noted in the test report - is unlikely to constitute a significant hazard, or be considered an imminent threat.

Hidden failures are discovered when the performance verification and safety testing tasks are performed during the PM. When they are found they should be described in a note on the PM work order or the PM report and it would be helpful if the description of the findings provided enough information to enable a judgment to be made as to the worst case potential level of severity (LOS 3, LOS 2, LOS 1 or LOS 0 - see Section 1.8 below) of the adverse outcome that would have resulted if the hidden failure had not been discovered.

A particularly important type of hidden failure is one that disables the proper operation of an automatic protection mechanism (APM) that is included as a component of the device. An APM is usually incorporated in the device to provide protection against another possible hidden failure that is itself considered to be capable of a serious, potentially life-threatening outcome.

1.7 Classifying and coding PM Findings

In the recommended format of each HTMC generic PM procedure (see column 7 of Table 4) a reporting section is added at the bottom of the procedure asking the service person to indicate by circling one of three letters (A, B or F) whether or not performing the so-called SV or safety verification tasks to evaluate the performance and safety of the device revealed any significant degradation (latent MR1 failures) or any hidden failures .

PM Code A = nominal. The letter A should be circled when the results of all of the performance and safety tests were in compliance with the relevant specifications, and any other functions tested were within expectations.
PM Code B = minor OOS condition(s) found. The letter B should be circled when one or more conditions were found that were slightly out-of-spec (OOS) or slightly outside expectations. The purpose of this B rating is to create a watch list to monitor for future adverse trends in particular performance or safety features, even though the discrepancy is not considered to be significant at this time. An example of this would be an electrical leakage reading of 310 microamps which is within 5% of the 300 microamp limit. A B rating should be considered a passing grade.
PM Code F = serious OOS condition(s) found. The letter F should be circled when one or more performance or safety features is found to be significantly out-of-spec. (OOS). This is a failing grade and, if it is a high-risk device, it should be removed from service immediately.

The service person is also asked to indicate by circling one of four numbers (1, 5, 9 or 0) the physical condition in which the device parts that were rejuvenated by the traditional PM tasks were found. The numerical ratings should be circled to indicate one of the following findings.

PM Code 1 = better than expected. There was very little or no deterioration; i.e. the physical condition of the restored part was found to be still good.
PM Code 5 = nominal. There was some minor deterioration but no apparent adverse effect on the device’s function; i.e. the physical condition of the restored part was found to be about as expected.
PM Code 9 = serious physical deterioration. The restored part was already worn out and probably having an adverse effect on device function; i.e. the physical condition was found to be considerably worse than expected.
0 = no physical restoration required. The device has no parts needing any kind of physical restoration.

If the PM findings are systematically documented each time a PM is performed, then aggregated into a PM Findings database, it will be possible to:

  • get a measure of the PM Yield (the ratio/percentage of problems found during PM to the number of PMs performed, and
  • get an indication of the mean time between failures (MTBFs) of any hidden failures, and
  • get an indication of how well the PM interval matches the optimum - which would be when the part being restored has deteriorated - but only to the point just before the point where the deterioration begins affecting the functioning of the device.
  • A preponderance of PM Code 1 findings would indicate that the interval is too short; and
  • A preponderance of PM Code 9 findings would indicate that the interval is too long.
  • A preponderance of PM Code 5 findings would indicate that the PM interval is just about right.

1.8 Possible adverse outcomes of medical device failures

There is a wide range of possible adverse outcomes from device failures. Some create potential physical harm to the patient (or to the device user). Others can result in additional direct or indirect costs to the facility and thus create an economic or business risk to the organization. We address these economic/business risks in greater detail in HTM ComDoc 9. "Medical devices that may benefit from PM from a business/ economics viewpoint"

In the case of outcomes creating the possibility of physical harm it is helpful if there is a need to conduct some kind of risk analysis or risk assessment to define a hierarchy of three levels of severity (LOS) of possible physical harm to the patient, or - in the case of economic harm to the facility - three levels of economic harm to the business.

Outcomes resulting in possible physical harm

  • LOS 3 = Serious, life-threatening injury - The patient (or the user) may lose his or her life.
  • LOS 2 = Less serious, non life-threatening injury - The patient (or the user) may sustain a direct or indirect injury ranging from minor to serious.
  • LOS 1 = No injury, but possible disruption of care - The incident may cause a temporary disruption of care, such as requiring one or more patients to be rescheduled, delaying treatment or delaying the acquisition of diagnostic information.
  • LOS 0 = No discernible injury or possible disruption of care.

Outcomes resulting in possible economic harm

  • Level 3 = Major economic impact - on the facility’s cost of doing business
  • Level 2 = Significant economic impact - on the facility’s cost of doing business
  • Level 1 = Relatively minor economic impact - on the facility’s cost of doing business
  • Level 0 = No discernible impact - on the facility’s cost of doing business

1.8.1 Life support devices

There are some devices, such as critical care ventilators and defibrillators, on which the the patient's continued well being may be totally dependent. These are sometimes called life support devices. Any type of failure that causes such a device to stop working completely or to stop working properly has the potential to result in an adverse outcome at the highest severity (LOS 3) level. If the device also happens to have one or more non-durable parts that needs timely and competent periodic restoration, this device then becomes critically vulnerable to a wear-out failure and it therefore becomes a device that should be given a high priority for PM. The same is true if the device has a hidden failure that could cause a high severity outcome.

1.9 Which kinds of medical equipment failures are PM-related failures?

Of the many ways in which devices can fail (its possible failure modes) listed in Section 1.4 above, there are two kinds that can be described as PM-related though only one kind (MR1 failures) that is PM-preventable:

1. Category MR1 (wear out) failures that have caused the device to stop working completely. These are failures that are caused by a non-durable part not receiving timely, competent restoration.
2. PM Code F (hidden) failures resulting from imperceptible failures of components within the device that do not cause the device to stop working completely but which have reduced the device's performance or safety below a critical level. These are failures that are discovered when safety verification (SV) tasks are performed during PMs and although this testing does not totally preclude the possibility that a patient will be exposed to the device while it is in a defective state, the discovery and correction of these hidden failures does shorten the period during which patients are exposed to this potentially hazardous condition. This benefit is addressed more completely in Sections 6.3 and 6.4 in HTM ComDoc 6.

1.10 The five basic questions at the heart of the great PM debate

The foregoing analysis puts us in a position to answer the first of the five basic questions about PM - some of which have been addressed previously in HTM ComDoc 15

1.10.1 Question 1. To what extent, does performing PM on medical equipment improve patient safety?

Generally speaking, PM does improve patient safety, but only to the extent that it detects then corrects the two kinds of PM-preventable failures that were identified just above in Section 1.9 (wear-out failures and hidden failures). And the extent of the improvement in patient safety varies for different devices according to the "level of risk" that the device would have presented if those potential failures had not been detected, and then eliminated. According to the modern theories of risk management, the level of risk takes into account both the level of the severity of the adverse outcome of the event and the likelihood that the event will actually occur.

In this case we are specifically concerned about the level of risk posed by PM-preventable failures, so the extent of the improvement in patient safety is determined by a combination of the potential severity of the outcome of the failure (with the higher levels of outcome severity - such as LOS 3 - being more serious than LOS 2, etc), and the likelihood of the failure occurring. The proper measure of this likelihood of the failure occurring is what the Task Force calls the device's PM-related reliability. We discuss this "likelihood of failing from a PM-preventable cause" more in HTM ComDoc 4 "Consideration of the device's PM-related reliability".

The Task Force has investigated both of these factors. Table 4 provides a ranking of the various device types according to the severity of each device's potential PM-preventable failures. For more on this investigation, see HTM ComDoc 3 "Risk assessment: Determining which medical devices can be made safer (but only a little safer) by PM". The device types at the top of the listing in Table 4 (rows 1 through 7) are judged to have potential PM-preventable failures with life-threatening outcomes. The PM-related reliability of each of the top twenty highest severity device types in Table 4 are currently being investigated and as the results become available they will posted to columns C8 and C9 of Table 13. For more on this investigation, see HTM ComDoc 4 "Consideration of the device's PM-related reliability".

The Task Force has set tentative thresholds for what should be considered an acceptable (safe) level of PM-related reliability for the devices in each of the three top levels of potential PM-related risk categories (namely those labeled high, moderate and low in column C10 of Table 13). From this table, once it is completed, professionals in charge of medical equipment maintenance programs will be able to identify which devices (by manufacturer and model) should continue to be maintained strictly according to their manufacturer's recommendations, and for the others, what level of PM-related reliability (which corresponds to PM-related safety when the category of severity is taken into account) is typically achieved when the indicated PM interval and procedure is used. The Task Force has also suggested a way in which the level of PM-related patient safety can be monitored on a continuous basis (see Section 3.10 of HTM ComDoc 3).

As can be seen from the summary below there are several other benefits from performing regular PM besides improving patient safety.

  • Improving patient safety. … Some devices - but only some - can be made safer (but only a little safer) by performing appropriate PM. Not all failures have the potential to cause a serious injury, and not all failures are PM-preventable.
  • Regulatory compliance. … As we explain more fully in HTM ComDoc 11 the CMS regulation addressing PM for medical devices has traditionally been that all medical devices must be maintained strictly according to the device manufacturers' recommendations. Even after the regulations were changed in 2013 there is still a requirement that certain devices be subjected to periodic PM. (For more on this see HTM ComDoc 16).
  • Better business economics. … As we explain more fully in HTM ComDoc 9. some devices - but only some - are made less costly to maintain by performing appropriate PM
  • Customer courtesy and/ or customer reassurance. … We may choose to perform PM on some devices because a user has asked us to do so, or because we believe that periodically inspecting and cleaning equipment used for patient care creates a reassuring "cared for" appearance that the user staff appreciates. While this is a qualitative rather than a quantitative benefit it should not be underestimated. These periodic inspections may also be useful by leading to the discovery of unreported broken equipment. The Task Force has issued a cautionary note about the possibility of undervaluing this last factor (see Section 16.11 of HTM ComDoc 16)


1.10.2 Question 2. What kind of PM program is required by the current CMS regulation?

The original Medicare legislation in 1965 stated that: "... There must be a regular periodical maintenance and testing program for medical devices and equipment. A qualified individual such as a clinical or biomedical engineer, or other qualified maintenance person must monitor, test, calibrate and maintain the equipment periodically in accordance with the manufacturer's recommendations and Federal and State laws and regulations. ..." But beginning in 1989 and as recently as 2011 the corresponding standards of the Joint Commission allowed equipment that was not considered to present a significant physical risk to be excluded from any specific maintenance requirements stating only that PM frequencies should be based on "criteria such as manufacturer's recommendations, risk levels, or current hospital experience," and they, in effect, endorsed the original Fennigkoh-Smith risk-based methodology.

This changed in 2011 when CMS issued revised regulations that narrowed the still official CMS requirement to use the manufacturer's maintenance recommendations from all equipment to just " ::3. Fail-safe design. Again, for devices with this level of risk, it would be prudent to choose (if it is available) a version of the device that has some kind of built-in fail-safe design, such as component redundancy. All equipment critical to patient health and safety</font> and any new equipment until a sufficient amount of maintenance history has been acquired." The "risk-based" option that TJC had been allowing was effectively rescinded. The revised CMS requirement specifically stated that for what they were now calling equipment critical to patient health and safety " Alternative equipment maintenance (AEM) methods are not permitted." However, there was no clear indication of which particular devices they intended to target with this definition of "critical." They seemed to be placing the responsibility for this onto the facility by stating that the "... hospital may adjust its maintenance, inspection, and testing frequency and activities for facility and medical equipment from what is recommended by the manufacturer, based on a risk‐based assessment by qualified personnel".

Faced with some push-back from members of the HTM community CMS issued a "clarification" memo in 2013 (HTM ComRef 28) in which they tried to address the uncertainty about the precise meaning of the phrase "equipment critical to patient health and safety". The key language in the 2013 memo is quoted in Section 11.3 of HTM ComDoc 11 Suffice it to say that this new language does not clarify sufficiently what the agency intends by the term "critical" and the Task Force's interpretation of their intention is described in Section 11.4 of HTM ComDoc 11 The new regulatory language does however introduce a major concession by allowing devices that are not considered to be "critical" to be included in an Alternative Equipment Management (AEM) program where they can be maintained other than as the manufacturer recommends. As reported also in HTM ComDoc 11, the Task Force summarizes its conclusions about the agency's intention in the form of the following two recommended AEM program inclusion criteria.

Recommended AEM Program Inclusion Criteria

After a careful analysis of the CMS memo the Task Force believes that, except for four specific categories of devices, the agency intends to allow to be included in an AEM program only those devices that meet one, or both, of the following criteria:

  • The device is highly unlikely to cause a serious injury or death to a patient or staff person if it should fail in a way that could have been prevented by the device having been subjected to appropriate PM
  • The device is highly unlikely to fail from a PM-preventable cause

Identification of the four specific categories of devices that cannot currently be included can be found by consulting HTM ComRef 33.

The Task Force's suggestions for implementing an efficient risk-based AEM program that will be compliant with these two criteria are contained in a recently-published two-part article in AAMI"s BI&T journal (HTM ComRef 35 and HTM ComRef 36). Much of that material is also contained in HTM ComDoc 16 "Implementing a simple CMS-compliant Alternate Equipment Management (AEM) program."


1.10.3 Question 3. How to maximize the efficiency of a planned maintenance (PM) program

HTM ComDoc 10 "Alternate Maintenance Strategies and Maintenance Program Optimization" identifies the following four maintenance strategies that are relevant to maintaining medical devices.

  1. Traditional fixed interval preventive maintenance (often combined with #3, periodic safety verification)
  2. Predictive maintenance
  3. Periodic safety verification
  4. Light maintenance (also known as run-to-failure maintenance)

The least efficient maintenance strategy in terms of using up scarce technical manpower is (#1) the traditional fixed interval preventive maintenance strategy. Predictive maintenance (#2) is the next least efficient. It differs from strategy #1 primarily in effectively extending the interval between restorations or replacement of the device's non-durable parts by substituting a visual inspection for the original restoration task. The most efficient strategy is, of course, the light maintenance strategy (#4). The periodic safety verification strategy is neutral with respect to efficiency because it must be performed on all devices that have a potential high severity (LOS 3) outcome to a hidden failure. It may also be considered prudent to perform periodic safety verification on all devices that are projected to have a less severe potential (LOS 2) outcome to a hidden failure.

Starting with the least efficient situation - a program in which PM is currently being performed on all of the facility's equipment according to the manufacturer's recommendations - implement the following steps:

  • Step 1 Identify which devices can be classified as non-critical devices (see Section 3.8.1 in HTM ComDoc 3), and change these immediately to a run-to-failure maintenance method (i.e. perform no scheduled PM).
  • Step 3 Look over the recommendations below that are taken from Section 4.10 of HTM ComDoc 4 and HTM ComRef 36. Then make the changes that you feel comfortable with (see also .... and HTM ComRef 35).

Recommendations for improving the efficiency of a medical equipment maintenance program

These are potentially hazardous devices with either overt or hidden PM-preventable failures that could cause a life-threatening injury and that are demonstrating PM-related failure rates greater than the currently acceptable level (not more than one failure every 75 years). For these devices, it would be prudent to continue to follow the manufacturer-recommended PM procedure (for both the interval and the scope of the tasks) and to routinely monitor the levels of patient safety being achieved, as described in Section 3.10 of HTM ComDoc 3 and HTM ComRef 35. This should be continued until acceptable evidence exists in the national database (Table 13) that some other procedure with more efficient tasks and/or a longer interval is found to demonstrate the same or better level of PM-related reliability or a comparable level of patient safety.

These are potentially hazardous devices with hidden PM-preventable failures capable of causing a life-threatening injury that are demonstrating PM-related failure rates greater than the currently acceptable level (not more than one failure every 75 years). For these devices, for which the only “maintenance” that the manufacturer recommends is periodic safety verification, it would be prudent to continue to follow the manufacturer-recommended safety verification testing schedule and routinely monitor the levels of patient safety being achieved, as described in Section 3.10 of HTM ComDoc 3 and HTM ComRef 35, until evidence exists in the national database (Table 13) that testing at a longer interval results in the same or better level of PM-related reliability or a comparable level of patient safety.

When testing for possible hidden failures with potential high-severity outcomes, there is no optimum interval — shorter is always better. However, it has been shown (see Section 6.3 in HTM ComDoc 6.) that for safety verification–related (hidden) failures with MTBF values greater than about 50 years, the increase in the time that the patient would be exposed to potentially hazardous hidden failures if the testing interval was increased from six months to as long as five years is very small.

These lower PM-risk devices qualify for inclusion in an AEM program either because of the lower level of severity of the outcomes of potential failures or because they have demonstrated an acceptable level of PM-related reliability. Therefore, they can be maintained using a maintenance procedure or strategy other than that recommended by the manufacturer. They can be transitioned immediately to less stringent PM strategies, such as the cost-efficient light maintenance (run-to-failure) strategy - which is mentioned in Appendix A of the CMS memo (HTM ComRef 28). At the very least, the manufacturer-recommended procedures can be modified (such as by omitting electrical safety checks that the facility has found to be nonproductive), or by extending the testing interval to make it coincide with a more convenient or more efficient routine.

The logical rule here is to explore the national database (Table 13) for evidence of more efficient maintenance procedures. It would be prudent to monitor the levels of patient safety (as described in Section 3.10 of HTM ComDoc 3 and HTM ComRef 35) being achieved by the current procedure (or any of the more efficient procedures, if chosen) for devices categorized as PM priority 2 (moderate PM-risk) devices. Monitoring those in the lower risk categories is much less important but can be undertaken if the facility chooses.

If these devices should fail, there is a negligible or zero additional risk to patient safety. Therefore, in the absence of other regulatory mandates, unless there is a convincing case that periodic PM can be justified through lower maintenance costs, these devices are excellent candidates for the very efficient light maintenance (run-to-failure) strategy. It was by adopting this run-to-failure maintenance strategy in the early 1960s that the civil aviation industry was able to reduce its maintenance costs by 50% while, unexpectedly, also improving the reliability and safety statistics for civilian aircraft by a factor of 200 times.


1.10.4 Question 4. How to maximize equipment-related reliability and safety (by using an Enhanced Risk Management Program)

The opening paragraph from HTM ComDoc 8 "Maximizing medical equipment-related reliability and safety" reads as follows:

"To the best of our knowledge, all of the studies reported to date have shown that only a very small percentage of injuries resulting from failures of medical devices are attributable to poor maintenance. See,for example, reference HTM ComRef 12). And, as we describe in Section 1.4 of HTM ComDoc 1, ...the great majority of medical device failures can be attributed to one or other of a fairly wide range of other causes.... However, if the cause of each device failure is routinely documented in the manner suggested in that same section of HTM ComDoc 1, this information (on which of those causes is currently contributing the most to device failures in a particular facility) can be very helpful in managing device failure prevention activities other than PM, and in monitoring the effectiveness of those efforts.

So, if our overall goal is to reduce the number of medical device failures, it makes sense to investigate ways in which these other causes can be reduced or eliminated. In HTM ComDoc 8 we point out that, based on the general statistics on causes of device failures, the most effective strategy for reducing failures of the critical life support device types is to:

  1. Give preference during device acquisition to those devices that are reported to have the highest level of inherent reliability. The possible impact of this strategy is unknown at this time but current statistics indicate that the inherent unreliability of the devices themselves accounts for 45-55% of all failures.
  2. Implement additional measures to reduce failures from the list of causes presented immediately below. They are listed in descending order of anticipated effectiveness.
13-20% - User-related issues such as controls or switches that have been set incorrectly. Although this type of failure may not always lead to a complete loss of function, it can have the same effect as actual failure. For example, an incorrectly set defibrillator can jeopardize patient resuscitation. (These Category PR1 calls typically represent between 13-20% of all of the repair calls).
7-8% - Problems related to a poor rechargeable battery management program. (These Category PR3 calls typically represent between 7-8% of all of the repair calls)
6-25% - Physical damage usually caused by a combination of poor design and user carelessness, such as dropping the device. (These Category PR2 calls typically represent between 6-25% of all of the repair calls).
3-9% - Problems with an accessory, such as patient cables and electrodes. (These Category PR4 calls typically represent between 3-9% of all of the repair calls).
1-7% - Problems resulting from an out-of-specification environmental condition, such as poor control of the ambient temperature. (These Category PR5 calls typically represent between 1-7% of all of the repair calls).
1-4% - Lack of timely PM (i.e. failing to restore [replace or refurbish] a part of the device that requires periodic attention. (These Category MR1 calls typically represent between 1-4% of all of the repair calls).
1-3% - Poor installation or poor initial set-up of the device. (These Category MR2 calls typically represent between 1-3% of all of the repair calls).
<1% - Tampering with internal switches or other controls that are not intended to be user-accessible. (These Category PR6 calls typically represent <1% of all of the repair calls).
<1% - Problems due to an issue with a data transmission network connected to the device’s output. (Category PR7 calls)

We also note in HTM ComDoc 8 that the best way to reduce potentially critical hidden failures in those device types that are most vulnerable to those kinds of failures (i.e. the device types listed in the first 11 rows of Table 2) is to:

  1. First, select versions of the device that have built-in self testing to verify that the device is functioning safely,
  2. Second, be diligent about following the manufacturer's recommendations for periodic safety verification testing, and
  3. Third, consider implementing pre-use inspections or testing to verify that the device is functioning safely immediately prior to use .

*Enhanced Risk Management Program. A very beneficial use for some or all of the resources made available by improving the efficiency of the facility's maintenance program would be to implement an enhanced Risk Management Program incorporating some or all of the additional measures described above.


1.10.5 Question 5. What changes to current PM work practices would be most beneficial?

As we state in Section 15.3 of HTM ComDoc 15. - there is absolutely no question that the most beneficial change would be for us to standardize the way we perform and report our maintenance activities .

There are three extremely important benefits that can be realized if the managers of the HTM community's maintenance programs can be persuaded to standardize on a common format for their maintenance reporting.

  1. Maintenance findings could be aggregated into a single, community-wide database which would then produce very helpful safety statistics on at least the more popular medical devices very, very quickly
  2. A comprehensive coding system for documenting the way devices fail would provide the data we need to optimize the effectiveness of the facility's maintenance and non-maintenance equipment safety strategies.
  3. By analyzing the findings of the PMs we perform on critical equipment we could select the right intervals to use for critical PMs. Helpful safety statistics from the community-wide maintenance database

As was noted in Section 4.5 of HTM ComDoc 4, collecting the amount of data needed for an evidence-based approach to an equipment maintenance strategy will be problematical for most individual healthcare facilities. In many cases they will be unable to collect sufficient data in a reasonable period of time to make their failure rate statistics credible. However, data collected in a consistent, common format can be aggregated into a single database.

This statistical complication arises from three factors.

  • First, because they are designed and constructed differently, different manufacturer-model versions of a given device type, such as defibrillators, can be expected to show different levels of reliability. So each different manufacturer-model combination has to be analyzed and characterized separately.
  • Second, most individual healthcare facilities will probably have only a small number of the individual device types that are PM-critical at the highest severity level. (See rows 1-20 of Table 4).
  • The third factor has to do with the likelihood that devices that are potentially PM-critical are likely to be carefully designed and fabricated to have a relatively low failure rate.

The result of these complicating factors is that individual facilities will probably not generate enough data to get a good indication of each device’s true PM-related reliability and PM-related level of safety. To get accurate estimates of the reliability of high reliability devices it will be necessary to pool maintenance statistics for each manufacturer-model version of each device from a number of institutions.

For example, suppose a facility has only three similar (same manufacturer – same model) heart-lung units and only three years of maintenance history for each unit. Since the facility has a total of only 9 device-years of experience, it is unlikely – if the actual MTBF of the units is, say, 50 years or more – that the facility will have experienced even one single failure during the 3-year testing period. In this case they would have to report their finding with respect to the devices’ indicated reliability (zero failures over 9 device-years) as undetermined. If, however, they did experience one or more failures of one of these devices during this relatively short period, then the indicated MTBF will appear to be unacceptably short for a critical device. In this situation it would be prudent for the facility to consult the findings on the reliability of these specific types of device in the national database to see whether or not their particular experience was indeed typical (and this type of device is, in fact, not sufficiently reliable) or if their experience was atypical. The Task Force has set a tentative level of 50 device-years as the minimum level for reasonable credibility (see Section 4.5 of HTM ComDoc 4)

Also, as was noted in Section 4.9 of HTM ComDoc 4, the summary proof tables (Table 5) are the most valuable part of the community database. In Section 1.9.4, above, we described how the statistics in Table 5 can be used to identify the most common causes of equipment failures. The hard evidence showing which PM intervals are optimum for different kinds of PM-critical equipment and what levels of PM-related reliability are achieved at those intervals

As we describe in Section 4.7 of HTM ComDoc 4, adopting a coding system for PM findings similar to that described in that section and systematically documenting these findings each time a PM is performed, then aggregating that data, will make it possible to obtain two very important pieces of information:

1) An indication of how well the PM interval matches the optimum. The optimum PM interval is when the parts being restored have deteriorated but not to the point where the deterioration has started to affect the functioning of the device. The indicators for how close the interval is to this optimum are as follows. A preponderance of:
  • PM code 1 findings (still very good) is an indicator that the interval is too short.
  • PM code 5 findings (about as expected) is an indicator that the interval is about right.
  • PM code 9 findings (already worn out) is an indicator that the interval is too long.
2) A numerical MTBF indicating the device’s level of PM-related reliability. This indicator is the lesser of the following MTBF values (representing the lower level of PM-related reliability):
  • The MTBF based on the total of any overt failures caused by inadequate device restoration (MR1 calls from the repair cause coding) and any PM code 9 findings (which are immediate precursors of the overt failures caused by inadequate restoration).
  • The MTBF based on the total of any hidden performance and safety degradations detected by the safety verification tasks (PM code F findings).

Site Toolbox:

Personal tools
This page was last modified 21:26, 8 January 2019. - This page has been accessed 54,930 times. - Disclaimers - About HTMcommunityDB.org