Log in Article Discussion Edit History Go to the site toolbox

HTM ComDoc 4.

From HTMcommunityDB.org


This page has been updated. Go to HTM ComDoc 4

PM-related reliability: Determining the likelihood of PM-preventable failures (under revision)

(This document was last revised on 2-22-17)

1.5 Measuring PM-related device reliability (from HTM ComDoc 1.)

A device or equipment system is considered to have failed when (a) it no longer performs the function or functions that the user wants it to perform, or (b) if it functions as it should – but in an unsafe manner. It is a truism, similar to the impossibility embedded in the concept of perpetual motion, that there can be no such a thing as an infallible device - one that cannot fail. All devices fail in some way or other, at some time or other. So there can be no absolutes in a scale of reliability, only relative measures. The simplest measure of a device’s reliability is its failure rate. However, a more intuitive way of expressing reliability is the inverse of the failure rate which is the device’s mean time between failures or MTBF.

It is generally easier for lay persons to relate to an MTBF because it is in the form of a period of time - a simple, easily comprehended variable. Most people will have little difficulty in considering a device with an MTBF of just one month to have a relatively poor level of reliability and, conversely, considering a device with an MTBF of 50 years to be quite reliable. Since, ideally, we would like to separate various different kinds of devices into neat compartmentalized categories such “safe” and “hazardous” we have to confront the difficulty of setting boundaries and consequent gray areas around those boundaries. For example, setting a threshold of, say, 75 years for the MTBF that should be considered safe creates the hard-to-answer question of how much less reliable (and therefore potentially less safe) is a device with an MTBF of 74 years than one with an MTBF of 75 years? There is, of course, no simple black and white answer to that question. It is all relative.

This discussion is made a little more complicated by the fact that there are a number of different reasons why devices fail, and lumping all of these failures for these different reasons into one overall failure rate, or corresponding MTPF, might well raise the question that this total failure rate does not seem to fairly describe what we think of as either the reliability of the device itself, or the effectiveness of the way we maintain it. The next section addresses the nature of these different kinds of causes of failure and how they can be categorized and used to develop a helpful and meaningful analysis.

4.1 Introduction

In HTM ComDoc 3 we used a systematic and logical, risk-based process – together with relatively conservative assumptions - to identify a large number of device types that can - without the need for any additional supporting evidence - be considered to be not PM-critical or more simply non critical.

These non critical device types are judged to be incapable of creating any potentially hazardous outcome when they either fail completely or when they develop any kind of hidden failure. This initial risk analysis separates out seventy-one (71) device types (see Table 4.) that will require the generation of appropriate evidence of their PM-related reliability to prove that they are, in fact, highly unlikely to either fail or degrade in a way that could cause harm or injure a patient. The most convincing evidence that the level of patient safety with respect to the periodic maintenance being conducted on potentially PM-critical devices in the nation’s hospitals is acceptable, would be to document the actual levels of PM-related reliability with respect to both total failures (in the case of device restoration-critical devices) and to critical hidden failures (in the case of safety testing-critical devices). This will require us to gather actual, real-world data from the repair records and findings of the PM currently being performed on these devices.

4.2 Quantifying actual levels of of PM-related reliability and safety for different kinds of medical devices

As noted in Section 3.3.2 of HTM ComDoc 3 Maintenance-related safety: Critical devices vs. high-risk devices of HTM ComDoc 1, “risk” is defined as a combination of two factors:

  1. The severity of the outcome of the event – in this case a device failure - which is characterized in the tables by an attributed level of severity (LOS 1, 2 or 3); and
  2. The probability that the event (i.e. the device failure or the critical degradation) will actually occur (conveniently expressed as a mean time between failures or MTBF).

Therefore, if total failure or critical degradation of the device is highly unlikely, the level of risk associated with using the device is correspondingly small. Devices that are classified in the tables as having potentially life-threatening severity (LOS 3) outcomes from either a total failure or from some kind of critical degradation should more properly be called “potentially hazardous” or “potential high-risk” devices. The actual level of risk at each of the three levels of severity is determined by the probability that the device will fail, either totally, or by developing some kind of significant degradation. To illustrate this concept, think about two different modes of air travel. Because they don’t crash very often commercial airplanes are generally considered to be safe, even though they have the same worst-case high-severity outcome (i.e. crashing) as experimental aircraft. It is because they have a much less reassuring track record with respect to their likelihood of crashing that experimental aircraft are considered to be a higher risk (i.e. not as safe as commercial aircraft).

4.3 Using average PM-related device failure rates as a measure of PM-related device safety

In the case of device types that are considered to be device restoration-critical (see Table 2.), the likelihood that the device will fail from a PM-related cause is exactly the same as the average rate at which this type of device has been found to stop working from a PM-related cause while it has been in use at either one particular hospital, or at a group of hospitals with similar operating conditions. Similarly, for safety verification-critical devices (see Table 3.), the likelihood of a particular high-consequence degradation occurring is exactly the same as the average rate at which that same degradation failure has been found to occur during periodic performance verification and safety testing of either this device or similar (same manufacturer-same model) devices.

4.4 Determining the levels of PM-related safety associated with device restoration-critical devices

Not all device restoration-critical devices will necessarily be high-risk devices. Different manufacturer-model versions of each device type will very probably have different average PM-related reliability (failure rates). Some manufacturer-model versions may be quite reliable with respect to PM-related failures and therefore relatively safe from a PM point of view. Other manufacturer-model versions may be less reliable with respect to PM-related failures and therefore less safe from a PM point of view. The average rate at which a particular version (same manufacture-same model) of a particular device type fails from a PM-related failure can be determined from the aggregated maintenance histories of a collection of similar (same manufacturer-same model) devices. These PM-related failure rate statistics provide us with the all-important evidence we need to determine the actual level of PM-related risk present when these particular devices are in use. If the average PM-related failure rate of the device is very low, then the level of risk associated with complete failure of the device from a PM-related failure is also very small - even for devices whose failure can be life-threatening.

As we described in Section 1.5 Measuring device reliability of HTM ComDoc 1, it is more intuitive and probably easier for lay persons to consider a device's reliability in terms of its MTBF (mean time between failures) which is the inverse of the device’s failure rate. For example, a collection of similar devices that experienced 2 failures during a total experience period of 50 device-years has an average failure rate of (2/50=) 0.04 failures per year and an MTBF of (50/2 =) 25 years. This means that this particular kind of device (same manufacturer-same model) can be expected to fail about once every twenty five years. An MTBF of 25 years may not, however, be considered an acceptable level of risk for a device whose failure might be life-threatening (or even for a device whose failure has the potential to cause a patient injury. We have temporarily set the threshold of acceptability at 50 years (see Section 7.6 in HTM ComDoc 7.).

We hope that devices that have been classified as device restoration-critical at the life-threatening level - such as defibrillators, critical care ventilators and heart-lung bypass units - will prove to have a high level of PM-related reliability with MTBFs in the range of 75 to 100 years, or more. However, most facilities will have only small numbers of these critical devices and this will make it difficult to gather meaningful MTBF statistics from the maintenance records of just this one facility. For example, suppose that the facility has only three similar (same manufacturer-same model) heart-lung units and only three years of maintenance history for each unit. Since the facility has only (3x3=) 9 device-years of experience, it is unlikely – if the actual MTBF of the units is, say, 50 years - that the facility will have experienced even one single failure. In this case, their finding with respect to indicated reliability (zero failures over 9 device-years) will have to be reported as indefinite. To get reasonably accurate estimates of the reliability of high-reliability devices it will be necessary to pool maintenance statistics for each particular device (same manufacturer-same model) from a number of institutions.

This brings us to the need to consider another important factor. When pooling any kind of data it is necessary to ensure that the data sets are reasonably comparable. Since one of our goals here is to determine the intrinsic reliability of one particular manufacturer–model of each device type, it will be necessary to separate out of each set of failure rate statistics, for each device, those recorded failures that were caused by something other than the intrinsic reliability of the device itself. As we described in Section 1.6 What causes equipment systems to fail of HTM ComDoc 1, there are a number of reasons why devices fail, or appear to fail, besides their inherent unreliability. The device’s inherent failure rate is determined by the nature and quality of the device’s design, the quality of the components selected and the quality of construction. Several other causes of failure usually add to the inherent failure rate to make up the device’s total failure rate. HTM ComRef 8. describes a method of classifying and coding the facility’s repair calls according to the kind of situation that prompted the call, and the authors of that report and others (HTM ComRef 9.) have reported on the typical fraction of the calls that originate from these other situations (These results are reproduced in paragraph 4.5 below). This method of coding and the subsequent analysis is valuable in revealing what other issues need to be addressed in order to reduce the failure rates of the facility’s critical devices to the absolute minimum. We discuss this further in HTM ComDoc 8.

4.5 Repair Call Cause Coding

(From HTM ComDoc 1. Section 1.6 "What causes equipment systems to fail?)

There are a number of reasons why equipment systems fail and it is particularly important to recognize that not all of these failures can be pre-empted by some kind of periodic planned maintenance. Consider, for example, the following list of possible causes of device failure:

  • The first set of causes can be classified as inherent reliability-related failures (IRFs) that are attributable to the design and construction of the device itself, including the inherent reliability of the components used in the device. They typically represent 45 - 55% of the repair calls. This type of failure can be reduced (but not to zero) only by redesigning the device or changing the way it was constructed.

Category IR1. Random failure or malfunction of a component part of the device. A random failure; a result of the device’s inherent unreliability. These typically represent between 46-52% of all repair calls.

Category IR2. Poor fabrication or assembly of the device itself.

Category IR3. Poor design of the hardware or processes required to operate the device.

  • The second set of causes can be classified as process-related failures (PRFs). They typically represent 40 - 50% of the repair calls. Reducing or eliminating these types of failure typically requires some kind of redesign of the system’s processes - for example, by using better methods to train the equipment users to operate the equipment (as intended by the manufacturer) or to train them to treat the equipment more carefully. They are not failures that can prevented by any kind of maintenance activities.

Category PR1. Incorrect set-up or operation of the device by the user. User has not set the device up correctly or does not know how to operate it. Typically these represent between 13-20% of all repair calls. (Note that although this type of “failure” does not represent a complete loss of function, it can have the same effect. For example, an incorrectly set defibrillator can result in a failure to resuscitate the patient).

Category PR2. Subjecting the device to physical stress outside its design tolerances. Device has been subjected to physical stress beyond its design tolerances. These typically represent between 6-25% of all calls.

Category PR3. Problem resulting from failure to recharge a rechargeable battery. These typically represent between 7-8% of all calls.

Category PR4. Using a wrong or defective accessory. Problem due to the device being used with a defective accessory. These typically represent between 3-9% of all calls.

Category PR5. Exposing the device to environmental stress outside its design tolerances. Device has been exposed to environmental conditions beyond its design tolerances. These typically represent between 1-7% of all calls.

Category PR6. Human interference with the device. A result of someone tampering with an internal control. These typically represent <1% of all calls.

Category PR7. Problem due to an issue within a data network connected to the device’s output.

  • The third set of causes can be classified as maintenance-related failures (MRFs). They typically represent 2 - 4% of the repair calls. These types of failure can be prevented through some kind of maintenance strategy incorporated into the facility’s maintenance program.

Category MR1. Problem due to inadequate restoration of a manufacturer-designated non-durable part (inadequate PM). These calls typically represent between 1-3% of all calls.

Category MR2. Poor or incomplete initial installation or set-up of the device. Problem arising from a poor or incomplete initial installation These typically represent between 1-3% of all calls.

Category MR3. Problem attributable to insufficient or improper periodic testing and/or calibration (inadequate PM).

Category MR4. Problem attributable to a poor quality previous repair of the device.

Category MR5. Problem attributable to earlier intrusive maintenance.

While the device’s overall reliability, which corresponds directly to the total number of the repair calls - irrespective of what caused them – determines the device's effective reliability, it is the numbers of maintenance-related failures (MRFs) and inherent reliability-related failures (IRFs) that are of greatest interest to us, as maintainers, at this time. The level of MRFs provides a good measure of the effectiveness of the facility’s maintenance program, and the level of IRFs provides an equally good measure of the basic or inherent reliability of the devices in question.

4.6 Determining the levels of PM-related safety associated with safety testing-critical devices

For safety testing-critical device types (see Table 3.), the likelihood of a particular hidden failure occurring is the same as the average rate at which the various hidden failures have been found to occur during periodic performance verification and safety testing conducted on a collection of similar versions (same manufacturer-same model) of each particular device type.

As we note in HTM ComDoc 5. The most useful format for generic PM procedures, it is important that the maintenance procedure include a complete list of safety testing tasks that will confirm (or not) that all of the device’s potentially critical failure modes (those that would lead to worst-case adverse outcomes) have been checked or tested. For example, in the case of an apnea monitor, which has been classified as safety testing-critical at the life-threatening level, the sensor not detecting a reduced air flow and the alarm component failing, have been identified as two critical failure modes. Therefore the maintenance procedure should include separately numbered line-item tasks to confirm that the breathing detector is operating within its specified performance specifications and that the alarm or notification device is performing properly. The maintenance documentation should note the results of testing for the proper functioning of these two components separately so that subsequent analysis of the maintenance records will allow separate computation of the failure rates of each of these two potentially critical hidden failures.

4.7 Acceptable levels of PM-related device reliability and PM-related device safety

The Task Force has chosen to use the typical PM-related failure rates (or their equivalent mean time between failures) found when various devices have been routinely subjected to periodic planned maintenance in strict accordance with their manufacturer's recommendations as a reasonable benchmark for what should be considered an acceptable level of PM-related reliability. While, at this time (February 2017) we have not collected sufficient reliable data to establish what this benchmark should be, early indications from just one type of device (Table 5.4 Defibrillator/ monitors) are that this type of device is showing an average PM-related reliability of one PM-related failure every 125 years, with a widely scattered range between 8 and 388 years.

With this as a very broad indicator the Task Force has set the more conservative figure of 75 years as a tentative placeholder for the acceptable level of reliability for devices that have worst case adverse outcomes at the highest level of severity (LOS 3). We have set slightly lower placeholder thresholds for devices with worst case adverse outcomes at the less severe LOS 2 and LOS 1 levels (Table 12.). As can be seen from the table, the Task Force has also set tentative placeholder values for two additional higher levels of reliability for each of the three levels of outcome severity. This results in a spread of five levels of PM Priority, from PM Priority 1 (for the devices that are most critical from a need-for-PM perspective) to PM Priority 5 for the least critical.

Site Toolbox:

Personal tools
This page was last modified 20:44, 1 October 2018. - This page has been accessed 15,050 times. - Disclaimers - About HTMcommunityDB.org