Availability Management is often seen as the 'elephant in the room'... something that we know we need to discuss or implement, but often left for others or another day...
We recently blogged about 'Getting Started With Availability Management', hopefully you read it? Now we have produced a checklist to give you another prod... an elephant prod perhaps...
Whether implementing, operating or seeking to improve Availability Management, you should consider the items on the following check-list:
1. Availability Management Is Carried On Two Interconnected Levels – Service Availability And Component Availability.
The key issue here is the “interconnectedness”. It will be impossible to predict and manage service availability if you do not understand which components combine (and in which manner they combine) to form services. Equally it will probably be a waste of time trying to manage the component availability of every component, especially if many of those components are only used by less important services.
It will also be very important to understand the “interconnectedness” of Enabling Services and Core Services. Core Services are those that the users directly associate with achieving their outcomes. Enabling services have to be present for the Core Services to work. Customers and users are not particularity aware of the Enabling Services – unless they fail and in turn cause a failure of the Core Service. Improving the availability of an Enabling Service may improve the availability of many Core Services.
Enhancing Services may also be present. Often they are offered “free” and are not covered by an SLA. However if they are covered by an agreement, their “interconnectedness” with components, Core and Enabling services will also need to be understood.
There is an important implication here – you need to have an up-to-date Service Portfolio to understand the importance of each service.
2. The Importance Of A Service Asset & Configuration Management (SACM) Process.
As mentioned in Point 1 (above) understanding the “interconnectedness” of the components and services is critical. If you do not have a mature SACM process in place, then you will be forced to map components to services on an individual service-by-service basis. This is clearly very inefficient, with lots of duplicated work and data, as soon as you perform this for more than one service.
Therefore it is important to develop the SACM process along with the availability process. Creating the SACM process (and the Configuration Management System) can be a significant investment of time, effort and money - but the benefits will be felt throughout the service providing organisation (and by extension, the customers will also see benefits). Consider developing the SACM process as you develop Availability Management.
On the other hand, don’t allow the absence of SACM to stop you from doing any availability tasks at all!
3. Agree With Your Stakeholders When, Where And How You Will Measure Service Availability.
In an ideal world the availability of any given service will be constantly measured at the point of delivery to the service users. This has the following advantages:
i) Any single-point-of-failure unavailability anywhere within the service will show up as unavailability at the point of delivery, so you can be sure that you have correctly measured the overall service availability.
ii) What you subsequently report to the customers and users will match their actual experience of the service.
However there are a number of practical difficulties around this. If there are very large number of user end points of the service, the act of measuring and reporting on each transaction may produce an unacceptable additional load on the processors and networks – in other words the measurement of availability may begin to adversely affect the availability.
It may be that user end points are not under the control of the service provider, for example mobile devices. It therefore may be agreed that only elements of the service directly under the control of the service provider will be used to measure the availability of the service. This also has implications for elements of the service that are under control of suppliers – should they be included as part of the measurement?
4. If You Are Not Measuring Service Availability At The User End Point, You Will Have To Calculate Availability Based Upon The Availability Of Service Component
Assuming you are able to identify all components (or at least all of the critical components) of the service you will have to have an agreed, consistent and repeatable way of measuring the service availability based on the component availability.
See the points below for methods of doing this.
5. How Can You Predict Service Availability Based On The Availability Of Components?
If you know the likely availability of each component of a service (expressed as a percentage) you can predict the future availability of the service using some mathematics.
If the components are in series the maths is very simple. For example if the likely availability of Component 1 (C1) is 98%, C2 is 97% and C3 is 99%, then the calculation is a straightforward multiplication:
C1 * C2 * C3 = 98% * 97% * 99% = 94.1%
Note that this gives you the likely future availability of this system over some time period. The actual availability may be greater or less than this figure.
If the components include some resilience then the calculation becomes more complex. If we assume that component C1 is actually made up of two resilient components C1a and C1b, then before calculating the availability of the whole system we would need to calculate that of C1. This is done using the formula:
1 –((1-C1a availability)*(1-C2a availability))
If the availability of C1a is predicted to be 99% and C1b is predicted to be 94%, the calculation for the whole system is:
(1- ((1-0.99)*(1-0.94))) * 97% * 99% = 99.94% * 97% * 99% = 95.98
This has some important implications: a) increasing the resilience at the component level is likely to increase the availability of the overall service, and b) calculating exactly how much availability we can expect is much more complicated for systems which include resilience in their design!
6. How Will You Calculate Actual Service Availability If You Are Not Going To Measure It At The End Point?
Once again the importance of understanding the component availability is paramount here. Assuming you have measured the component availability for each component over some time period, the system availability will be that of the weakest component less any unavailability of other components that occurred when the weakest component was up and running.
For example, let us imagine a simple system with components in series where C1 was available 98% of the time, C2 was available 97% of the time and C3 99% of the time. However some of the times that the components were down overlaps. Let us say that C2 was down only 1% of the total time (that didn’t overlap downtime on C1 or C3), similarly C2 was down 2% of the total time (that didn’t overlap downtime on C1 or C2).
The calculation of actual availability then would be 97% - 1|% - 2% = 94% availability was actually achieved.
It is worth re-iterating that this figure will not be 100% accurate because you have not included the availability of the user end points.
7. Having Agreed How You Will Measure Predicted And Actual Availability, How Will You Report On It?
With actual availability in particular, there may be the need to identify several different audiences for the reporting and several different media may need to be employed. For example, it may be appropriate to have a detailed dashboard of live availability information for your operational teams to see and perhaps a simpler dashboard for your customer and users to view.
Each audience may also want reports at regular intervals showing availability achieved over some agreed time period. It is likely that you will need to agree these reports (and their audiences, frequency etc.) with service level management (SLM).
8. Consider Non-Technical Aspects Of Availability.
Most Availability Management staff tend to concentrate on monitoring and managing availability in live operations, with an emphasis on hardware and software. However services are made up of many components beyond hardware and software. For example, staff in operations also have a direct impact on availability, through their involvement in strategy, design and transition of services in addition to their operational roles.
If we look at the operational side of staff involvement with availability, it becomes clear that operational staff can increase availability by more than one means. For example, the correct use of maintenance slots on systems will ensure the systems have higher availability.
Reactively, operational staff can contribute to availability by shortening outages. In other words, one way to maximise up-time is to minimise any down-time. In terms of staff we can help them to minimise down-time by ensuring that they are on-site when needed, that there are enough of them to cover normal workload plus the extra workload that an incident might bring, that they motivated and that they have had appropriate training.
Access to a Known Error Database (KEDB) will also shorten the length of outages. In many smaller organisations where it may not be cost effective to create and manage a KEDB, you should ensure that operational staff have access to the Internet so that they may access the publicly available KEDBs of your various suppliers.
Another non-technical area where availability may be increased is via the service users. Giving them appropriate training on how to use services end points in safe and productive way can increase availability; reinforcing that training at appropriate intervals will ensure no improvements are lost. A simple example is ensuring users are aware of the dangers of spilling fluids onto their electronic equipment.
User training will be especially important in environments where some of the service infrastructure has to be maintained by the users, for example in a remote office a “super user” may have to ensure that communications equipment is suitably maintained.
9. Availability Of Local Spares.
Another practical measure that can help shorten unavailability is the local availability of critical spare equipment. It will not be cost effective to keep a spare of every single service component, but it may be to keep spares of those which have high rates of failure and also have a big impact on system availability.
Availability Management should be considered within the strategy and design of the service.
10. Availability Is Made Up Of Reliability And Maintainability.
It is important to understand that availability is achieved through a combination of reliability (freedom from failure) and maintainability (the service is easy to fix when it does fail).
Usually when designing systems that support services we place more emphasis on reliability than maintainability. However consider that if one or more of the systems is likely to break down, then we should give equal importance to maintainability. For example, if the system is likely to be used in an environment with extreme temperatures or is perhaps very dusty then we should expect more failures.
Maintainability can be improved through availability of spares, trained staff etc as mentioned in earlier points.
One aspect of maintainability that is often overlooked however is the quality and availability of documentation. If operations staff cannot fix an incident based on their own knowledge or what they find in the Known Error Database (KEDB), then they will need to consult the manuals (or any other documentation) that was produced at design time.
There is often pressure during the design phase of systems to hit a certain date for release of the system, even if that means cutting down on training and documentation. This pressure should be resisted because of the potential impact these things will have on availability. If the pressure cannot be resisted then the customers and any other relevant stakeholders must sign off that they have accepted the risks.
11. The Maturity Of Other Processes Will Have A Big Impact On Availability.
The availability process cannot act in a vacuum. The maturity of other key processes such as Event Management, Supplier Management, SACM, Change Management, Release & Deployment Management, Incident Management, Problem Management and Information Security Management will have a very large impact on availability.
If you do not currently have these processes formally defined and working, then consider it as part of your longer term goals to get them in place. Don’t try to do too much too quickly though – remember the way to eat an elephant is one small piece at a time! Although I would never advocate eating an Elephant under any circumstance as they are very intelligent animals…
Speaking of intelligence, what are your thoughts, let us know in the comments.