In many organisations Problem Management is typically one of those processes that was implemented alongside the Incident and Change processes, but it’s either not working as you had hoped, or is generally being ignored.
A good indication of this is that your colleagues will be regularly engaged in firefighting incidents rather than investigating the root cause of incidents and preventing them from recurring or occurring in the first place. You may be suffering repeat incidents and may also have a backlog of problems which are not being progressed or problems which remain unassigned.
Here are some words of wisdom to help get your Problem Management process back on track and working as you had hoped.
1. If you haven’t already got one, develop a Problem Management policy document (a set of rules), clearly documenting what is within scope of the process, and clearly defining roles and responsibilities (a RACI Matrix would help with this) or if you have a policy document and it hasn’t been reviewed in the last 12 months then review it to ensure it’s still appropriate.
2. Within the Problem Management policy document define who can raise problem records and under what circumstances. An example of what would and would not be appropriate may aid communication and understanding. Misunderstanding of this from the point of view of Service Desk and Second Line Support staff can lead to duplication of effort, or Problem Management being flooded with Problem records which have been opened unnecessarily.
3. Have reactive and proactive triggers been clearly defined within a single Problem Management process? You don’t want two separate processes. Examples of reactive and proactive activities could include;
- Support for Incidents
- Identification of Problems
- Diagnosis of Problems
- Monitoring Change progress in relation to the elimination of known errors (KEs)
- Escalation of Problems
- Identifying work arounds
- Trend Analysis
- Initiation of Change to prevent Incidents from recurring
- Initiation of Change to prevent Incidents from occurring in the first place
- Preventing Problems from affecting other areas and systems (minimising the effect of the Incident)
4. Ensure that the Service Desk and 2nd, 3rd or nth line support groups generate good quality data in the integrated ITSM toolset they use, to aid the Problem Management process. Poor quality data will impact on Problem Management’s ability to generate trends and perform problem analysis.
Typically, this has been a major issue over many years for many companies. Words like ‘sorted, done and fixed’ have been the only input into the work logs of Incident records, often requiring the Problem Manager to go back as far as the Customer to request information about the symptoms being experienced and what lead up to them. Not only is this time consuming, but it hardly portrays IT in a good or professional light!
Examples of key inputs into Incident Records (which can then be copied into Problem Records) include;
- User details
- Service details
- Equipment details
- Date/time initially logged
- Categorisation and prioritisation details
- Incident description (with as much detail as possible)
- Details of all diagnostic or attempted recovery actions taken
5. Ensure that all stakeholders clearly understand the difference between Incident Management and Problem Management, especially around goals and objectives, input, outputs and process triggers. A one hour presentation can be easily generated and delivered in a way that shows them ‘What’s in it for them’.
6. Clearly define the interfaces between Problem Management and other processes, not just Incident Management although this is where most of your reactive problem records will be initiated. Your other key interfaces are:
- Change Management – Problem Management ensures that all resolutions or workarounds that require a change are submitted through Change Management via an RFC. Change Management will monitor the progress of these changes and keep Problem Management advised. Problem Management is also involved in rectifying the situation caused by failed changes, and has a major role to play as a Change Advisory Board (CAB) member.
- Service Asset and Configuration Management – A major source of knowledge for Problem Management. Configuration records are linked to previous incidents, problems, changes, releases and known errors associated with CIs, which could speed up the resolution of something being experienced today. Configuration Management also identifies relationships between CIs, which allows a full understanding of the impact of the Problem on the organisation
- Release and Deployment Management – Responsible for the build, test and deployment of fixes into the live environment. Release and Deployment Management also assists in ensuring that the associated known errors are transferred from the development Known Error Database (KEDB) into the live KEDB. Problem Management will assist in resolving problems caused by faults during the release process.
- Availability Management – Is involved in how to decrease downtime and increase uptime. There is a great deal of interfacing between Problem and Availability Management and a close relationship must exist. Many techniques (both reactive and proactive) associated with Availability Management are also shared by Problem Management.
- Capacity Management – Some problems will require investigation by Capacity Management teams and techniques, e.g. performance issues. Capacity Management will also assist in assessing proactive measures. Problem Management provides management information relative to the quality of decisions during the capacity planning process
- IT Service Continuity Management – Problem Management acts as an entry point into ITSCM where a significant problem is not resolved before it starts to have a major impact on the business, which may require part (or all) of the ITSCM plan to be invoked.
- Service Level Management – The occurrence of incidents and problems affects the level of service delivery measured by SLM and experienced by their Customers. Problem Management contributes to improvements in service levels and its management information is used as the basis of some of the SLA review components. SLM also provides parameters within which Problem Management works, such as impact information and the effect on services of proposed resolutions and proactive measures
- Financial Management – Assists in assessing the impact of proposed resolutions or workarounds. Problem Management provides management information about the cost of resolving and preventing problems, which in turn provides an input into the budgeting and accounting systems and total cost of ownership calculations.
7. The purpose of the Known Error Database (KEDB) should also be clearly communicated, and its use regularly monitored. If no one is using it, find out why and address the issues. Problem Management is said within ITIL to be the ‘Gatekeeper’ of the KEDB, ensuring the structure is sound, the knowledge is presented in a way that makes it easy for people to find what they’re looking for, and the currency of that knowledge is always maintained. Known errors take two forms;
- Root Causes
8. Work with the Service Desk to develop useful scripting when logging Incidents as it will aid effective analysis; Train Service Desk staff in the use of the KEDB and the CMS; in quality data logging and help with efficient Incident and Problem resolution. Problem Management should ensure both initial and ongoing training is in place for service desk (and 2nd Line) support staff (the level and depth of training very much dependent on the first rate fix the Service Desk is required to achieve, without the need for escalation to more technically able people). This not only allows more incidents to be dealt with earlier and quicker (thus raising the profile of IT to the business), but raising the skills level of 1st and 2nd line support staff makes the transition of those individuals into more technically challenging roles easier - benefitting both the individuals and the organisation
9. As previously mentioned, there is a great deal of overlap between the Problem and Availability Management processes. If the Problem Management process is suitably mature and the Availability Management process has not yet been developed or implemented, there is an opportunity to address the challenges that its absence causes and/or develop the process itself, as many of the skills and techniques are interchangeable from Problem to Availability Management.
10. Ensure that all problem resolver groups have adequate skills in problem analysis and resolution techniques and ensure that they don’t hold on to problems when a more collaborative approach may yield a quicker resolution. This will require both initial and ongoing training, and suitable plans should be put in place to ensure this happens. Some of the most commonly used techniques are as follows;
- Chronological Analysis – A timeline of events leading to the incident, in chronological order. This may allow the ‘trigger’ of what’s causing the incident to be identified
- Pain Value Analysis – Taking a broader view of the impact of an incident or problem. Instead of just analysing the number of incidents or problems of a particular type in a particular period, a more in-depth analysis is done to determine exactly what level of pain has been caused to the organisation by these incidents or problems. A formula can be devised to calculate this pain level. Typically, this might include taking into account the number of people affected, the duration of the downtime caused and the cost to the business
- Kepner/Tregoe Analysis – 5 stage analysis which aids Problem Management when deeper-rooted problems are experienced. The five stages being;
- Define the Problem – In what way are we not achieving our service level targets – write the reasons down
- Describe the Problem – This breaks down into four headings;
- Size – What’s the size of the Problem – how many parts are affected?
- Time – When does it occur? How frequently does it happen?
- Location – Where does it happen? How many sites are affected?
- Identity – What part(s) are not working well? What Is the problem? Also, consider what aspects are working as expected, as this may assist in identification of the problem area.
- Establish possible causes – Use the breakdown from ‘describe the problem’ to do this
- Test the most probable cause – If there are a number of ‘probable’ causes, they may have to be prioritised.
- Verify the true cause
- Brainstorming – Get the relevant people into a room and get them to ‘throw’ ideas for discussion as to root cause. Just remember – no idea is to be met with scorn or derision, however improbable it may seem!
- Ishikawa Diagrams – The ‘fishbone’ or ‘Cause and Effect’ Diagram. The reason for the failure is represented by the fish ‘backbone’. The bones of the fish are then populated with the main reasons this could have happened, and problem solvers are then invited to break these down to a more granular level of why the group believe that could have happened.
- Pareto Analysis – A technique from separating important potential causes from more trivial issues.
And of course a bonus eleventh tip…
11. If Problem Management is already working at a mature level why not help to initiate and develop a Continual Service Improvement (CSI) program, otherwise it may never get developed. As a Problem Manager you will have many of the skills and knowledge required to establish CSI within your organisation.