Major Incidents - Best Practice Advice
Creating a Major Incident Procedure is often overlooked in many organisations, or left to IT Service Continuity Management (ITSCM) to create. It’s worthwhile considering if you have an appropriate procedure in place. If not then here is the basic information you will need to get started. If you already have a procedure then you could use this as a checklist. Either way we hope that you find this article of use…
Firstly – What Are Major Incidents?
Major Incidents are events which require a different approach from ‘normal’ day to day incidents.
People sometimes use loose terminology and confuse a major incident with a problem. In reality, an incident remains an incident for ever – it may grow in impact or priority to become a major incident, but an incident never ‘becomes’ a problem. A problem is the underlying cause of one or more incidents and remains a separate entity always!
“An incident never ‘becomes’ a problem"
A definition of what constitutes a major incident must be agreed and ideally mapped on to the overall incident prioritisation system.
Where necessary, the major incident procedure should include the dynamic establishment of a separate Major Incident Team, under the direct leadership of the Incident Manager.
The Major Incident Team is tasked to concentrate on this incident alone, and to ensure that adequate resources and focus are provided for finding a fast and effective resolution.
Generally major incidents are typically those for which the degree of impact on the business/organisation is extreme.
Incidents for which the timescale of disruption – to even a relatively small percentage of users – becomes excessive should also be regarded as major incidents.
“Are all priority 1 incidents, Major Incidents?”
It is possible to define some of these major incidents, but most will be prioritised as they happen based on impact and urgency. Usually Priority 1 is set aside for these types of incident. A separate procedure, with shorter timescales and greater urgency, must be used for major incidents.
Typically the same major incident doesn’t recur. If it does then someone has seriously failed in their duties to prevent a recurrence. But more of that later…
What If We Don’t Have A Dedicated Incident Manager?
If the Service Desk Manager is also fulfilling the role of Incident Manager, which can be the situation in some smaller businesses and organisations, then a separate person may need to be designated to lead the major incident investigation team. This will avoid conflict of time or priorities. Whoever is appointed should ultimately report back to the Incident Manager/Service Desk Manager.
The Incident Manager is the process owner for the Incident Management process and as such needs to work closely with other process owners and practitioners.
If the cause of the incident needs to be investigated at the same time, then the Problem Manager will be involved as well, but the Incident Manager must ensure that service restoration and underlying cause investigation are kept separate.
“Beware of the ‘conflict of interest’ between Incident and Problem Management”
Throughout the major incident the service desk will ensure that all activities are recorded and users are kept fully informed of progress. Communication is a hugely important activity in handling major incidents and should not be underestimated.
The Major Incident Manager (or Problem Manager if covering the role) should arrange a formal meeting with interested parties (or regular meetings if necessary). These should be attended by all key in house support staff, vendor support staff and IT services management, with the purpose of reviewing progress and determining the best course of action. The service desk representative should attend these meetings and ensure that a record of actions/decisions is maintained, ideally as part of the overall incident record as major incidents are still logged in the same way as all other incidents (it is only the priority and management of the incident which is different).
As a side note: If no Problem Manager or Problem Process Owner is currently in place, an Incident Management Executive and Major Incident Team could take on the root cause analysis activities.
“If root cause is not determined and addressed then there is a high risk that the major incident will reoccur”
What Is The Major Incident Procedure?
A separate procedure, with shorter timescales and greater urgency, must be used for ‘major’ incidents. A definition of what constitutes a major incident must be agreed and ideally mapped onto the overall incident prioritization scheme – such that they will be dealt with through this separate procedure.
The Major Incident Procedure
A procedure should be in place to manage all aspects of a major incident, including resources and communication.
It should describe how the business/organisation handles major incidents from receiving notification of a potential major incident, through the investigation process itself and to the delivery of a final report.
A related procedure describing the process of reviewing the major incident policy and procedure also needs to be in place.
The Major Incident procedure should be reviewed on a regular (at least annual basis) also before any major change and also following the occurrence of a major incident.
Some of the areas to be covered in the major incident policy and procedure are: Purpose, Scope, Activity Definition, Policy and Roles and responsibilities
Purpose Of The Major Incident Procedure
Describe the purpose of the major incident policy and procedure. For example: “This procedure and related policies have been put in place to document the business/organisation’s requirements and arrangements for responding to and investigating major incidents.”
Scope Of The Major Incident Procedure
Document the exact scope of the major incident procedure and policy. For example: “This procedure and related policies apply to all Incidents that, due to their status of impact or urgency to the business/ organisation, have been prioritised as a major incident.” Definition
A major incident is defined as an event which has significant impact or urgency for the business/organisation and which demands a response beyond the routine incident management process.
A major incident will be an Incident that is either defined in the major incident procedure or which may either cause, or have the potential to cause, impact on business critical services or systems (which should be named in the major incident procedure)
A major incident can also be an incident that has significant impact on business reputation, legal compliance, regulation or security of the business/organisation.
Major Incident Procedure Policy
A policy defines the scope of the process or procedure. It effectively gives you a boundary to prevent ‘scope creep’.
The business/organisation’s policy is to have an effective and efficient system for responding to major incidents, which is appropriate to the individual circumstances.
The key requirements of the policy are:
- To provide an effective communication system across the business/organisation during a major incident
- To ensure that an appropriate Incident Manager/Major Incident Team/Management Group are in place to manage a major incident
- That there are in place appropriate arrangements to ensure that major incidents are notified promptly to appropriate management and technical groups, so that the appropriate resources are made available
- To conduct major incident investigations and to contribute to the business/organisation’s knowledge of the causes of incidents.
- To provide timely information about the causes of incidents and any relevant findings from investigations
- To conduct a review of each major incident once service has been restored and, in line with problem management, to look at root cause and options for a permanent solution to prevent the same major incident happening again
- To conduct reviews of major incident investigation policy and procedure, independent of the major incident investigation, and to report on them (any lessons to be learned from the policy and procedure review will be considered, and appropriate action taken to ensure any improvements to existing arrangements are implemented within a specified timescale)
Roles And Responsibilities In The Major Incident Procedure
The following roles and responsibilities need to be defined for managing major incidents:
- The Incident Manager
- The Problem Manager or if no Problem Manager exists, then the role of a Root Cause Analyst (effectively a technical expert trained in RCA techniques – possibly 3rd line support)
- Major Incident Investigation Board
- Investigation Team/investigation resources (technical staff)
- The service desk
- Service level managers/IT account managers
- Business relationship managers who may take part in the management of major incidents to conduct communication with key customers.
- Any other relevant groups who will act as part of the Major Incident Team
Major Incident Procedure Reviews
It is important to conduct Major Incident Reviews where the review determines:
- How well did we manage the Major Incident?
- Could it have been prevented?
- Could we do things better next time?
- How do we stop a recurrence?
A review is a very important aspect of the Major Incident process and should be carefully planned and managed. Typically it will be chaired by the Major Incident Manager or by a senior member of the management team.
- All relevant parties involved in the Major Incident should attend the review.
- Supporting documentation such as the Incident Record is shared in advance if possible.
- A walk-through of the incident together with the actions taken.
- Attendees are asked what went well and what did not, what actions will be taken to prevent re-occurrences and/or assist with the resolution should it happen again in the future.
- In addition the Major Incident process as a whole is reviewed and improvements are again identified.
- Meeting minutes should be produced detailing attendees, actions identified, who has been assigned the action and expected completion date.
- It is the responsibility of the Major Incident Manager to ensure the actions are completed accordingly within agreed timescales.
Beware not to confuse a Major Incident review with a Major Problem Review:
- Those things that were done correctly
- Those things that were done wrong
- What could be done better in the future
- How to prevent recurrence
- Whether there has been any third-party responsibility and whether follow-up actions are needed.
The knowledge gained from both reviews should be incorporated into a service review meeting with the business customer to ensure the customer is aware of the actions taken and the plans to prevent future major incidents from occurring. This helps to improve customer satisfaction and assure the business that service operation is handling major incidents and problems responsibly and actively working to prevent their future recurrence.)
Communication In Major Incident Procedure And Related To Emergencies
Although ITIL® specifies how to deal with urgent, high-impact situations such as disasters (IT Service Continuity Management) and major incidents (Incident Management), managers in the service operation stage will find themselves dealing with various types and scales of emergency not covered in these processes. It is important to note that this is not a separate process; rather it is a view of several processes and situations from a communication perspective.
Communication during emergencies is similar in purpose and content to communication during exceptions. The main differences are in the level of urgency and impact of the exception.
Emergency communications are usually initiated by the incident manager or by a senior IT manager who has been designated as the escalation point for all such emergencies.
In the case where an IT service continuity plan is invoked, this will include a detailed communication plan to be executed by the appropriate authority.
The incident manager or designated manager will often form a ‘response team’, and the communication is initiated and coordinated by this team.
If the Major Incident is evident in the public domain then careful communication needs to be delivered by a senior member of the management team, maybe in the form of a carefully drafted press release.
The following also need to be considered for managing major incidents:
- Knowledge transfer – The Major Incident Manager should not become a single point of failure. Consider training a number of senior personnel in this role if possible and establish a way of keeping all stakeholders informed of changes to plans, procedures etc.
- Changes to appropriate documentation – Consider how changes to documents are managed. A history log at the beginning of each document is the simplest way, but you may also consider formal version and copy control, or use of document sharing technology.
- Changes to appropriate processes – When processes change ensure all stakeholders are informed and trained as appropriately.
Planning For Major Incidents
Despite our best efforts a major incident will at some stage no doubt occur and therefore as well as developing a Major Incident policy and major incident procedure we should undertake a number of other activities in preparation…
- Define Major Incident Process - Defining and documenting the Major Incident process, including a high-level flow diagram, is invaluable. The process documentation will then assist with defining the associated procedures to be used by all parties.
- Define Roles and Responsibilities - Clearly define in generic terms the roles and responsibilities of each party, both internal and external to the organisation, engaged in the Major Incident process. Creating a RACI matrix is often the easiest way to do this, determining those: (Responsible, Accountable, Consulted and Informed) for each activity. Ensure that all involve understand their role and responsibility and are appropriately trained.
- Review Service Level Agreements and Service Catalogues - Working with the Business Relationship Manager and Business representatives determine the mission critical services and components. Business relationship managers may assist in gathering detailed requirements during the service design stage of the lifecycle for new services.
- Liaise with Information Security Management - In the current ‘Cyber-attack’ climate consider the implications of security breaches, Phishing attacks, Malware, Ransomware and other software virus attacks as part of your planning.
- Identify IT Service Continuity Management (ITSCM) Interfaces and Involvement - Determining when and who needs to be communicated with from the IT Service Continuity Management team when a Major Incident occurs. Also agree and capture what triggers the ITSCM plan or what circumstances would invoke the ITSCM plan.
- Define Incident Priorities – As part of ‘normal’ incident management process it’s important to establish a simple clearly defined incident priority hierarchy covering low priority through too high or critical priority incidents (Major Incidents). This would normally be based on Business Impact and Business Urgency, but could incorporate other factors such as ‘Technical Severity’. The incident priority should be reflected in the generic "IT Service Support Model" if one exists. It’s imperative to ensure there is no confusion regarding priority, especially regarding what constitutes a "Major Incident" and that they can be applied across the IT department and its third party suppliers as well as the organisation's Business community.
- Define Incident Escalations - A Major Incident has the potential to have a significant impact upon an organisation for example, from a reputational, legal, trading and in some cases life and death perspective. Speed is of the essence and any delays can be very costly. By establishing an Escalation Hierarchy within the organisation and associated third party suppliers, appropriate authorisation, focus and resource can be committed, in a timely manner to the Major Incident, to resolve and re-establish the service(s) in question. Note: Both hierarchic and function escalations need to be considered.
- Review Underpinning Contract(s) and Operating Level Agreement(s) - Examine Contracts with existing third party suppliers and Operating Level Agreements (OLAs) with internal support teams to determine whether they align with the Major Incident process. Where UCs and OLAs do not align with SLAs they will need to be renegotiated.
- Establish need for any ‘out of hours’ (OOH) arrangements - It is possible that some internal support teams will be required to have staff available "out of hours" to assist with Major Incidents that may occur. Some form of compensation may need to be agreed with staff. The organisations Human Resources department would usually undertake this requirement.
- Create a key contact list - Capture the names, job titles, telephone numbers (both landline and mobile), preferred methods of communication of the various individual team members and third party suppliers involved in the Major Incident process.
- Establish Communication Plan(s) - It is important to communicate out to the Business community and other relevant staff (e.g. Business Relationship Manager(s)) in a timely manner detailing when a Major Incident occurs, followed by progress update(s) and finally notification of the restoration of service. From experience it worthwhile from an effectiveness and efficiency perspective to target the communications to those who are affected by the Major Incident. In advance identify who is to be contacted, the method and frequency of the communication, the Business Relationship Manager(s) should be of significant value in identifying and agreeing the contacts. Email is often used and setting up and maintaining distribution lists often facilitates such communications. Consider setting up Communication templates for:
- Major Incident notification
- Progress Updates
- Service Restoration
- Create an Escalation Plan - Establish the hierarchy of names, job titles, telephone numbers (both landline and mobile), and the time period following the occurrence of the Major Incident each individual will be contacted should the incident not be resolved. The further up the hierarchy the more influential the individual will be expected to be within the organisation with the capability to expedite resources and the availability of key individuals. The scope of the Escalation Plan should include internal and third party suppliers.
- Checklist(s) - Checklists save time, reduce stress and ensure all aspects of a Major Incident are considered. Establish checklist for:
- Meeting Agendas
- Staff Rotation (shifts)
- Staff Facilities.
- Command and Control Centre - Where possible identify a dedicated location, including meeting room equipped with conference call facilities, whiteboards, flipcharts and pens. Ensure out of hours facilities such as Security, parking, heating, toilets, food and water are available and maintained. The meeting room may well be used outside of Major Incidents but on the understanding that if a Major Incident occurs then the room will be commandeered and existing occupants expected to leave immediately.
How Do We Establish A Culture Of Continual Improvement?
It’s necessary to maintain and continually improve the major incident procedure and plans, and to encourage all stakeholders to keep up to date with their knowledge of Major Incident Management.
- Policy and Process Improvement - As part of the "Post Major Incident Review" the Major Incident process is reviewed. In addition the process should be periodically reviewed with the stakeholders. Any improvements would be raised as a Request for Change (RFC) and follow the Change process. All policies and processes should be reviewed at least annually.
- Change Management – Consideration needs to be given to all changes to evaluate if a change may impact on the existing Major Incident procedure or plans. This is equally important for ITSCM.
- Current and Accurate Contact Data - As defined in the roles and responsibilities supporting the Major Incident process everyone is required to provide any updates such as contact names, job titles, telephone numbers, email addresses and methods of communication to the Major Incident Process Owner. The Process Owner will update the relevant checklists and communication documents and communicate out accordingly.
- Lessons Learnt - As previously mentioned time is of the essence when dealing with a Major Incident and for those new to the process receiving education and training in advance can only be beneficial. Obviously understanding the process and procedures is important, but also consider using recent Major Incidents as training scenarios, including the lessons learnt from the post Major Incident Review.
- On-going education and awareness - All organisations to some extent experience a turnover in staff. Therefore it would be beneficial to establish an education and awareness plan incorporating scheduled sessions for both key internal staff and those of third party suppliers involved in major incident management.
The information contained in this document has been collected and collated from a number of sources and researched and modified by the author to bring it up to date. It therefore represents what we would currently consider Best Practice for Major Incident Management.