An Introduction to ITIL® Monitoring and Event Management
The purpose of the monitoring and event management practice is to systematically observe services and service components, and record and report selected changes of state identified as events. This practice identifies and prioritises infrastructure, services, business processes, and information security events, and establishes the appropriate response to those events, including responding to conditions that could lead to potential faults or incidents.
In this blog we will dive deep into monitoring and event management in ITIL with the scope, definitions, examples, metrics, and best practices. If you want to learn more we have a one day ITIL® 4 Practitioner: Monitoring & Event Management training course
What Is Monitoring and Event Management in ITIL 4?
Monitoring and event management is used to manage events throughout their lifecycle to understand and optimize their impact on the organisation and its services. Monitoring and event management includes identification and categorisation, or analysis, of events related to all levels of infrastructure and to service interactions between the organisation and its service consumers. Monitoring and event management ensures appropriate and timely response to those events.
The monitoring part of the practice focuses on services and configuration items (CIs) to detect conditions of potential significance, track and record the state of servicers and CIs, and provide this information to relevant parties.
The event management part of the practice focuses on those monitored changes of state defined by the organisation as an event, determining their significance, and identifying and initiating the correct response to them. Information about events is also recorded, stored and provided to relevant parties.
How Can We Help?
At Purple Griffon we offer an ITIL® 4 Practice: Monitoring & Event Management course. This course is structured and aligned around the ITIL® Framework. The 1-day course is ideal for anyone requiring practical knowledge on monitoring and event management.
The ITIL 4 monitoring and event management course is designed to provide participants with the knowledge and skills needed to effectively monitor and manage IT services using the ITIL framework.
The course can be taken on its own as a 1-day course, or part of a bundle called ITIL® 4 Practices: Monitor, Support & Fulfil which includes:
• ITIL® 4 Practice: Service Desk
• ITIL® 4 Practice: Incident Management
• ITIL® 4 Practice: Problem Management
• ITIL® 4 Practice: Service Request Management
• ITIL® 4 Practice: Monitoring & Event Management
What is the Scope of Monitoring and Event Management?
Monitoring and Event Management has an important role in ensuring the effective operation of IT services.
Event Management can be applied to any aspects of Service Management that can be controlled and can be automated.
The events (warnings and exceptions) can be used to automate many routine activities.
Provide mechanisms for early detection of incidents.
Some types of automated activities can be monitored by exception, reducing downtime.
The scope of the monitoring and event management practice covers includes:
• identifying and optimising the scope of monitoring
• implementing and maintaining continuous monitoring
• establishing and maintaining event identification, categorisation, and processing rules
• implementing processes and automation tools to operationalize the defined event management rules.
• ongoing processing of events according to the agreed and implemented rules and processes
• providing information about the current and historical state of the monitored services and resources to relevant stakeholders in an agreed form.
There are several activities and areas of responsibility that are not included in the monitoring and event management practice, although they are still closely related to monitoring and event management.
What is an Event in ITIL 4?
ITIL 4 defines an event as “any change of state that has significance for the management of a service or other configuration item (CI)”.
An event can be defined as any detectable occurrence that has significance for the management of IT services or the IT infrastructure. An event can be generated by hardware, software, applications, or by human activity, and it can be either normal or abnormal. There are 3 types of events in ITIL informational, warning, and exceptional.
What is an Informational Event?
An informational event in ITIL 4 is an event that is detected by a monitoring tool or system and does not require any action or intervention. It is simply recorded for informational purposes.
Informational events are used to capture and store data about the performance and availability of IT services, infrastructure, and applications. These events can include system start-ups and shutdowns, changes in system status, user login and logout activities, and other automated system events.
The purpose of collecting and analysing informational events is to gain insight into the overall health and performance of the IT environment. This data can be used to identify trends and patterns, forecast future demand, and support capacity planning and optimization efforts. Additionally, informational events can be used to support troubleshooting and problem resolution activities, as they can provide valuable contextual information to support incident and problem investigations.
What is a Warning Event?
A warning event in ITIL 4 is an event that indicates that a potential problem or issue has been detected that requires attention or intervention.
A warning event may be an indication of a problem that is currently affecting the performance or availability of IT services, infrastructure, or applications, or it may be a sign of an impending problem that needs to be addressed proactively to prevent service disruptions.
Examples of warning events include CPU utilisation reaching a critical threshold, a hard disk running out of space, a network link becoming saturated, or a server becoming unresponsive. These events are typically detected by monitoring tools or systems and are forwarded to the IT operations team for further investigation and action.
The purpose of warning events is to provide early warning of potential problems or issues, enabling IT operations teams to take proactive measures to prevent service disruptions and minimise the impact of incidents. Warning events can trigger automated responses, such as the creation of incident tickets, or they can be used to inform IT operations staff of the need for immediate attention and action.
What is an Exceptional Event?
An exceptional event in ITIL 4 is an event that requires immediate attention and intervention from IT operations staff.
Exceptional events are typically indicators of significant problems or issues that are affecting the performance or availability of IT services, infrastructure, or applications, and require urgent action to restore normal service operations.
Examples of exceptional events include server crashes, network outages, security breaches, data corruption, and other critical incidents that require immediate attention from IT operations staff.
The purpose of exceptional events is to provide a mechanism for IT operations staff to prioritise and respond to critical incidents in a timely and effective manner. Exceptional events trigger the initiation of the incident management process, which includes activities such as incident detection, logging, categorisation, prioritisation, and resolution. Exceptional events also trigger the activation of the service continuity management process, which focuses on restoring normal service operations as quickly as possible and minimising the impact of incidents on business operations.
It is important for organisations to have well-defined processes and procedures in place for handling exceptional events to ensure that incidents are managed effectively and efficiently, and to minimise the impact on customers and the organisation.
Examples of Monitoring and Event Management in ITIL 4
In ITIL 4, monitoring and event management are critical components of the service operation stage of the ITIL service lifecycle. Here are some examples of each type of event:
Here are some examples of Informational Events:
• System start-up and shutdown
• User login and logout activities
• File transfers
• Successful backups
• Routine maintenance activities
Examples of warning events:
• High CPU utilisation
• Low disk space
• High memory usage
• Network congestion
• Slow response time
Examples of exceptional events:
• Server crash
• Network outage
• Security breach
• Data corruption
• Application failure
It's worth noting that some events may fall into different categories depending on the context and severity of the event. For example, a high CPU utilisation event may be a warning event in some cases, but it may also become an exceptional event if it persists for an extended period and begins to affect service availability.
Metrics for Tracking Monitoring and Event Management
In ITIL 4 metrics are used to measure and evaluate the effectiveness and efficiency of Monitoring and Event Management processes. Here are some examples of metrics that can be used to track Monitoring and Event Management in ITIL 4 some of which will feed into other process key performance indicators, typically for incident management:
Number of events and alerts
This metric measures the total number of events and alerts generated by monitoring tools and systems, including informational, warning, and exceptional events. This metric can be used to track the overall volume of events and to identify trends and patterns over time.
Time to detect and diagnose incidents
This metric measures the time it takes to detect and diagnose incidents, from the initial event detection to the creation of an incident ticket. This metric can be used to evaluate the efficiency of event management processes and to identify areas for improvement.
Time to resolution
This metric measures the time it takes to resolve incidents and restore normal service operations. This metric can be used to evaluate the effectiveness of incident management processes and to identify opportunities for reducing downtime and improving service availability.
Mean time between failures (MTBF)
This metric measures the average time between failures of IT services, infrastructure, or applications. This metric can be used to evaluate the reliability of IT systems and to identify areas for improvement.
Mean time to repair (MTTR)
This metric measures the average time it takes to repair IT systems or resolve incidents. This metric can be used to evaluate the efficiency of incident management processes and to identify areas for improvement.
False positives and false negatives
This metric measures the number of false positives and false negatives generated by monitoring tools and systems. This metric can be used to evaluate the accuracy of monitoring and event management processes and to identify opportunities for improving the quality of event data.
Event noise reduction
This metric measures the percentage of events and alerts that are filtered out or discarded as noise. This metric can be used to evaluate the effectiveness of noise reduction strategies and to identify opportunities for further optimisation.
By tracking and analysing these metrics, IT organisations can gain valuable insights into the performance and effectiveness of their Monitoring and Event Management processes, and make data-driven decisions to improve their IT operations.
What Are the Best Practises for Monitoring and Event Management?
Best practices for monitoring and event management are proven and repeatable methods or processes that are recognised as the most effective and efficient ways to achieve a particular objective or goal. These best practices have been developed through years of experience and experimentation, and are widely accepted and adopted by practitioners and organisations. Here are some best practices for the Monitoring and Event Management practice in ITIL4:
Define clear objectives
Establish clear objectives for Monitoring and Event Management that align with the business goals and objectives of the organisation. This will help ensure that monitoring activities are focused on the most critical services and systems.
Select appropriate tools and systems
Select monitoring tools and systems that are appropriate for the needs of the organisation, and that provide the required level of visibility and control. Consider factors such as scalability, flexibility, ease of use, and integration with other ITSM processes.
Define event thresholds and alerts
Define clear event thresholds and alerts for warning and exceptional events and ensure that they are aligned with the needs of the organisation. This will help ensure that IT staff are notified of critical events in a timely manner and can take appropriate action to prevent or minimise the impact of service disruptions.
Implement automated event management processes
Implement automated event management processes to minimise manual intervention and reduce the risk of human error. This can include automated event correlation, root cause analysis, and incident ticket creation.
Use data analytics and reporting
Use data analytics and reporting to identify trends, patterns, and anomalies in event data, and to support decision-making and continuous improvement. This can include dashboards, reports, and data visualisation tools.
Continually review and improve processes
Continually review and improve Monitoring and Event Management processes to ensure that they are effective and efficient in supporting organisational operations. This can include regular process reviews, stakeholder feedback, and benchmarking against industry best practices.
Align with other ITSM processes
Ensure that Monitoring and Event Management processes are aligned with other ITSM processes, such as Incident Management, Problem Management, and Change Management. This will help ensure that events are managed in a coordinated and effective manner across the IT organisation.
By following these best practices, organisations can establish effective and efficient Monitoring and Event Management processes that support the delivery of high-quality IT services to the organisation.
Why is Monitoring and Event Management So Important?
Monitoring and Event Management is important for several reasons. It: Improves service availability, Reduces Mean Time to Repair, Enhances customer satisfaction, Increases operational efficiency, and Facilitates compliance.
Improves service availability
Effective monitoring and event management can help to detect and resolve issues before they cause service disruptions, improving service availability and minimising downtime.
Reduces Mean Time to Repair (MTTR)
By quickly identifying and resolving incidents, monitoring and event management can help to reduce the Mean Time to Repair (MTTR), which is the average time it takes to restore a service after an incident. Note: In some organisations the ‘R’ may stand for Restore or Recover. Which if you think about it are slightly different metrics.
Enhances customer satisfaction
By ensuring that services are available and performing as expected, monitoring and event management can help to enhance customer satisfaction and confidence in IT services.
Increases operational efficiency
By automating event management processes and providing actionable insights, monitoring and event management can help to increase operational efficiency and reduce manual effort.
Supports proactive problem management
Monitoring and event management can provide valuable insights into recurring issues and potential problems, supporting proactive problem management and helping to prevent future incidents.
Enables better decision-making
By providing real-time and historical data on service performance, monitoring and event management it can help IT teams and management to make better informed decisions and optimise service delivery.
Monitoring and event management can help organisations to meet various compliance requirements by providing visibility into service performance and availability.
Overall, effective ITIL4 Monitoring and Event Management is essential for ensuring that IT services are delivered efficiently, effectively, and in a way that meets the needs of the organisation and its customers.
Events, Incidents, and Problems - What’s the Difference?
In ITIL 4, events, incidents, and problems are all related to the management of IT services, but they represent different stages in the lifecycle of an IT service and require different types of management.
ITIL 4 defines an event as “any change of state that has significance for the management of a service or other configuration item”.
ITIL 4 defines an Incident as “an unplanned interruption to a service or reduction in the quality of a service”.
ITIL 4 defines a problem as ”a cause, or potential cause, of one or more incidents”.
So what’s the difference?
Let’s expand the definitions to explain the differences…
An event is any change of state that has significance for the management of an IT service or infrastructure. Events can be informational, warning, or exceptional, and can be raised by various sources such as software applications, hardware devices, or users. Events are typically logged in an event management system, which enables the IT team to monitor, analyse, and respond to them.
An incident is an unplanned interruption or reduction in the quality of an IT service. Incidents can be caused by a wide range of factors, such as hardware failure, software bugs, or user error. Incident management is the process of detecting, logging, categorising, prioritising, and resolving incidents as quickly as possible to minimize the impact on the organisation and its customers. The goal of incident management is to restore normal service operation as quickly as possible.
A problem is the underlying cause of one or more incidents. Problem management is the process of identifying, analysing, and resolving the root cause of problems to prevent recurrence of incidents. Problem management is proactive, whereas incident management is reactive.
In summary, events are changes of state that may or may not require action, incidents are unplanned interruptions or reductions in the quality of IT services that require a rapid response, and problems are the underlying causes of incidents that require analysis and resolution to prevent recurrence. Effective management of events, incidents, and problems is essential for ensuring the quality, availability, and reliability of IT services.
Final Notes On ITIL® 4 Monitoring and Event Management
At the risk of repeating myself and just to make sure that we’ve got it nailed…
ITIL4 Monitoring and Event Management is a process that focuses on detecting, analysing, and managing events in the IT infrastructure to ensure the effective operation of IT services. The process involves monitoring and collecting data from various sources such as applications, hardware, software, and networks, and using that data to identify and prioritize events.
Overall, ITIL4 Monitoring and Event Management is a critical but often overlooked component of IT service management, helping organisations to maintain the performance, availability, and reliability of their IT services and to meet the needs of their customers and stakeholders.
If you want to learn more about Monitoring and Event Management we offer a one day ITIL® 4 Practitioner: Monitoring & Event Management training course