Blog
July 30, 2025

Major Incident Manager Interview Questions and Answers Part-2

What is the escalation matrix at your company? (Hierarchical and functional)
The escalation matrix in my company defines how and to whom we escalate issues when extra attention or decisions are needed. We have both functional escalation (bringing in higher expertise or different teams) and hierarchical escalation (notifying higher-level management). Here’s how it works:

  • Functional Escalation: If an incident is not getting resolved with the current resources or expertise, we escalate functionally by involving additional teams or specialists. For example, if a database issue is beyond the on-call DBA’s expertise, we escalate to the database architecture team or even the vendor’s support. Similarly, if it’s a complex network issue, we might pull in our network SME who wasn’t originally on call. Functional escalation is about getting the right people involved.

  • Hierarchical Escalation: This is about informing or involving management levels as the situation severity increases or if SLA likely to be breached. In our matrix:

    • For a P1 incident, the Incident Manager (me) will notify the IT Duty Manager or Incident Management Lead within, say, 15 minutes of declaration. If resolution is not found in an hour, we escalate to the Head of IT Operations.

    • Ultimately, for very severe or prolonged incidents, we escalate up to the CIO and relevant business executives (like the account manager or business owner of the service). We have criteria like: if a major incident exceeds 2 hours, inform CIO; if it’s causing significant client impact, inform Account Managers to handle customer comms.

  • Matrix Structure: We literally have a document/spreadsheet that lists: Level 1: Incident Manager on duty, Level 2: Incident Management Lead (or Service Delivery Manager), Level 3: Director of IT Operations, etc., with their contact info. Similarly, on the technical side, each support team has an escalation ladder: e.g., if the on-call engineer is stuck, call the team lead; if team lead isn’t available or also stuck, call the department manager; then maybe the head of technology. This ensures accountability at each level.

  • Example: Suppose a critical banking app is down and the initial team cannot solve it in X time. According to the matrix, I call the Senior Manager of Applications (functional escalation to more expertise) and also ping the Incident Process Owner to notify them (hierarchical). If things continue, next I might involve the CIO (hierarchical) to make major decisions like switching operations to disaster recovery site or to communicate with client’s leadership.

  • Why it’s important: Everyone knows whom to call next if things aren’t progressing. It prevents delays where people might be hesitant, and it provides authority – when I escalate to a higher-up, they can allocate more resources or make high-level decisions (like approving to shut down a system or communicate externally).

  • Also, Business Escalation: Part of our matrix is notifying the business side. For instance, if an incident affects a major client or revenue stream, there’s an escalation to the account team or business continuity manager to handle non-IT aspects (customer management, regulatory notifications, etc.).

  • Periodic Review: We update the matrix regularly (people change roles, phone numbers update, etc.). We also occasionally simulate escalations to ensure contacts respond.

In summary, the escalation matrix is a pre-defined chain of command and expertiseHierarchical escalationbrings higher management attention as needed, and functional escalation brings in deeper technical expertise or additional teams. By following this matrix, we ensure that when an incident is beyond the current team’s ability or threatens to breach SLAs, the right people are pulled in quickly and decision-makers are aware. This structured approach to escalation is a backbone of our major incident process.

 

What will you do if the technical team is not responding or not doing their job?
If a technical team or engineer is unresponsive during a critical incident, I have to take prompt action to get things back on track:

  • Multiple Contact Attempts: First, I’d try all forms of contact. If they aren’t responding on Teams or email, I will call them on phone. Perhaps they missed the initial alert – a direct phone call or even SMS can grab attention. If one particular engineer is MIA and was critical, I’d reach out to others in that team or their manager.

  • Escalate to Team Lead/Manager: If the on-call person isn’t responding within, say, a few minutes in a P1, I escalate to their team lead or manager. For example, if the database on-call isn’t joining, I’ll call the database team lead to either find a backup or themselves join in. This is where having an updated on-call roster is important.

  • Inform Incident Leadership: I’d also inform my Incident Management Lead or duty manager that “Team X is not responding, I have escalated to their manager.” This ensures the situation is known at higher level and they can assist if needed (e.g., call that team’s director if necessary).

  • Workaround with What We Have: In parallel, I’d see if other teams can cover or if we can progress without them. For instance, if network team isn’t responding and we suspect a network issue, I might ask a system admin with some network knowledge to do basic checks (like ping tests, etc.) while we keep trying the network folks. Or leverage monitoring tools to gather data that that team would normally provide.

  • Document the Lack of Response: I keep a note in the incident timeline that “at 10:05 PM, paged network on-call, no response; 10:15 PM escalated to Network Manager, awaiting update.” This provides a clear record and also covers accountability later.

  • Replace or Bypass If Needed: In a severe scenario, if a particular person just isn’t responding and time is ticking, once I have their manager, I’ll request a replacement resource. Good organizations have backup on-call or secondary contacts. I’ll say to the manager, “I need someone from your team now – if person A isn’t reachable, can you get person B or C?” The manager may even jump in themselves if capable.

  • Post-Incident Follow-up: After the dust settles, I would address this formally. Not to point fingers, but reliability of on-call response is crucial. I’d work with that team’s leadership to understand what happened – was the person unreachable due to some emergency or was our contact info outdated? Or did they negligently ignore the call? According to that, we’d take actions: maybe update the contact list, improve the paging system, or if it’s a performance issue, the manager handles it with that employee (training or disciplinary if warranted). The incident management process might treat a non-response as a breach of OLA, and it should be discussed in the post-incident review so it doesn’t recur.

  • Meanwhile, Not Doing Their Job: If the team is present but dragging feet or not taking action: I’ll assertively guide them. Sometimes I encounter analysis-paralysis or reluctance. I’d say, “We need to try something now. Let’s reboot that server or failover – do we have objections?” If a team is hesitating, I might escalate to a higher technical authority to authorize an action. Essentially, I won’t let inaction persist; I’ll make decisions or seek someone who can. Additionally, if I feel they’re not giving it due attention (maybe treating a P1 too casually), I’d remind them of impact (“This is affecting all customers, we need full focus”). If needed, involve their manager to jolt them into urgency.

In summary, if a technical team isn’t responding, I escalate quickly up their chain and try alternate contacts, while mobilizing any interim solutions. The Major Incident Manager has to be the squeaky wheel in such cases – time lost equals more impact, so I’ll use every means to get the right engagement. Afterward, we ensure accountability so that such a lapse doesn’t happen again, whether that means process change or personnel change.

 

What types of incidents do you handle?
I handle a wide range of incidents across the entire IT infrastructure and applications spectrum. Essentially, any high priority incident (P1 or P2), regardless of technology domain, comes to the Major Incident Management process. Some types include:

  • Infrastructure Incidents: These involve servers, storage, operating systems, or data center issues. For example, a major server crash, VM host down, storage network outage, power failures in the data center, etc.

  • Network Incidents: Such as WAN link failures, router/switch outages, firewall misconfigurations locking out connectivity, DDOS attacks impacting network availability. These are often widespread because network is core – e.g., a company-wide network outage.

  • Application Incidents: Critical business applications going down or severely malfunctioning. For instance, our e-commerce website unavailable, a core banking system error, ERP system issues, or even severe bugs from a new release causing outages. This can also include incidents with integrations between applications failing.

  • Database Incidents: Like a database server going offline, database corruption, or performance issues on the DB that cascade to app slowdowns. Any incident where the DB is the bottleneck and affecting services is in scope.

  • Security Incidents (major ones): While we have a separate security team, if there’s a major security breach (like ransomware spreading, or a critical vulnerability exploitation requiring emergency response), I would be involved or at least coordinate with the cybersecurity incident response. Often, major security incidents are run by the security incident lead, but I support with communication and coordination if it’s impacting services (for example, if we have to shut down systems to contain an attack, that’s both a security and availability incident).

  • Service Outages: This broad category includes email service down, VPN down for all remote users, file server inaccessible, etc. These could be due to infra or software issues but they manifest as a service outage.

  • Major Incident in Cloud Services: e.g., our cloud provider has an outage in a region affecting our applications. I handle coordinating with the cloud vendor and mitigating impact (like failover to another region if possible).

  • IT Facilities: In some cases, incidents like a data center cooling failure or fire alarm could become IT incidents (needing server shutdown or failover to DR). I would coordinate technical response in those scenarios as well.

  • Telephony/Communications: If the phone system or MS Teams is down company-wide, that’s a major incident I’d handle.

  • Critical Batch Job Failures / Data Incidents: For example, end-of-day processing in a bank fails or a major data pipeline breaks, missing an SLA to a client – those also come to my plate if the impact is high.

Essentially, “all IT infrastructure and applications” as the question hints. So I cover incidents in infrastructure, application, network, database – basically all IT domains as needed. I’m not the deep expert in each, but I coordinate the experts in each.

I’d add that handling all these types means I need a broad understanding of IT systems. One day I might be dealing with a network outage, the next day a database lock issue. The commonality is these incidents significantly impact the business and require urgent, coordinated response. So I’m versatile and able to shift between different technical realms (with the help of specific SMEs for each).

 

Where do you see yourself in 5 years?
In five years, I see myself growing into a senior leadership role in IT Service Management or IT Operations. Having honed my skills as a Major Incident Manager, I’d like to progress to roles such as Incident/Problem Management Lead or IT Operations Manager, where I can drive strategic improvements across the entire incident lifecycle. I also envision deepening my expertise in related areas – for example, becoming an expert in Service Reliability or DevOps processes, which complement incident management.

I’m passionate about the Major Incident function, so in five years I could be a Major Incident Process Ownerglobally, establishing best practices and training teams across the organization or multiple clients. I might also pursue further ITIL advanced certifications or even get into Site Reliability Engineering (SRE) practices to enhance how we prevent incidents.

Ultimately, I see myself as a leader who not only handles incidents reactively but also works proactively to improve service resilience. Perhaps I’ll be heading a Service Excellence team that encompasses Incident, Problem, and Change Management, using my frontline experience to create a more robust IT environment. I’m also interested in people management, so I could be managing a team of incident managers by then, mentoring them with the knowledge I’ve gained.

In summary, five years from now I aim to take on greater responsibility, possibly at a large enterprise or in an even more challenging domain, continuing to ensure that IT delivers reliable service to the business. And I certainly hope to grow with the company I join, so if I were to join your company, I’d love to see myself contributing at higher and broader capacities, aligning with the company’s evolution over that time.

 

Do you know what our company does?
Yes, I’ve researched your company thoroughly. Your company, [Company Name], is a leading IT services and consulting provider (for example, if it’s Infosys/TCS/Capgemini, I’d tailor accordingly: “a leading global IT consulting and outsourcing firm, serving clients across various industries with technology solutions”). I know that you specialize in delivering solutions such as [mention major services/products – e.g., digital transformation, cloud services, application development, managed infrastructure services, etc.].

For instance, I noted that your company has a strong presence in the Banking/Financial sector and also works in domains like retail and healthcare (assuming that fits the company). One of your flagship services is around enterprise cloud and digital solutions – you help clients modernize their IT. Also, your company’s revenue was around $X billion last year, and it has a global workforce of over N thousand employees, which indicates a huge scale of operations.

I’m aware of some recent news: you have been investing in AI and automation in IT Service Delivery (I recall reading a press release about a new AI Ops platform or a partnership you did). Your company’s motto/mission revolves around innovation and customer-centric service (I’d use the actual slogan if I found it, like “Building a bold tomorrow” or such).

I also took note that [Company Name] prides itself on its strong ITIL-based processes and service quality – which is directly relevant to the Major Incident Manager role. In summary, your company is a powerhouse in the IT industry, providing end-to-end IT solutions and services to clients worldwide. I wanted to ensure I understand your business so that I, as a potential Major Incident Manager, align my approach to the types of services and clients you handle. This knowledge will help me tailor my incident management strategies to your business context from day one.

 

Why do you want to join our company?
I am excited about the prospect of joining [Company Name] for several reasons:

  • Leadership in Industry: Your company is a well-respected leader in the IT services industry, known for its innovation and large-scale operations. As a Major Incident Manager, I thrive in environments that are complex and dynamic. Joining a top-tier firm like yours means I’ll be dealing with major clients, cutting-edge technologies, and challenging incidents – all of which will allow me to leverage my skills fully and also continue learning.

  • Culture and Values: From what I’ve researched, your company emphasizes values like customer focus, excellence, and teamwork. These resonate with me. Major incident management is all about teamwork and keeping the customer in mind, so I feel my mindset aligns well with your culture. I’ve also seen that employee development is important to you – many employees mention the good training programs and growth opportunities. I’m attracted to a company where I can grow my career long-term.

  • ITIL and Process Maturity: I know your organization is quite mature in ITIL processes and Service Management. For someone like me who is ITIL-certified and process-driven, that’s a great fit. I want to contribute to and learn from an environment that follows best practices. Also, I’ve read that [Company Name] is adopting the latest ITSM tools (possibly ServiceNow upgrades or AI-driven monitoring). That tells me I’ll get to work with modern tools and methodologies, which is exciting.

  • Global Exposure: Your company’s clientele spans multiple industries and countries. I look forward to the global exposure – managing incidents for different clients and technologies around the world. That diversity of experience is something I value, and it will make me a stronger professional.

  • Impact and Responsibility: The role at your company likely comes with significant responsibility (given the scale, a major incident could affect thousands or millions of end-users). I want that challenge. Knowing that my role will directly help maintain the reputation of [Company Name] by swiftly resolving crises is a big motivator. I take pride in such impactful work.

  • Personal Recommendation/Research: (If applicable) I’ve spoken to colleagues or read employee testimonials about working here – people talk about a collaborative environment and respect for the Incident Management function. It’s important for me to work at a place that recognizes the importance of a Major Incident Manager’s role, and I sense that here.

In summary, I want to join [Company Name] because I see it as a place where my skills will contribute significantly to the organization’s success, and where I will also grow professionally. I’m enthusiastic about the possibility of being part of your team and helping uphold the high service standards your company is known for.

 

Why should we hire you?
You should hire me because I bring a strong combination of experience, skills, and passion for Major Incident Management that aligns perfectly with what this role requires:

  • Relevant Experience: I have over X years’ experience managing high-severity incidents in a fast-paced IT environment. I’ve been the point person for countless P1 incidents – from infrastructure outages to application failures – and have a track record of driving them to resolution quickly and efficiently. I understand the ITIL process deeply and have implemented it under pressure. This means I can hit the ground running and require minimal training to start adding value.

  • Proven Communication & Leadership: Major Incident Managers must communicate clearly with technical teams and leadership. I pride myself on my communication skills – in my current role, I’ve been commended for timely and transparent updates during crises. I also lead bridge calls with confidence and calm, keeping teams focused. You’ll get a person who can coordinate cross-functional teams (network, server, application, vendors) and ensure everyone’s on the same page. I essentially act as a leader during emergencies, and I’m comfortable making decisions and escalations. These leadership qualities are essential for the role and I have demonstrated them consistently.

  • Tool Proficiency (ServiceNow, etc.): I am well-versed in ServiceNow – creating incident tickets, mass communications, using the CMDB, and generating incident reports. If your environment is on ServiceNow (or similar tools), I’ll be able to leverage it fully. I also have exposure to monitoring tools and can quickly grasp dashboards (which helps in incident validation and tracking).

  • Process Improvement Mindset: I don’t just resolve incidents – I also improve the process around them. For example, in my last job, I reduced major incident recurrence by implementing a better problem management linkage. I will continuously seek ways to reduce incident impact and frequency for your organization, whether through better monitoring, runbooks, or streamlining the comms process. This adds long-term value beyond day-to-day firefighting.

  • Calm Under Pressure: Perhaps one of the most important traits – I stay calm and organized when things are chaotic. I’ve been through outages at 3 AM, systems failing on Black Friday sale, etc., and colleagues know me for maintaining composure. This attitude helps teams stay focused and also inspires confidence in management that the situation is under control.

  • Alignment with Company: As I discussed, I know what your company does and I’m genuinely excited about it. I fit culturally – I work well in a team, I’m customer-centric, and I have a strong work ethic. I’m also willing to go the extra mile (nights, weekends) whenever incidents demand – which is inherent in this job.

  • ITIL Certified and Continuous Learner: I’ve got ITIL certification and keep myself updated with ITIL v4 practices. I’m also familiar with Agile/DevOps concepts which increasingly tie into incident management (like post-incident reviews feeding into continuous improvement). So I bring not just static knowledge, but a mindset of evolving and learning, which is something every organization needs as technology and best practices change.

In short, you’d be hiring someone who is battle-tested in incident management, brings structured process along with practical know-how, and is enthusiastic about protecting and improving service reliability for the business. I’m confident that I can not only fulfill the requirements of this role but also drive positive outcomes – like improving your incident KPIs, increasing customer satisfaction, and strengthening the overall incident management practice at your company.

 

What are your salary expectations?
(Note: In an interview I would approach this diplomatically.) I am open to a competitive offer that reflects the responsibilities of the Major Incident Manager role and my experience. My current understanding is that a role like this in a major IT company typically ranges around [provide a range if pressed, based on market research – e.g., “XYZ to ABC currency per annum”]. Considering my X years of experience and skill set, I would expect a salary in the ___ range (for example, “in the mid-₹X0 lakhs per annum in India” or “around $$X in the US market”). However, I’m flexible and, for me, the opportunity to work at [Company Name] and the potential for growth and contributions is a big factor. I’m sure if I’m the right fit, we can come to a mutually agreeable number.

(If I have a specific figure because they insist, I would give a number within a reasonable range. Otherwise, I emphasize I’m negotiable but looking for market fair compensation.)

 

How many incidents do you typically handle on a weekly or monthly basis?
On average, I handle a high volume of incidents, though not all are major. In terms of Major (P1) incidents, it usually averages to a few per week. Specifically, I’d say roughly 3-4 major (P1) incidents per day when on shift. That translates to about 15 P1s in a week, and perhaps 50-60 P1 incidents in a month (assuming around 200 P1s per month across a 24/7 team split into shifts)【User’s note】. This is in a large enterprise setting. For P2 incidents, the volume is higher – probably about double the P1 count. So maybe around 6-8 P2 incidents a day in my queue, which is ~100-120 P2s a month for me personally, but company-wide that could be 400-500 P2s per month as a whole team.

Including lower priorities (P3, P4), the Service Desk handles many of those without needing my involvement unless they risk breaching or escalate. My primary focus is on P1 and high P2 incidents. If we include all priorities that I touch or oversee, weekly it could be dozens of incidents that I have some hand in (either direct management or oversight). But strictly as lead Major Incident Manager, monthly maybe ~200 P1s occur in the organization and since we have multiple shifts, I end up managing a portion of those – likely ~40-50 P1s a month personally depending on shift distribution.

The key point is I’m very accustomed to handling multiple incidents daily and juggling priorities. Our environment is quite busy, so incident management is a constant activity. That said, those numbers can fluctuate – some weeks are quieter, then some week if there’s a big problem (like a widespread virus or a big change gone wrong) we might handle far more in that span.

 

How many total incidents occur in a week or month?
Across all priorities, our IT support handles a large number of incidents. Let me break it down as we usually measure:

  • P1 (Critical) incidents: We see about 7-8 P1 incidents per day on average across the operation, which comes to roughly ~200 P1 incidents per month in total (7 per day * 30 days)【User’s data】. These are the major ones that get full attention.

  • P2 (High) incidents: Typically, the volume of P2s is about double that of P1s. So we might have around 14-15 P2 incidents per day across the organization, totaling maybe 400-500 P2 incidents per monthoverall.

  • P3 and P4 incidents: These are much more numerous, but mostly handled by the service desk and support teams without needing major incident process. They could be in the hundreds per week. For instance, P3 might be a few thousand a month depending on user base size, and P4 even more, but many of those are minor and resolved quickly.

Summing up, if we talk all incidents (P1-P4), our company might handle several thousand incidents per month. But focusing on the critical ones: around 200 P1s and 500 P2s per month are typical in my experience. Per week, that’s about 50 P1s and 100+ P2s.

Within my shift (since we run 3 shifts to cover 24/7), I personally handle a subset. Usually, I manage 3-4 P1 incidents per day when on duty (which fits with ~ one-third of those 7-8 daily P1s, because colleagues on other shifts handle the rest) and maybe 5-10 P2s per day.

These numbers indicate a high-activity environment. They underscore why having a structured incident management process is crucial – with that many incidents, you need clear prioritization (only ~10-15% of those are truly major, others can be delegated). It also shows my experience is not from a small environment; I’m used to dealing with incident queues at scale.

 

How do you differentiate between a P1 and P2 incident?
The distinction between a P1 and P2 incident primarily comes down to impact and urgency – basically how severe the issue is and how quickly it needs resolution:

  • P1 (Priority 1) – This is a Critical incident. It usually means highest impact: a full outage or total failure of a mission-critical service, affecting a large number of users (or a whole site, or all customers). And it’s high urgency: there’s no workaround and immediate attention is required. For example, “Online banking system is completely down for all customers” or “Corporate network is offline company-wide” would be P1. P1 implies the business is significantly hampered – maybe financial loss, safety issue, or SLA breach is imminent. We trigger our major incident process for P1s.

  • P2 (Priority 2) – This is High priority but one step down. Typically, significant impact but not total. It might affect a subset of users or a secondary service, or the main service is degraded but somewhat operational. Urgency is high but perhaps a workaround exists or it’s happening in a non-peak time. For example, “Email is working but extremely slow for one region” or “One of two redundant internet links is down (capacity reduced but service up)” could be P2. Business is impacted, perhaps inconvenienced, but not completely stopped. P2 still needs prompt attention, but maybe not an all-hands-on-deck like P1.

Concretely, to differentiate, I ask:

    • Scope of impact: All users vs many vs some. P1 is often global or enterprise-wide; P2 might be multiple departments or a critical group of users but not everyone.

    • Criticality of service: Is the affected service a top critical service? If yes and it’s down, that’s P1. If it’s important but one tier lower, maybe P2.

    • Workaround: If users have no alternative way to do their work, leans toward P1. If a workaround exists (even if inconvenient), it might be P2.

    • Urgency: If we can tolerate a few hours without the service (and it’s after hours, for example), maybe P2. If every minute of downtime costs money or reputation, that’s P1.

For example, in ITIL terms, P1 = High Impact, High Urgency (often Extensive/Widespread impact + Critical urgency). P2 = High Impact but perhaps lower urgency, or vice versa (Significant impact with Critical urgency could still be P1 depending on matrix, but generally P2 might be one grade down). Many companies define something like: P1 means whole service down; P2 means service degraded or significant issues but not total failure.

In practice, if there’s ever doubt, we might initially treat it as higher priority and later downgrade if appropriate. It’s safer to start with P1 response if unsure. But experience and the priority matrix help guide that decision.

So, to summarize: P1 = “We’re on fire” (immediate, major impact), P2 = “This is serious but not a five-alarm fire.” I apply the formal criteria our organization has, which align with that logic, to classify incidents correctly.

 

How many incidents do you handle on a weekly or monthly basis?

I typically handle a considerable number of incidents. On a weekly basis, I actively manage roughly 15-20 major incidents (P1/P2). Breaking that down, perhaps 5-7 might be P1s and the rest P2s in a week. On a monthly basis, that scales to around 60-80 major incidents that I’m directly involved in. These figures can vary based on what’s happening (some months have more outages due to seasonal load or big changes).

If we include all incidents of any priority that I oversee or touch indirectly, the numbers are much higher – our entire support organization might handle hundreds per week. But specifically for what I personally handle as a Major Incident Manager:

    • P1s: ~40-50 per month (as mentioned earlier about ~200 P1s org-wide, split across a team and shifts, I’d handle a portion of those).

    • P2s: Perhaps ~80-100 per month that I oversee (again, shared among MIMs).

    • Lower priority incidents are usually handled by support teams without my intervention unless they escalate.

Another perspective: Each shift/day I might deal with 1-3 P1s and a few P2s. So over 20 workdays in a month, that math holds – e.g., 2 P1s a day * 20 days = ~40 P1s per month, which aligns with earlier data.

These numbers illustrate that I’m very accustomed to a high volume incident environment. It requires good time management and prioritization on my part. For instance, on a busy day I might be coordinating a major outage in the morning and another in the afternoon, while also keeping an eye on a few P2s in between.

I’d like to add that while quantity is one aspect, I ensure quality in handling each incident is maintained – no matter how many incidents are going on. That’s where teamwork (other MIMs, support teams) comes in too. But yes, weekly dozens and monthly on the order of a hundred incidents is the scale I work with, which has kept me sharp and efficient in the role.

 

What is the difference between an event and an incident?
In ITIL (and general IT operations) terminology, the terms “event” and “incident” have distinct meanings:

  • Event: An event is any detectable or discernible occurrence that has significance for the management of the IT infrastructure or delivery of services. Not all events are bad – an event could be normal or expected. It’s basically a change of state that is noteworthy. For example, a server CPU crossing 80% utilization generates an event in a monitoring tool, a user logging in could generate a security event in logs, or a backup job completion is an event. Many events are routine and do not require action. They’re often handled by monitoring systems and might just be informational or warnings. Events can be categorized (in ITIL v3 terms) as informational, warning, or exception. Only when events indicate something abnormal or that something might be wrong do they potentially lead to an incident.

  • Incident: An incident specifically is an unplanned interruption to an IT service or reduction in its quality. It usually means something is broken or not working as it should, impacting users or business process. Every incident is essentially a problem manifested – downtime, error, performance degradation, etc. Importantly, as a rule: “Not all events are incidents, but all incidents are events.” In other words, an incident is a subset of events that have a negative impact. For example, if that server CPU event at 80% crosses a threshold and the server becomes unresponsive, that becomes an incident because service got affected.

To illustrate: Event vs Incident – If monitoring shows a memory spike on a server (event), but it auto-resolves or doesn’t impact anything, it remains just an event and perhaps an entry in a log. However, if that memory spike causes the server to crash and a service goes down, now we have an incident (service interruption). An incidenttypically triggers the incident management process (with ticket creation, support engagement, etc.), whereas many events are handled automatically by event management or do not need intervention at all.

Another way to put it:

    • We manage events to filter and detect conditions. Event Management might create an incident if an event is serious.

    • We manage incidents to restore service when something has gone wrong.

ITIL 4 also emphasizes that an incident is an unplanned interruption or reduction in service quality, whereas events are just occurrences that are significant. A key part of operations is having good event monitoring to catch issues early – ideally resolving or informing before they become user-visible incidents.

In summary: events are signals or alerts (which could be benign or abnormal), and incidents are when those signals indicate an actual service disruption or issue requiring response. As a Major Incident Manager, I primarily deal with incidents; however, our monitoring team deals with tons of events and only escalates to us when an event implies an incident (like an outage) that needs action.

 

What is the difference between OLA and SLA?

An SLA (Service Level Agreement) is an agreement between an IT service provider and an external customer that defines the expected level of service. It sets targets like uptime, response/resolution times, etc., and is often part of a contract. For example, an SLA might state “99.9% uptime for the website” or “Priority 1 incidents will be resolved within 4 hours.” It’s customer-facing and focuses on the end-to-end service performance metrics.

An OLA (Operational Level Agreement), on the other hand, is an internal agreement between different internal support teams or departments within the same organization. It outlines how those teams will work together to meet the SLAs. For instance, if the SLA to the customer is resolution in 4 hours, an OLA between, say, the Application Support team and the Database team might commit the DB team to provide a fix or analysis within 2 hours when escalated, so that the overall 4-hour SLA can be met. OLA details each group’s responsibilities, timelines, and the support they provide to each other.

Key differences:

    • Audience: SLA is external (provider ↔ customer), OLA is internal (between support groups).

    • Scope: SLA covers the entire service delivery to the customer. OLA covers a component or underpinning service and usually does not directly involve the customer; it underpins the SLA.

    • Enforcement: SLAs can have penalties or credits if violated because they’re often contractual. OLAs are typically less formal (not legal contracts, but rather commitments to ensure smooth internal operations).

    • Example: Think of an SLA as “the promise to the customer,” while OLAs are “the promises we make to each other inside to keep the external promise.” So if the SLA is a chain, OLAs are the links inside that chain between internal teams and maybe underpinning contracts with vendors (UCs).

In ITIL terms, both SLA and OLA are part of Service Level Management. OLAs are not a substitute for SLAs, but they are important to achieving SLAs. If an SLA is failing, often we look at whether an underpinning OLA wasn’t met by an internal team. For instance, maybe the network team had an OLA to respond in 15 minutes to P1s and they didn’t – that can cause the SLA breach.

To conclude, SLA = external service commitment to a clientOLA = internal support commitment between departments to enable meeting those SLAs. Both are documented agreements, but at different levels.

 

Please share an example of a time when you had to multitask and make sound judgments in a fast-paced, high-stress environment, while keeping people informed.
One example that comes to mind is when I had to handle a data center power outage that caused multiple systems to fail simultaneously, during a weekday afternoon. It was a high-stress scenario with several critical services down at once – email, an internal ERP, and a client-facing portal were all affected (because they shared that data center).

Multitasking and Judgment: I effectively had multiple incidents in one and had to multitask across them:

    • First, I immediately declared a major incident and initiated the bridge call. However, very soon the magnitude required splitting focus: I had the infrastructure team working on power restoration, the server team planning failovers for key services, and the application teams dealing with recovery of their specific applications once power returned.

    • I had to prioritize on the fly: The client-facing portal was the most time-sensitive (SLA with clients), so I directed resources to get that up via our DR site. Meanwhile, I trusted the IT infrastructure folks to concentrate on restoring power and not micro-manage them, beyond getting updates.

    • There was also a judgment call about evacuating to DR (Disaster Recovery) for each service. You can’t do that casually because it might involve data sync issues. Under pressure, I conferred quickly with the senior engineers and made the call: For the portal, yes fail over to DR now (to minimize client impact); for the internal ERP, wait 15 more minutes as power was expected back, because switching that to DR could cause more complexity. These were tough calls with incomplete information, but I weighed business impact vs. risk and decided accordingly.

    • Simultaneously, I had to keep an eye on dependencies – for example, even if apps fail over, network needed to reroute. I made sure those teams were engaged and prepared.

Keeping People Informed: Throughout this, I maintained clear and constant communication:

    • I provided updates every 15 minutes on the bridge about what each thread was doing (“Portal failing over to cloud DR, ETA 10 minutes to live,” “Power vendor on site, rebooting UPS,” etc.). This kept all technical folks aware of overall progress.

    • I had a separate communication stream to leadership and affected users. I sent an initial notification within 10 minutes: “We have a data center outage affecting X, Y, Z systems, teams are responding.” Then every 30 minutes I sent email updates to the wider stakeholders about which services were back and which were pending. For instance, “Portal is now running from DR site as of 3:45pm, users may access read-only data; ERP still unavailable, next update at 4:00pm.”

    • I also had to hop between communication channels – I was on the phone with the data center facilities manager (as that’s somewhat outside normal IT), on the bridge coordinating IT teams, and on email/IM updating management. It truly was multitasking under pressure.

    • At one point, a C-level executive joined the call unexpectedly for an update. I paused the tech discussion for a minute to concisely brief them on the situation and expected timelines (keeping my cool despite the pressure of upper management presence, which was noted later as a positive).

Outcome: Within about an hour, power was stable again. We restored all services – the portal was up via DR (later failed back to production), email came back, ERP came back with minimal data loss. Throughout, because I kept everyone informed, there was surprisingly no panic from users or management; they felt updated and knew we had a plan. After, leadership praised the incident handling – especially the communication frequency and clarity, and the fact that I juggled multiple workstreams effectively.

This situation demonstrates my ability to stay calm, multitask across parallel issues, make key decisions with limited time, and continuously communicate to all stakeholders in a high-stress, fast-paced incident. It was like being an air-traffic controller for IT services during a storm, and I successfully landed all the planes safely, so to speak.

 

Can you walk me through your experience in implementing preventative measures to reduce the frequency and severity of IT incidents?
Certainly. In my role, I don’t just react to incidents; I also focus on preventing them (or at least reducing their impact). Here are some preventative measures I’ve implemented and my experience with them:

  • Post-Incident Reviews and Problem Management: After each major incident, I lead a blameless post-mortem. For example, we had recurring outages with a particular application whenever usage spiked. Through post-incident analysis, we identified a pattern – the root cause was a memory leak in the app not caught in testing. I raised a Problem record and worked with the development team to get a patch (thus preventing that incident from happening again). In another case, frequent database lockups were causing incidents; the problem management process led us to do a schema optimization and index tuning, which prevented those lockups going forward. My experience is that diligent root cause analysis and ensuring permanent fixes (or at least mitigations) are applied has a huge effect on reducing repeat incidents.

  • Trend Analysis for Proactive Fixes: I’ve analyzed incident trends over time (e.g., noticing that Monday mornings had many VPN issues). By spotting those trends, I coordinated preventive actions – in the VPN case, we found the authentication server had a memory issue that always cropped up after weekend backup jobs. We then scheduled a preventive reboot of that server early Sunday, and the Monday incident spike disappeared. Essentially, I used historical incident data to predict and address underlying issues.

  • Monitoring and Alert Improvements (AIOps): I spearheaded projects to enhance monitoring so we catch potential failures early (proactive incident management). For instance, after a major storage incident, we implemented additional sensors and alerts on storage array performance. This paid off – once an alert warned of I/O latency rising; we intervened before it escalated to an incident. I also introduced an APM (Application Performance Management) tool for our critical customer app which started alerting us about slowdowns before users called in. Overall, by investing time in better monitoring and even AI-based predictive alerting, we prevented incidents or at least fixed them at an event stage before they became full-blown incidents.

  • Capacity Planning: One preventative measure was establishing a formal capacity review for key systems. For example, we noticed incidents around end of quarter on our reporting database (due to heavy load). I worked with infrastructure to implement capacity planning – upgrading resources or archiving old data proactively. This reduced those high-load failures. Essentially, ensuring our systems have headroom prevented a lot of incidents that come from overload.

  • Resilience and Redundancy Initiatives: I have been involved in improving the architecture resilience. After some network-related major incidents, I pushed for and helped justify adding a second network provider link (redundant ISP) for our data center. Since implementation, if one link goes down, the other picks up – we haven’t had a major site-wide network outage since. Similarly, after a major incident due to a single point of failure in an app’s design, I advocated with development to create an active-active cluster. We simulated failure and proved the new design would avoid downtime. Building redundancy is a key preventive strategy I’ve driven.

  • Runbooks and Training (Human Factor): Some incidents happen due to operator error or slow response. I created operational runbooks and drills for critical scenarios. For example, we made a runbook for “App hung – how to safely recycle services without data loss.” We practiced it in test environments. This meant when that scenario re-occurred at 2 AM, the on-call had clear steps, reducing both severity and duration of the incident. I also conducted workshops with support teams to share knowledge from past incidents, so they’re less likely to make mistakes or they recognize early warning signs.

  • Change Management Tightening: A lot of incidents originate from changes. I worked with the Change Manager to identify changes that frequently led to incidents and implement more stringent testing or approval for such changes. In one case, a particular integration deployment caused two incidents; we then required that any future integration changes have a performance test and a rollback plan reviewed by the architecture team. This drastically reduced change-induced incidents.

Through these experiences, I learned that proactive measures can drastically reduce incident frequency and impact. As a result of these initiatives, we saw measurable improvements: e.g., a ~20% drop in P1 incidents year-over-year, and those that did happen were resolved faster (since we had better tools and plans). Preventative work is an ongoing effort, and I continuously collaborate with Problem Management and SRE/engineering teams to harden the environment. It’s rewarding because every prevented incident is essentially an invisible win – no downtime that day, which means business as usual for everyone!