Blog
September 25, 2025

Top 50 Problem Management Interview Questions and Answers - Part 3

  1. A user reports a minor issue that doesn’t impact many people, but you suspect it might indicate a larger underlying problem. How do you handle it?
    Answer: “Even minor symptoms can be early warning signs of bigger issues. Here’s how I’d approach it:

    For example, a single user’s complaint about slow search on our app led me to notice in logs that a certain database query occasionally took long. Digging further, we found an indexing issue – it wasn’t affecting all searches, but under heavier load, it could have caused system-wide slowness. We fixed the index proactively, turning a minor report into a major improvement.

    In summary, I treat minor issues with a detective’s mindset – they might be clues. By investigating and monitoring proactively, I either prevent a larger incident or at least confirm it’s truly minor and keep it on the radar.”

    • Acknowledge and Investigate: I wouldn’t dismiss it just because it’s minor. I’d thank the user for reporting and gather as much detail as possible about the issue. Minor issues often come with sparse data, so I’d ask: how often does it happen? What exactly happens? Any pattern noticed? Getting clarity on the symptom is step one.

    • Check for Related Incidents: I’d search our incident/problem database to see if this has been reported elsewhere or if there are similar issues. Sometimes a “minor” glitch reported by one user is actually happening to others who haven’t spoken up. If I find related incidents or past problems, that gives context – perhaps it’s part of a known error or a recurring theme.

    • Assess Impact if it Escalated: I consider what the worst-case scenario is. Could this minor issue be a precursor to a major outage? For example, a small error in a log might hint at a memory leak that could eventually crash a system. I mentally map out, or with the team, discuss how this small symptom could relate to overall system health. This risk assessment justifies spending time on it even if it’s not impacting many users yet.

    • Proactive Problem Logging: If it appears non-isolated and technically significant, I’d log a Problem record proactively. Even if I’m not 100% sure it’s a big problem, having a formal investigation ticket means it won’t be forgotten. In the description, I’d note why we suspect a deeper issue (e.g., “Minor data discrepancy observed – could indicate sync issues between databases”).

    • Investigate in Background: I allocate some time (maybe off-peak or assign an analyst) to investigate the underlying cause. This might involve looking into system logs around that time, reviewing recent changes in that area of code, or replicating the scenario in a test environment to see if it triggers anything else. I often use the principle of “find the cause when the impact is low, to prevent high impact.” For example, a single user’s minor issue might reveal an error that, if conditions worsened, would affect everyone.

    • Monitor More Closely: I might set up extra monitoring or logging temporarily to see if this minor issue is happening quietly elsewhere. For instance, turn on verbose logging for that module, or set an alert if the minor error condition occurs again or for other users. Proactive detection is key – if it starts to spike or spread, we catch it early.

    • Keep User Updated: I’d tell the user that we are looking into it even if it’s minor. This manages their expectations and encourages a culture of reporting anomalies. Users are often the canaries in the coal mine.

    • Escalate if Needed: If my investigation does find a bigger problem (say the minor glitch is due to a hidden data corruption issue), I’d immediately scale up the response – involve appropriate engineers, prioritize it like any other problem, and communicate to management that a potentially serious issue was uncovered. Then follow normal problem resolution steps.

    • If It Truly Is Minor: Sometimes a minor issue is just that – minor and isolated. If after analysis I conclude it’s low risk and low impact, I’d still decide what to do: maybe it’s worth fixing as part of continuous improvement, or if it’s not cost-justified, we might document it as a known minor bug with a decision not to fix now. But importantly, that decision would be conscious and documented (and revisited if circumstances change).

 

  1. How would you explain a complex technical root cause to a non-technical stakeholder or executive?
    Answer: “Translating tech-speak to plain language is something I’ve had to do often. My approach:

    For example, explaining a “memory leak due to an unreleased file handle” to a non-tech manager, I said: “Our application wasn’t cleaning up a certain kind of task properly – kind of like leaving too many tabs open in your browser – eventually it overloaded and crashed. We found that ‘leak’ and fixed it, and now the app is running normally without piling up those tasks.” The manager understood immediately.

    Ultimately, I aim to tell the story of the root cause in simple, relatable terms, focusing on what happened and what we did about it, so that even a non-technical person walks away knowing the issue is understood and handled.”

    • Start with the Bottom Line: I open with the conclusion and impact, not the technical details. For example, instead of starting with “Thread contention in the JVM caused deadlocks,” I’d say, “The outage was caused by a software error that made the system get stuck.” This gives them the gist in simple terms.

    • Use Analogies or Metaphors: Analogies can be powerful. If the root cause is complex, I find a real-world parallel. Suppose the root cause is a race condition (timing issue in code) – I might analogize it to miscommunication: “It’s like two people trying to go through a door at the same time and getting stuck; our system had two processes colliding because they weren’t coordinated.” If it’s database deadlocks, maybe: “Think of it as two people each holding a key the other needs – both waiting for the other to release it.” These images convey the essence without jargon.

    • Avoid Acronyms and Jargon: I consciously strip out or quickly define technical terms. Instead of “API gateway threw 503 due to SSL handshake failure,” I’d say, “Our system’s front door was not able to talk securely with a key component, so it shut the door as a safety measure – that’s why users couldn’t get in.” If I must mention a term, I’ll briefly explain it (“The database ‘deadlock’ – which means two operations blocked each other – caused the slow-down”).

    • Focus on Cause and Resolution: I ensure I cover three main things executives care about: what happened, why it happened, and what we’re doing about it (or did about it). For the why part (root cause), after the plain description, I might add a bit of technical color only if it helps understanding. But I quickly move to how we fixed it or will prevent it. E.g., “We found a flaw in the booking software that only showed up under heavy load. We’ve now patched that flaw, and additionally implemented an alert so if anything like that starts happening, we catch it early.” This emphasizes that the problem is understood and handled.

    • Relate to Business Impact: I tie the explanation to something they value. For instance: “This root cause meant our checkout process could fail when two customers tried to buy at once, which obviously could hurt sales – that’s why it was critical to fix.” This way, I connect the tech cause to business terms like revenue, downtime, customer trust. It answers the unspoken question executives often have: “So what?”

    • Use Visual Aids if Helpful: If it’s a meeting or a report, sometimes a simple diagram can help. I might draw a one-box-fails diagram or a timeline showing where the breakdown happened. Executives often grasp visuals faster than a paragraph of text. In a written RCA report, I include a non-technical summary at the top for this audience.

    • Check Understanding: When speaking, I watch their body language or ask if it makes sense. If someone still looks puzzled, I’ll try another angle. Maybe I’ll simplify further or give a quick example. I avoid condescension; I frame it as “This tech stuff can be confusing, but essentially… [simplified cause].”

    • Emphasize Prevention: To wrap up, I highlight what’s been done to ensure it won’t happen again. Executives want confidence. So I might conclude: “In short, a rare combination of events caused the system to lock up. We’ve implemented a fix in the code and added an automatic restart feature as a safety net. We’re confident it’s resolved, and we’ll keep a close eye on it.” This gives them assurance in language they trust.

 

  1. If a major incident occurs outside of business hours, how do you handle problem management activities for it?
    Answer: “Major incidents don’t respect clocks! If something big happens at, say, midnight, here’s what I do:

    For example, a 2 AM datacenter cooling failure once took out multiple servers. We fixed power and cooling by 4 AM (incident solved), but root cause (why cooling failed) was investigated the next day with facilities and engineering teams. We scheduled that review in daylight, and by afternoon had recommendations to prevent recurrence (redundant cooling system, alarms). Handling it in this two-phase approach – stabilize at night, analyze in day – worked well.

    In short, even if a major incident happens off-hours, I make sure the problem management process kicks in promptly – capturing information during the firefight and formally investigating as soon as feasible – to find and fix the underlying cause.”

    • Immediate Response (Incident Management): First, the priority is to get the service restored – that’s incident management. If I’m the on-call Problem Manager or Incident Manager, I’d join the conference bridge or incident call as needed, even after hours. The focus initially is containment and resolution of the outage. I collaborate with the on-call technical teams to apply workarounds or fixes to get things up and running.

    • Capture Clues in Real-Time: Even while it’s being resolved, I wear a “problem investigator” hat to an extent. I advise the team to preserve evidence – don’t overwrite logs, take snapshots of error states, etc. I might ask someone to start note-taking timeline events (“server X rebooted at 12:40, fix applied at 12:50,” etc.). After hours, adrenaline is high and documentation can slip, so I try to capture key details that will be useful later. If I’m not needed hands-on in the fix, I quietly start gathering data for later analysis.

    • Flag for Problem Management: The next morning (or once the incident is stable), I ensure a Problem record is formally created if the nature of the incident warrants one (which major incidents usually do). Many companies have a practice of automatically kicking off problem management after a Severity-1 incident. I’d either create the problem ticket myself or confirm that it’s in the queue for review. I link all relevant incident tickets to it. This ensures we follow up despite the odd hour occurrence.

    • Post-Incident Review Scheduling: I’d coordinate to have a post-incident review meeting as soon as practical (often the next business day). This includes all the folks who worked the incident (even if they were bleary-eyed at 2 AM when it ended). We’ll recap what happened with a fresher mind, and then pivot to root cause analysis discussion as we would for any incident. The timing is important – not so soon that people haven’t had rest, but soon enough that details are fresh.

    • Communication to Stakeholders: If the incident was major, executives and other stakeholders will want to know what went wrong. I might send a preliminary incident report (that night or first thing in the morning) saying, “We had X outage, service was restored by Y, root cause investigation is underway and we will follow up with findings.” This buys time to do a proper RCA during normal hours. It also signals that problem management is on it.

    • Overtime & Handoffs: Recognizing that the team might be exhausted, I ensure that if the root cause analysis can wait until business hours, it does. I don’t want people making mistakes because they’re tired. If the service is stable, the deep dive can happen the next day. If however the fix was temporary and we risk another outage before morning, I might rally a secondary on-call team (who might be fresher) to work the problem fix through the night. For instance, “We applied a temp fix at 1 AM, but need to implement a permanent database patch before morning business – Team B will handle that at 3 AM.” Planning hand-offs is key.

    • Follow Problem Management Procedure: Then it’s business as usual for problem management – collecting logs, analyzing root causes, involving vendors if needed, etc., during normal hours. We treat the after-hours incident with the same rigor: identify root cause, document it, fix permanently. The only difference is some steps get queued to working hours.

    • Self-Care and Team Care: As a side note, I also look out for my team. If someone pulled an all-nighter fixing an outage, I’d likely excuse them from the morning RCA meeting and catch them up later, or ensure they rest while someone else continues the investigation initially. Burnt-out engineers can’t effectively solve problems, so balance is important.

 

Technical Interview Questions

  1. What is the difference between incident management and problem management in ITIL (IT Service Management)?
    Answer: “Incident Management and Problem Management are two closely related but distinct processes in ITIL. The key difference lies in their goals and timing:

    • Incident Management is about restoring normal service operation as quickly as possible when something goes wrong. An “incident” is an unplanned interruption or reduction in quality of an IT service (for example, a server outage or application error). The focus for incidents is on quickly fixing the symptom – getting the user or business back up and running. This might involve workarounds or rebooting a system, etc. Incident management typically has a shorter timeline and urgency to resolve immediately, often through a predefined process to minimize downtime. Think of it as firefighting – put out the fire and restore service fast.

    • Problem Management, on the other hand, is about finding and addressing the root causes of incidents. A “problem” is essentially the underlying cause of one or more incidents. Problem management may take longer because it involves analysis, diagnosis, and permanent resolution (which could be a code fix, design change, etc.). The goal is to prevent incidents from happening againor reduce their impact. In problem management, we don’t just ask “how do we get the system back?” but “why did this incident happen, and how do we eliminate that cause?”. It’s a bit more complex and can be a longer-term process than incident management.

    To illustrate: If a website goes down, incident management might get it back up by restarting the server (quick fix), whereas problem management would investigate why the website went down – e.g., was it a software bug, resource exhaustion, etc. – and then work on a solution to fix that underlying issue (like patching the bug or adding capacity).

    Another way to put it: Incident management is reactive and focused on immediate recovery, while problem management can be proactive and focuses on thorough analysis and prevention. They overlap in that incident data often feeds into problem analysis. Also, problem management can be proactive (finding potential problems before incidents occur) or reactive (after incidents). But an incident is considered resolved when service is restored, whereas a problem is resolved when the root cause is addressed and future incidents are prevented.

    In ITIL v4 terms, incident management is a practice ensuring quick resolution of service interruptions, and problem management is a practice that reduces both the likelihood and impact of incidents by finding causes and managing known errors.

    So, in summary: Incident management = fix the issue fast and get things running (short-term solution), Problem management = find out why it happened and fix that cause (long-term solution to prevent recurrence). Both are crucial; incident management minimizes downtime now, problem management minimizes downtime in the future by preventing repeat incidents.”

 

  1. Explain the key steps in the ITIL problem management process.
    Answer: “ITIL outlines a structured approach for problem management. The key steps in the process are:

    • Problem Identification (Detection): Recognize that a problem exists. This can be through trend analysis of incidents (proactively finding recurring issues), from a major incident that triggers a problem record, or via technical observation of something that’s not right. Essentially, it’s detecting underlying issues, sometimes even before they cause incidents. For example, noticing that a particular error has occurred in multiple incidents might lead to identifying a problem.

    • Logging & Categorization: Once identified, log the problem in the ITSM system. You record details like description, affected services, priority, etc. Categorize the problem (by type, like software, hardware, etc.) and prioritize it based on impact and urgency. Prioritization ensures the most serious problems are addressed first. Logging provides a record and unique ID to track the problem’s lifecycle.

    • Investigation & Diagnosis (Root Cause Analysis): This is the core phase where the problem team analyzes the problem to find the root cause. They gather data (logs, error messages, timelines) and apply root cause analysis techniques – could be 5 Whys, Ishikawa diagrams, check past changes, etc. The goal is to identify what is actually causing the incidents or issues. Diagnosis may require multiple iterations or tests. ITIL acknowledges this can take time and expertise. For example, during this step you might discover that a memory leak in an application is the root cause of frequent crashes.

    • Workaround Identification (if needed): While root cause is being sought or if it will take time to fix, the team finds a workaround to mitigate the impact. A workaround is a temporary solution that allows service to function (perhaps in a reduced way) until a permanent fix. For instance, if a service keeps crashing, a workaround might be to schedule automatic restarts every hour to prevent buildup of issues. This step often overlaps with incident management – known errors with workarounds are documented so that Service Desk can apply them to recurring incidents.

    • Known Error Record Creation: Once the root cause is found (or even if not yet, but a workaround is known), ITIL suggests recording a Known Error. A known error record documents the problem, its root cause, and the workaround. Essentially, as soon as you know the root cause (or at least have a good handle on it) and/or have a workaround, you log it as a known error so others can reference it. This is stored in the Known Error Database (KEDB). For example, “Problem: email system crashes. Root cause: memory leak in module X. Workaround: restart service weekly.”

    • Solution Identification: Find a permanent resolution to eliminate the problem. This often involves change management because it might be a change to the IT environment – e.g., applying a patch, changing a configuration, upgrading hardware. The problem team will identify possible solutions or recommend a change required. They may have to evaluate multiple options (repair vs replace component, etc.) and possibly do a cost-benefit analysis for major problems.

    • Change Implementation (Error Control): Implement the fix through the Change Enablement / Change Management process. “Error Control” is ITIL’s term for resolving a problem by deploying a change to fix the error. This includes submitting a change request, getting approvals, and deploying the fix in production. Example: if the root cause is a software bug, the error control phase would be getting the development team to code a fix and deploying that patch via change management. ITIL v4 mentions applying solutions or deciding on long-term handling if a permanent fix isn’t viable immediately.

    • Verification (Post-Resolution): After implementing the solution, verify that it indeed resolved the problem. Monitor the system to ensure the incidents don’t recur. Perhaps run tests, or see that no new incidents linked to this problem occur over some period. This step is about ensuring the problem is fully resolved, and there are no unexpected side effects. ITIL suggests taking time to review the resolution and confirm the problem is eliminated.

    • Closure: Finally, formally close the problem record. Before closure, ensure all documentation is updated: the problem record should have the root cause, the fix implemented, and any lessons learned. Also, check that any related incidents can be closed or are linked to the problem resolution (sometimes service desk will notify affected users that a permanent fix has been applied). Closing includes verifying that all associated change records are closed and the Knowledge Base (KEDB) is updated with the final solution information. At closure, we might also do a brief review: did the problem management process work well? Are there improvements for next time (this feeds continual improvement).

    In summary, the problem management process flow is: detect -> log -> investigate -> (provide workaround) -> identify root cause -> propose fix -> implement fix -> verify -> close, with documentation and known error records created along the way. The outcomes of a successful problem management process include reduced incidents and improved system stability after solutions are implemented.

    ITIL emphasizes documenting each step, communicating known errors, and using change control for fixes, so it’s a controlled and learnful process rather than ad-hoc fixing.”

 

  1. How do you differentiate between reactive and proactive problem management? Can you give examples of each?
    Answer: “Reactive vs Proactive Problem Management are two approaches within the problem management practice:

    • Reactive Problem Management is when you respond to problems after incidents have occurred. It’s essentially the “firefighting” mode: an incident (or multiple incidents) happens, and then you initiate problem management to find the root cause and fix it so it doesn’t happen again. For example, if a server crashes three times in a week (incidents), reactive problem management would kick in to investigate the crashes, find the root cause (say a faulty power supply or a bug in an update), and then implement a resolution (replace the hardware or patch the software). It’s reactive because it’s triggered by something going wrong. Most classic problem management work – root cause analysis after an outage – is reactive. In short, reactive = solving problems triggered by incidents.

      Example: A database outage occurs due to an unknown issue. After restoring service (incident resolved), the problem team conducts RCA and discovers a transaction log filling up as the root cause. They then implement better log rotation to prevent future outages. This is reactive problem management; it happened because the incident already impacted us.

    • Proactive Problem Management involves identifying and resolving problems before incidents occur. Here, you’re on the lookout for weaknesses or error trends in the environment that could lead to incidents if not addressed. It’s more preventative. Techniques include trend analysis of incident records, monitoring data for early warning signs, or routine reviews/audits of infrastructure for single points of failure. For instance, if you notice through monitoring that a server’s memory usage has been climbing steadily and will likely hit 100% in a month, you treat this as a problem to be solved proactively (maybe by adding more memory or fixing a memory leak) before it actually crashes and causes an incident. Proactive problem management is often about using data and experience to foresee issues and eliminate them. Proactive = preventing future incidents by addressing potential problems in advance.

      Example: Let’s say your service desk notices that a particular software’s error logs are showing a non-critical error repeatedly, although users haven’t complained yet. Through proactive problem management, you investigate that error message, find a misconfiguration that could lead to a failure if load increases, and fix it ahead of time. Another example: analyzing past incidents might reveal a trend that every Friday the network gets slow – before it turns into an outage, you investigate proactively and discover a bandwidth bottleneck, then upgrade capacity or reconfigure traffic, thereby avoiding a major incident.

    ITIL encourages proactive problem management because it can save the organization from incidents that never happen (which is hard to quantify but very valuable). It’s like maintenance on a car – fix the worn tire before it blows out on the highway.

    To summarize: Reactive problem management is after the fact – an incident has happened, and we don’t want it again, so we find and fix the root cause. Proactive problem management is before the fact – looking for the warning signs or known risks and addressing them so that incidents don’t occur in the first place. Both use similar analysis techniques, but proactive requires a mindset of continuous improvement and often data analysis (like trending incident reports, monitoring system health) to spot issues early.”

 

  1. What tools and techniques do you use for root cause analysis in problem management?
    Answer: “For root cause analysis (RCA), I use a variety of tools and techniques, choosing the one(s) best suited to the problem’s nature. Some common ones I rely on include:

    An example of a Fishbone (Ishikawa) diagram, a tool used to systematically identify potential root causes by category.
    One key technique is the Ishikawa (Fishbone) Diagram, which helps in brainstorming and categorizing potential causes of a problem. I draw a fishbone chart with the problem at the head, and bones for categories like People, Process, Technology, Environment, etc. Then the team and I list possible causes under each. This ensures we consider all angles – for instance, if a server is failing, we’d consider causes in hardware (machine issues), software (bugs), human factors (misconfiguration), and so on. It’s great for visualizing and discussing cause-effect relationships and not missing a branch of inquiry.

    Another staple is the 5 Whys technique. This is a straightforward but powerful method: we keep asking “Why?” up to five times (or as many as needed) until we drill down from the symptom to a fundamental cause. For example, an incident is “Server outage.” Why? – Power supply failed. Why? – Overloaded circuit. Why? – Data center power distribution not adequate. Why? – No upgrade done when new servers added. Why? – Lack of procedure to review power usage. By the fifth why, we often reach a process or systemic cause, not just the immediate technical glitch. 5 Whys helps verify if the supposed root cause is truly the last link in the chain, not just a surface cause.

    I also use Pareto Analysis in some cases, which isn’t an RCA method per se, but helps prioritize which problems or causes to tackle first (80/20 rule) – for instance, if multiple issues are contributing to downtime, fix the one causing 80% of it first. It’s useful when data shows many small issues vs. few big ones.

    For complex, multi-faceted problems, Kepner-Tregoe (KT) Problem Analysis is valuable. KT gives a logical step-by-step approach: define the problem (what it is and is not – in terms of identity, location, timing, and magnitude), then identify possible causes, evaluate them against the problem definition, and test the most likely cause. I’ve used KT especially when the cause isn’t obvious and needs a more structured investigation to avoid bias. It forces you to describe the problem in detail (“is/is not” analysis) and systematically eliminate unlikely causes, which is helpful in thorny scenarios.

    There are other techniques too: Log Analysis and Correlation – using tools like Splunk to sift through logs and correlate events around the time of an incident. This is less of a “formal method” and more a practice, but it’s core to RCA in IT (e.g., correlating a spike in memory with a specific job running helps root cause a performance issue). Modern AIOps tools can assist here by identifying patterns in large data sets.

    Fault Tree Analysis (FTA) is another formal method: it’s a top-down approach where you start with the problem and map out all possible causes in a tree diagram (with logical AND/OR nodes). It’s somewhat similar to fishbone but more Boolean logic oriented; I use it in high availability system failures, where multiple things might have to go wrong together.

    In terms of software tools:
    – I often use our ITSM tool (like ServiceNow) to document and track RCA steps and to store things like fishbone diagrams (some ITSM suites have RCA templates).
    – For drawing fishbone or other diagrams, I might use Visio or an online collaboration board (Miro, etc.) especially when doing team brainstorming.
    – Splunk or other log aggregators are indispensable tools for drilling into technical data to support the RCA (seeing error patterns, etc.).
    – We sometimes use specialized RCA software or templates especially in post-incident reports.

    And of course, Brainstorming in general with subject matter experts is a technique in itself – often, I’ll gather the team involved and we’ll use one or many of the above tools collaboratively to root out the cause.

    In practice, I might combine techniques: start with gathering evidence (logs, metrics), use 5 Whys to get an initial cause hypothesis, then do a fishbone with the team to ensure we’re not missing anything, and finally use the data to confirm which cause is real. For example, on a recent problem, we saw a spike in CPU leading to a crash. Through 5 Whys we hypothesized a specific job was causing it. Then via log analysis, we confirmed that job started at the times of the spike. We then did a fishbone to see why that job caused high CPU (was it code inefficiency, too much data, etc.), leading us to the root cause in code.

    So, to summarize, common RCA tools/techniques I use are: Fishbone diagrams for structured brainstorming of causes, the 5 Whys for drilling down into cause-effect chains, log/monitoring analysis for data-driven insights, Pareto for prioritizing multiple causes, and sometimes formal methods like Kepner-Tregoe for complex issues. These help ensure we identify the true root cause and not just treat symptoms.”

 

  1. What is a Known Error in ITIL, and how is a Known Error Database (KEDB) used in problem management?
    Answer: “In ITIL terminology, a Known Error is a problem that has been analyzed and has a documented root cause and a workaround (or permanent solution) identified. Essentially, it’s when you know what’s causing the issue (the error) and how to deal with it temporarily, even if a final fix isn’t yet implemented. The status “Known Error” is often used when the problem is not fully resolved but we’ve pinned down the cause and perhaps have a way to mitigate it.

    A simple way to put it: once a problem’s root cause is known, it becomes a Known Error if the fix is not yet implemented or available. For example, if we discover that a bug in software is causing outages, and the vendor will release a patch next month, we mark the problem as a known error and note that bug as the root cause, along with any workaround we use in the meantime.

    The Known Error Database (KEDB) is a repository or database where all known error records are stored and managed. It’s part of the knowledge management in ITSM. The KEDB is accessible typically to the service desk and support teams so that when an incident comes in, they can quickly check if it matches a known error, and if so, apply the documented workaround or resolution steps.

    Here’s how the KEDB is used and why it’s useful:

    • Faster Incident Resolution: When an incident occurs, support teams search the KEDB to see if there’s a known error that matches the symptoms. If yes, the KEDB entry will tell them the workaround or quick fix to restore service. This can greatly reduce downtime. For example, if there’s a known error “Email server occasionally hangs – workaround: restart service,” when the help desk gets a call about email being down, they can check KEDB, find this, and immediately guide the fix (restart), without needing to escalate or troubleshoot from scratch. So it’s a big time-saver.

    • Knowledge Sharing: The KEDB essentially is a subset of the knowledge base focused on problems/known errors. It ensures that lessons learned in problem analysis are preserved. Today’s known error might help solve tomorrow’s incident quicker. It prevents siloed knowledge; the resolution info isn’t just in the problem manager’s head but in the database for all to use.

    • Avoiding Duplication: If an issue recurs or affects multiple users, having it in the KEDB prevents each support person from treating it as a new unknown incident. They can see “Ah, this is a known error. Don’t need to raise a new problem ticket; just link the incident to the existing known error and apply the workaround.” It streamlines the process and avoids multiple teams unknowingly working on the same root cause separately.

    • Tracking and Closure: The KEDB entries are updated through the problem lifecycle. Initially, a known error entry might list the workaround. Later, when a permanent fix is implemented (say a patch applied), the known error is updated or flagged as resolved (and eventually archived) This also helps in tracking which known errors are still outstanding (i.e., problems waiting for a permanent fix) and which have been fixed.

    In ITIL, when a problem record is created, it remains a “Problem” until root cause is found. Once root cause is identified, and especially if a workaround is identified, a Known Error record is generated (often automatically in tools like ServiceNow). This then can be published to the knowledge base for support teams.

    So to boil it down: A Known Error = Problem with known root cause (and typically a workaround). KEDB = the library of those known error records that support and problem teams use to quickly resolve incidents and manage problems. It’s an important link between Problem Management and Incident Management, enabling incident teams to “deflect” or handle incidents with known solutions readily.

    Real-world example: we had an issue where a certain scheduled job would fail every Monday. We investigated and found the root cause (bug in script), but developers needed time to rewrite it. In the meantime, our workaround was to manually clear a cache before Monday’s run. We recorded this as a Known Error in the KEDB. When the job failure incident happened, our Level 1 support saw the Known Error article, applied the cache-clearing workaround, and service was restored in minutes rather than hours. Later, when the permanent fix was deployed, we updated the known error as resolved.

    In summary: a Known Error is a problem that we understand (cause identified, workaround known) even if not fully fixed yet, and the KEDB is the centralized repository of all such known errors, used to expedite incident resolution and maintain institutional knowledge in IT support.”

 

  1. Which metrics are important to track in problem management, and why?
    Answer: “In problem management, we track metrics to gauge how effectively we are finding and eliminating root causes and preventing incidents. Some important metrics include:

    • Number of Problems (and Trend): The total count of open problems at any time. We often track new problems logged vs. problems closed each month. If open problems keep rising, it might indicate we’re not keeping up with underlying issues. We also monitor the problem backlog size. A high number of unresolved problems could mean resource constraints or bottlenecks in the process.

    • Average Time to Resolve a Problem: This measures how long, on average, it takes from problem identification to implementing a permanent fix. It’s akin to Mean Time to Resolve (MTTR) but for problems (not just incidents). This is usually much longer than incident MTTR because RCA and changes take time. However, we want to see this trend down over time or be within targets. If this is too high, it could mean delayed RCAs or slow change implementations. Tracking this helps in continuous improvement – e.g., after improving our RCA process, did the average resolution time decrease?

    • Average Age of Open Problems: Similar to above but specifically looking at how old the unresolved problems are on average. If problems are sitting open for too long (say, many over 6 months), that’s a red flag. Many organizations set targets like “no problem should remain open without action beyond X days.” By tracking age, we can catch stagnation.

    • Percentage of Problems with Root Cause Identified: This shows how many of our logged problems we have actually diagnosed fully. A high percentage is good – it means we’re succeeding in RCA. If a lot of problems have unknown root cause for long, that might indicate skills or information gaps.

    • Percentage of Problems with Workarounds (or Known Errors): This indicates how many problems have a workaround documented vs. the total. A high percentage means we’re good at finding interim solutions to keep things running. This ties into the Known Error Database usage – ideally, for most problems that are not immediately fixable, we have a workaround to reduce incident impact.

    • Incident Related Metrics (to show problem management effectiveness):

      • Incident Repeat Rate: How often incidents recur for the same underlying cause. If problem management is effective, this should go down (because once we fix a root cause, those incidents stop).

      • Reduction in Major Incidents: We can measure percentage decrease in major incidents over time. Effective problem management, especially proactive, should result in fewer major outages. Sometimes we specifically look at incidents linked to known problems – pre and post fix counts.

      • Incidents per Problem: Roughly, how many incidents on average are triggered by one problem before it’s resolved. Lower is better, meaning we’re addressing problems before they pile up incidents.

    • Problem Resolution Productivity: e.g., number of problems resolved per month. Along with number of new problems, this gives a sense if we’re keeping pace. Also potentially “problems resolved as a percentage of problems identified” in a period.

    • SLA compliance for Problem Management: If the organization sets targets like “Root cause should be identified within 10 business days for high-priority problems,” then compliance to that is a metric. It’s less common to have strict SLAs here than in incidents, but some places do.

    • Known Error to Problem Ratio: This one is interesting – if we have a high number of known errors relative to total open problems, it means we have documented a lot of workarounds (which is good for continuity). ManageEngine suggests that if the ratio between problems logged and known errors is low, that’s not great – a good sign is when a healthy portion of problems have known error records.

    • Customer/Stakeholder Satisfaction: If we survey or get feedback from stakeholders (business or IT teams) on problem management, that’s a qualitative metric. For instance, do application owners feel that underlying issues are being addressed? It’s not a typical KPI, but can be considered.

    • Impact Reduction Metrics: For specific problems resolved, we might track the impact reduction: e.g., “Problem X resolved – it eliminated 20 incidents per month, saving Y hours of downtime.” These are case-by-case but great for demonstrating value of problem management.

    To illustrate why these are important: Let’s take Average Resolution Time of Problems. If this metric was, say, 60 days last quarter and now it’s 40 days, that’s a positive trend – we’re resolving issues faster, likely preventing incidents sooner. Or Total Number of Known Errors: if that’s increasing, it might mean we’re doing a good job capturing and documenting problems (though we also want to ultimately reduce known errors by permanently fixing them). We also look at major incident reduction; perhaps problem management efforts have led to a 30% drop in repeat major incidents quarter-over-quarter – a clear win to show to management.

    Ultimately, these metrics help ensure problem management is delivering on its purpose: reducing the number and impact of incidents over time. They highlight areas to improve (for example, if problems are taking too long to resolve, maybe we allocate more resources or streamline our change management). They also show the value of problem management – e.g., fewer incidents, improved uptime, etc., which we can correlate with cost savings or user satisfaction improvements.”

 

  1. How do you prioritize problems for resolution in a busy IT environment?
    Answer: “Prioritizing problems is crucial when there are many competing issues. I prioritize by assessing impact and urgency, similar to incident prioritization but with a forward-looking twist. Here’s my approach:

    • Business Impact: I ask, if this problem remains unresolved, what is the potential impact on the business? Problems that cause frequent or severe incidents affecting critical services get top priority. For example, a problem that could bring down our customer website is higher priority than one causing a minor glitch in an internal report. Impact considers factors like how many users or customers are affected by the related incidents, financial/revenue impact, regulatory or safety implications, etc. Essentially, problems tied to high-impact incidents (or future risks) bubble to the top.

    • Frequency/Trend: How often are incidents occurring due to this problem? A problem causing daily incidents (even minor ones) might be more urgent than one that caused one big incident last year but hasn’t appeared since. Recurring issues accumulate pain and support cost. So I prioritize problems contributing to high incident counts or MTTR collectively. We might use incident trend data here – e.g., “Problem A caused 5 outages this month, Problem B caused 1 minor incident.” Problem A gets higher priority.

    • Urgency/Risk: This is about how pressing the problem is to address right now. For instance, if we know Problem X could cause an outage at any time (like a ticking time bomb scenario, maybe a capacity issue nearing its threshold), that’s very urgent. Versus a problem that will eventually need fixing but has safeguards or long lead time (like a decommissioned app bug that’s rarely used). If a workaround is in place and working well, urgency might be lower compared to a problem with no workaround and constant pain. In ITIL terms, impact + urgency drive priority.

    • Alignment with Business Cycles: If a problem relates to a system that’s critical for an upcoming business event (say, an e-commerce system before Black Friday), I’d give that priority due to timing. Similarly, if a known problem could jeopardize an upcoming audit or product launch, it’s prioritized.

    • Resource Availability & Quick Wins: Sometimes, if multiple problems have similar priority, I might also consider which can be resolved more quickly or with available resources. Quick wins (fast to fix problems) might be tackled sooner to reduce noise, as long as they’re not displacing a more urgent big issue. But generally, I’m careful not to let ease of fix override business impact – it’s just a secondary factor.

    • Regulatory/Compliance: Problems that, if not resolved, could lead to compliance breaches or security incidents are high priority regardless of immediate incident impact. For example, a problem that’s causing backups to fail (risking data loss) might not have caused a visible incident yet but has huge compliance risk – I’d prioritize that.

    We often formalize this by assigning a Priority level (P1, P2, etc.) to problems, using a matrix of impact vs urgency. For example:

    • P1 (Critical): High impact on business, high urgency – e.g., causing major incidents or likely to soon.

    • P2 (High): High impact but perhaps lower urgency (workaround exists), or moderate impact but urgent.

    • P3 (Medium): Moderate impact, moderate urgency.

    • P4 (Low): Minor impact and not urgent (perhaps cosmetic issues or very isolated cases).

    In practice, say we have these problems:

    1. Database memory leak causing weekly crashes (impact: high, urgency: high since crashes continue).

    2. Software bug that caused one data corruption last month but we have a solid workaround (impact high, but urgency lower with workaround).

    3. Annoying UI glitch affecting a few users (impact low).

    4. Potential security vulnerability identified in a component (impact potentially high security-wise, urgency high if actively exploitable).

    I’d prioritize #1 and #4 at top (one for stability, one for security), then #2 next (still important, but contained by workaround), and #3 last.

    Also, ITIL suggests aligning prioritization with business goals. So I’ll also consult with business stakeholders if needed – to them, which problems are most painful? That feedback can adjust priorities.

    Once prioritized, I focus resources on the highest priority problems first. We communicate this in our problem review meetings so everyone knows why we’re working on Problem X before Y.

    In summary, I prioritize problems by evaluating their potential or actual impact on the business, how urgent it is to prevent future incidents, and considering any mitigating factors like workarounds or upcoming needs. This ensures we tackle the issues that pose the greatest risk or cost to the organization first.”

 

  1. What is the relationship between problem management and change management?
    Answer: “Problem management and change management are closely linked in the ITIL framework, because implementing the solution to a problem often requires going through change management. Here’s the relationship:

    • Implementing Problem Resolutions via Change: When problem management finds a root cause and identifies a permanent fix, that fix frequently involves making a change to the IT environment. It could be a code patch, a configuration change, infrastructure replacement, etc. Such fixes must be done carefully to avoid causing new incidents. That’s where Change Management (or Change Enablement in ITIL4) comes in – it provides a controlled process to plan, approve, and deploy changes. Essentially, problem management hands off a “request for change” (RFC) to change management to execute the solution. For example, if the problem solution is “apply security patch to database,” a change request is raised, approved by CAB, and scheduled for deployment.

    • Analyzing Failed Changes: Conversely, if a change (perhaps poorly implemented) causes an incident, that’s often treated as a problem to analyze. ITIL explicitly notes that a change causing disruption is analyzed in problem management. So if a change leads to an outage, problem management investigates why – was it a planning flaw, a testing gap, etc. Then problem management might suggest process improvements for change management to prevent similar failures (like better testing or backout procedures).

    • Coordinating Timing: Problem fixes may require downtime or risky modifications. Change management helps schedule these at the right time to minimize business impact. As a Problem Manager, I coordinate with the Change Manager to ensure the fix is deployed in a maintenance window, approvals are in place, etc. For instance, a root cause fix might be urgent, but we still go through emergency change procedures if it’s outside normal schedule, to maintain control.

    • Advisory and CAB input: Often I, or someone in problem management, might present at CAB (Change Advisory Board) meetings to explain the context of a change that’s to fix a known problem. This gives CAB members confidence that the change is necessary and carefully derived. Conversely, CAB might ask if a change has been reviewed under problem management (for risky changes, did we analyze thoroughly?).

    • Known Errors and Change Planning: The Known Error records from problem management can inform change management. For example, if we have a known error workaround in place, we might plan a change to remove the workaround once the final fix is ready. Or change management keeps track that “Change X is to resolve Known Error Y” which helps in tracking value of changes (like seeing reduction in incidents after the change).

    • Continuous Improvement: Results from problem management (like lessons learned) can feed into improving the change process. Maybe a problem analysis finds that many incidents come from unauthorized changes – that insight goes to Change Management to enforce policy better. On the flip side, change records often feed problem management data: if a problem fix requires multiple changes (maybe an iterative fix), problem management monitors those change outcomes.

    In practice, think of it like: Problem management finds the cure; change management administers it safely.One scenario: we find a root cause bug and develop a patch – before deploying, we raise a change, test in staging, get approvals, schedule downtime, etc. After deployment, change management helps ensure we verify success and close the change. Problem management then closes the problem once the change is confirmed successful.

    Another scenario: an unplanned change (someone did an improper config change) caused a major incident. Problem management will investigate why that happened – maybe inadequate access controls. The solution might be a change management action: implement stricter change control (like require approvals for that device configuration). So problem results in a procedural change.

    To summarize the relationship: Problem management identifies what needs to change to remove root causes; Change management ensures those changes are carried out in a controlled, low-risk manner. They work hand-in-hand – effective problem resolution almost always goes through change management to put fixes into production safely. Conversely, change management benefits from problem management by understanding the reasons behind changes (resolving problems) and by getting analysis when changes themselves fail or cause issues.”