What Is Problem Management?
Problem Management is the ITIL v4 practice of reducing the likelihood and impact of incidents by identifying actual and potential causes and managing workarounds and known errors. It bridges the gap between reactive incident handling and proactive service improvement.
Many organizations treat symptoms rather than causes. They resolve the same incidents repeatedly without ever addressing why they keep happening. Problem management breaks this cycle by systematically investigating patterns and implementing permanent fixes.
Griffin IT Group applies both reactive problem management (investigating recurring incidents) and proactive problem management (identifying risks before they cause outages) to continuously improve your IT environment.
Key Capabilities
Root Cause Analysis (RCA)
Formal RCA methodologies including 5 Whys, Ishikawa diagrams, and fault tree analysis to identify the true source of recurring issues.
Trend Analysis
Statistical analysis of incident data to identify patterns, correlations, and emerging problems before they escalate.
Known Error Database (KEDB)
Documented known errors with proven workarounds so incidents can be resolved faster while permanent fixes are developed.
Proactive Problem Management
Threat modeling and risk assessments to identify potential problems before they manifest as incidents.
Problem Prioritization
Problems are ranked by business impact, frequency, and cost to ensure resources focus on the highest-value improvements.
Permanent Fix Tracking
Every identified problem is tracked through to permanent resolution with clear ownership and timelines.
How We Deliver
- Problem Identification: We analyze incident trends, monitoring data, and user reports to identify recurring patterns that indicate underlying problems.
- Root Cause Investigation: Using structured RCA methodologies, we investigate the true root cause — not just the immediate trigger.
- Workaround Development: While permanent fixes are developed, we document and publish workarounds to reduce incident impact immediately.
- Permanent Fix Implementation: We develop, test, and deploy permanent fixes through the change management process to eliminate the root cause.
- Validation & Closure: After implementation, we monitor for recurrence and formally close the problem record once effectiveness is confirmed.
Understanding Problem Management in Depth
Problem management operates on two parallel tracks: reactive and proactive. Reactive problem management responds to recurring incidents — when the same issue appears three or more times, it triggers a formal problem investigation. Proactive problem management uses trend analysis, infrastructure assessments, and vendor advisories to identify risks before they cause outages.
The Kepner-Tregoe method, 5 Whys analysis, Ishikawa (fishbone) diagrams, and fault tree analysis are the primary RCA methodologies used in mature problem management. Each has strengths suited to different problem types: 5 Whys excels at straightforward causal chains, while Ishikawa diagrams handle multi-factor problems with contributing causes across people, process, technology, and environment dimensions.
The Known Error Database (KEDB) is one of the most undervalued assets in IT service management. A well-maintained KEDB enables L1 analysts to resolve incidents in minutes that previously required L3 investigation — because the root cause is already documented along with a proven workaround. Organizations with mature KEDBs consistently achieve first-contact resolution rates above 75%.
A critical success factor is the handoff between incident and problem management. Without clear trigger criteria (e.g., "three or more incidents with the same CI within 30 days"), problem investigations never start. Without feedback loops from problem management back to incident management (via updated knowledge articles and workarounds), the same investigations happen repeatedly.
Problem management maturity is measured by metrics including problem backlog age, percentage of incidents linked to known errors, mean time between failures (MTBF) improvements, and the ratio of proactive to reactive problem records. World-class organizations target a 60/40 proactive-to-reactive ratio.
How Griffin IT Group Implements Problem Management
Griffin IT Group's problem management practice operates as a dedicated function within our ETOC, staffed by senior engineers who combine deep technical expertise with formal ITIL training. Unlike incident analysts who focus on rapid restoration, our problem analysts are measured on permanent fix implementation and incident reduction rates.
We integrate problem management directly with our monitoring, incident, change, and knowledge management practices. When our NOC detects a pattern — such as a server rebooting at the same time each week — it automatically creates a problem record and assigns it to an analyst. This tight integration means problems are identified and investigated before clients even notice a trend.
Our quarterly problem management reviews with each client present a clear picture: problems identified, root causes confirmed, permanent fixes implemented, incidents prevented, and cost savings realized. This data-driven approach demonstrates tangible ROI and builds confidence in proactive IT investment.
- Automated Problem Detection: Correlation engines analyze incident data in real-time, flagging recurring patterns and creating problem records automatically.
- Structured RCA Workshops: For complex problems, we facilitate cross-functional RCA workshops with client stakeholders, vendors, and our engineering team.
- Known Error Library: Every confirmed root cause is documented with workarounds, permanent fix plans, and ETA — accessible to all support tiers.
- Change Integration: Permanent fixes are implemented through formal change enablement, ensuring solutions don't introduce new problems.
- Measurable Outcomes: We track and report incident reduction percentages directly attributable to problem management activities.
Value-Added Benefits of Structured Problem Management
- Reduced Incident Volume: Eliminating root causes typically reduces recurring incident volumes by 30-50% within the first two quarters of engagement.
- Lower Support Costs: Fewer incidents means fewer tickets, less escalation, and more L1 capacity — directly reducing your IT support expenditure.
- Improved Service Stability: Permanent fixes increase mean time between failures (MTBF), delivering measurably more reliable IT services.
- Faster Incident Resolution: A well-maintained KEDB enables rapid workaround application, cutting resolution times even for issues awaiting permanent fixes.
- Risk Reduction: Proactive problem management identifies vulnerabilities before they cause outages — shifting from reactive firefighting to preventive operations.
- Strategic IT Investment: Problem data reveals where infrastructure investment delivers the highest ROI, enabling evidence-based technology planning.
Stop Fixing the Same Issues Over and Over?
Griffin IT Group's problem management eliminates the root causes draining your IT budget.
Frequently Asked Questions
- How does problem management differ from incident management?
- Incident management restores service quickly. Problem management investigates why the incident happened and implements permanent fixes to prevent recurrence. They work together — incidents trigger problem investigations.
- What root cause analysis methods do you use?
- We use 5 Whys analysis, Ishikawa (fishbone) diagrams, fault tree analysis, and timeline analysis depending on the complexity. For major problems, we conduct formal post-mortem reviews with all stakeholders.
- How long does a problem investigation take?
- Simple problems may be resolved in days. Complex, multi-system problems can take weeks of investigation. We provide regular status updates and interim workarounds to minimize impact during the investigation.
- Do you track the ROI of problem management?
- Yes. We measure avoided incidents, reduced ticket volumes, decreased downtime, and lower support costs. Most clients see a measurable reduction in recurring incidents within the first quarter.
- Can you work with our existing incident data?
- Absolutely. We can analyze historical incident data from any ITSM platform to identify problem patterns and prioritize investigations based on business impact.