Colton Wirth | Technical Program Manager

Project 01

AI Operations Platform

Oracle SaaS • Ongoing • 50+ Engineers

Principal TPM

Challenge

Enterprise SaaS operations relied on reactive, manual incident response—engineers paged after customers were already impacted. Root cause analysis was slow across tightly coupled systems, remediation was inconsistent, and operations headcount was scaling linearly with platform growth. The organization needed a fundamental shift from reactive firefighting to autonomous, predictive operations.

What I Delivered

🧠

Multi-Agent Architecture

Designed an autonomous system combining deep learning, graph-based root cause analysis, and custom LLMs for multi-step remediation.

🔗

Custom LLM Pipeline

Engineered RAG, SFT, and CPT capabilities integrated with MCP servers for context-aware remediation decisions.

⚡

Auto-Remediation Engine

Built autonomous workflows that detect, triage, and resolve production issues without human intervention.

📊

Executive Dashboard & Rollout

Cross-team rollout plan with real-time metrics for leadership visibility into platform performance and adoption.

Agent Architecture

1

Detect Agent

Observe & Diagnose

Continuously monitors system health by aggregating signals from logs, metrics, alerts, and topology graphs. Uses deep learning and graph-based root cause analysis to pinpoint anomaly origins across coupled components.

Anomaly Detection Graph-Based RCA Telemetry Aggregation

2

Reason Agent

Analyze & Plan

Receives the diagnosed state and determines optimal remediation. Powered by custom LLMs fine-tuned via RAG, SFT, and CPT, this agent evaluates risk, blast radius, and historical outcomes to select the safest action plan.

Custom LLM (RAG/SFT/CPT) MCP Integration Risk Assessment

3

Action Agent

Execute & Verify

Executes remediation through automated runbooks—restarting services, rerouting traffic, scaling resources. Validates recovery and closes the loop back to the Detect Agent for continuous monitoring.

Automated Runbooks Self-Healing Closed-Loop Feedback

Impact

50%

Production Issues Auto-Remediated

Half of all detected production issues now resolved autonomously without human intervention.

50%

Operations Headcount Reduction

Autonomous platform cut the operations staffing needed in half, redirecting capacity to product innovation.

Project 02

ML Infrastructure Platform

Oracle AI/ML • 12+ months • Cross-Functional

Senior TPM

Challenge

Data scientists and ML engineers were bottlenecked by fragmented infrastructure—no shared datalake, no feature store, and ad-hoc model delivery pipelines that took 3 months from prototype to production. Every new deep learning model required custom infrastructure work, blocking the AI roadmap and creating unsustainable overhead as the organization scaled its ML ambitions.

What I Delivered

🏗️

Platform Architecture & Roadmap

End-to-end ML platform design spanning datalake, feature store, training pipelines, and model serving infrastructure.

📦

Feature Store Design

Centralized feature repository enabling reuse across models, reducing duplicate data engineering and improving consistency.

🔄

CI/CD for Model Deployment

Automated pipelines for model training, validation, and production deployment—replacing manual handoffs with repeatable workflows.

🤝

Cross-Team Alignment

Stakeholder alignment across data science, ML engineering, and infrastructure teams to drive unified platform adoption.

Impact

93%

Faster Model Delivery

Model delivery cycle compressed from 3 months to 1 week — unblocking the AI roadmap and accelerating every downstream initiative.

3mo→1wk

Prototype to Production

Data scientists can now take a model from experimentation to production deployment in a single sprint.

Project 03

Data Center Automation Toolset

Oracle Cloud Infrastructure • 12-week launch • Engineering Team

TPM

Challenge

Global cloud data centers lacked centralized visibility into physical infrastructure health. Temperature spikes and power anomalies went undetected until they caused hardware failures and customer-facing outages—with individual incidents risking millions in revenue loss and SLA penalties. GPU rack density was intensifying thermal and power challenges with no automated tooling to respond.

What I Delivered

🌡️

Monitoring Service

Launched temperature and power monitoring across global data centers in 12 weeks from concept to production.

🤖

ML Anomaly Detection

Machine learning-based detection with heatmap visualization and proximity alarming for early intervention.

⚡

Virtual Power Down & GPU Capping

Automated load reduction during thermal events and power capping for GPU racks to prevent cascading failures.

📋

Workload Priority Visibility

Customer workload priority dashboard enabling intelligent remediation decisions that protect high-value workloads first.

Impact

75%

Reduction in DC Downtime

Proactive detection and automated remediation cut unplanned infrastructure downtime by three-quarters.

$M+

Saved Per Incident Prevented

Each prevented incident avoids millions in potential revenue loss, SLA penalties, and recovery costs.

12 wks

Concept to Production

Took the monitoring platform from initial concept to production launch in under three months.

Project 04

Chaos Engineering Program

Oracle SaaS & OCI • 12+ months • Cross-Functional

Senior TPM

Challenge

SaaS applications lacked systematic resilience testing. High-severity incidents were increasingly caused by failure modes that only surfaced in production—dependencies failing silently, cascading timeouts, and resource exhaustion under load. Without a structured way to proactively find these weaknesses, the organization was always one step behind the next outage.

What I Delivered

💥

Chaos Testing Framework

End-to-end chaos engineering program with structured experiment playbooks and safety guardrails.

🏗️

Dedicated Test Regions (2)

Secured multi-million-dollar investment for two dedicated chaos testing regions, eliminating risk to production.

📋

Failure Injection Playbooks

Standardized playbooks for dependency failures, cascading timeouts, resource exhaustion, and network partitions.

🛡️

Investment Proposal & Strategy

Built the business case and multi-region infrastructure proposal that secured executive-level funding approval.

Impact

$M+

Multi-Region Investment Secured

Championed and received multi-million-dollar funding for dedicated chaos infrastructure spanning two regions.

99.99%

System Availability Achieved

Combined chaos testing, automated recovery, and rules-based automation drove best-in-class availability.

25%

Fewer High-Severity Incidents

Proactive resilience testing and automated failure handling cut the frequency of critical incidents.

Shipping AI programs
from zero to production

Programs That Moved the Needle

AI Operations Platform

Challenge

What I Delivered

Agent Architecture

Impact

ML Infrastructure Platform

Challenge

What I Delivered

Impact

Data Center Automation Toolset

Challenge

What I Delivered

Impact

Chaos Engineering Program

Challenge

What I Delivered

Impact

Not Your Typical TPM

I Build AI Systems

I Ship Fast at Scale

I Go Deep Technically

I Make Reliability a Feature

Let's build what's next

Shipping AI programsfrom zero to production

Programs That Moved the Needle

AI Operations Platform

Challenge

What I Delivered

Agent Architecture

Impact

ML Infrastructure Platform

Challenge

What I Delivered

Impact

Data Center Automation Toolset

Challenge

What I Delivered

Impact

Chaos Engineering Program

Challenge

What I Delivered

Impact

Not Your Typical TPM

I Build AI Systems

I Ship Fast at Scale

I Go Deep Technically

I Make Reliability a Feature

Let's build what's next

Shipping AI programs
from zero to production