Technical Program Manager

Shipping AI programs
from zero to production

Principal TPM who builds AI systems — not just manages them. 8+ years leading cross-functional teams of 50+ engineers to deliver ML platforms, agentic architectures, and infrastructure programs at enterprise scale. Looking for high-velocity teams where AI ships fast and reliability isn't optional.

8+
Years of Experience
50+
Engineers Led
99.99%
System Availability

Programs That Moved the Needle

Strategic initiatives spanning AI, ML infrastructure, and reliability—each with measurable business impact.

Project 01

AI Operations Platform

Oracle SaaS • Ongoing • 50+ Engineers
Principal TPM

Challenge

Enterprise SaaS operations relied on reactive, manual incident response—engineers paged after customers were already impacted. Root cause analysis was slow across tightly coupled systems, remediation was inconsistent, and operations headcount was scaling linearly with platform growth. The organization needed a fundamental shift from reactive firefighting to autonomous, predictive operations.

What I Delivered

🧠
Multi-Agent Architecture
Designed an autonomous system combining deep learning, graph-based root cause analysis, and custom LLMs for multi-step remediation.
🔗
Custom LLM Pipeline
Engineered RAG, SFT, and CPT capabilities integrated with MCP servers for context-aware remediation decisions.
Auto-Remediation Engine
Built autonomous workflows that detect, triage, and resolve production issues without human intervention.
📊
Executive Dashboard & Rollout
Cross-team rollout plan with real-time metrics for leadership visibility into platform performance and adoption.

Agent Architecture

1
Detect Agent
Observe & Diagnose
Continuously monitors system health by aggregating signals from logs, metrics, alerts, and topology graphs. Uses deep learning and graph-based root cause analysis to pinpoint anomaly origins across coupled components.
Anomaly Detection Graph-Based RCA Telemetry Aggregation
2
Reason Agent
Analyze & Plan
Receives the diagnosed state and determines optimal remediation. Powered by custom LLMs fine-tuned via RAG, SFT, and CPT, this agent evaluates risk, blast radius, and historical outcomes to select the safest action plan.
Custom LLM (RAG/SFT/CPT) MCP Integration Risk Assessment
3
Action Agent
Execute & Verify
Executes remediation through automated runbooks—restarting services, rerouting traffic, scaling resources. Validates recovery and closes the loop back to the Detect Agent for continuous monitoring.
Automated Runbooks Self-Healing Closed-Loop Feedback

Impact

50%
Production Issues Auto-Remediated
Half of all detected production issues now resolved autonomously without human intervention.
50%
Operations Headcount Reduction
Autonomous platform cut the operations staffing needed in half, redirecting capacity to product innovation.
Project 02

ML Infrastructure Platform

Oracle AI/ML • 12+ months • Cross-Functional
Senior TPM

Challenge

Data scientists and ML engineers were bottlenecked by fragmented infrastructure—no shared datalake, no feature store, and ad-hoc model delivery pipelines that took 3 months from prototype to production. Every new deep learning model required custom infrastructure work, blocking the AI roadmap and creating unsustainable overhead as the organization scaled its ML ambitions.

What I Delivered

🏗️
Platform Architecture & Roadmap
End-to-end ML platform design spanning datalake, feature store, training pipelines, and model serving infrastructure.
📦
Feature Store Design
Centralized feature repository enabling reuse across models, reducing duplicate data engineering and improving consistency.
🔄
CI/CD for Model Deployment
Automated pipelines for model training, validation, and production deployment—replacing manual handoffs with repeatable workflows.
🤝
Cross-Team Alignment
Stakeholder alignment across data science, ML engineering, and infrastructure teams to drive unified platform adoption.

Impact

93%
Faster Model Delivery
Model delivery cycle compressed from 3 months to 1 week — unblocking the AI roadmap and accelerating every downstream initiative.
3mo→1wk
Prototype to Production
Data scientists can now take a model from experimentation to production deployment in a single sprint.
Project 03

Data Center Automation Toolset

Oracle Cloud Infrastructure • 12-week launch • Engineering Team
TPM

Challenge

Global cloud data centers lacked centralized visibility into physical infrastructure health. Temperature spikes and power anomalies went undetected until they caused hardware failures and customer-facing outages—with individual incidents risking millions in revenue loss and SLA penalties. GPU rack density was intensifying thermal and power challenges with no automated tooling to respond.

What I Delivered

🌡️
Monitoring Service
Launched temperature and power monitoring across global data centers in 12 weeks from concept to production.
🤖
ML Anomaly Detection
Machine learning-based detection with heatmap visualization and proximity alarming for early intervention.
Virtual Power Down & GPU Capping
Automated load reduction during thermal events and power capping for GPU racks to prevent cascading failures.
📋
Workload Priority Visibility
Customer workload priority dashboard enabling intelligent remediation decisions that protect high-value workloads first.

Impact

75%
Reduction in DC Downtime
Proactive detection and automated remediation cut unplanned infrastructure downtime by three-quarters.
$M+
Saved Per Incident Prevented
Each prevented incident avoids millions in potential revenue loss, SLA penalties, and recovery costs.
12 wks
Concept to Production
Took the monitoring platform from initial concept to production launch in under three months.
Project 04

Chaos Engineering Program

Oracle SaaS & OCI • 12+ months • Cross-Functional
Senior TPM

Challenge

SaaS applications lacked systematic resilience testing. High-severity incidents were increasingly caused by failure modes that only surfaced in production—dependencies failing silently, cascading timeouts, and resource exhaustion under load. Without a structured way to proactively find these weaknesses, the organization was always one step behind the next outage.

What I Delivered

💥
Chaos Testing Framework
End-to-end chaos engineering program with structured experiment playbooks and safety guardrails.
🏗️
Dedicated Test Regions (2)
Secured multi-million-dollar investment for two dedicated chaos testing regions, eliminating risk to production.
📋
Failure Injection Playbooks
Standardized playbooks for dependency failures, cascading timeouts, resource exhaustion, and network partitions.
🛡️
Investment Proposal & Strategy
Built the business case and multi-region infrastructure proposal that secured executive-level funding approval.

Impact

$M+
Multi-Region Investment Secured
Championed and received multi-million-dollar funding for dedicated chaos infrastructure spanning two regions.
99.99%
System Availability Achieved
Combined chaos testing, automated recovery, and rules-based automation drove best-in-class availability.
25%
Fewer High-Severity Incidents
Proactive resilience testing and automated failure handling cut the frequency of critical incidents.

Not Your Typical TPM

🧠

I Build AI Systems

Designed multi-agent architectures, custom LLM pipelines (RAG/SFT/CPT), and ML infrastructure from datalake to production serving. I speak the language of the engineers building the models.

Multi-Agent Systems Deep Learning RAG / SFT / CPT Anomaly Detection Feature Stores NLP
🚀

I Ship Fast at Scale

12-week launches, 93% faster model delivery, multi-million-dollar investment proposals approved. I run programs that move at startup speed inside enterprise environments.

Cross-Functional Leadership Roadmap Strategy Executive Communication Agile/Scrum OKRs
💻

I Go Deep Technically

Python, SQL, system design reviews, CI/CD pipelines, cloud infrastructure. I can read the code, review the architecture, and debug the pipeline — not just track the Jira tickets.

Python SQL System Design Cloud Infrastructure Apache Kafka CI/CD
🛠️

I Make Reliability a Feature

Chaos engineering programs, automated incident remediation, 99.99% availability. I treat production reliability as a first-class product — not an afterthought.

Chaos Engineering Incident Management SRE Practices Observability Auto-Remediation

Let's build what's next

I'm exploring opportunities where AI and ML programs need to move fast and ship reliably. If you're looking for a TPM who can take complex AI initiatives from strategy through production, let's connect.