Infrastructure & Platform Engineering
Kubernetes (400+ clusters, 200K+ pods, 150K+ cores at scale), Container Orchestration, Helm Charts, ArgoCD, Argo Rollouts, Infrastructure as Code: Terraform, Puppet, Packer, CI/CD Pipeline Design: GitHub Actions, GitOps workflows
Observability & Monitoring
Large-scale metrics platforms: Cortex, Grafana Mimir (4M+ metrics/sec, 3+ PB data), Monitoring stack: DataDog, Prometheus, Grafana, OpenTelemetry, SLI/SLO implementation, alerting strategy, incident response
Cloud & Distributed Systems
Multi-cloud expertise: AWS, Google Cloud Platform, Hybrid cloud architecture (on-premises + cloud), GPU workloads and ML infrastructure: NVIDIA/AMD operators, Cost optimization strategies (up to 90% savings via spot instances)
Reliability Engineering
Production incident response and root cause analysis, Performance optimization for high-traffic applications, Chaos engineering and failure mode analysis, Developer enablement and best practices evangelism