SCOTT CABRINHA
Lake Havasu City, AZ | (949) 943-7317 | [email protected]
LinkedIn: linkedin.com/in/scott-cabrinha
STAFF SITE RELIABILITY ENGINEER
Professional Summary
Accomplished Site Reliability Engineer with nearly a decade of experience architecting and scaling mission-critical infrastructure at industry-leading technology companies. Proven expertise in managing hyperscale Kubernetes environments, implementing observability solutions at petabyte scale, and driving reliability engineering practices that reduce operational toil while improving system performance. Recognized for translating complex technical challenges into scalable solutions that support millions of users and devices globally.
Core Competencies
Infrastructure & Platform Engineering
Kubernetes (400+ clusters, 200K+ pods, 150K+ cores at scale), Container Orchestration, Helm Charts, ArgoCD, Argo Rollouts, Infrastructure as Code: Terraform, Puppet, Packer, CI/CD Pipeline Design: GitHub Actions, GitOps workflows
Observability & Monitoring
Large-scale metrics platforms: Cortex, Grafana Mimir (4M+ metrics/sec, 3+ PB data), Monitoring stack: DataDog, Prometheus, Grafana, OpenTelemetry, SLI/SLO implementation, alerting strategy, incident response
Cloud & Distributed Systems
Multi-cloud expertise: AWS, Google Cloud Platform, Hybrid cloud architecture (on-premises + cloud), GPU workloads and ML infrastructure: NVIDIA/AMD operators, Cost optimization strategies (up to 90% savings via spot instances)
Reliability Engineering
Production incident response and root cause analysis, Performance optimization for high-traffic applications, Chaos engineering and failure mode analysis, Developer enablement and best practices evangelism
Professional Experience
Staff Site Reliability Engineer | Tesla | Remote
MAY 2020 - PRESENT
Platform Engineering & Scale
DevOps & Application Delivery
Incident Response & Performance Engineering
Innovation & Emerging Technologies
Senior Site Reliability Engineer | GoPro | San Mateo, CA
DECEMBER 2018 - MAY 2020
Cloud Migration & Modernization
Observability & Automation
Site Reliability Engineer | Google (Nest) | Mountain View, CA
JANUARY 2016 - JUNE 2018
Infrastructure & Service Reliability
Automation & Developer Enablement
Technical Achievements
Hyperscale Infrastructure Management: Successfully operated 400+ Kubernetes clusters with 200K+ pods globally
Observability at Scale: Deployed monitoring infrastructure processing 4M+ metrics/second with 3+ petabytes of storage
Zero-Downtime Migrations: Led multiple critical infrastructure migrations without service interruption
Cost Optimization: Achieved up to 90% cost reduction through strategic cloud resource optimization
GPU/ML Infrastructure: Pioneered Kubernetes-based GPU workload orchestration for machine learning applications
References available upon request