Scott Cabrinha - Site Reliability Engineer Resume

Core Competencies

Infrastructure & Platform Engineering

Kubernetes (400+ clusters, 200K+ pods, 150K+ cores at scale), Container Orchestration, Helm Charts, ArgoCD, Argo Rollouts, Infrastructure as Code: Terraform, Puppet, Packer, CI/CD Pipeline Design: GitHub Actions, GitOps workflows

Observability & Monitoring

Large-scale metrics platforms: Cortex, Grafana Mimir (4M+ metrics/sec, 3+ PB data), Monitoring stack: DataDog, Prometheus, Grafana, OpenTelemetry, SLI/SLO implementation, alerting strategy, incident response

Cloud & Distributed Systems

Multi-cloud expertise: AWS, Google Cloud Platform, Hybrid cloud architecture (on-premises + cloud), GPU workloads and ML infrastructure: NVIDIA/AMD operators, Cost optimization strategies (up to 90% savings via spot instances)

Reliability Engineering

Production incident response and root cause analysis, Performance optimization for high-traffic applications, Chaos engineering and failure mode analysis, Developer enablement and best practices evangelism

Professional Experience

Staff Site Reliability Engineer | Tesla | Remote

MAY 2020 - PRESENT

Platform Engineering & Scale

Architected and maintained Tesla's global Kubernetes platform serving 400+ clusters across multiple continents, managing 200K+ pods with 150K+ cores of compute capacity
Led the design and implementation of a multi-petabyte observability infrastructure, processing 4+ million metrics per second and storing 3+ petabytes of time-series data globally
Successfully migrated from Cortex to Grafana Mimir at massive scale, maintaining zero downtime during the transition while improving query performance and operational efficiency

DevOps & Application Delivery

Revolutionized application deployment processes using ArgoCD and Argo Rollouts, implementing canary deployment strategies and standardizing Helm charts across infrastructure services and developer applications
Reduced deployment risk and time-to-recovery through GitOps workflows and automated rollback mechanisms
Established infrastructure standards and best practices that improved developer velocity while maintaining operational excellence

Incident Response & Performance Engineering

Served as primary Kubernetes subject matter expert during high-severity incidents, consistently achieving swift resolution of production issues affecting critical Tesla services
Implemented real-time debugging processes for high-traffic applications, resulting in measurable improvements in system reliability and performance
Developed comprehensive monitoring and alerting strategies that reduced mean time to detection (MTTD) and mean time to resolution (MTTR)

Innovation & Emerging Technologies

Pioneered GPU workload orchestration on Kubernetes using NVIDIA and AMD operators, enabling large-scale machine learning and inference workloads
Collaborated cross-functionally to design and implement solutions that increased operational leverage while reducing manual toil
Mentored development teams on engineering best practices, balancing scalability requirements with cost optimization

Senior Site Reliability Engineer | GoPro | San Mateo, CA

DECEMBER 2018 - MAY 2020

Cloud Migration & Modernization

Led comprehensive migration from Amazon ECS to EKS, significantly improving developer velocity through enhanced continuous delivery capabilities and intelligent auto-scaling
Designed and implemented modern CI/CD pipelines optimized for Kubernetes environments, emphasizing Infrastructure as Code principles
Achieved up to 90% cost reduction in compute expenses through strategic implementation of AWS spot instances and optimized resource allocation

Observability & Automation

Enhanced monitoring and alerting infrastructure using DataDog, establishing foundation for comprehensive SLO/SLA frameworks
Architected and deployed fully automated Puppet infrastructure using Puppet 6 & 7, improving configuration management and deployment consistency
Developed custom AMI build pipelines using Packer, reducing server startup times and improving deployment reliability

Site Reliability Engineer | Google (Nest) | Mountain View, CA

JANUARY 2016 - JUNE 2018

Infrastructure & Service Reliability

Maintained high availability and performance for cloud infrastructure supporting over 1 million Nest hardware devices across AWS and GCP environments
Executed zero-downtime migration of virtual machines from AWS to Kubernetes on GCP, demonstrating expertise in complex cloud migrations without customer impact
Designed and implemented comprehensive monitoring solutions using industry-standard open-source frameworks

Automation & Developer Enablement

Developed automation tools using Python and Go to streamline API operations and database management, significantly reducing manual operational overhead
Built self-service monitoring platform integrating Sensu, InfluxDB, and PagerDuty, empowering development teams with automated observability tools
Established monitoring best practices and trained development teams on implementing comprehensive application metrics and alerting strategies