Senior SRE · DevOps · Cloud Infrastructure · AI Systems
📍 Denver, CO — Remote & On-site — Colorado
Deep expertise across the full infrastructure stack — from bare metal to cloud-native, from on-call incident response to long-term platform architecture.
SLO/SLA design, on-call runbooks, incident management, observability platforms, and blameless postmortem culture. Built SRE orgs from scratch at Crusoe, AppOmni, and Andromeda Cluster.
Multi-cloud architecture across AWS, GCP, and Azure. VPC design, cost optimization, on-premise to cloud migrations, and FinOps. Comfortable taking existing infrastructure and making it scale.
CI/CD pipelines, GitOps (Argo CD, FluxCD), Kubernetes, Helm, Terraform, and Ansible. Infrastructure as Code from day one, deployment automation that teams actually want to use.
GPU cluster operations, SLURM, DCGM health monitoring, NVIDIA NGC, InfiniBand networking, distributed ML training pipelines, and heterogeneous fleet observability.
Python, Go, Java, JavaScript/Node, Scala, and more. REST and gRPC APIs, frontend frameworks, embedded firmware, and PCB-level hardware integration for industrial IoT.
FedRAMP and SOC2 compliance lifecycle management, SSPM/SIEM operations, secrets management with HashiCorp Vault, IAM design, and unified authentication infrastructure.
A breadth of tools built over 26+ years — from kernel-level Linux to managed cloud services.
A career built at companies pushing the edges of scale, reliability, and infrastructure complexity.
GPU cluster operations, aggregated observability for heterogeneous SLURM and Kubernetes fleets, DCGM health monitoring, Argo CD release management.
Grew SRE team 1 to 5. FedRAMP & SOC2 compliance. FinOps ownership. Blameless incident culture, FireHydrant implementation, cloud cost optimization.
First SRE hire. Built and led a 6-person team. Ground-to-cloud GPU cloud platform: PXE, MaaS, Rocky Linux, Kubernetes, Tailscale. Scaled from 1 to 5 mobile data centers.
GCP migration, AWS re-architecture, containerized deployments, bare-metal Kubernetes monitoring, Puppet/Ansible fleet management at 6k-host scale.
300B events/day across 6k services. China mobile market expansion. Terraform, GCP, Drone CI, HashiCorp Vault, multi-region monitoring and fraud detection.
13-year oilfield telemetry engagement, industrial IoT edge computing, PCB design, microcontroller firmware, cloud architecture, full-stack development across Python, JavaScript, and Java.
Beyond software, I take on skilled trade and general contracting work in the Denver metro area on a 1099 basis.
Available for contract and 1099 engagements. Remote-friendly, Colorado-based