[Remote] Senior Cluster Site Reliability Engineer

Remote Full-time
Note: The job is a remote job and is open to candidates in USA. The Voleon Group is a technology company that applies state-of-the-art machine learning techniques to real-world problems in finance. As a Senior Cluster Site Reliability Engineer, you will be responsible for scaling the research compute cluster, ensuring high uptime and reliability, and collaborating with engineering teams to improve monitoring and operational frameworks.ResponsibilitiesBe a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they ariseEnsure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliabilityDiagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teamsDevelop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't doHelp software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policiesAssist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usabilitySkills5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech leadKnowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)Experience with cloud infrastructure (AWS or GCP)Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)Experience with distributed storage technologies (Lustre, Ceph, S3)Embodies a 'system engineer' rather than 'system administrator' mindset, thinking systematically and leveraging automationBachelor degree in computer scienceHands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)Familiarity with hybrid/on-prem environmentsExperience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environmentsExperience with HPC networking (InfiniBand, RDMA)Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)BenefitsIf you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral.Company OverviewThe Voleon Group is a family of companies committed to the development & deployment of cutting-edge technologies in investment management. It was founded in 2008, and is headquartered in Berkeley, California, USA, with a workforce of 201-500 employees. Its website is http://voleon.com/.Company H1B SponsorshipThe Voleon Group has a track record of offering H1B sponsorships, with 2 in 2025, 2 in 2024, 3 in 2023, 4 in 2022, 1 in 2021, 1 in 2020. Please note that this does not guarantee sponsorship for this specific role.

Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

CEO-in-Residence

Remote

Data Entry Remote Jobs, Work For Amazon Remotely In USA $23/Hour

Remote

GPSU Military and Spouses - Commercial Fellowship

Remote

Director of Legal Affairs – Real Estate & Infrastructure (AI Data Center Expansion)

Remote

[Remote] SEO Content Specialist

Remote

Head of Revenue and Field Marketing, Elevate

Remote

Apollo Retail Specialists – Assembly Technician $18-20/hr – Federal Way, WA

Remote

**Experienced Global Customer Solutions Specialist – Remote Opportunity with arenaflex**

Remote

**Experienced Remote Data Entry Specialist - Online Work from Home Opportunity with a Dynamic Team**

Remote

Business Systems Analyst

Remote
← Back