Senior Site Reliability Engineer

Remote Full-time
Gorgias is the conversational AI platform for ecommerce that drives sales and resolves support inquiries. Trusted by over 15,000 ecommerce brands, Gorgias supports growing independent shops to globally recognizable brands. Built for Shopify and powered by advanced ecommerce integrations, Gorgias's conversational AI understands your brand, tools, policies, and customers to drive personalized, 1-to-1 conversations β€” from editing orders and initiating returns to making product recommendations. Gorgias, where every customer interaction feels personal, support becomes sales, and conversations shape success.About The SRE Team We are seeking a highly skilled and experienced Senior Site Reliability Engineer (SRE) to join our team. As an SRE at Gorgias, you will play a crucial role in ensuring the reliability, scalability, and performance of our systems, enabling the seamless delivery of our products and services.The SRE team at Gorgias maintains the core infrastructure and services that make up the heart of our product. We have the privilege to work with high throughput systems and TB-scale data stores serving billions of queries per day, most with sub millisecond response times.We also design and maintain the software delivery stack, offering features such as metrics-based canary rollout strategies to all internal development teams.We currently have a team of 9 Senior and Staff SREs operating together globally with aim to be 12 in the near term. We focus on scalable methods to provide the largest impact across the organization.Some achievements we’re proud of:Partitioned multi-TB tables in Postgres to reduce Vacuum time by 5xFor partitioning we studied the problem, the partitioning strategy, analyzed all queries to avoid bad surprises, utilized Debezium and Kafka to do a live copy and accomplished it with less than 20 mins maintenance window and no data lossSplit PostgreSQL connections proxy in multiple pools to guarantee quotas per service of our product, allowing sub-systems that heavily hit the database to be contained and not create a large incident blast radiusFor connections proxying we had to go deeper into the BE to propose solutions, coded part of the fix in the backend, provided the path and helped teams migrate to the new methodology. In the end successfully eliminating incidents due to DB connections starvationWorked with all product-engineering teams to accomplish SOC2 certification, ran a Hackerone program, refactored our whole incident management with Rootly for better visibility and resolution time, and improved our overall security postureTo keep the lights on the team is constantly working on upgrading our self-hosted Postgres and RabbitMQ, alongside other critical infrastructure components with minimal down time and high accuracyWhat You Will Do:Manage multi-TB PostgreSQL clusters in the public cloud, optimize parameters, storage settings and data structureOperate RabbitMQ and Redis with tens of thousands of operations per secondManage 10+ full featured GKE clusters worldwide, 10k+ TenantsAdopt new stack of: Kafka, Debezium, Apache FlinkFacilitate rollout strategies at scale with Gitlab CI and ArgoCDRoll out best practices around Kubernetes/Helm/Operators, SLIs/SLOs, Incident Management, Observability, Security, and Disaster Recovery to all Product-Engineering teams and drive adoption by themAutomate complex infrastructure pieces for our worldwide footprint with best practices IaC with TF, strong scripting with Python/GolangWhat You Should Have:Experience with cloud-native web systems at scaleBachelor's degree in Computer Science or equivalent work experience.5+ years experience as a Site Reliability Engineer or similar role, with a focus on maintaining high-performance, scalable, and reliable high-throughput web systems.Proficiency in using Kubernetes for container orchestration and management.5+ years experience with Cloud Providers (AWS, GCP) and a deep understanding of cloud services and architectures. (We use GCP).Proficient in scripting and programming languages such as Python, Bash, Go, or NodeJS.Comfortable and confident in Linux systems and the command line.Solid understanding of infrastructure as code (IaC) principles and experience with tools like Terraform.Experience with continuous integration and deployment (CI/CD) pipelines.Excellent problem-solving and troubleshooting skills.Strong communication and collaboration skills with the ability to work effectively in a team environment.Bonus Points If You Have:Certification in Kubernetes (e.g., Certified Kubernetes Administrator - CKA).Certification in a Cloud Provider platform (e.g., AWS Certified Solutions Architect, Google Cloud Professional Cloud Architect).Experience in managing and optimizing PostgreSQL databases.Company Benefits and Perks ️ 5-week vacation Paid sick leave Paid parental leave MacBook Pro We provide private health insurance ️ Monthly lunch stipend of $300 gross added to your salary ‍♀️ Get up to €700 (gross) to set up your workstation at home (added to your first pay-check as an onboarding bonus) Get up to €1,500 of learning budget and a FitPass yearly membership. Take advantage of these resources to grow in your role and prioritize your personal development and wellness. Every quarter, we organize an online company-wide summit to discuss where we’re going and strengthen social bonds. Once per year, we organize offsite team retreats and company retreats!Diversity & Inclusion at GorgiasWe celebrate diversity and are committed to creating an inclusive environment for all employees. We welcome applicants of all backgrounds, experiences, and perspectives. At Gorgias, we believe that diverse teams drive innovation and better decision-making. We do not discriminate based on race, color, religion, gender identity, sexual orientation, disability, age, or any other protected status.If you need accommodations to participate in the application or interview process, perform essential job functions, or access other employment benefits, please contact us at [email protected]. Let’s grow together!

Apply Now
Apply Now β†’

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Remote Part-time Customer Service Representative – Virtual Amazon Customer Support and Service Delivery Expert

Remote

Chatroom Operator

Remote

(Part/Full Time) American Express Virtual Assistant Jobs-

Remote

**Experienced Customer Service Loader – Ensuring Exceptional Customer Experience and Efficient Store Operations**

Remote

Experienced Remote Customer Service Representative – Delivering Exceptional Support to Valued Customers at blithequark

Remote

Experienced Customer Service Associate - blithequark Locker+ Retail Locations

Remote

**Experienced Full Stack Customer Service Representative – Work From Home**

Remote

Paid Entry-Level Typing Work - Remote

Remote

Business Analyst

Remote

Experienced Customer Service/Inside Sales Representative (Property and Casualty) – Remote Opportunity for Insurance Professionals with a Passion for Delivering Exceptional Customer Experiences

Remote
← Back