Corporate Vice President - Lead Site Reliability Engineer

Remote Full-time
Location Designation: Hybrid - 3 days per quarter

We are seeking a highly skilled site reliability engineer (SRE) to join our IT Operations team. The site reliability engineer (SRE) role is responsible for enabling innovation and velocity of change while ensuring system reliability focusing on the critical features and functionality within products and platforms. It collaborates with the business or product owners to prioritize operational requirements by defining service-level indicators (SLIs) and service-level objectives (SLOs) to monitor and optimize customer's journey and experience.

Our goal is to improve the stability of existing platforms and in parallel design and operate scalable resilient systems utilizing modern software engineering principles. In the role you will analyze service management incident management, problem management, change management, and release management date to identify persistent problems. You will then improve monitoring and observability and implement corrective actions. You are also encouraged to recommend changes to our architecture to increase performance and stability.

Successful reliability outcomes are likely to implement and extend on DevOps and Agile ways of working and associated automation approaches. These are underpinned by the site reliability engineer's solid understanding of systems, production environments, operational insights, incident management, on-premises, cloud and hybrid world. The nature of the work involved means that the site reliability engineer will directly engage with customer teams but will also work on reliability initiatives that span multiple teams.

The site reliability engineer collaborates closely with product owners and teams, architects, IT service management, software developers, security and network engineers, as well as other subject matter experts and roles, particularly in infrastructure and operations. Being an approachable team player and a good communicator is therefore crucial for success, and a willingness to lead initiatives is important.

The site reliability engineer leads root cause analysis in areas such as deployment activities, event management, incident and problem management, availability, capacity and service-level management, as well as service continuity and scalability.

What You'll Do:
• Define and mature SRE practices, including SLO/SLI frameworks and error-budget governance.
• Design and implement automation solutions using Java, JavaScript, APIs, SQL, and Terraform.
• Investigate and resolve application performance bottlenecks by analyzing code, queries, APIs, and data flows.
• Optimize data-processing pipelines, ETL components, and backend services for improved throughput and latency.
• Deliver application-level fixes and enhancements through disciplined software engineering.
• Focus on key reliability and performance indicators: uptime, system throughput, system output, and download rate/application load speed.
• Partner with the NYL Platform Engineering Team to shift from non-standard application platforms to standard software artifacts (Terraform modules, secure base images, YAML templates, Java libraries) integrated into CI/CD pipelines, creating reusable patterns and reducing repetitive configuration and coding tasks.
• Support and Troubleshooting: Provide expert support and troubleshooting across network and enterprise service issues, ensuring minimal disruption to business operations.
• Platform Support: Support various platforms, including Windows, Linux, macOS, and cloud environments (e.g., AWS, Azure).
• Incident Response: Respond to and resolve incidents in a timely manner, providing clear communication to stakeholders throughout the process.
• Monitoring and Optimization: Build monitoring, observability dashboards, and alerting systems, Monitor network and platform performance, identifying and addressing potential issues proactively ensure to address gaps identified during troubleshooting efforts.
• Documentation: Maintain detailed records of issues, actions taken, and outcomes to support continuous improvement efforts.
• Collaboration: Work closely with other IT teams and external vendors to resolve complex issues and implement solutions.
• Process Improvement: Identify opportunities to improve support processes and implement best practices to enhance overall efficiency.
• Training: Provide training and guidance to IT staff on network and platform support techniques and best practices.

What You'll Bring:
• Education: Bachelor's degree in Information Technology, Computer Science, or a related field
• Experience: 3+ years in software engineering, DevOps, SRE, or related disciplines
• Essential: AWS Certification, Experience supporting Salesforce and Salesforce integrations, Strong programming skills: Java, JavaScript, SQL, API development, Experience with Terraform and infrastructure-as-code, Knowledge of SLIs/SLOs, observability, and performance metrics
• Skills:
• Strong technical expertise in network infrastructure and platform support.
• Excellent analytical and problem-solving abilities.
• Proven ability to manage high-pressure situations and resolve complex issues.
• In-depth knowledge of network protocols and services (TCP/IP, DNS, DHCP, VPN).
• Proficiency in using network monitoring and troubleshooting tools (e.g., Wireshark, SolarWinds, Nagios).
• Experience with various server operating systems (Windows, Linux, AMI) and cloud platforms (AWS, Azure).
• Strong communication and interpersonal skills.
• Ability to work independently and as part of a team.
• Commitment to continuous learning and improvement.

Pay Transparency

Salary Range: $133,000-$190,000

Overtime eligible: Exempt

Discretionary bonus eligible: Yes

Sales bonus eligible: No

Actual base salary will be determined based on several factors but not limited to individual's experience, skills, qualifications, and job location. Additionally, employees are eligible for an annual discretionary bonus. In addition to base salary, employees may also be eligible to participate in an incentive program.

Job Requisition ID: 93708

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Full Stack Software Engineer – Web & Cloud Application Development

Remote

Remote Machine Learning Engineer – FinTech SaaS | Python, TensorFlow

Remote

Senior Field Service Engineer I

Remote

Fedex data entry jobs – United States

Remote

Senior Product Manager, Data Ingestion & Care Enablement

Remote

Experienced Retail Department Coordinator – Customer Service at careerzynith

Remote

Care Management RN: Remote Full Time $40

Remote

[Remote] Customer Service – Booking Hotels | Work From Home

Remote

Experienced Full Time Remote Customer Care Representative – Patient Care Advocate for Innovative Healthcare Solutions at blithequark

Remote

Dispatcher/Router Combo

Remote
← Back