AI Test Engineer - Senior Manager

Remote Full-time
About Vialto Labs (VLabs)



Vialto Labs (VLabs) is responsible for redesigning how work is delivered in the tax and immigration service lines, as well as driving operational efficiency across Vialto’s functional areas using AI. The team builds and deploys novel AI-enabled solutions that directly improve productivity and increase delivery quality for our clients. VLabs is accountable for rapidly turning innovative experiments into production-ready deliverables at scale and embedding them into day-to-day operations. This team focuses on the highest-impact workflows, creating standardized, repeatable capabilities that can be deployed globally. Operating with a mandate for speed and measurable outcomes, VLabs works alongside service line, product, and platform leaders.



About the Role



The Senior Manager, AI Test Engineering is a hands-on role within VLabs Quality Engineering, responsible for validating the performance, reliability, and integrity of AI-enabled solutions in production environments. This role operates at the intersection of AI engineering and quality assurance, ensuring that outputs from LLMs, OCR pipelines, document classification models, and agentic workflows perform as expected at scale and meet defined business performance thresholds. Working closely with the Programme Test Manager and partnering with engineering, product, and delivery teams, this role translates AI testing strategy into executable frameworks, evaluation pipelines, and reusable assets embedded into the delivery lifecycle.



Success requires independent execution, strong technical depth, and the ability to proactively identify risks, patterns, and performance gaps while enabling rapid, production-grade deployment of AI capabilities.



Key Responsibilities



AI Evaluation & Test Design

Translate AI testing strategy into executable test scenarios across LLM outputs, document classification, extraction accuracy, agent workflows, and edge cases

Design adversarial and boundary test inputs to expose hallucination, misclassification, and failure modes

Validate AI outputs for structure, consistency, accuracy, and production readiness against defined performance thresholds



Evaluation Engineering & Automation



Build reusable Python-based evaluation frameworks, including output validation, hallucination detection, and scoring mechanisms

Develop parameterized test scripts reusable across features, models, and releases

Implement AI-as-Judge frameworks, including prompt design, scoring logic, and calibration of evaluation reliability

Embed evaluation frameworks into CI/CD pipelines to support continuous testing and deployment



Drift Detection & Quality Monitoring

Design and operate drift detection frameworks using fixed baseline datasets and scheduled re-evaluation

Establish thresholds to distinguish acceptable variation from performance degradation

Enable release gating by identifying regressions prior to production deployment



Ground Truth & Data Quality

Build and maintain ground truth datasets in partnership with subject matter experts

Define standards for classification, extraction accuracy, and acceptable output characteristics

Continuously update datasets to reflect evolving business requirements and use cases



Workflow & Integration Testing

Test end-to-end agentic workflows, validating data integrity, error propagation, and fallback behavior

Perform API-level testing of AI pipeline endpoints using Python and Postman/Newman

Validate data persistence and integrity across system layers using SQL

Partner with engineering teams to ensure testability, observability, and system reliability



Standardization & Scaling

Define and scale standardized AI evaluation patterns and reusable quality frameworks across VLabs

Contribute to enterprise AI quality standards and reference architectures



Governance & Responsible AI

Ensure adherence to Responsible AI, data privacy, and governance requirements

Support auditability, traceability, and transparency of AI outputs and evaluation processes



Stakeholder Enablement

Translate evaluation results into actionable insights for engineering, product, and business stakeholders

Support decision-making on model readiness, release risk, and performance trade-offs

Proactively identify risks, patterns, and systemic issues and escalate appropriately



Qualifications & Experience



Professional Experience

7+ years in software testing, including 2–3 years focused on AI/ML-enabled systems in production environments

Proven experience designing and executing AI evaluation frameworks and quality strategies

Strong track record building ground truth datasets, drift detection systems, and scalable evaluation pipelines

Experience testing multi-step agentic workflows and AI-driven automation systems

Experience operating in fast-paced, iterative delivery environments

Background in regulated or compliance-driven environments preferred



Technical Expertise

Advanced Python programming for evaluation frameworks, batch processing, and data analysis

Experience with LLM evaluation tools such as deepeval, RAGAS, promptfoo, or similar

Strong capabilities in:

AI output validation, hallucination detection, and grounding checks

Drift detection frameworks and statistical evaluation methods

OCR, VLM, and document AI testing (classification, extraction, edge cases)

API testing using Python (requests/httpx) and Postman/Newman

SQL for data validation and pipeline integrity checks

Familiarity with LangChain, LlamaIndex, or similar frameworks

Experience with cloud AI platforms such as Azure AI Foundry or AWS Bedrock preferred



Operating Capabilities

Ability to operate independently in fast-moving, ambiguous environments

Strong analytical mindset with attention to detail and quality rigor

Ability to balance speed and rigor in AI evaluation and delivery cycles

Proactive communicator who identifies risks and drives resolution

Ability to translate technical findings into business-relevant insights



Education

Bachelor’s degree required; Advanced degree in Computer Science, Data Science, or related field preferred

We are an equal opportunity employer that does not discriminate on the basis of any legally protected status.
Please note, AI is used as part of the application process.
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Senior Manager, Technology

Remote

VP of EPR Policy and Compliance

Remote

Experienced Remote Data Entry Specialist – Join careerzynith's Dynamic Team

Remote

**Experienced Live Chat Agent – Deliver Exceptional Customer Support Experience at arenaflex**

Remote

[Remote-Position] Physician, Medical Asset Support Team

Remote

Procurement Analyst, Supplier Management & Fee Collections

Remote

Director, Customer Marketing - Lead Customer Engagement, Loyalty, and Advocacy for a Cybersecurity Pioneer

Remote

**Experienced Remote Customer Chat Specialist – Unlocking Customer Potential through Personalized Conversations**

Remote

Paid Social Media Manager

Remote

Senior Product Analytics and Applied AI Consultant job at Elevance Health in Atlanta, GA, Indianapolis, IN, Lake Mary, FL, Nashville, TN, Miami, FL, Cincinnati, OH, Durham, NC, Grand Prairie, TX

Remote
← Back