[Remote] QA Engineer, AI Products
Note: The job is a remote job and is open to candidates in USA. MDCalc is a leading medical reference widely used by clinicians to improve patient outcomes. They are seeking a QA Engineer to ensure the quality and reliability of their AI-powered features, focusing on testing LLM-based systems and collaborating with cross-functional teams.ResponsibilitiesDesign and execute test strategies for LLM-powered features, including prompt regression testing, output evaluation, and hallucination detectionBuild and maintain automated evaluation pipelines (eval sets, golden datasets, LLM-as-judge frameworks) to catch quality regressions in non-deterministic outputsPerform black-box and exploratory testing of MDCalc's AI features across web and mobile, with particular attention to clinical accuracy, safety, and edge casesDefine quality metrics for AI outputs (accuracy, faithfulness, relevance, safety, latency, cost) and establish thresholds for release readinessCollaborate cross-functionally with engineers, product managers, ML/AI engineers, and clinical reviewers to define what 'good' looks like for AI responsesInvestigate and triage AI failure modes, distinguishing model issues, prompt issues, retrieval issues, and integration bugsParticipate in team discussions, offering feedback on testability, risks, prompt design, and guardrailsHelp develop QA strategies to expand future testing capacity, automation, and evaluation coverage as the AI product surface growsSkills5+ years of experience in software QA, with at least 1 year of hands-on testing of LLM-based or AI/ML-powered featuresStrong understanding of QA principles, test case creation/documentation, and best practices for both deterministic and non-deterministic systemsHands-on experience with LLM tooling and concepts: prompt engineering, RAG systems, evaluation frameworks (e.g., Promptfoo, Braintrust, LangSmith, DeepEval, Ragas, OpenAI Evals), and LLM APIs (OpenAI, Anthropic, etc.)Experience designing automated qualitative evaluation approaches, including LLM-as-judge, rubric-based scoring, semantic similarity checks, and golden dataset regression testingProficiency with test automation tools, with a focus on PlaywrightStrong SQL skills for data validation, test data creation, and verifying data integrity across systemsFamiliarity with token usage, latency profiling, and cost monitoring as quality signalsEagerness to learn quickly and a positive, solutions-oriented attitudeClear and concise communicator, able to surface issues, blockers, and risks effectively when communicating ambiguous or probabilistic failuresSelf-motivated, proactive, and able to manage time and priorities independentlyBenefitsMedical, Dental, & Vision Coverage, with option to extend to your dependentsCompany-sponsored short-term insuranceFully-paid 8 week parental leave, after 6 months of employmentCompany-sponsored 401k, after 3 months of employmentUnlimited vacation for salaried roles - we trust you to take the time you needBi-annual company offsites to connect, reflect, and plan togetherWork from home monthly stipendA culture of fun and motivated team members who believe in a greater mission here at MDCalcCompany OverviewMDCalc is used by over 2/3 of US physicians, provides free and access to 800+ medical scores, calculations and algorithms. It was founded in 2005, and is headquartered in New York, New York, USA, with a workforce of 11-50 employees. Its website is https://www.mdcalc.com.