AI Model Evaluator (LLM & Agent Systems)

Remote Full-time
Job Title: AI Model Evaluator (LLM & Agent Systems)

Job Type: Contract (Minimum 2 weeks, with potential extension)

Location: Remote

Job Summary:

Join our customer's team as an AI Model Evaluator (LLM & Agent Systems) and play a pivotal role in shaping the future of generative AI and autonomous agents. You'll help benchmark, analyze, and assess cutting-edge AI systems in real-world scenarios, providing structured insights that drive improvements. This position is ideal for analytical professionals passionate about AI quality and real-world impact.

Key Responsibilities:
• Evaluate outputs from large language models (LLMs) and autonomous agent systems against defined guidelines and rubrics
• Review multi-step agent actions, including screenshots and reasoning traces, to determine accuracy and quality
• Consistently apply evaluation standards, flagging edge cases and identifying recurring patterns or failure modes
• Provide detailed, structured feedback to inform benchmarking, product evolution, and model refinement
• Participate in calibration and alignment sessions to ensure consistent application of evaluation criteria
• Work collaboratively to adapt to evolving scenarios and ambiguous evaluation situations
• Document findings and communicate insights clearly both in writing and verbally to relevant stakeholders

Required Skills and Qualifications:
• Demonstrated experience with LLM evaluation, AI output analysis, QA/testing, UX research, or similar analytical roles
• Strong background in AI model evaluation, benchmarking, and applying rubric-based scoring frameworks
• Exceptional attention to detail and sound judgement in ambiguous or edge-case scenarios
• Proficiency in English (B2+ or equivalent) with excellent written and verbal communication skills
• Ability to adapt quickly to evolving guidelines and work independently
• Comfort with remote work and a commitment of at least 20 hours per week for the initial term
• Analytical mindset with a focus on actionable, qualitative feedback

Preferred Qualifications:
• Experience with RLHF, annotation workflows, or AI benchmarking frameworks
• Familiarity with autonomous agent systems or workflow automation tools
• Background in mobile apps or digital product evaluation processes

Required Skills
• LLMs
• Generative AI
• AI Model Evaluation
• AI Benchmarking
• AI Quality Assessment
• Model Performance Evaluation
• Prompt Response Evaluation
• AI Output Analysis
• Rubric-Based Scoring

Apply tot his job

Apply To this Job
Apply Now →

Similar Jobs

Experienced Registered Behavior Technician for In-Home ABA Therapy - Atlanta, GA

Remote

Immediate Hiring: Experienced Registered Behavioral Technician (RBT) for Clinic-Based ABA Therapy Services

Remote

Experienced Registered Behavioral Technician (RBT) - ABA Therapy for Children with Autism Spectrum Disorder

Remote

Experienced Registered Nurse - Telehealth: Providing Remote Care Coordination and Patient Support

Remote

Experienced Substitute Teacher for Riverside County Schools - Join Scoot Education's Innovative Team

Remote

Experienced Substitute Teacher for San Bernardino County - Flexible Schedules & Competitive Pay

Remote

Experienced School Year Instructional Coach for High-Dosage Tutoring Programs in Edgewater Park, NJ

Remote

Experienced School Year Tutor for K-8 Students in Math and Literacy - Mickleton, NJ

Remote

Experienced Secondary Social Studies Teacher for Kansas - Flexible Hybrid Remote Arrangement

Remote

USPS Office Helper

Remote

Experienced Customer Service Representative – Food Industry Expertise and Community Focus at arenaflex

Remote

Philanthropy Officer - Post Acute Care Services, Day Shift, Rehab Philanthropy in Rockville, MD

Remote

Legal Analyst, Climate and Environment

Remote

SAP Data Migration Senior Consultant in Rosslyn, VA

Remote

Account Executive

Remote

Flex Nurse Educator - Memphis, TN Memphis, TN

Remote

**Experienced Customer Service Representative – Federal Student Loan Servicing**

Remote

Urgent Job Opening - VCCS - Data Analyst 4 - Daleville, Virginia 24083 - Remote

Remote

Remote- Patient Access Representative- Centrali...

Remote

Growth Insights Analyst Santa Monica, CA, USA

Remote
← Back