Hudson Manpower
Overview:
Looking for an AI Evaluation Engineer with deep expertise in LLM benchmarking and evaluation frameworks. The candidate will be responsible for designing, automating, and executing structured evaluations to assess model quality, safety, performance, and cost. This role plays a critical part in ensuring reliable, scalable, and enterprise-ready Generative AI solutions in a fully remote environment.
Location: Bellevue, WA (Remote)
Duration: 6+ Months
Work Authorization: USC, GC, GC EAD, H4 EAD, TN
Interview Mode: Video Interview
Job Summary
We are seeking an experienced AI Evaluation Engineer to design, automate, and execute large language model (LLM) evaluation and benchmarking frameworks for Generative AI systems. This role focuses on assessing model quality, safety, performance, latency, and cost across Azure OpenAI and other GenAI platforms. The ideal candidate has strong hands-on experience with evaluation metrics, prompt testing, and Python-based automation, ensuring enterprise-grade and reliable AI outputs.
Key Responsibilities
Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost
Perform hands-on benchmarking and comparative analysis of Generative AI models
Build and maintain automated evaluation pipelines using Python
Create and manage datasets, benchmarks, and ground-truth references
Conduct structured prompt testing using Azure OpenAI and OpenAI APIs
Analyze hallucinations, bias, safety, and security risks in LLM outputs
Establish baselines and compare multiple models and prompt strategies
Ensure reproducibility and consistency of evaluation results
Document evaluation methodologies, metrics, and findings
Collaborate with AI/ML engineers, product teams, and stakeholders
AI Evaluation
LLM Benchmarking
Azure OpenAI
OpenAI Evals
Prompt Engineering
Prompt Testing
Evaluation Metrics
Hallucination Analysis
Python Automation
Generative AI Testing
Must-Have Hands-On Experience (Critical)
LLM evaluation and benchmarking for Generative AI models
Designing and executing Eval test suites
Automated evaluation pipeline development using Python
Working with Azure OpenAI and structured prompt testing
Creating datasets, benchmarks, and ground-truth references
Required Skills
Technical Skills
Azure OpenAI / OpenAI APIs
LLM evaluation and benchmarking frameworks
Evaluation metrics: Precision, Recall, F1, BLEU, ROUGE, hallucination rate, latency, cost
Prompt engineering: zero-shot, few-shot, and system prompts
Python for automation, batch evaluation execution, and data analysis
Evaluation tools and frameworks:
OpenAI Evals
HuggingFace Evals
Promptfoo
RAGAS
DeepEval
LM Evaluation Harness
AI safety evaluation, bias testing, and security assessment
Functional Skills
Test design and test automation
Reproducible evaluation pipeline design
Model comparison and baseline creation
Strong analytical and problem-solving skills
Clear technical documentation and reporting
Cross-functional collaboration with AI/ML and product teams
| Opublikowana | około 16 godzin temu |
| Wygasa | za 30 dni |
| Rodzaj umowy | Umowa o pracę |
| Źródło |
Milczenie jest przytłaczające. Wysyłasz aplikacje jedna po drugiej, ale Twoja skrzynka odbiorcza pozostaje pusta. Nasze AI ujawnia ukryte bariery, które utrudniają Ci dotarcie do rekruterów.
Nie znaleziono ofert, spróbuj zmienić kryteria wyszukiwania.