AI Evaluation Engineer (Praca zdalna)

Hexjobs ATS

Aplikuj teraz

AI Evaluation Engineer (Praca zdalna)

Hudson Manpower

within USA

30-50 USD / godz.

Umowa o pracę

☁️ azure

🐍 python

security

testing

Umowa o pracę

Overview:

Looking for an AI Evaluation Engineer with deep expertise in LLM benchmarking and evaluation frameworks. The candidate will be responsible for designing, automating, and executing structured evaluations to assess model quality, safety, performance, and cost. This role plays a critical part in ensuring reliable, scalable, and enterprise-ready Generative AI solutions in a fully remote environment.

Location: Bellevue, WA (Remote)

Duration: 6+ Months

Work Authorization: USC, GC, GC EAD, H4 EAD, TN

Interview Mode: Video Interview

Job Summary

We are seeking an experienced AI Evaluation Engineer to design, automate, and execute large language model (LLM) evaluation and benchmarking frameworks for Generative AI systems. This role focuses on assessing model quality, safety, performance, latency, and cost across Azure OpenAI and other GenAI platforms. The ideal candidate has strong hands-on experience with evaluation metrics, prompt testing, and Python-based automation, ensuring enterprise-grade and reliable AI outputs.

Key Responsibilities

Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost
Perform hands-on benchmarking and comparative analysis of Generative AI models
Build and maintain automated evaluation pipelines using Python
Create and manage datasets, benchmarks, and ground-truth references
Conduct structured prompt testing using Azure OpenAI and OpenAI APIs
Analyze hallucinations, bias, safety, and security risks in LLM outputs
Establish baselines and compare multiple models and prompt strategies
Ensure reproducibility and consistency of evaluation results
Document evaluation methodologies, metrics, and findings
Collaborate with AI/ML engineers, product teams, and stakeholders
AI Evaluation
LLM Benchmarking
Azure OpenAI
OpenAI Evals
Prompt Engineering
Prompt Testing
Evaluation Metrics
Hallucination Analysis
Python Automation
Generative AI Testing

Must-Have Hands-On Experience (Critical)

LLM evaluation and benchmarking for Generative AI models
Designing and executing Eval test suites
Automated evaluation pipeline development using Python
Working with Azure OpenAI and structured prompt testing
Creating datasets, benchmarks, and ground-truth references

Required Skills

Technical Skills

Azure OpenAI / OpenAI APIs
LLM evaluation and benchmarking frameworks
Evaluation metrics: Precision, Recall, F1, BLEU, ROUGE, hallucination rate, latency, cost
Prompt engineering: zero-shot, few-shot, and system prompts
Python for automation, batch evaluation execution, and data analysis
Evaluation tools and frameworks:
- OpenAI Evals
- HuggingFace Evals
- Promptfoo
- RAGAS
- DeepEval
- LM Evaluation Harness
AI safety evaluation, bias testing, and security assessment

Functional Skills

Test design and test automation
Reproducible evaluation pipeline design
Model comparison and baseline creation
Strong analytical and problem-solving skills
Clear technical documentation and reporting
Cross-functional collaboration with AI/ML and product teams

Wyświetlenia: 1

Zgłoś

Opublikowana	około 16 godzin temu
Wygasa	za 30 dni
Rodzaj umowy	Umowa o pracę
Źródło

Podobne oferty, które mogą Cię zainteresować

Na podstawie "AI Evaluation Engineer"

Dlaczego nikt nie odpowiada na Twoje CV?

Milczenie jest przytłaczające. Wysyłasz aplikacje jedna po drugiej, ale Twoja skrzynka odbiorcza pozostaje pusta. Nasze AI ujawnia ukryte bariery, które utrudniają Ci dotarcie do rekruterów.

Nie znaleziono ofert, spróbuj zmienić kryteria wyszukiwania.