Aplikuj teraz

AI Evaluation Engineer (Praca zdalna)

Hudson Manpower

within USA
30-50 USD / godz.
Umowa o pracę
☁️ azure
🐍 python
security
testing
Umowa o pracę

Overview:

Looking for an AI Evaluation Engineer with deep expertise in LLM benchmarking and evaluation frameworks. The candidate will be responsible for designing, automating, and executing structured evaluations to assess model quality, safety, performance, and cost. This role plays a critical part in ensuring reliable, scalable, and enterprise-ready Generative AI solutions in a fully remote environment.

Location: Bellevue, WA (Remote)

Duration: 6+ Months

Work Authorization: USC, GC, GC EAD, H4 EAD, TN

Interview Mode: Video Interview

Job Summary

We are seeking an experienced AI Evaluation Engineer to design, automate, and execute large language model (LLM) evaluation and benchmarking frameworks for Generative AI systems. This role focuses on assessing model quality, safety, performance, latency, and cost across Azure OpenAI and other GenAI platforms. The ideal candidate has strong hands-on experience with evaluation metrics, prompt testing, and Python-based automation, ensuring enterprise-grade and reliable AI outputs.

Key Responsibilities

  • Design and execute structured LLM evaluation (Eval) test suites to measure accuracy, relevance, safety, latency, and cost

  • Perform hands-on benchmarking and comparative analysis of Generative AI models

  • Build and maintain automated evaluation pipelines using Python

  • Create and manage datasets, benchmarks, and ground-truth references

  • Conduct structured prompt testing using Azure OpenAI and OpenAI APIs

  • Analyze hallucinations, bias, safety, and security risks in LLM outputs

  • Establish baselines and compare multiple models and prompt strategies

  • Ensure reproducibility and consistency of evaluation results

  • Document evaluation methodologies, metrics, and findings

  • Collaborate with AI/ML engineers, product teams, and stakeholders

  • AI Evaluation

  • LLM Benchmarking

  • Azure OpenAI

  • OpenAI Evals

  • Prompt Engineering

  • Prompt Testing

  • Evaluation Metrics

  • Hallucination Analysis

  • Python Automation

  • Generative AI Testing

Must-Have Hands-On Experience (Critical)

  • LLM evaluation and benchmarking for Generative AI models

  • Designing and executing Eval test suites

  • Automated evaluation pipeline development using Python

  • Working with Azure OpenAI and structured prompt testing

  • Creating datasets, benchmarks, and ground-truth references

Required Skills

Technical Skills

  • Azure OpenAI / OpenAI APIs

  • LLM evaluation and benchmarking frameworks

  • Evaluation metrics: Precision, Recall, F1, BLEU, ROUGE, hallucination rate, latency, cost

  • Prompt engineering: zero-shot, few-shot, and system prompts

  • Python for automation, batch evaluation execution, and data analysis

  • Evaluation tools and frameworks:

    • OpenAI Evals

    • HuggingFace Evals

    • Promptfoo

    • RAGAS

    • DeepEval

    • LM Evaluation Harness

  • AI safety evaluation, bias testing, and security assessment

Functional Skills

  • Test design and test automation

  • Reproducible evaluation pipeline design

  • Model comparison and baseline creation

  • Strong analytical and problem-solving skills

  • Clear technical documentation and reporting

  • Cross-functional collaboration with AI/ML and product teams

Wyświetlenia: 1
Opublikowanaokoło 16 godzin temu
Wygasaza 30 dni
Rodzaj umowyUmowa o pracę
Źródło
Logo

Podobne oferty, które mogą Cię zainteresować

Na podstawie "AI Evaluation Engineer"