
LLM Agent Evaluation Engineer / OpenClaw AI Task Specialist
- or -
Post a project like this29
£363(approx. $487)
- Posted:
- Proposals: 35
- Remote
- #4503440
- OPPORTUNITY
- Open for Proposals
Full-Stack Web & Mobile App Developer With AI Integration & Automation Expertise

♛ PPH No. #1 ✔ AI Automations ♛ 12 Years of Experience in Web & Mobile Development & Designing ✔ Magento ✔ Shopify ✔ WordPress ✔ API Integration ✔ React Native ✔ AngularJS / Node.js


Freelance Technology Consultant & Software Developer – Web, Mobile & Cloud Solutions

♛ Most Trusted #1 Team |19+ years of expertise in Website, Mobile Apps, Desktop & Console Games. Wordpress, ReactJS, Shopify, Laravel, Python, React Native, Flutter, Unity, Unreal Engine and AR/VR




WordPress & Shopify Developer | AI Chatbot Automation Expert | Web Scraping | Automation | Data Extraction
✨✨✨✨✨ Top 10 UK based - AI |Mobile & Web Apps | AI & ML | Website | CRM/CMS

129038202060139117000511729491108895571283421227545513770191339466610749830130215352279603
Description
Experience Level: Expert
Role Overview
We are looking for a technically strong AI Agent Evaluation Engineer to design, execute, and evaluate complex multi-step agent workflows using OpenClaw. The ideal candidate should understand how LLM agents coordinate across tools, handle real-world constraints, reason through messy information, and produce verifiable final outputs. This role combines prompt/task design, AI evaluation, QA testing, Python-based verification, and detailed technical analysis.
Key Responsibilities
- Design realistic OpenClaw agent tasks that require multi-stage coordination, decision logic, tool usage, and final artifact generation.
- Build task prompts that test agent capabilities across data acquisition, reasoning, processing, and output generation.
- Run the same task across multiple AI models while maintaining fair starting conditions and comparable environments.
- Analyze full model trajectories to identify reasoning mistakes, tool-use failures, hallucinations, safety issues, and instruction-following gaps.
- Create task-specific rubrics to evaluate agent architecture, tool coordination, reasoning quality, output quality, and final task completion.
- Write and maintain pytest unit tests in verifiers.py to validate whether the final system state meets the task requirements.
- Inspect JSON/state files, workspace outputs, generated files, model traces, and final artifacts to confirm task success or failure.
- Create corrected “silver” trajectories by guiding the best model output until it satisfies all task requirements.
- Document Model A / Model B failures clearly with evidence-based justifications.
- Package and organize trajectory files, workspace files, test files, and final outputs for submission.
Required Technical Skills
- Strong understanding of LLM agents, agent workflows, tool use, and multi-step reasoning.
- Experience working with AI models such as GPT, Claude, Gemini, or similar LLM systems.
- Strong prompt engineering and task design skills, especially for complex multi-step workflows.
- Python experience, especially for writing validation scripts and unit tests.
- Hands-on experience with pytest or similar testing frameworks.
- Ability to read and validate JSON, CSV, workspace files, logs, and structured output files.
- Familiarity with API-based workflows, browser tools, file systems, email/calendar tools, search tools, or other external integrations.
- Strong QA/testing mindset, including edge-case analysis, failure detection, and reproducible evaluation.
- Ability to design objective rubrics with clear pass/fail criteria.
- Experience analyzing model outputs for hallucinations, incomplete reasoning, wrong tool use, and missing requirements.
- Strong technical writing skills for documenting failures, evaluation decisions, and final model rankings.
Preferred Skills
- Experience with AI agent frameworks, workflow automation, or orchestration tools.
- Experience with web scraping, browser automation, Playwright, Selenium, or data extraction workflows.
- Familiarity with data normalization, scoring logic, ranking systems, and rule-based decision engines.
- Understanding of AI safety, privacy boundaries, permission handling, and tool-use risk.
- Experience comparing model performance or working on RLHF, LLM evaluation, data annotation, or AI benchmarking projects.
- Basic familiarity with Git, virtual environments, package management, and reproducible test setup.
- Ability to work with live test accounts and maintain clean, equivalent environments across model runs.
Ideal Candidate Profile
The ideal candidate is not just a prompt writer. They should be comfortable acting like a QA engineer, automation tester, AI evaluator, and technical analyst at the same time. They need to understand how agents should break down a complex task, use tools correctly, recover from friction, produce structured outputs, and leave behind evidence that can be tested.
Example Backgrounds That Fit
- AI Evaluation Engineer
- QA Automation Engineer with LLM experience
- Python Test Engineer
- AI Prompt Engineer with strong technical testing skills
- Data Annotation / RLHF Specialist with coding ability
- Full Stack or Backend Developer interested in AI agent evaluation
- Technical Product Analyst with Python and AI workflow experience
We are looking for a technically strong AI Agent Evaluation Engineer to design, execute, and evaluate complex multi-step agent workflows using OpenClaw. The ideal candidate should understand how LLM agents coordinate across tools, handle real-world constraints, reason through messy information, and produce verifiable final outputs. This role combines prompt/task design, AI evaluation, QA testing, Python-based verification, and detailed technical analysis.
Key Responsibilities
- Design realistic OpenClaw agent tasks that require multi-stage coordination, decision logic, tool usage, and final artifact generation.
- Build task prompts that test agent capabilities across data acquisition, reasoning, processing, and output generation.
- Run the same task across multiple AI models while maintaining fair starting conditions and comparable environments.
- Analyze full model trajectories to identify reasoning mistakes, tool-use failures, hallucinations, safety issues, and instruction-following gaps.
- Create task-specific rubrics to evaluate agent architecture, tool coordination, reasoning quality, output quality, and final task completion.
- Write and maintain pytest unit tests in verifiers.py to validate whether the final system state meets the task requirements.
- Inspect JSON/state files, workspace outputs, generated files, model traces, and final artifacts to confirm task success or failure.
- Create corrected “silver” trajectories by guiding the best model output until it satisfies all task requirements.
- Document Model A / Model B failures clearly with evidence-based justifications.
- Package and organize trajectory files, workspace files, test files, and final outputs for submission.
Required Technical Skills
- Strong understanding of LLM agents, agent workflows, tool use, and multi-step reasoning.
- Experience working with AI models such as GPT, Claude, Gemini, or similar LLM systems.
- Strong prompt engineering and task design skills, especially for complex multi-step workflows.
- Python experience, especially for writing validation scripts and unit tests.
- Hands-on experience with pytest or similar testing frameworks.
- Ability to read and validate JSON, CSV, workspace files, logs, and structured output files.
- Familiarity with API-based workflows, browser tools, file systems, email/calendar tools, search tools, or other external integrations.
- Strong QA/testing mindset, including edge-case analysis, failure detection, and reproducible evaluation.
- Ability to design objective rubrics with clear pass/fail criteria.
- Experience analyzing model outputs for hallucinations, incomplete reasoning, wrong tool use, and missing requirements.
- Strong technical writing skills for documenting failures, evaluation decisions, and final model rankings.
Preferred Skills
- Experience with AI agent frameworks, workflow automation, or orchestration tools.
- Experience with web scraping, browser automation, Playwright, Selenium, or data extraction workflows.
- Familiarity with data normalization, scoring logic, ranking systems, and rule-based decision engines.
- Understanding of AI safety, privacy boundaries, permission handling, and tool-use risk.
- Experience comparing model performance or working on RLHF, LLM evaluation, data annotation, or AI benchmarking projects.
- Basic familiarity with Git, virtual environments, package management, and reproducible test setup.
- Ability to work with live test accounts and maintain clean, equivalent environments across model runs.
Ideal Candidate Profile
The ideal candidate is not just a prompt writer. They should be comfortable acting like a QA engineer, automation tester, AI evaluator, and technical analyst at the same time. They need to understand how agents should break down a complex task, use tools correctly, recover from friction, produce structured outputs, and leave behind evidence that can be tested.
Example Backgrounds That Fit
- AI Evaluation Engineer
- QA Automation Engineer with LLM experience
- Python Test Engineer
- AI Prompt Engineer with strong technical testing skills
- Data Annotation / RLHF Specialist with coding ability
- Full Stack or Backend Developer interested in AI agent evaluation
- Technical Product Analyst with Python and AI workflow experience
Marc D.
0% (0)Projects Completed
-
Freelancers worked with
-
Projects awarded
0%
Last project
18 Jun 2026
United States
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-

Hi Marc,
Before quoting any fixed cost or timeline I'd love to understand the following:
1. What version of OpenClaw framework are you currently using?
2. Are evaluation datasets already defined or should I design them from scratch?
3. What is the expected volume of tasks per week/month?
4. Do you already have a baseline rubric, or should I build the initial evaluation schema?
5. Will this be integrated into CI/CD or manual evaluation pipelines?
Looking forward to your response.
1156879
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies