
Build Research-Grade Coding Tasks for AI Evaluation
- or -
Post a project like this29
$100
- Posted:
- Proposals: 14
- Remote
- #4498055
- Open for Proposals
⭐⭐⭐⭐⭐Top Rated Website developer || Mobile Application Dev || Website Designer || Ai Automation Expert

I help businesses turn raw data into revenue using AI, ML, and predictive analytics
AI,Software & Web/Mobile Machine & Learning Specialist | 1000+ Success Stories | Smart SaaS & Automation Expert
13109116133108819062789134149029287485133047541304786613382972230242692718618314012377742
Description
Experience Level: Entry
Estimated project duration: Ongoing
I need help building realistic, terminal-based STEM research tasks used to evaluate frontier AI models (GPT, Gemini, etc.).
What you'll build: A self-contained coding task that looks like real research work (analyzing datasets, running simulations, validating hypotheses, comparing methods). Not a textbook problem.
Each submission must include:
instruction.md (workflow, inputs, outputs, success criteria)
Reproducible Docker environment with data
Oracle solution (solve.sh) that fully solves the task
Deterministic tests for verification
task.toml metadata
All packaged into one zip
Quality bar:
Multi-step, research-grade workflow
Hard enough that frontier models fail more than 80% of the time
Oracle passes local tests 3 out of 3 times
Objectively verifiable outputs
No LLM-generated content allowed
Who's a fit: STEM background (biology, chemistry, physics, ML, data science, etc.) with strong Python and Docker skills.
Payout: $100 per accepted submission.
Please share your research background and a code sample when applying.
What you'll build: A self-contained coding task that looks like real research work (analyzing datasets, running simulations, validating hypotheses, comparing methods). Not a textbook problem.
Each submission must include:
instruction.md (workflow, inputs, outputs, success criteria)
Reproducible Docker environment with data
Oracle solution (solve.sh) that fully solves the task
Deterministic tests for verification
task.toml metadata
All packaged into one zip
Quality bar:
Multi-step, research-grade workflow
Hard enough that frontier models fail more than 80% of the time
Oracle passes local tests 3 out of 3 times
Objectively verifiable outputs
No LLM-generated content allowed
Who's a fit: STEM background (biology, chemistry, physics, ML, data science, etc.) with strong Python and Docker skills.
Payout: $100 per accepted submission.
Please share your research background and a code sample when applying.
Kamil ..
0% (0)Projects Completed
-
Freelancers worked with
-
Projects awarded
0%
Last project
26 May 2026
United States
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies
