Build Research-Grade Coding Tasks for AI Evaluation

- or -

Post a project like this

Ended at: 25/06/2026

Fixed Price

$100

Posted: 2 months ago
Proposals: 18
Remote
#4498055
Expired

+ have already sent a proposal.

Description

Experience Level: Entry

Estimated project duration: Ongoing

I need help building realistic, terminal-based STEM research tasks used to evaluate frontier AI models (GPT, Gemini, etc.).

What you'll build: A self-contained coding task that looks like real research work (analyzing datasets, running simulations, validating hypotheses, comparing methods). Not a textbook problem.

Each submission must include:

instruction.md (workflow, inputs, outputs, success criteria)

Reproducible Docker environment with data

Oracle solution (solve.sh) that fully solves the task

Deterministic tests for verification

task.toml metadata

All packaged into one zip

Quality bar:

Multi-step, research-grade workflow

Hard enough that frontier models fail more than 80% of the time

Oracle passes local tests 3 out of 3 times

Objectively verifiable outputs

No LLM-generated content allowed

Who's a fit: STEM background (biology, chemistry, physics, ML, data science, etc.) with strong Python and Docker skills.

Payout: $100 per accepted submission.

Please share your research background and a code sample when applying.

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Kamil ..

New Proposal

Clarification Board Ask a Question