
Audio/Video Processing Prototype
- or -
Post a project like this28
€25/hr(approx. $29/hr)
- Posted:
- Proposals: 15
- Remote
- #4493334
- Open for Proposals
Graphic Design | 2D & 3D Motion Graphics | AI Video Generation | Video Editing | Youtube | 3D Designs

Web and Mobile Developer ( Angular 15+ , Reactjs ,nextjs, React native , Ionic , Flutter and FlutterFlow , Java)
Software Engineer, Full Stack, Solutions Architect, Backend, Frontend, Mobile, System Administrator, DevOps

906278948908861109516812922483118279411206668047038837006066107498301167547212098313318561
Description
Experience Level: Intermediate
Estimated project duration: 1 - 6 months
We are building a prototype that processes audio and video data and generates structured outputs (e.g. speaking time, activity levels, simple lesson analysis).
The focus is on practical implementation, not research.
You will build :
- processing of audio/video input (files or streams)
- integration of existing tools:
+ speech-to-text (e.g. Whisper)
+ speaker diarization
- calculation of simple metrics:
+ speaking time per speaker
+ silence / overlap
- generation of structured output (JSON / API)
- simple backend (FastAPI)
Tech stack (indicative)
- Python
- FastAPI
- FFmpeg
- Whisper (or similar)
- PostgreSQL (optional)
- Docker (nice to have)
Profile
- 2–5 years of experience with Python
- experience with backend/API development
- experience with audio/video processing is a plus
- able to work independently from clear specifications
- pragmatic and solution-oriented
Practical
- freelance / part-time or full-time
- remote
- start asap
- duration: 4–8 weeks (initial phase)
This is not an AI research role. You will use existing tools and focus on building a working system.
To apply please include:
- relevant projects
- experience with Python/APIs
- short explanation of how you would approach this technically
- availability
The focus is on practical implementation, not research.
You will build :
- processing of audio/video input (files or streams)
- integration of existing tools:
+ speech-to-text (e.g. Whisper)
+ speaker diarization
- calculation of simple metrics:
+ speaking time per speaker
+ silence / overlap
- generation of structured output (JSON / API)
- simple backend (FastAPI)
Tech stack (indicative)
- Python
- FastAPI
- FFmpeg
- Whisper (or similar)
- PostgreSQL (optional)
- Docker (nice to have)
Profile
- 2–5 years of experience with Python
- experience with backend/API development
- experience with audio/video processing is a plus
- able to work independently from clear specifications
- pragmatic and solution-oriented
Practical
- freelance / part-time or full-time
- remote
- start asap
- duration: 4–8 weeks (initial phase)
This is not an AI research role. You will use existing tools and focus on building a working system.
To apply please include:
- relevant projects
- experience with Python/APIs
- short explanation of how you would approach this technically
- availability
Kris V.
0% (0)Projects Completed
-
Freelancers worked with
-
Projects awarded
0%
Last project
6 May 2026
Belgium
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-

1/ What type of audio/video inputs are we primarily dealing with? Are these pre-recorded files, live streams, or both?
2/ How accurate does the speaker diarization need to be for your use case? Is approximate speaker separation acceptable or do you need high precision?
3/ What specific metrics are critical beyond speaking time and silence? Do you want advanced insights later like engagement scoring or sentiment?
1153806
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies