Local Council Chatbot utilising Llama2 and dataset of PDF docs.

- or -

Post a project like this

Ended at: 06/04/2024

Fixed Price

£363(approx. $456)

Posted: 2 months ago
Proposals: 16
Remote
#4174974
OPPORTUNITY
Expired

+ have already sent a proposal.

Description

Experience Level: Expert

Full stack developer with relevant experience in AWS services and LLM deployment.
Description
Develop an MVP that provides a chat interface allowing users to query a dataset of local council documents, which will variously include minutes and policy documents. A dataset that contains all information relating to the purpose, policies, news, information, and decision making by that council.
The dataset would contain approximately 100 PDF documents, and the chatbot would return meaningful and coherent answers to user prompts, while providing reference links to documents that information in the response is taken from.
The client acknowledges the current limitations of LLMs in returning responses from queries across multiple documents, especially given current token limits and processing cost restrictions. A developer is sought that can leverage techniques to embed metadata in the text, allowing techniques such as RAG to extract snippets of data from multiple documents relating to the query and collate them into a response to the user, while adhering to token limits.
Objective
Develop an automated semantic text analysis pipeline that processes and analyses textual data extracted from documents using Llama2. This pipeline enriches text with metadata for deeper insights and enables semantic search capabilities through a user-friendly interface. This stage of the project is for a MVP system, leveraging AWS services such as Textract for text extraction, a text categorising stage with a simple to use GUI, all-mpnet-base-v2 for embedding, and Postgres with a vector extension.
This job posting is for the MPV stage only, but we must be mindful of the stage two development and facilitate rapid and straightforward scalability in any stage one MPV processes.
System Overview
The solution encompasses AWS services for storage and processing, a custom interface for metadata enrichment, all-mpnet-base-v2 for generating text embeddings, Postgres and a vector extension for efficient storage and retrieval of vectors, and a custom-built web interface for user interaction. RAG will be implemented with a broad a context as possible to the model across a large document set.
Phase 1: MVP Stage
1. Document Storage and Processing Trigger
Tool: Amazon S3.
Process: Upload documents (PDFs initially) to designated S3 buckets, documents will be renamed in accordance with a set naming convention and details of the document entered into the database. This triggers the subsequent text extraction process. For test purposes the uploads will be made manually, and at later stages a web scraper will be added that automatically places PDF documents into relevant S3 buckets.
2. Text Extraction
- Tool: AWS Textract.
- Process: Text is extracted from uploaded PDF documents and temporarily stored in A3 buckets to facilitate further processing.
3. Text Enrichment
Developer to advise on best method of adding labels / categories to the text, via an easy to use interface. Labels to be added at a granular level to allow the return of text snippets from within the chunks of data, but with relevant metadata. The purpose of this is to provide context to the LLM in formulating responses from a broad range of documents without exceeding the token limit.
4. Text Vectorization
• Embedding tool: all-mpnet-base-v2
• LLM: Amazon SageMaker (using LLaMA 2).
Process: The text is processed with LLaMA 2 to generate vector embeddings, capturing semantic information for advanced analysis and search functionalities.
5. Vector Storage
Tool: Postgres with a vector extension
Process: Text vectors are stored in the database, allowing for efficient management and retrieval of vectorized data for semantic searches.
6. Front-end Web Application and Search Functionality
Front-end Technology:
• React.js.
Key Features:
• Semantic search input and results display.
• email input field for collecting contact information for marketing purposes, forwarding to the client's email address.
• Homepage containing descriptive marketing text.
3 pages total: home page, interaction page, contact page, plus a pop up with GDPR info. Graphics provided as template guidance.
Back-end Technology:
• Python with FastAPI.
7. Fine Tuning
Allow for fine tuning based on a series of questions and responses to be provided by the client, until such point that coherent responses to queries are achieved.
Phase 2: Full Automation and Scaling
Beyond the scope of this job.
Notes:
The developer is to provide guidance and feedback on the capabilities of the technologies and is free to provide their own guidance and suggestions. However, the functionality of the system in providing coherent responses based on text snippets drawn from a large dataset is both the challenge and the absolute requirement.
Please only bid with your full and final price. Placeholders will not be accepted.
Completion with approximately two weeks.

Please respond by explaining how you would handle the text enrichment?

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Insides

New Proposal

Clarification Board Ask a Question