Automated Semantic Text Analysis Pipeline

- or -

Post a project like this

Ended at: 06/04/2024

Fixed Price

£363(approx. $454)

Posted: 2 months ago
Proposals: 11
Remote
#4170699
OPPORTUNITY
Expired

+ have already sent a proposal.

Description

Experience Level: Expert

Comprehensive Use Case Specification: Automated Semantic Text Analysis Pipeline
Objective
Develop an automated semantic text analysis pipeline that processes and analyses textual data extracted from documents. This pipeline enriches text with metadata for deeper insights and enables semantic search capabilities through a user-friendly interface. This stage of the project is for a MVP system should leverage AWS services such as Textract for text extraction, a text categorising stage with a simple to use GUI, all-mpnet-base-v2 for embedding, and Postgres with a vector extension.

This job posting is for the MPV stage only, but we must be mindful of the stage two development and facilitate rapid and straightforward scalability in any stage one MPV processes.

System Overview

The solution encompasses AWS services for storage and processing, a custom interface for metadata enrichment, all-mpnet-base-v2 for generating text embeddings, Postgres and a vector extension for efficient storage and retrieval of vectors, and a custom-built web interface for user interaction. RAG will be implemented with a broad a context as possible to the model across a large document set.

Phase 1: MVP Stage
1. Document Storage and Processing Trigger
- Tool: Amazon S3.
- Process: Upload documents (PDFs initially) to designated S3 buckets, documents will be remained in accordance with a set naming convention and key metadata relating to the document entered into the database for future reference. This triggers the subsequent text extraction process. For test purposes the uploads will be made manually, and at later stages a web scraper will be added that automatically places PDF documents into relevant S3 buckets.

2. Text Extraction
- Tool: AWS Textract.
- Process: Text is extracted from uploaded PDF documents and temporarily stored in A3 buckets to facilitate further processing.

3. Text Enrichment
Developer to advise on best method of adding labels / categories to the text, via an easy to use interface. Labels to be added at a granular level to allow the return of text snippets, providing context to the LLM in formulating it's responses from a broad range of documents without exceeding the token limit.

4. Text Vectorization
- Embedding tool: all-mpnet-base-v2
- LLM: Amazon SageMaker (using LLaMA 2).

- Process: The text is processed with LLaMA 2 to generate vector embeddings, capturing semantic information for advanced analysis and search functionalities.

5. Vector Storage
- Tool: Postgres with a vector extension

- Process: Text vectors are stored in the database, allowing for efficient management and retrieval of vectorized data for semantic searches.

6. Front-end Web Application and Search Functionality
- Front-end Technology: React.js.
- Key Features:
- Semantic search input and results display.
- email input field for collecting contact information for marketing purposes, forwarding to the client's email
address.
- Homepage containing descriptive marketing text.
- 3 pages total: home page, interaction page, contact page, plus a pop up with GDPR info. Graphics provided as
template guidance.

- Back-end Technology: Python with FastAPI.

Phase 2: Full Automation and Scaling
1. Automated Document Ingestion
- Process: A web scraping tool is implemented to automatically identify and upload new documents to the S3 bucket, facilitating a continuous flow of data into the pipeline without manual intervention.
2. Scalable Architecture
- Deployment: The application components are containerized using Docker and managed with Kubernetes (Amazon EKS), ensuring the system can scale efficiently to accommodate increased data volumes and user queries.
3. Enhanced Processing Capabilities
- Improvements: Integrate additional NLP and ML models for broader and more nuanced text analysis. Consider fine-tuning custom models for specific domain applications.
4. User registration and user management system integration.

Please note the attached contract agreement that will be deemed agreed to upon acceptance of the project. Your price given on PPH will be deemed to be your full and final price, and you will be deemed to have fully understood the scope, brief, and specification.

To provide context, the project business plan has been uploaded. This is for context only and does not form part of the brief.

New Proposal

Clarification Board Ask a Question

There are no clarification messages.

Description

Insides

New Proposal

Clarification Board Ask a Question