Automated Semantic Text Analysis Pipeline
- or -
Post a project like this£363(approx. $454)
- Posted:
- Proposals: 11
- Remote
- #4170699
- OPPORTUNITY
- Expired
Full Stack Web Developer/ Designer / Mobile Apps / Wordpress / Magento / Shopify / Opencart / BigCommerce / APIs / PHP / Python /
Bay Minette
1550+ Projects (iOS | Android | Mac OS | Web | Win | BlockChain | IOT)
Bad Vilbel
Graphic Designer |Experienced Web Designer | Video/Audio Editor | PowerPoint/Keynote | Content Writer |
San Jose
Web and App Development | Database Expert | Database Analysis| Python Developer
Islamabad
6638419685703114743343141650415918897467988989810217969105213631074983011024794
Description
Experience Level: Expert
Comprehensive Use Case Specification: Automated Semantic Text Analysis Pipeline
Objective
Develop an automated semantic text analysis pipeline that processes and analyses textual data extracted from documents. This pipeline enriches text with metadata for deeper insights and enables semantic search capabilities through a user-friendly interface. This stage of the project is for a MVP system should leverage AWS services such as Textract for text extraction, a text categorising stage with a simple to use GUI, all-mpnet-base-v2 for embedding, and Postgres with a vector extension.
This job posting is for the MPV stage only, but we must be mindful of the stage two development and facilitate rapid and straightforward scalability in any stage one MPV processes.
System Overview
The solution encompasses AWS services for storage and processing, a custom interface for metadata enrichment, all-mpnet-base-v2 for generating text embeddings, Postgres and a vector extension for efficient storage and retrieval of vectors, and a custom-built web interface for user interaction. RAG will be implemented with a broad a context as possible to the model across a large document set.
Phase 1: MVP Stage
1. Document Storage and Processing Trigger
- Tool: Amazon S3.
- Process: Upload documents (PDFs initially) to designated S3 buckets, documents will be remained in accordance with a set naming convention and key metadata relating to the document entered into the database for future reference. This triggers the subsequent text extraction process. For test purposes the uploads will be made manually, and at later stages a web scraper will be added that automatically places PDF documents into relevant S3 buckets.
2. Text Extraction
- Tool: AWS Textract.
- Process: Text is extracted from uploaded PDF documents and temporarily stored in A3 buckets to facilitate further processing.
3. Text Enrichment
Developer to advise on best method of adding labels / categories to the text, via an easy to use interface. Labels to be added at a granular level to allow the return of text snippets, providing context to the LLM in formulating it's responses from a broad range of documents without exceeding the token limit.
4. Text Vectorization
- Embedding tool: all-mpnet-base-v2
- LLM: Amazon SageMaker (using LLaMA 2).
- Process: The text is processed with LLaMA 2 to generate vector embeddings, capturing semantic information for advanced analysis and search functionalities.
5. Vector Storage
- Tool: Postgres with a vector extension
- Process: Text vectors are stored in the database, allowing for efficient management and retrieval of vectorized data for semantic searches.
6. Front-end Web Application and Search Functionality
- Front-end Technology: React.js.
- Key Features:
- Semantic search input and results display.
- email input field for collecting contact information for marketing purposes, forwarding to the client's email
address.
- Homepage containing descriptive marketing text.
- 3 pages total: home page, interaction page, contact page, plus a pop up with GDPR info. Graphics provided as
template guidance.
- Back-end Technology: Python with FastAPI.
Phase 2: Full Automation and Scaling
1. Automated Document Ingestion
- Process: A web scraping tool is implemented to automatically identify and upload new documents to the S3 bucket, facilitating a continuous flow of data into the pipeline without manual intervention.
2. Scalable Architecture
- Deployment: The application components are containerized using Docker and managed with Kubernetes (Amazon EKS), ensuring the system can scale efficiently to accommodate increased data volumes and user queries.
3. Enhanced Processing Capabilities
- Improvements: Integrate additional NLP and ML models for broader and more nuanced text analysis. Consider fine-tuning custom models for specific domain applications.
4. User registration and user management system integration.
Please note the attached contract agreement that will be deemed agreed to upon acceptance of the project. Your price given on PPH will be deemed to be your full and final price, and you will be deemed to have fully understood the scope, brief, and specification.
To provide context, the project business plan has been uploaded. This is for context only and does not form part of the brief.
Objective
Develop an automated semantic text analysis pipeline that processes and analyses textual data extracted from documents. This pipeline enriches text with metadata for deeper insights and enables semantic search capabilities through a user-friendly interface. This stage of the project is for a MVP system should leverage AWS services such as Textract for text extraction, a text categorising stage with a simple to use GUI, all-mpnet-base-v2 for embedding, and Postgres with a vector extension.
This job posting is for the MPV stage only, but we must be mindful of the stage two development and facilitate rapid and straightforward scalability in any stage one MPV processes.
System Overview
The solution encompasses AWS services for storage and processing, a custom interface for metadata enrichment, all-mpnet-base-v2 for generating text embeddings, Postgres and a vector extension for efficient storage and retrieval of vectors, and a custom-built web interface for user interaction. RAG will be implemented with a broad a context as possible to the model across a large document set.
Phase 1: MVP Stage
1. Document Storage and Processing Trigger
- Tool: Amazon S3.
- Process: Upload documents (PDFs initially) to designated S3 buckets, documents will be remained in accordance with a set naming convention and key metadata relating to the document entered into the database for future reference. This triggers the subsequent text extraction process. For test purposes the uploads will be made manually, and at later stages a web scraper will be added that automatically places PDF documents into relevant S3 buckets.
2. Text Extraction
- Tool: AWS Textract.
- Process: Text is extracted from uploaded PDF documents and temporarily stored in A3 buckets to facilitate further processing.
3. Text Enrichment
Developer to advise on best method of adding labels / categories to the text, via an easy to use interface. Labels to be added at a granular level to allow the return of text snippets, providing context to the LLM in formulating it's responses from a broad range of documents without exceeding the token limit.
4. Text Vectorization
- Embedding tool: all-mpnet-base-v2
- LLM: Amazon SageMaker (using LLaMA 2).
- Process: The text is processed with LLaMA 2 to generate vector embeddings, capturing semantic information for advanced analysis and search functionalities.
5. Vector Storage
- Tool: Postgres with a vector extension
- Process: Text vectors are stored in the database, allowing for efficient management and retrieval of vectorized data for semantic searches.
6. Front-end Web Application and Search Functionality
- Front-end Technology: React.js.
- Key Features:
- Semantic search input and results display.
- email input field for collecting contact information for marketing purposes, forwarding to the client's email
address.
- Homepage containing descriptive marketing text.
- 3 pages total: home page, interaction page, contact page, plus a pop up with GDPR info. Graphics provided as
template guidance.
- Back-end Technology: Python with FastAPI.
Phase 2: Full Automation and Scaling
1. Automated Document Ingestion
- Process: A web scraping tool is implemented to automatically identify and upload new documents to the S3 bucket, facilitating a continuous flow of data into the pipeline without manual intervention.
2. Scalable Architecture
- Deployment: The application components are containerized using Docker and managed with Kubernetes (Amazon EKS), ensuring the system can scale efficiently to accommodate increased data volumes and user queries.
3. Enhanced Processing Capabilities
- Improvements: Integrate additional NLP and ML models for broader and more nuanced text analysis. Consider fine-tuning custom models for specific domain applications.
4. User registration and user management system integration.
Please note the attached contract agreement that will be deemed agreed to upon acceptance of the project. Your price given on PPH will be deemed to be your full and final price, and you will be deemed to have fully understood the scope, brief, and specification.
To provide context, the project business plan has been uploaded. This is for context only and does not form part of the brief.
Insides
100% (25)Projects Completed
18
Freelancers worked with
15
Projects awarded
40%
Last project
8 Feb 2024
United Kingdom
New Proposal
Login to your account and send a proposal now to get this project.
Log inClarification Board Ask a Question
-
There are no clarification messages.
We collect cookies to enable the proper functioning and security of our website, and to enhance your experience. By clicking on 'Accept All Cookies', you consent to the use of these cookies. You can change your 'Cookies Settings' at any time. For more information, please read ourCookie Policy
Cookie Settings
Accept All Cookies