ChromaQuery: AI-Powered Knowledge Retrieval

ChromaQuery is an AI-powered knowledge retrieval system that integrates retrieval-augmented generation (RAG), web scraping, and ChromaDB for accurate and real-time responses. The system uses OpenAI embeddings and vector search to retrieve relevant articles and generate contextual answers.

The project consists of three parts:

RAG (Retrieval-Augmented Generation): Combines document retrieval and language generation to enhance response accuracy using OpenAI embeddings and vector-based search.
Web Scraping: Extracts and updates content from online sources using BeautifulSoup and requests.
ChromaDB: Stores and indexes documents and their embeddings for efficient retrieval using semantic search.

RAG Pipeline Architecture

Project Architecture

The ChromaQuery system follows a modular design that integrates several key components for efficient knowledge retrieval and response generation. Below is an overview of the main architecture:

1️⃣ User Input & Query Processing

User Input: The system starts by accepting a query from the user (e.g., "Tell me something related to crypto").
Query Embedding: The input query is converted into an embedding using OpenAI's model, which transforms the text into a numerical vector.

2️⃣ Document Retrieval with ChromaDB

Article Fetching: The system fetches technology-related articles from a predefined source using web scraping (via BeautifulSoup and requests).
Metadata & Embedding Storage: For each article, the system stores the title, link, and embedding (semantic vector) in ChromaDB, along with metadata for easy retrieval.
Persistence: ChromaDB provides persistent storage to keep document embeddings and metadata across sessions.

3️⃣ Relevant Document Search

Query Matching: Once the user query is embedded, it is matched against the embeddings stored in ChromaDB using vector-based search to find the most relevant documents.
Top Results: The system retrieves the top documents based on semantic similarity to the query embedding.

4️⃣ Web Scraping for Article Content

Full Text Retrieval: After retrieving relevant article links, the system scrapes the content of these articles (handling multiple sources) to extract detailed information.
Text Chunking: The article content is split into smaller chunks for more granular search and indexing.

5️⃣ Storing Chunks in ChromaDB

Chunk Storage: The system stores these article chunks (along with their embeddings) in a new ChromaDB collection, ensuring fast access and retrieval.

6️⃣ Final Response Generation

Answer Synthesis: The system then queries the ChromaDB collection to retrieve the most relevant chunks and generates a contextual response using the OpenAI model.

Diagram of the Architecture

Get Article & Link

graph LR
    A[User Input] --> B[Generate Query Embedding]
    B --> C[ChromaDB - Store Embeddings and Metadata]

Get Article Content

graph LR
    C[ChromaDB - Store Embeddings and Metadata] --> D[Web Scraping of Articles]
    D --> E[Store Article Chunks in ChromaDB]

Retrieve Chunks & Generate Response

graph LR
    E[Store Article Chunks in ChromaDB] --> F[Retrieve Relevant Chunks]
    F --> G[Generate AI-Generated Response]

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
chroma_persistent_storage		chroma_persistent_storage
.gitignore		.gitignore
Helpers.py		Helpers.py
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChromaQuery: AI-Powered Knowledge Retrieval

RAG Pipeline Architecture

Project Architecture

1️⃣ User Input & Query Processing

2️⃣ Document Retrieval with ChromaDB

3️⃣ Relevant Document Search

4️⃣ Web Scraping for Article Content

5️⃣ Storing Chunks in ChromaDB

6️⃣ Final Response Generation

Diagram of the Architecture

Get Article & Link

Get Article Content

Retrieve Chunks & Generate Response

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChromaQuery: AI-Powered Knowledge Retrieval

RAG Pipeline Architecture

Project Architecture

1️⃣ User Input & Query Processing

2️⃣ Document Retrieval with ChromaDB

3️⃣ Relevant Document Search

4️⃣ Web Scraping for Article Content

5️⃣ Storing Chunks in ChromaDB

6️⃣ Final Response Generation

Diagram of the Architecture

Get Article & Link

Get Article Content

Retrieve Chunks & Generate Response

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages