Visual RAG Implementation

Learn how to implement Apolo's enterprise-ready multimodal AI for Visual RAG on complex PDFs. This guide provides a step-by-step approach to integrating advanced text and visual data processing, enabling accurate insights and seamless analysis of intricate documents at scale.
Fully Integrated With

1. Setting Up Apolo

The Apolo platform is the backbone of this workflow, providing:
- Compute Resources: GPUs for running ML models.
- Storage: To manage raw data, embeddings, and processed outputs.
- Job Management: To orchestrate the pipeline.

2. Data Preparation

Upload your sample data to Apolo:
apolo cp -r ./sample_data/ storage:visual_rag/raw-data/
The uploaded PDFs will be used to extract text and images for embedding.

3. Data Ingestion

Run the ingestion job to process PDFs and store embeddings in LanceDB:
apolo run --detach \
         --no-http-auth \
         --preset H100x1 \
         --name ingest-data \
         --http-port 80 \
         --volume storage:visual_rag/cache:/root/.cache/huggingface:rw \
         --volume storage:visual_rag/raw-data/:/raw-data:rw \
         --volume storage:visual_rag/lancedb-data/:/lancedb-data:rw \
         -e HF_TOKEN=$HF_TOKEN \
         ghcr.io/kyryl-opens-ml/apolo_visual_rag:latest -- python main.py ingest-data /raw-data --table-name=demo --db-path=/lancedb-data/datastore

The ingestion process involves:
- Extracting images and text from each page of a PDF.
- Generating embeddings for these components using ColPali.
- Storing the embeddings in LanceDB.

The processed data, including embeddings and metadata, is stored in LanceDB, a vector database optimized for high-speed search and retrieval.

4. Deploy the Generative LLM

Once the data is ingested and stored in LanceDB, deploy the generative LLM server for processing multimodal queries. This server runs the Llama 3.2 Vision-Instruct model, enabling responses based on both text and visual data.
apolo run --detach \
         --no-http-auth \
         --preset H100x1 \
         --name generation-inference \
         --http-port 80 \
         --volume storage:visual_rag:/models:rw \
         -e HF_TOKEN=$HF_TOKEN \
         ghcr.io/huggingface/text-generation-inference:2.4.0 -- --model-id meta-llama/Llama-3.2-11B-Vision-Instruct
What Happens in This Step:
- Deploying the Server
: The command sets up the generative LLM server within Apolo’s infrastructure, running the meta-llama/Llama-3.2-11B-Vision-Instruct model.
- Secure Storage Integration: The model weights are accessed securely via the mounted storage:visual_rag directory.
- Multimodal Inference: The server is configured to handle multimodal queries, such as combining text and images for processing.


With this setup, your generative LLM is ready to serve multimodal queries, providing the backbone for the Visual RAG pipeline. The system can now combine the embeddings retrieved from LanceDB with the user queries, using the model to generate comprehensive and accurate responses.

5. Querying the System

With the ingestion pipeline and LLM server running, you can query the system using the ask_data function

Here’s how it works:
1. Query Embedding: The user query is embedded using ColPali in get_query_embedding.
2. Database Search: search_db retrieves the most relevant images based on embeddings.
3. Response Generation: A vision-enabled LLM (e.g., Llama 3.2) processes the prompt and images via run_vision_inference.

6. Visualizing the Results

To enhance usability, integrate a Streamlit-based dashboard for querying and visualizing responses. The dashboard includes:
- PDF Viewer: Displays available documents for context.
- Search Input: Allows users to submit natural language queries.
- Results Panel: Shows the retrieved images and the LLM-generated responses.For example, querying “What is the market share by region?” retrieves visuals related to market share and generates a concise, context-aware response.

Related Post

Enterprise-Ready Generative AI Applications

Generative AI is revolutionizing enterprise data interactions, and this blog explores how to build secure, high-performance Retrieval-Augmented Generation (RAG) applications using Apolo's on-premise platform and industry-leading tools.

Read more

Apolo Documentation Chatbot

Leverage Apolo's enterprise-ready generative AI to create an intelligent documentation chatbot that delivers instant, accurate answers by understanding and retrieving content from complex documentation.

Read more

Visual RAG on Complex PDFs

Utilize Apolo's enterprise-ready multimodal AI to extract insights from complex PDFs by combining text and visual data processing for accurate and efficient document analysis.

Read more

Full Name

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

How we can help you

Full Name

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

How we can help you

Is Your Data Center Facility AI-Ready?

If you’re ready to adapt your infrastructure, contact us today. For any requests or queries, please use the form below. A member of our team will respond within 2 business days or sooner.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Get Started Today!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.