Skip to content

[Bug]: Enhance PDF Processing with OCR for Unreadable Documents #16

@MridulTi

Description

@MridulTi

Describe the issue

Currently, our PDF processing pipeline extracts text from documents. However, if text extraction fails, we need a fallback mechanism to perform OCR and retrieve the content. This ensures users can still query the document even if it's an image-based PDF or has unreadable text.

Steps to Reproduce

  1. Upload a scanned PDF (or an image-based PDF).
  2. Attempt to extract text from the document.
  3. Observe that the system fails to retrieve any content and does not try OCR.

Expected Behavior

  • If text extraction fails, the system should automatically attempt OCR.
  • OCR-extracted text should be processed for search queries.
  • Errors should be logged if OCR also fails.
  • Users should receive a meaningful message if the document is entirely unreadable.

Relevant Logs/Error Messages

NA

Priority

Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions