PurrfectKitยถ

PurrfectMeow Logo

PurrfectKit is a whimsical Python library that combines feline charm with powerful natural language processing (NLP), optical character recognition (OCR), and document processing. Inspired by the elegance of Thai cat breeds, each module in the purrfectmeow package is named after a unique breed, making text processing, semantic search, and data extraction.

Contentsยถ

๐Ÿพ Overviewยถ

PurrfectKit blends NLP, OCR, and document processing with a playful nod to Thai cat breeds. Its core modules are:

  • Kornja: Segments text into manageable chunks (Content Chunking).

  • WichienMaat: Interprets query intent for precise results (Semantic Search).

  • KhaoManee: Converts text to vectors and manages storage (Embedding & Storage).

  • Malet: Extracts data from PDFs, images, spreadsheets, and Markdown (Text Extraction).

  • Suphalaks: Handles file operations and model/tokenizer loading (Utility & Infrastructure).

Note

All modules are prefixed with purrfectmeow for namespace clarity and reflect Thai cat breed names.

๐ŸŒŸ Why PurrfectKit?ยถ

  • Playful yet Powerful: Combines robust NLP and OCR capabilities with a delightful, cat-inspired interface.

  • Multilingual Mastery: Built-in support for Thai via pythainlp, with extensibility for other languages.

  • Developer-Friendly: Intuitive APIs and comprehensive documentation make integration a breeze.

  • Versatile Applications: Perfect for semantic search, document processing, and retrieval-augmented generation (RAG).

๐Ÿ› ๏ธ System-level Dependenciesยถ

PurrfectKit relies on the following system-level dependencies for OCR and document processing. Install them based on your operating system:

  • tesseract-ocr: Core OCR engine for text extraction from images.

  • tesseract-ocr-tha: Thai language data for Tesseract.

  • poppler: PDF rendering library for processing PDF documents.

  • ffmpeg: Multimedia framework for handling image and video inputs.

  • libmagic1: File type identification library for robust file handling.

Installation on Ubuntu/Debian:

sudo apt-get update
sudo apt-get install tesseract-ocr tesseract-ocr-tha poppler-utils ffmpeg libmagic1

Installation on macOS (using Homebrew):

brew install tesseract tesseract-lang poppler ffmpeg libmagic

Installation on Windows:

๐Ÿš€ Installationยถ

PurrfectKit requires Python 3.10 to 3.12.4. We recommend using uv for faster dependency management, but pip is also supported.

Using uv (Recommended):

git clone https://github.com/SUWALUTIONS/PurrfectKit.git
cd PurrfectKit
uv pip install -e .
uv sync --extra-index-url https://download.pytorch.org/whl/cpu

Using pip:

git clone https://github.com/SUWALUTIONS/PurrfectKit.git
cd PurrfectKit
pip install . --extra-index-url https://download.pytorch.org/whl/cpu

For development dependencies (e.g., pytest):

uv sync --extra-index-url https://download.pytorch.org/whl/cpu --extra dev
# or
pip install .[dev] --extra-index-url https://download.pytorch.org/whl/cpu

๐Ÿฑ Quick Startยถ

from purrfectmeow import Suphalaks, Malet, Kornja, KhaoManee, WichienMaat

# Load and process a PDF
file_path = "example/meowdy.pdf"
metadata = Suphalaks.get_file_metadata(file_path)

with open(file_path, "rb") as f:
    text = Malet.loader(f, file_path, loader="MARKITDOWN")

# Chunk the text
chunks = Kornja.chunking(
    text,
    splitter="token",
    model_name="intfloat/multilingual-e5-large-instruct",
    chunk_size=50,
    chunk_overlap=0
)

# Create document templates and embeddings
docs = Suphalaks.document_template(chunks, metadata)
embeddings = KhaoManee.get_embeddings(docs, model_name="intfloat/multilingual-e5-large-instruct")
query_embeddings = KhaoManee.get_query_embeddings(query="howdy", model_name="intfloat/multilingual-e5-large-instruct")

# Perform semantic search
results = WichienMaat.get_search(query_embeddings, embeddings, docs, top_k=2)

# Print results
for result in results:
    print(f"Score: {result['score']:.4f}")
    print(f"Content: {result['document'].page_content}\\n")

Expected Output:

[
  {
    "score": 0.8146,
    "document": {
      "metadata": {
        "chunk_info": {
          "chunk_number": 1,
          "chunk_id": "fc690110e8a2407db6b65e7129331ec7",
          "chunk_hash": "4b7ffc7f57494fba188f7bc55d348a7c",
          "previous_chunk_hash": null,
          "next_chunk_hash": "49473745424e819315a4ad8cb2c25fa8",
          "chunk_size": 168
        },
        "source_info": {
          "file_name": "meowdy.pdf",
          "file_size": 3981724,
          "file_created_date": "2025-05-23 09:46:17",
          "file_modified_date": "2025-05-23 09:46:17",
          "file_extension": ".pdf",
          "file_type": "application/pdf",
          "description": "PDF document, version 1.7, 1 pages",
          "total_pages": 1,
          "file_md5": "bf4db19df52cb3a3e4e3854c9edbdc73"
        }
      },
      "page_content": "Meowdy, marvelous makers of machine magic! PurrfectKit. Whether you're chunking, searching, embedding, extracting, or orchestrating, I've got a cat for that"
    }
  },
  ...
]

โœจ Featuresยถ

  • NLP: Tokenization, semantic analysis and more with spacy, pythainlp, and transformers.

  • OCR: Extract text from images and PDFs using surya-ocr, easyocr, and pytesseract.

  • Document Processing: Handle PDFs, images, and Markdown with pymupdf, docling and markitdown.

  • Multilingual Support: Thai language processing via pythainlp, extensible for other languages.

  • AI & LLMs: Leverage torch and langchain-core for embeddings and RAG workflows.

  • Whimsical Design: Thai cat breed-inspired module names for a delightful developer experience.

๐Ÿ“š Documentationยถ

๐Ÿค Contributingยถ

We welcome contributions! To get started:

  1. Fork the repository.

  2. Create a branch: git checkout -b feature/your-feature.

  3. Commit changes: git commit -m "Add your feature".

  4. Push and open a pull request.

See CONTRIBUTING for detailed guidelines.

๐Ÿ“„ Licenseยถ

PurrfectKit is released under the MIT License.