Meow Package

🐾 Natural Language Processing with a Purr-sonal Touch 🐾

The purrfectmeow package delivers a curated suite of NLP-focused classes, each thoughtfully named after a Thai cat breed. This thematic naming reflects the unique role of each class in a modular, user-friendly approach to solving various NLP challenges.

Whether you’re pre-processing text, generating embeddings, managing file I/O, or building model utilities, Meow provides streamlined components to support robust and scalable natural language pipelines.

See Usage Guide for details.

API Reference

class purrfectmeow.Kornja[source]

Bases: object

A flexible interface for text segmentation based on tokenization or custom separators.

This class provides a unified API for splitting large text inputs into smaller, manageable chunks using either token-based or separator-based strategies. It supports configuration of chunk size, overlap, and model-specific parameters for token-aware splitting, as well as simple string-based segmentation for more structured inputs.

static chunking(text: str, splitter: Literal['token', 'separator'] | None = 'token', **kwargs) List[str][source]

Handles text chunking with token or separator-based splitting.

Parameters:
  • text (str) – The input text to be chunked.

  • splitter (str, optional) – The type of splitter to use for chunking. Must be either ‘token’ or ‘separator’.

  • **kwargs (dict) –

    Additional parameters for the splitter:

    For ‘token’ splitter:
    model_namestr, optional

    Name of the model for token-based splitting.

    chunk_sizeint, optional

    Maximum size of each chunk in tokens.

    chunk_overlapint, optional

    Number of overlapping tokens between chunks.

    For ‘separator’ splitter:
    separatorstr, optional

    String used to split the text.

Returns:

A list of text chunks generated by the specified splitter.

Return type:

List[str]

Raises:

ValueError – If splitter is not ‘token’ or ‘separator’, or if required parameters (model_name for ‘token’, separator for ‘separator’) are invalid or empty.

Examples

>>> text = "This is a sample text.\n\nAnother paragraph."
>>> Kornja.chunking(text, splitter="separator")
['This is a sample text.', 'Another paragraph.']
>>> Kornja.chunking(text, splitter="token", model_name="text-embedding-ada-002", chunk_size=10)
['This is a', 'sample text.', 'Another', 'paragraph.']
class purrfectmeow.WichienMaat[source]

Bases: object

A ligthweight and efficient vector-based semantic search utility for document retrieval.

This class for performing similarity searches between a query embedding and a set of document embeddings. Ideal for use cases involving small- to medium-scale semantic search applications, it abstracts the search logic while maintaining flexibility and interpretability of results.

static get_search(query_embedding: ndarray, embeddings: List[ndarray], documents: List[Document], top_k: int = 5)[source]

Performs similarity searches using embeddings and documents.

Parameters:
  • query_embedding (numpy.ndarray) – The embedding vector for the search query.

  • embeddings (List[numpy.ndarray]) – A list of embedding vectors to search against.

  • documents (List[Document]) – A list of Document objects corresponding to the embeddings.

  • top_k (int, optional) – The number of top similar documents to return. Defaults to 5.

Returns:

A list of the top_k most similar documents based on the query embedding.

Return type:

List[Document]

Examples

>>> import numpy as np
>>> from langchain_core.documents import Document
>>> query_emb = np.array([0.1, 0.2, 0.3])
>>> embeddings = [np.array([0.1, 0.2, 0.4]), np.array([0.4, 0.5, 0.6])]
>>> docs = [Document(page_content="Doc 1"), Document(page_content="Doc 2")]
>>> WichienMaat.get_search(query_emb, embeddings, docs, top_k=2)
[Document(page_content="Doc 1"), Document(page_content="Doc 2")]
class purrfectmeow.KhaoManee[source]

Bases: object

A class provides underlying complexity of embedding and tokenization processes.

This class consolidates methods from SimpleHFEmbedder, and SimpleTokenization to perform encoding documents and query strings into dense vector using pre-trained transformer models, as well as tokenizing text through various supported engines.

static get_embeddings(documents: Document, model_name: str | None = 'intfloat/multilingual-e5-large-instruct') ndarray[source]

Generates embeddings and tokenizes text using various engines.

Parameters:
  • documents (Document) – The document(s) to generate embeddings for.

  • model_name (str, optional) – The name of the model to use for embedding generation.

Returns:

An array of embeddings for the input documents.

Return type:

numpy.ndarray

Examples

>>> from langchain_core.documents import Document
>>> doc = Document(page_content="This is a test document.")
>>> KhaoManee.get_embeddings(doc)
array([[0.1, 0.2, ...], ...])
static get_query_embeddings(query: str | None = 'meow~', model_name: str | None = 'intfloat/multilingual-e5-large-instruct') ndarray[source]

Generates embeddings for a query string using a specified model.

Parameters:
  • query (str, optional) – The query string to generate embeddings for. Defaults to ‘meow~’.

  • model_name (str, optional) – The name of the model to use for embedding generation. Defaults to ‘intfloat/multilingual-e5-large-instruct’.

Returns:

An array of embeddings for the input query.

Return type:

numpy.ndarray

Examples

>>> KhaoManee.get_query_embeddings(query="What is this?")
array([0.1, 0.2, ...])
static get_tokens(text: str, engine: Literal['spacy', 'pythainlp', 'huggingface'] | None = 'pythainlp') List[str][source]

Tokenizes input text using a specified tokenization engine.

Parameters:
  • text (str) – The input text to tokenize.

  • engine (str, optional) – The tokenization engine to use. Must be one of ‘spacy’, ‘pythainlp’, or ‘huggingface’. Defaults to ‘pythainlp’.

Returns:

A list of tokens extracted from the input text.

Return type:

List[str]

Raises:

ValueError – If the specified engine is not one of ‘spacy’, ‘pythainlp’, or ‘huggingface’.

Examples

>>> KhaoManee.get_tokens("Hello world", engine="pythainlp")
['Hello', 'world']
class purrfectmeow.Malet[source]

Bases: object

A class provides a static interface to load and convert files into text.

This class consolidates methods from Markdown, OCR, and Simple to perform extracting text from various file formats using a range of loader backends such as Markdown converters, OCR engines, and simple data parsers.

static loader(file: BinaryIO, file_name: str, loader: str = 'PYMUPDF', **kwargs: Any) str[source]

Load and convert a binary file using the specifed loader backend.

Parameters:
  • file (BinaryIO) – The binary file to be converted.

  • file_name (str) – The name to use for the file.

  • loader (str) – The loader backend

  • **kwargs – Any arguments needed for the loader.

  • Loaders (Supported)

  • -----------------

  • MARKITDOWN (Callable) – Converts Markdown using the Markitdown engine.

  • DOCLING (Callable) – Converts Markdown using the Docling engine.

  • PYTESSERACT (Callable) – Extracts text from images or PDFs using Tesseract OCR.

  • EASYOCR (Callable) – Extracts text using the EasyOCR engine.

  • SURYAOCR (Callable) – Extracts text using the SuryaOCR engine.

  • DOCTR (Callable) – Extracts text using the docTR engine.

  • PYMUPDF (Callable) – Parses PDF, text file content using PyMuPDF.

  • PANDAS (Callable) – Reads Spreadsheet, CSV files using pandas.

  • ENCODING (Callable) – Reads files using endcoding uft-8.

Returns:

The extracted text.

Return type:

str

Examples

>>> with open("example.pdf", "rb") as f:
...     text = Malet.loader(f, "example.pdf", loader="PYMUPDF")
>>> print(text)
class purrfectmeow.Suphalaks[source]

Bases: object

A class for handling files, loading models, and creating document templates.

This class consolidates methods from LoadingModel, DocTemplate and MetadataFile to perform common operations such as saving/removing files, retrieving models and tokenizers, extracting file metadata, and creating structured LangChain document templates.

static document_template(chunks: List[str], metadata: Dict[str, Any]) Document[source]

Create a structured LangChain Document object from chunks and metadata.

Parameters:
  • chunks (List[str]) – A list of text chunks.

  • metadata (Dict[str, Any]) – A dictionary containing metadata associated with the document.

Returns:

A structured LangChain Document object.

Return type:

Document

Examples

>>> chunks = ["This is the first chunk.", "This is the second chunk."]
>>> metadata = {"source": "example.txt", "author": "John Doe"}
>>> document = Suphalaks.document_template(chunks, metadata)
>>> print(document.page_content, document.metadata)
static get_file_metadata(file_path: str) Dict[source]

Extract metadata from a file including size, timestamps, and type.

Parameters:

file_path (str) – The path to the file.

Returns:

A dictionary containing metadata such as size, creation date, modification date, and file type.

Return type:

Dict

Examples

>>> metadata = Suphalaks.get_file_metadata('tmp_dir/example.txt')
static get_model_hf(model_name: str = None) PreTrainedModel[source]

Retrieve a Hugging Face model by model name.

Parameters:

model_name (str, optional) – The name of the model. If None, a default is used.

Returns:

The loaded Hugging Face model.

Return type:

PreTrainedModel

Examples

>>> model = Suphalaks.get_model_hf('bert-base-uncased')
static get_model_st(model_name: str = None) SentenceTransformer[source]

Retrieve a SentenceTransformer model by model name.

Parameters:

model_name (str, optional) – The name of the model. If None, a default is used.

Returns:

The loaded SentenceTransformer model.

Return type:

SentenceTransformer

Examples

>>> st_model = Suphalaks.get_model_st('all-MiniLM-L6-v2')
static get_tokenizer(model_name: str = None) PreTrainedTokenizerBase[source]

Retrieve a Hugging Face tokenizer by model name.

Parameters:

model_name (str, optional) – The name of the model. If None, a default is used.

Returns:

The tokenizer corresponding to the specified model.

Return type:

PreTrainedTokenizerBase

Examples

>>> tokenizer = Suphalaks.get_tokenizer('bert-base-uncased')