Meow Package¶
🐾 Natural Language Processing with a Purr-sonal Touch 🐾
The purrfectmeow package delivers a curated suite of NLP-focused classes, each thoughtfully named after a Thai cat breed. This thematic naming reflects the unique role of each class in a modular, user-friendly approach to solving various NLP challenges.
Whether you’re pre-processing text, generating embeddings, managing file I/O, or building model utilities, Meow provides streamlined components to support robust and scalable natural language pipelines.
See Usage Guide for details.
API Reference¶
- class purrfectmeow.Kornja[source]
Bases:
object
A flexible interface for text segmentation based on tokenization or custom separators.
This class provides a unified API for splitting large text inputs into smaller, manageable chunks using either token-based or separator-based strategies. It supports configuration of chunk size, overlap, and model-specific parameters for token-aware splitting, as well as simple string-based segmentation for more structured inputs.
- static chunking(text: str, splitter: Literal['token', 'separator'] | None = 'token', **kwargs) List[str] [source]
Handles text chunking with token or separator-based splitting.
- Parameters:
text (str) – The input text to be chunked.
splitter (str, optional) – The type of splitter to use for chunking. Must be either ‘token’ or ‘separator’.
**kwargs (dict) –
Additional parameters for the splitter:
- For ‘token’ splitter:
- model_namestr, optional
Name of the model for token-based splitting.
- chunk_sizeint, optional
Maximum size of each chunk in tokens.
- chunk_overlapint, optional
Number of overlapping tokens between chunks.
- For ‘separator’ splitter:
- separatorstr, optional
String used to split the text.
- Returns:
A list of text chunks generated by the specified splitter.
- Return type:
List[str]
- Raises:
ValueError – If splitter is not ‘token’ or ‘separator’, or if required parameters (model_name for ‘token’, separator for ‘separator’) are invalid or empty.
Examples
>>> text = "This is a sample text.\n\nAnother paragraph." >>> Kornja.chunking(text, splitter="separator") ['This is a sample text.', 'Another paragraph.'] >>> Kornja.chunking(text, splitter="token", model_name="text-embedding-ada-002", chunk_size=10) ['This is a', 'sample text.', 'Another', 'paragraph.']
- class purrfectmeow.WichienMaat[source]
Bases:
object
A ligthweight and efficient vector-based semantic search utility for document retrieval.
This class for performing similarity searches between a query embedding and a set of document embeddings. Ideal for use cases involving small- to medium-scale semantic search applications, it abstracts the search logic while maintaining flexibility and interpretability of results.
- static get_search(query_embedding: ndarray, embeddings: List[ndarray], documents: List[Document], top_k: int = 5)[source]
Performs similarity searches using embeddings and documents.
- Parameters:
query_embedding (numpy.ndarray) – The embedding vector for the search query.
embeddings (List[numpy.ndarray]) – A list of embedding vectors to search against.
documents (List[Document]) – A list of Document objects corresponding to the embeddings.
top_k (int, optional) – The number of top similar documents to return. Defaults to 5.
- Returns:
A list of the top_k most similar documents based on the query embedding.
- Return type:
List[Document]
Examples
>>> import numpy as np >>> from langchain_core.documents import Document >>> query_emb = np.array([0.1, 0.2, 0.3]) >>> embeddings = [np.array([0.1, 0.2, 0.4]), np.array([0.4, 0.5, 0.6])] >>> docs = [Document(page_content="Doc 1"), Document(page_content="Doc 2")] >>> WichienMaat.get_search(query_emb, embeddings, docs, top_k=2) [Document(page_content="Doc 1"), Document(page_content="Doc 2")]
- class purrfectmeow.KhaoManee[source]
Bases:
object
A class provides underlying complexity of embedding and tokenization processes.
This class consolidates methods from SimpleHFEmbedder, and SimpleTokenization to perform encoding documents and query strings into dense vector using pre-trained transformer models, as well as tokenizing text through various supported engines.
- static get_embeddings(documents: Document, model_name: str | None = 'intfloat/multilingual-e5-large-instruct') ndarray [source]
Generates embeddings and tokenizes text using various engines.
- Parameters:
documents (Document) – The document(s) to generate embeddings for.
model_name (str, optional) – The name of the model to use for embedding generation.
- Returns:
An array of embeddings for the input documents.
- Return type:
numpy.ndarray
Examples
>>> from langchain_core.documents import Document >>> doc = Document(page_content="This is a test document.") >>> KhaoManee.get_embeddings(doc) array([[0.1, 0.2, ...], ...])
- static get_query_embeddings(query: str | None = 'meow~', model_name: str | None = 'intfloat/multilingual-e5-large-instruct') ndarray [source]
Generates embeddings for a query string using a specified model.
- Parameters:
- Returns:
An array of embeddings for the input query.
- Return type:
numpy.ndarray
Examples
>>> KhaoManee.get_query_embeddings(query="What is this?") array([0.1, 0.2, ...])
- static get_tokens(text: str, engine: Literal['spacy', 'pythainlp', 'huggingface'] | None = 'pythainlp') List[str] [source]
Tokenizes input text using a specified tokenization engine.
- Parameters:
- Returns:
A list of tokens extracted from the input text.
- Return type:
List[str]
- Raises:
ValueError – If the specified engine is not one of ‘spacy’, ‘pythainlp’, or ‘huggingface’.
Examples
>>> KhaoManee.get_tokens("Hello world", engine="pythainlp") ['Hello', 'world']
- class purrfectmeow.Malet[source]
Bases:
object
A class provides a static interface to load and convert files into text.
This class consolidates methods from Markdown, OCR, and Simple to perform extracting text from various file formats using a range of loader backends such as Markdown converters, OCR engines, and simple data parsers.
- static loader(file: BinaryIO, file_name: str, loader: str = 'PYMUPDF', **kwargs: Any) str [source]
Load and convert a binary file using the specifed loader backend.
- Parameters:
file (BinaryIO) – The binary file to be converted.
file_name (str) – The name to use for the file.
loader (str) – The loader backend
**kwargs – Any arguments needed for the loader.
Loaders (Supported)
-----------------
MARKITDOWN (Callable) – Converts Markdown using the Markitdown engine.
DOCLING (Callable) – Converts Markdown using the Docling engine.
PYTESSERACT (Callable) – Extracts text from images or PDFs using Tesseract OCR.
EASYOCR (Callable) – Extracts text using the EasyOCR engine.
SURYAOCR (Callable) – Extracts text using the SuryaOCR engine.
DOCTR (Callable) – Extracts text using the docTR engine.
PYMUPDF (Callable) – Parses PDF, text file content using PyMuPDF.
PANDAS (Callable) – Reads Spreadsheet, CSV files using pandas.
ENCODING (Callable) – Reads files using endcoding uft-8.
- Returns:
The extracted text.
- Return type:
Examples
>>> with open("example.pdf", "rb") as f: ... text = Malet.loader(f, "example.pdf", loader="PYMUPDF") >>> print(text)
- class purrfectmeow.Suphalaks[source]
Bases:
object
A class for handling files, loading models, and creating document templates.
This class consolidates methods from LoadingModel, DocTemplate and MetadataFile to perform common operations such as saving/removing files, retrieving models and tokenizers, extracting file metadata, and creating structured LangChain document templates.
- static document_template(chunks: List[str], metadata: Dict[str, Any]) Document [source]
Create a structured LangChain Document object from chunks and metadata.
- Parameters:
- Returns:
A structured LangChain Document object.
- Return type:
Document
Examples
>>> chunks = ["This is the first chunk.", "This is the second chunk."] >>> metadata = {"source": "example.txt", "author": "John Doe"} >>> document = Suphalaks.document_template(chunks, metadata) >>> print(document.page_content, document.metadata)
- static get_file_metadata(file_path: str) Dict [source]
Extract metadata from a file including size, timestamps, and type.
- Parameters:
file_path (str) – The path to the file.
- Returns:
A dictionary containing metadata such as size, creation date, modification date, and file type.
- Return type:
Dict
Examples
>>> metadata = Suphalaks.get_file_metadata('tmp_dir/example.txt')
- static get_model_hf(model_name: str = None) PreTrainedModel [source]
Retrieve a Hugging Face model by model name.
- Parameters:
model_name (str, optional) – The name of the model. If None, a default is used.
- Returns:
The loaded Hugging Face model.
- Return type:
PreTrainedModel
Examples
>>> model = Suphalaks.get_model_hf('bert-base-uncased')
- static get_model_st(model_name: str = None) SentenceTransformer [source]
Retrieve a SentenceTransformer model by model name.
- Parameters:
model_name (str, optional) – The name of the model. If None, a default is used.
- Returns:
The loaded SentenceTransformer model.
- Return type:
SentenceTransformer
Examples
>>> st_model = Suphalaks.get_model_st('all-MiniLM-L6-v2')
- static get_tokenizer(model_name: str = None) PreTrainedTokenizerBase [source]
Retrieve a Hugging Face tokenizer by model name.
- Parameters:
model_name (str, optional) – The name of the model. If None, a default is used.
- Returns:
The tokenizer corresponding to the specified model.
- Return type:
PreTrainedTokenizerBase
Examples
>>> tokenizer = Suphalaks.get_tokenizer('bert-base-uncased')