This set of components represents the core utility and NLP processing layers of the KAZU system. It encompasses functionalities for managing spaCy and Stanza NLP pipelines, building and testing model packs, detecting abbreviations, generating SapBert embeddings, downloading external contexts, and performing various string normalization and general data manipulation tasks. These components collectively provide the foundational capabilities for KAZU's natural language processing and knowledge extraction workflows, ensuring efficient data handling, model deployment, and text standardization.
Components
Shared Utilities
A collection of reusable utility functions and helper classes supporting various KAZU functionalities, including string normalization, caching, and abbreviation detection.
Referenced Source Code
SpaCy Pipeline Management
This component is responsible for managing spaCy models within the KAZU system. It handles the addition of models from various sources, retrieves models for processing, and includes mechanisms for reloading models when necessary to ensure efficient and up-to-date NLP processing. It provides core functionalities for batch and single document processing.
Model Pack Building & Testing
This component orchestrates the entire lifecycle of KAZU model packs, from loading build configurations to applying merge strategies, clearing cached resources, building caches, running sanity checks, and executing acceptance tests. Its primary purpose is to ensure the robust creation and validation of deployable model artifacts.
Abbreviation Detection
This component is dedicated to identifying and processing abbreviations within text. It includes functionalities to find candidate abbreviations, build matchers, and override existing entities with detected abbreviations, contributing to the normalization and enrichment of textual data.
Referenced Source Code
SapBert Embedding Utility
This component provides utilities for generating and retrieving embeddings using the SapBert model. It supports obtaining embeddings from data loaders or directly from strings, facilitating tasks that require semantic representations of text, such as similarity comparisons or information retrieval.
Gilda Context Downloader
This component is responsible for downloading and processing contextual information, primarily for Gilda, from external knowledge bases like Wikipedia and Wikidata. It includes functionalities to retry requests, retrieve URLs and content, and extract specific data such as Open Targets information.
Referenced Source Code
String Normalization
This component provides a comprehensive set of tools for standardizing and cleaning text strings. It includes a default normalizer with methods for replacing substrings, handling numbers, Greek characters, and depluralization, along with specialized normalizers for diseases, anatomy, genes, and companies, ensuring consistent text representation across the system.
General Utilities
This component provides foundational helper functions for common data processing tasks across the KAZU project. It includes utilities for mapping documents to section encodings and creating various types of n-grams (character and word), serving as a support layer for other components.
Caching Utility
Provides caching mechanisms for various KAZU functionalities, improving performance by storing and retrieving frequently accessed data.
Link Index Management
Manages the indexing and lookup of entities for linking purposes, facilitating efficient retrieval of related information.
SpaCy Object Mapper
Provides utilities for mapping KAZU data structures to spaCy objects and vice-versa, enabling seamless integration with spaCy's NLP capabilities.
Constants
Defines various constants used throughout the KAZU project, ensuring consistency and easy configuration of system parameters.
Grouping Utilities
Provides utility functions for grouping and organizing data, often used in processing and managing collections of entities or documents.
Stanza Pipeline
Manages the integration and usage of Stanza NLP models within KAZU, providing functionalities for text processing with Stanza.