CodeBoarding

Initializing diagram...

This set of components represents the core utility and NLP processing layers of the KAZU system. It encompasses functionalities for managing spaCy and Stanza NLP pipelines, building and testing model packs, detecting abbreviations, generating SapBert embeddings, downloading external contexts, and performing various string normalization and general data manipulation tasks. These components collectively provide the foundational capabilities for KAZU's natural language processing and knowledge extraction workflows, ensuring efficient data handling, model deployment, and text standardization.

Components

Shared Utilities

A collection of reusable utility functions and helper classes supporting various KAZU functionalities, including string normalization, caching, and abbreviation detection.

Referenced Source Code

SpaCy Pipeline Management

This component is responsible for managing spaCy models within the KAZU system. It handles the addition of models from various sources, retrieves models for processing, and includes mechanisms for reloading models when necessary to ensure efficient and up-to-date NLP processing. It provides core functionalities for batch and single document processing.

Model Pack Building & Testing

This component orchestrates the entire lifecycle of KAZU model packs, from loading build configurations to applying merge strategies, clearing cached resources, building caches, running sanity checks, and executing acceptance tests. Its primary purpose is to ensure the robust creation and validation of deployable model artifacts.

Abbreviation Detection

This component is dedicated to identifying and processing abbreviations within text. It includes functionalities to find candidate abbreviations, build matchers, and override existing entities with detected abbreviations, contributing to the normalization and enrichment of textual data.

Referenced Source Code

SapBert Embedding Utility

This component provides utilities for generating and retrieving embeddings using the SapBert model. It supports obtaining embeddings from data loaders or directly from strings, facilitating tasks that require semantic representations of text, such as similarity comparisons or information retrieval.

Gilda Context Downloader

This component is responsible for downloading and processing contextual information, primarily for Gilda, from external knowledge bases like Wikipedia and Wikidata. It includes functionalities to retry requests, retrieve URLs and content, and extract specific data such as Open Targets information.

Referenced Source Code

String Normalization

This component provides a comprehensive set of tools for standardizing and cleaning text strings. It includes a default normalizer with methods for replacing substrings, handling numbers, Greek characters, and depluralization, along with specialized normalizers for diseases, anatomy, genes, and companies, ensuring consistent text representation across the system.

General Utilities

This component provides foundational helper functions for common data processing tasks across the KAZU project. It includes utilities for mapping documents to section encodings and creating various types of n-grams (character and word), serving as a support layer for other components.

Caching Utility

Provides caching mechanisms for various KAZU functionalities, improving performance by storing and retrieving frequently accessed data.

Link Index Management

Manages the indexing and lookup of entities for linking purposes, facilitating efficient retrieval of related information.

SpaCy Object Mapper

Provides utilities for mapping KAZU data structures to spaCy objects and vice-versa, enabling seamless integration with spaCy's NLP capabilities.

Constants

Defines various constants used throughout the KAZU project, ensuring consistency and easy configuration of system parameters.

Grouping Utilities

Provides utility functions for grouping and organizing data, often used in processing and managing collections of entities or documents.

Stanza Pipeline

Manages the integration and usage of Stanza NLP models within KAZU, providing functionalities for text processing with Stanza.

Initializing diagram...

Components

Shared Utilities

A collection of reusable utility functions and helper classes supporting various KAZU functionalities, including string normalization, caching, and abbreviation detection.

Referenced Source Code

SpaCy Pipeline Management

Model Pack Building & Testing

Abbreviation Detection

Referenced Source Code

SapBert Embedding Utility

Gilda Context Downloader

Referenced Source Code

String Normalization

General Utilities

Caching Utility

Provides caching mechanisms for various KAZU functionalities, improving performance by storing and retrieving frequently accessed data.

Link Index Management

Manages the indexing and lookup of entities for linking purposes, facilitating efficient retrieval of related information.

SpaCy Object Mapper

Provides utilities for mapping KAZU data structures to spaCy objects and vice-versa, enabling seamless integration with spaCy's NLP capabilities.

Constants

Defines various constants used throughout the KAZU project, ensuring consistency and easy configuration of system parameters.

Grouping Utilities

Provides utility functions for grouping and organizing data, often used in processing and managing collections of entities or documents.

Stanza Pipeline

Manages the integration and usage of Stanza NLP models within KAZU, providing functionalities for text processing with Stanza.