The NLP Pipeline Management subsystem in KAZU is responsible for orchestrating the execution of Natural Language Processing (NLP) steps on documents and efficiently managing spaCy models within these pipelines. It encompasses the core pipeline execution flow, including error handling and performance monitoring, and provides foundational utilities for loading, reloading, and processing documents with spaCy models. This subsystem integrates various NLP processing steps, leverages an in-memory database for efficient lookups, and interacts with training and evaluation components, as well as ontology preprocessing for knowledge enrichment. It also supports testing and management within the Kazu Resource Tool (KRT) environment.
Components
Pipeline Orchestration
Manages the execution flow of documents through a series of NLP steps, handling pre-filtering, error management, and performance profiling. It is the central component for processing documents within KAZU.
Referenced Source Code
SpaCy Pipeline Management
A foundational utility for managing and providing access to spaCy language models and custom pipeline components, including mechanisms for adding, retrieving, and reloading models.
NLP Processing Steps
A collection of individual processing steps for Named Entity Recognition (NER) and entity linking, encompassing various model-based (e.g., LLM, HuggingFace, spaCy) and rule-based approaches.
General Utilities
Provides common utility functions used across the KAZU system, such as path handling, simple document creation, and specialized NLP utilities like abbreviation detection and spaCy object mapping.
Referenced Source Code
In-Memory Database
Manages an in-memory database primarily used for storing and retrieving synonym data, enabling efficient lookups and matching during NLP processing steps.
Model Training & Evaluation
Provides functionalities for training, predicting, and evaluating machine learning models, particularly for Named Entity Recognition (NER). It includes utilities for data handling, model wrapping, and metric calculation.
Referenced Source Code
Ontology Preprocessing
Responsible for generating and expanding synonyms and variants from ontological data, often leveraging spaCy pipelines for linguistic analysis to enrich the knowledge base.
KRT Pipeline Testing
Facilitates the testing and management of Kazu pipelines within the Kazu Resource Tool (KRT) environment, interacting with resource managers to load and test pipeline configurations.