The KAZU system is designed for processing and analyzing biomedical text, primarily focusing on Named Entity Recognition (NER) and entity linking. It orchestrates a pipeline of steps, from data model definition and ontology preprocessing to various NER and linking strategies. The system also includes functionalities for training machine learning models, providing utility functions, and offering web interfaces for annotation and interaction. Its core purpose is to extract and link biomedical entities to knowledge bases, facilitating downstream analysis.
Components
Core Data Models
Defines fundamental data structures (documents, entities, sections, mappings, ontology resources) used across the KAZU system.
Referenced Source Code
Pipeline Management
This component is responsible for orchestrating the overall document processing workflow within KAZU. It manages the execution of various steps, handles document pre-filtering, updates failed documents, and provides profiling capabilities for performance monitoring. It acts as the central control flow for processing documents through the KAZU system.
Referenced Source Code
NER Steps
This component encompasses various Named Entity Recognition (NER) steps, each implementing different strategies for identifying and extracting entities from text. This includes rule-based approaches (Seth, Opsin), machine learning models (Spacy, Transformers, GLiNER, LLM-NER), and utilities for processing tokenized words and post-processing entities.
Linking & Disambiguation Steps
This component focuses on linking identified entities to knowledge base identifiers and resolving ambiguities. It includes strategies for mapping entities, disambiguating candidates based on various criteria (e.g., document context, default labels, embedding similarity), and managing cross-references.
Ontology Preprocessing
This component is responsible for preparing and managing ontology data. It includes utilities for loading ontology resources, handling global parser actions, resolving linking candidates, scoring and grouping IDs, and generating synonyms. It ensures that ontology data is in a usable format for the NER and linking steps.
Referenced Source Code
Training & Modelling
This component provides functionalities for training and evaluating machine learning models within KAZU. It includes utilities for getting label lists, creating model wrappers, managing training data, and saving trained models. It supports the development and deployment of custom NER and linking models.
Referenced Source Code
Utility Functions
This component provides a collection of general-purpose utility functions used across various parts of the KAZU system. This includes string normalization, abbreviation detection, grouping and sorting, file path handling, and managing link indices. These utilities support common operations and improve code reusability.
Referenced Source Code
Annotation & Web Interfaces
This component handles the conversion of annotation data from external tools like Label Studio into KAZU's internal document format. It also provides functionalities for web-based document conversion, enabling interaction with the KAZU system through a web interface.
Knowledge Resource Tools (KRT)
This component provides tools and functionalities for managing and interacting with knowledge resources. It includes capabilities for creating placeholder resources, extracting associated ID sets from dataframes, and processing text for pipeline testing within the KRT environment.
Referenced Source Code
Other Steps
This component includes miscellaneous processing steps that don't fit into the core NER or Linking categories. Examples include Stanza for linguistic processing and cleanup actions for refining document data.