The KAZU system is a comprehensive Natural Language Processing (NLP) framework designed for biomedical text analysis. Its main flow involves processing documents through a configurable NLP pipeline that performs Named Entity Recognition (NER) and Entity Linking & Disambiguation. Core data models underpin all operations, while ontology management provides the necessary knowledge base. The system also includes robust tooling for resource curation, model training and evaluation, and a web API for external integration, all supported by a suite of general utilities and quality assurance mechanisms.
Components
Core Data Models
Defines fundamental data structures (documents, entities, sections, mappings, ontology resources) used across the KAZU system.
Referenced Source Code
NLP Pipeline Management
Orchestrates the execution of NLP processing steps on documents and manages spaCy models within the pipeline.
Referenced Source Code
Named Entity Recognition (NER)
Identifies and extracts named entities from text using transformer models, rule-based approaches, and post-processing.
Entity Linking & Disambiguation
Links identified entities to external knowledge bases and disambiguates between potential links using dictionary, rule-based, and context-scoring strategies.
Ontology Management
Manages the parsing, curation, and generation of synonyms for various ontologies, supporting external knowledge integration.
Referenced Source Code
Model Training & Evaluation
Provides functionalities for training, predicting, and evaluating machine learning models, particularly for multi-label NER, including data handling and metric calculation.
Referenced Source Code
Resource Management Tools (KRT)
Offers interactive tools for managing and curating Kazu resources, including resource editing, conflict resolution, and ontology updates.
Referenced Source Code
Web API Interface
Provides RESTful API endpoints for external applications to interact with the KAZU system, enabling NER and entity linking operations.
Referenced Source Code
Shared Utilities
A collection of reusable utility functions and helper classes supporting various KAZU functionalities, including string normalization, caching, and abbreviation detection.
Referenced Source Code
Annotation & Quality Assurance
Provides tools for converting KAZU data for annotation and performing acceptance tests to ensure the quality and consistency of annotations and pipeline results.