The Model Training & Evaluation subsystem in KAZU provides a comprehensive suite of functionalities for developing, training, and assessing machine learning models, specifically tailored for multi-label Named Entity Recognition (NER). Its main flow involves preparing raw data, training various transformer-based models, persisting these models, and then using them for prediction and evaluation. The subsystem also integrates with external tools like Label Studio for annotation visualization and leverages KAZU's internal pipeline steps for document processing, ensuring a cohesive and efficient machine learning workflow.
Components
Training Orchestrator
Manages the entire lifecycle of model training, including initialization, data loading, optimization, evaluation, and model saving. It coordinates interactions with datasets, model architectures, and persistence mechanisms.
Data Preparation
Responsible for transforming raw documents and annotations into a format suitable for model training and evaluation. This includes tokenization, alignment of entities, and creation of multi-hot encoded labels, along with utilities for yielding documents from various sources.
Model Architectures
Defines the specific neural network models used for multi-label token classification. These models typically extend pre-trained Hugging Face Transformers models and are adapted to handle multi-label outputs.
Model Persistence
Handles the saving and loading of trained models, their configurations, and associated artifacts, ensuring that trained models can be reused for prediction or further evaluation.
Prediction & Evaluation Execution
Manages the execution of prediction and evaluation tasks. It orchestrates the loading of trained models, processing of documents through the KAZU pipeline, generation of predictions, and calculation of performance metrics.
Referenced Source Code
Label Studio Integration
Provides functionalities to interact with Label Studio, an annotation tool. It enables the creation and updating of annotation views, allowing for visualization of model predictions alongside gold annotations for human review.
KAZU Pipeline Steps
Encompasses the KAZU pipeline steps specifically designed for Named Entity Recognition (NER) using Hugging Face Transformers models. These steps integrate the Transformer models into the KAZU processing pipeline and convert their token-level outputs into structured KAZU Entity objects.
Referenced Source Code
Configuration
Manages configuration settings for model training and evaluation, such as model paths, training parameters, and data directories.