Final Component Overview for Data Management & Preparation within the chemicalx project
Components
Data Loaders
This component is the primary interface for acquiring raw chemical and biological datasets. It abstracts the complexities of fetching data from various sources (remote or local) and specific public datasets. It's responsible for populating the initial `DrugFeatureSet`, `ContextFeatureSet`, and `LabeledTriples` objects. This aligns with the "Data Loaders" pattern, providing a unified API for diverse data sources.
Feature Management
This component defines and manages the structured representation of chemical features for drugs (`DrugFeatureSet`) and contextual information (`ContextFeatureSet`). These classes encapsulate the feature data, ensuring consistency and providing methods for accessing and manipulating them. They serve as the organized repositories for all input features required by the models. This aligns with "Feature Processors" and "Datasets" patterns.
Interaction Data (Labeled Triples)
This component represents the core interaction data, typically in the form of (drug1, drug2, label) tuples. It provides functionalities for managing these triples, such as splitting data into training and testing sets, and querying counts of positive/negative interactions. It is the ground truth dataset for model training and evaluation. This is a fundamental "Dataset" component.
Batching & Data Preparation
This component is responsible for transforming the raw feature sets and labeled triples into optimized mini-batches (`DrugPairBatch`) suitable for efficient consumption by deep learning models. It iterates through the interaction data, retrieves corresponding features, and constructs the final batch objects. This is a critical part of the "Training Pipeline" and "Data Preprocessing Scripts".
DrugPairBatch
This is a specialized data structure designed to hold a single mini-batch of drug pair interaction data, including drug features, context features, and labels, in a format optimized for PyTorch models. It acts as the standardized input for the deep learning models, ensuring efficient data flow during training and inference. This is a key "Data Structure" component.
Data Utilities
This component provides a collection of general helper functions for common data-related tasks, such as writing data to JSON files or extracting features. These utilities support the overall data preparation and management processes, promoting code reusability and simplifying various data handling operations across the subsystem.