Table 3 Publicly available lung cancer datasets and their description and challenges
From: Progress and challenges of artificial intelligence in lung cancer clinical translation
Dataset | Details on Dataset | Cons | Pros | Restrictions in Data Sharing | Access Link |
---|---|---|---|---|---|
NSCLC-Radiomics | Radiomic analysis dataset for NSCLC from CT and PET scans, consisting of 422 cases. | - Preprocessing variability - Missing clinical data - Limited diversity of cases | - Enables radiomics studies - Well-annotated dataset for imaging research | - Requires institutional agreements - Redistribution without permission prohibited | https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics |
NSCLC Radiogenomics | A dataset integrating radiomic features from CT imaging and genomic data for NSCLC, with 211 cases. | High dimensionality of genomic and radiomic features - Need for robust feature selection methods - Limited sample size for comprehensive radiogenomic analysis - Requires expertise in multi-modal data integration | - Facilitates multi-modal studies integrating radiomics and genomics - High-quality imaging data | - Requires agreement to specific licensing terms - Redistribution is restricted and subject to approval | https://www.cancerimagingarchive.net/collection/nsclc-radiogenomics/ |
LIDC-IDRI | Largest publicly available lung cancer imaging dataset, containing over 1,010 CT scans with nodule annotations. | - Annotation inconsistencies - Inter-reader variability in nodule classification - Limited PET/CT inclusion - No clinical outcome data | - Large-scale dataset with diverse cases - Comprehensive nodule annotations | - Free for research purposes - Redistribution or commercial use prohibited | |
TCGA-LUAD | Cancer Genome Atlas Lung Adenocarcinoma dataset, including genomic and imaging data. | - Complex preprocessing required - High dimensionality of genomic data - Limited imaging modalities included - Expertise needed for genomic-imaging integration | - Combines genomic and imaging data - Enables survival and outcome studies | - Requires agreement to TCGA’s data-use policy - Certain portions require dbGaP application | |
NLST | National Lung Screening Trial dataset with CT imaging and demographic information for screening participants. | - Limited metadata - Variability in screening protocols - No PET/CT imaging data - Restricted access to patient-level details | - Large-scale dataset for screening research - Longitudinal data available | - Requires formal data-use agreement - Access limited to qualified researchers | |
AutoPET | Dataset for segmentation in PET/CT imaging, featuring annotated lung cancer cases. | - High annotation variability - Preprocessing needed to standardize image formats - Focused only on segmentation tasks - Limited patient numbers | - High-quality segmentation annotations - Facilitates PET/CT segmentation research | - Open access via registration - Cannot use for commercial applications without permission | |
RIDER Lung CT | Quantitative imaging biomarker dataset, including serial CT scans of 32 lung cancer patients. | - Temporal scan alignment issues - Lack of outcome metadata - Limited patient diversity - No PET/CT data included | - Supports temporal analysis studies - Enables biomarker development research | - Freely available under RIDER project license - Redistribution not permitted | https://www.cancerimagingarchive.net/collection/rider-lung-ct/ |
MIDRC | MIDRC is a collaborative initiative initially focused on COVID-19 chest imaging and, as of 2023, has begun expanding to include cancer imaging, including lung cancer. | - Limited lung cancer-specific data. | - Centralized repository for imaging data. - Supports multi-institutional studies. | - Requires registration for access. - Redistribution without permission prohibited. |