Table 3 Publicly available lung cancer datasets and their description and challenges

From: Progress and challenges of artificial intelligence in lung cancer clinical translation

Dataset

Details on Dataset

Cons

Pros

Restrictions in Data Sharing

Access Link

NSCLC-Radiomics

Radiomic analysis dataset for NSCLC from CT and PET scans, consisting of 422 cases.

- Preprocessing variability

- Missing clinical data

- Limited diversity of cases

- Enables radiomics studies

- Well-annotated dataset for imaging research

- Requires institutional agreements

- Redistribution without permission prohibited

https://wiki.cancerimagingarchive.net/display/Public/NSCLC-Radiomics

NSCLC Radiogenomics

A dataset integrating radiomic features from CT imaging and genomic data for NSCLC, with 211 cases.

High dimensionality of genomic and radiomic features

- Need for robust feature selection methods

- Limited sample size for comprehensive radiogenomic analysis

- Requires expertise in multi-modal data integration

- Facilitates multi-modal studies integrating radiomics and genomics

- High-quality imaging data

- Requires agreement to specific licensing terms

- Redistribution is restricted and subject to approval

https://www.cancerimagingarchive.net/collection/nsclc-radiogenomics/

LIDC-IDRI

Largest publicly available lung cancer imaging dataset, containing over 1,010 CT scans with nodule annotations.

- Annotation inconsistencies

- Inter-reader variability in nodule classification

- Limited PET/CT inclusion

- No clinical outcome data

- Large-scale dataset with diverse cases

- Comprehensive nodule annotations

- Free for research purposes

- Redistribution or commercial use prohibited

https://www.cancerimagingarchive.net/collection/lidc-idri/

TCGA-LUAD

Cancer Genome Atlas Lung Adenocarcinoma dataset, including genomic and imaging data.

- Complex preprocessing required

- High dimensionality of genomic data

- Limited imaging modalities included

- Expertise needed for genomic-imaging integration

- Combines genomic and imaging data

- Enables survival and outcome studies

- Requires agreement to TCGA’s data-use policy

- Certain portions require dbGaP application

https://portal.gdc.cancer.gov/projects/TCGA-LUAD

NLST

National Lung Screening Trial dataset with CT imaging and demographic information for screening participants.

- Limited metadata

- Variability in screening protocols

- No PET/CT imaging data

- Restricted access to patient-level details

- Large-scale dataset for screening research

- Longitudinal data available

- Requires formal data-use agreement

- Access limited to qualified researchers

https://cdas.cancer.gov/datasets/nlst/

AutoPET

Dataset for segmentation in PET/CT imaging, featuring annotated lung cancer cases.

- High annotation variability

- Preprocessing needed to standardize image formats

- Focused only on segmentation tasks

- Limited patient numbers

- High-quality segmentation annotations

- Facilitates PET/CT segmentation research

- Open access via registration

- Cannot use for commercial applications without permission

https://autopet.grand-challenge.org/Dataset/

RIDER Lung CT

Quantitative imaging biomarker dataset, including serial CT scans of 32 lung cancer patients.

- Temporal scan alignment issues

- Lack of outcome metadata

- Limited patient diversity

- No PET/CT data included

- Supports temporal analysis studies

- Enables biomarker development research

- Freely available under RIDER project license

- Redistribution not permitted

https://www.cancerimagingarchive.net/collection/rider-lung-ct/

MIDRC

MIDRC is a collaborative initiative initially focused on COVID-19 chest imaging and, as of 2023, has begun expanding to include cancer imaging, including lung cancer.

- Limited lung cancer-specific data.

- Centralized repository for imaging data.

- Supports multi-institutional studies.

- Requires registration for access.

- Redistribution without permission prohibited.

https://www.midrc.org/