Background & Summary

Effective lung cancer screening and diagnosis has emerged as the critical factor in prolonging patient survival. Among the spectrum of fatal diseases, pulmonary diseases constitute a significant portion1,2,3. Within this category, lung cancer stands out as the most prominent, and its incidence and mortality rates are growing the fastest. This renders it one of the formidable malignancies, posing a severe threat to the population’s health and well-being4,5. Early diagnosis and prompt treatment of lung cancer patients have been shown to substantially enhance their survival rate6,7,8. Since pulmonary nodules as the initial clinical manifestation of lung cancer, the accurate diagnosis and thorough screening of pulmonary nodules assume pivotal roles in the overall therapy of lung cancer9,10,11. Medical imaging techniques encompassing modalities such as X-ray12, CT13,14, Positron Emission Tomography (PET), and so on, take on a critical function in the screening process for lung cancer. Among these techniques, CT scanning is regarded as a routine and indispensable tool in clinical practice, capable of non-invasively capturing the intricate heterogeneity of lung tumours. Consequently, it has emerged as the primary modality for lung cancer detection in contemporary healthcare settings.

The conventional diagnostic approach primarily involves clinicians manually examining CT images slice by slice and relying on their clinical expertise to make a diagnosis. However, this approach has several limitations15. Firstly, radiologists review hundreds of lung CT scans daily, which is a time-consuming and labour-intensive task. Secondly, diagnostic decisions made by clinicians can be highly subjective and may be influenced by their individual levels of expertise and clinical experience. Moreover, the heterogeneity of malignant lung nodules is not solely manifested in the size and morphological characteristics of the lesions, but also in subtle variations in local pixel values within CT images. These trivial nuances are often challenging for the human eye to discern accurately. As a result, accurately determining the malignancy degree of nodules poses a formidable challenge for physicians16.

The implementation of CAD systems in the context of lung nodule diagnosis effectively addresses the aforementioned challenges17,18. The traditional methods include those based on morphological features, voxel-based clustering, and threshold-based techniques19, which are highly reliant on manual hand-crafted feature representation and extraction. In recent years, the advent of deep learning-based algorithms applied to medical images16,20, and has led researchers to progressively propose 2D convolutional neural networks21,22,23 and 3D convolutional neural networks24,25,26,27,28 for the purpose of diagnosing lung nodules. These advancements in lung cancer CAD systems have resulted in breakthroughs and remarkable developments, yielding more accurate results than those traditional methodologies.

At present, the majority of lung nodule detection and classification researches rely on the LIDC/IDRI dataset29 or its extended subsets30. These datasets are annotated by four experienced chest radiologists. However, the labels lack verification from pathological reports, resulting in a degree of uncertainty regarding their accuracy. Recently, several new datasets have been proposed. The PN9 dataset includes nine categories based on combinations of nodule size and density31. Later, the LIDP dataset features annotations derived from pathology reports, providing more accurate labels32. Additionally, most existing CAD systems only address the detection of pulmonary nodules or the benign/malignant classification of these nodules, lacking more advanced functionalities. This limitation may be directly linked to the types of public datasets currently available. To address this issue, a new lung nodule dataset with histopathology-based cancer type annotation has been constructed. This dataset offers precise cancer type annotations, small challenging nodule detection, and a rich variety of cancer classifications, enabling more advanced functionalities for deep learning models33. However, these existing datasets share a common limitation - covering only clinical imaging data at a single time point, thus overlooking the crucial properties, trends, and patterns associated with the gradual and dynamic evolution of tumors cross spatio-temporal periods.

The morphology of nodules continuously evolves and transforms over time. Notably, patients often visit different healthcare facilities and undergo multiple CT scans at various stages of their illness. However, challenges in the interoperability and seamless exchange of medical information hinder the effective sharing of data between hospitals over different periods. In clinical practice, data from various times and sources is crucial for comprehensive assessment, yet physicians face difficulties in accessing and monitoring CT scan information across different time periods. This isolation of information can compromise the consistency of diagnostic analysis and treatment plans, potentially impacting patient care. To address this critical bottleneck, using CAD systems to dynamically observe cross-temporal sequence data of patients and identify and analyze patterns based on disease progression can significantly enhance the accuracy of lung nodule detection and diagnosis. This approach provides physicians with dynamic insights into disease development. Therefore, there is an urgent need for a dataset that incorporates diverse temporal and spatial dimensions.

In this study, we introduce a novel cross spatio-temporal lung nodule dataset based on pathological information, which effectively integrates rich multimodal information within the spatio-temporal dimension. The constructed dataset comprises 317 CT scans from 109 patients, including 1,569 malignant nodules and 726 benign nodules. Moreover, the dataset includes 201 cross spatio-temporal CT scans of different times. During the processing of the dataset, we accurately locate and annotate the nodules in the CT scans according to the pathological information. This critical step ensures the precise tracking and quantitative analysis of nodules, thereby furnishing reliable clinical data (e.g., Ground truth with labeled detail) for disease research and medical imaging analysis. To the best of our knowledge, this dataset is the pioneering pulmonary nodule dataset that embraces the concept of cross spatio-temporal integration with various periods, and provides a fresh dynamic perspective and opportunity for future medical imaging investigations.

A comparison of our dataset with existing publicly available datasets is presented in Table 1. This table compares these diverse datasets in three key aspects: labeling methods, categories, and whether they include multiple time series for the identical patient. As illustrated in the Table 1, our dataset contains finer categorization and more accurate pathology-based labels. An obvious advantage of our dataset is the inclusion of multiple time series for the same patient, enabling a more comprehensive analysis of tumor progression across different times.

Table 1 The comparison of our dataset with the existing publicly available datasets.

Methods

In order to more effectively address the clinical need for mining richer information related to lung disease, discovering potential imaging features, and predicting the evolutionary tendency of the disease, we partner with The Second Hospital of Shandong University to construct a pathology-driven cross spatio-temporal lung CT dataset drawn from actual clinical cases. This dataset is correct annotated based on pathology information. The data samples involved in this study have been reviewed and supported by the research ethics committee of the Second Hospital of Shandong university (approval number: KYLL-2023LW089). Since the Dataset is collected retrospectively and these data undergo irreversible desensitization, there is no sensitive data such as personal information, privacy, and biological samples of the subjects involved in the data. Meanwhile, this study complies with the ethical principles of the Helsinki Declaration and does not pose any risk or harm to the subjects, so the Ethics Committee has exempted the researchers from the requirement to provide informed consent from the patients and has allowed the public release of this dataset. The following section will provide a detailed overview of the dataset, covering aspects such as the collection principles, data collection and annotation, dataset structure, dataset properties and so on.

Collection principles

To ensure the accuracy of the collected subjects and to eliminate interference from irrelevant factors, we implement the following principles in our case selection process:

Inclusion criteria

In accordance with pathology-driven criterion, only pulmonary nodules with a definitive pathological diagnosis are included. These nodules must be uniquely identifiable in the corresponding CT images based on the provided descriptions of nodule ___location and size in the diagnosis. This strict criterion ensures that each target nodule aligns precisely with its pathological diagnosis.

Exclusion criteria

Cases with pathological diagnoses unrelated to pulmonary nodules are excluded. This step effectively eliminates any interference from other non-nodule lesions in our dataset.

Pre-treatment imaging

All CT scans are performed before any relevant treatments are administered. This inclusion criterion ensures that the image characteristics of the nodules are not influenced by prior treatments.

Quality control

We meticulously review and discard CT images with missing layers or incorrect layering to guarantee the integrity and continuity of the lung nodule image dataset.

Data collection and annotation

Our data collection and annotation efforts are conducted at The Second Hospital of Shandong University. The data collection process encompasses four key stages, as illustrated in Fig. 1:

Fig. 1
figure 1

The workflow for generating the dataset. Firstly, the Electronic Medical Record System (EMRS) is used to identify cases diagnosed with pulmonary occupying lesions within the past six years. Subsequently, these cases are filtered using the Pathology Information System (P.I.S), retaining only those with available pathology information. Finally, one or multi-time CT sequences of the patient are exported from the Picture Archiving and Communication System (PACS). After data extraction, the dataset is categorized into two distinct categories: classification and detection. Under the guidance of expert physicians, annotators extract the coordinate information of the nodules from the CT sequences, based on the corresponding pathological information, and records it in the CSV file. In addition, post-segmentation lung data is also provided.

Case collection

We amassed cases from the EMRS spanning from January 2016 to January 2023, specifically selecting those cases with diagnoses referencing nodular lung lesions and accompanying surgery records.

Pathological diagnosis recording

On the basis of the organized case records, we retrieve and document the pathological diagnosis corresponding to the lung nodules in the P.I.S.

Patient imaging data retrieval

In the PACS system, we search for the CT scan times of the patients, view the patients’ multiple diagnostic imaging reports and CT images, and subsequently employ the provided image descriptions in the diagnostic imaging reports along with the corresponding pathological diagnostic results to precisely locate the target pulmonary nodule.

Data export and archive preparation

Following the identification of the target pulmonary nodules, we proceed to export the corresponding DICOM file sequences. To facilitate subsequent labelling operations and ensure data preservation, we create archive disks containing the imaging data.

Under the guidance of expert physicians, we perform the labelling of nodule ___location and contour within the exported CT sequences. Then physicians verify the lesion locations by referencing the imaging manifestations and the pathological information contained in the dataset. Afterwards, they instruct the annotators to complete the labelling task within the slices containing the lung nodules. This rigorous process is crucial in guaranteeing the precision and reliability of the annotations, forming a robust foundation for subsequent data analysis. In total, we label 2,295 nodules across 317 CT scans in this dataset.

Dataset structure

Temporal dimension CT series

We collect CT scans from patients at various time points, denoted as T1, T2, …, Tm, where m represents the CT scans captured at the m moments for each patient. Unlike single-point-in-time slice studies, longitudinal studies yield dynamic data tend to unearth richer information about the disease variation. Figure 2 shows an example of CT images taken at different times for an identical patient.

Fig. 2
figure 2

An example of the CT sequences of different temporal dimension.

Spatial dimension CT series

In the spatial dimension, we annotate the CT scan sequences (referred to as P1, P2, …, Pk,…, Pn) containing pulmonary nodules. Within these sequences, we identify and designate specific slices where the nodules are most prominently visible as keyframes, represented by Pk. Figure 3 provides an example of the CT sequences for a patient in line of the spatial dimension.

Fig. 3
figure 3

An example of the CT sequences in terms of spatial dimension.

Cross spatio-temporal dimension CT series

Our dataset encompasses both longitudinal inspection data spanning multiple time moments and spatially oriented horizontal data. This comprehensive dataset enables researchers to concurrently observe nodule changes from the temporal and spatial perspective simultaneously, facilitating the training of accurate lung nodule diagnosis models and providing a rich, multi-dimensional perspective for in-depth exploration of disease progression. A holistic overview of the dataset is presented in Fig. 4.

Fig. 4
figure 4

A global overview of the dataset structure.

Dataset properties

Pathological type

Since the CT data in this dataset originate from patients who have undergone surgery or medical treatment, there are more malignant cases than benign cases. To address this imbalance while considering clinical relevance, we apply a like-for-like aggregation approach to classify the dataset into six distinct pathology types. These types include invasive adenocarcinoma (A), minimally invasive adenocarcinoma (B), adenocarcinoma in situ (C), other malignant subtypes (D), cases involving inflammation (E), and benign subtypes (F), as illustrated in Fig. 5(a).

Fig. 5
figure 5

Statistics of dataset categories and nodule sizes. (a) The types and percentages of pathology in the dataset; (b) The types and percentages of nodule size; (c) The correlation between nodule size and pathological type.

From the graph, it is evident that there are more malignant nodules than benign nodules, with 1,569 malignant nodules (68.37%) and 726 benign nodules (31.63%).

Nodule size

Pulmonary nodules are categorized based on size into four distinct groups: 0–10 mm, 11–20 mm, 21–30 mm, and >30 mm. Among these categories, nodules sized between 11–20 mm are the most prevalent, comprising 42.79% of all nodules. Nodules sized 21–30 mm and >30 mm each account for approximately 27.67% and 26.49%, respectively, as shown in Fig. 5(b).

Figure 5(c) illustrates the correlation between nodule size and pathological type. It is notable that invasive adenocarcinomas are predominant among nodules sized between 21 and 30 mm. Additionally, inflammation is observed in nodules sized between 11 and 20 mm.

Nodule ___location

Figure 6(a) illustrates the distribution of nodule locations. It shows that the superior lobe of right lung has the highest number of nodules, followed by the inferior lobe of left lung.

Fig. 6
figure 6

Statistics on Nodule Location and Nodule Count. (a) Nodule ___location; (b) Number of nodules per sequence.

Nodule count

We quantify the number of nodules in each sequence, as shown in Fig. 6(b). The sequences with 1–5 nodules are the most numerous, totaling 168, followed by sequences with 6–10 nodules, 81 in total.

Patient demographics

We compile the demographic information of the patients. The gender distribution, as depicted in Fig. 7(a), indicates that there are more female patients than male patients.

Fig. 7
figure 7

Patient demographics. (a) Gender distribution of patients; (b) Age distribution of sequences.

Since our dataset includes multiple CT sequences for the same patient over different times, the patient’s age may vary across sequences. Therefore, we record the age corresponding to each CT sequence, as illustrated in Fig. 7(b). The data shows that the majority of patients fall within the age ranges of 51–60 and 61–70.

Thickness

We concurrently collect CT sequences with slice thicknesses of 1.25 mm and 5 mm to address diverse requirements in actual clinical diagnosis and treatment. Some patients have both 1.25 mm and 5 mm CT scans. There are 164 CT series with a slice thickness of 1.25 mm and 153 series with a slice thickness of 5 mm.

Data preprocessing

Initially, Gaussian filtering is applied to each image slice to achieve two objectives: image smoothing for noise reduction and enhancement of color contrast between the lungs and surrounding tissues. Subsequently, a clustering technique is employed to determine thresholds that effectively distinguish lung regions from non-lung regions within the medical images. The ultimate aim of this preprocessing step is to isolate and highlight the lung region, so that the structure or region of interest can be more easily identified in successive segmentation and analysis steps. The processed lung region is then systematically stored in a designated folder, labeled as “lung_mask”. This is included in the “1.25 mm_3D_detection_mhd_2.zip” dataset within the “Data Records” section for 1.25mm slice thickness detection34.

Data Records

We have publicly released the data in the Zenodo repository without any password protection34,35. Since our data is large and Zenodo has an upload capacity limitation, we split the dataset and store it in two different websites (Database Part A34 and Database Part B35). To facilitate downloads, we have partitioned large-sized files into multiple zip archives. In case the main link doesn’t work, we also provide an alternate link36. The uploaded dataset files are organized according to different slice thicknesses (1.25 mm and 5 mm), which are described as follows.

For the layer thickness of 1.25 mm, we provide two different types of datasets for classification and detection. The detailed description of the file format is given below:

1.25 mm slice thickness classification dataset

The classification dataset, comprising BMP-format images, is compressed into a single archive named “class_dataset.zip” and uploaded to Database Part A34. Within this archive, the dataset is divided into two segments: the training set and the testing set, in a 5:1 ratio. These segments are stored in the “train” and “validate” folders, respectively. Each folder contains six subfolders categorized as labels 1, 2, 3, 4, 5 and 6, which correspond to invasive adenocarcinoma, minimally invasive adenocarcinoma, adenocarcinoma in situ, other malignant subtypes, inflammation, and other benign subtypes, respectively. Additionally, the “READ_ME.txt” file provides detailed information, including the number of samples in the training and testing sets for each category.

1.25 mm slice thickness detection dataset

For the 1.25 mm slice thickness, we develop a dataset for lung nodule detection and evaluation, encompassing both three-dimensional (3D) and two-dimensional (2D) data formats.

  1. 1)

    3D Data: We offer two formats: MHD and BMP.

    The MHD dataset includes four compressed files, uploaded to Database Part A34. Specifically: “1.25 mm_3D_detection_mhd_0.zip” and “1.25mm_3D_detection_mhd_1.zip” contain the original MHD-format images of the patients. “1.25mm_3D_detection_mhd_2.zip” includes MHD-format images that have undergone lung segmentation named “lung_mask”. “5mm_3D_detection_mhd_3.zip” contains the “annotations.csv” file, which holds information on the actual nodule positions and diameters. These data points are referenced within a global coordinate system, serving as a reliable benchmark for assessing the performance of the testing data.

    The detection dataset in BMP format contains augmented data through flipping and rotation, and is stored in Dataset Part B35. We compress the images into three distinct sub-archives for easy upload, each labeled as “1.25mm_3D_detection_bmp i (i = 0–10).zip”. Researchers should extract and consolidate these files for use. In addition, both the labels and training and test set information are placed in “1.25mm_3D_detection_bmp11.zip”. This archive contains the files “image_annotation_augment_diameter.csv”, “train_anno.csv”, “val_anno.csv”, “train.txt”, and “val.txt”. The “image_annotation_augment_diameter.csv” file provides detailed information regarding the real nodule positions and diameters based on the voxel coordinate system. This data is divided into training and testing sets in a 5:1 ratio, saved as “train_anno.csv” and “val_anno.csv” files, respectively. The case numbers for the training and testing sets are listed in “train.txt” and “val.txt”.

  2. 2)

    2D Data: We provide the “1.25mm_2D_detection.zip” folder in Database Part A34. The “JPEGImages” folder contains the dataset images, the “Annotations” folder includes the XML format annotations, and the “labels” folder holds the bounding box center coordinates along with the width and height. “VOC_CT.xlsx” lists the image filenames and their corresponding nodule coordinates (xmin, ymin, xmax, ymax). To facilitate model training and performance evaluation, we partition the dataset into training and testing sets using a 4:1 ratio. The image names of these sets are recorded in the “train.txt” and “val.txt” files located in the “ImageSets/Main” directory, while the image paths are documented in the “train.txt” and “val.txt” files within the “images” folder. Furthermore, the “train.txt” and “val.txt” files inside the “1.25mm_2D_detection.zip” include the image paths and corresponding nodule coordinates.

5mm slice thickness detection dataset

We provide a detection dataset availablely in MHD format within Database Part A34. This dataset is divided into four compressed files. The first three files, “5mm_3D_detection_mhd i (i = 0–2).zip” contain data in MHD format. The final file, “5mm_3D_detection_mhd3.zip” includes detailed information on the actual nodule locations and diameters for each case in the global coordinate system.

Technical Validation

Classification experiments

To appraise the efficacy of our designed pulmonary nodule classification dataset, we conduct comprehensive assessments utilizing eight distinct and typical image classification networks, namely ResNet37, ConvNext38, ResNext39, Res2Net40, SE-ResNet41, CABNet42, InceptionV443, and EfficientNet44. These networks exhibit diverse architectures and excellent performance characteristics, enabling a thorough examination of the dataset’s suitability for pulmonary nodule classification tasks. We carry out both training and testing phases for these networks using identical datasets and configurations to ensure the fairness and comparability of our experiments. Each dataset is divided into training and testing sets in a 5:1 ratio. During the experiment, each network undergoes training for a total of 150 epochs.

For the quantitative assessment of classification performance, we employ two key metrics, namely Extreme Accuracy (ACC) and Quadratic Weighted Kappa (QWK), for objective evaluation. These metrics are selected to provide a thorough assessment of our classification results. The outcomes of these experiments are presented in Table 2.

Table 2 Performance assessment based on eight representative classification models.

As evident from the table, our dataset yields reliable results across various categorical networks. This consistency underscores the dataset’s label and feature accuracy, as well as its robustness. The multi-model applicability indicates that the dataset encompasses diverse data samples and has been adequately considered in terms of data quality and diversity. While all eight typical networks achieve accuracy and QWK scores exceeding 0.5, it is remarkably that none of their metrics attain exceptionally high values. This observation accentuates the challenges and research potential inherent in our dataset.

To assess the quality of category detection for the eight individual networks, we present the confusion matrices in Fig. 8. Significantly, the values along the main diagonal of each confusion matrix are relatively higher, signifying that the network classification results generally aligned with our annotations. This alignment highlights the effectiveness of our dataset for classification tasks. Given that different networks exhibit varying sensitivities to distinct data classes, this further verifies the dataset’s complexity and diversity.

Fig. 8
figure 8

The confusion matrix between the real labels and the predicted results. (a) ResNet37; (b) ConvNext38; (c) ResNext39; (d) Res2Net40; (e) SE-ResNet41; (f) CABNet42; (g) InceptionV443; (h) EfficientNet44.

Detection experiments

In order to validate the effectiveness of the lung nodule detection dataset, we test five different object detection networks, consisting of Faster R-CNN45, Yolov346, MobileNet47, SSD48, and RetinaNet49. Remarkably, each network is trained over 50 epochs using the same dataset and train configuration. To estimate the model’s performance, we employ object detection evaluation metrics typically used for the COCO dataset, including Average Precision (AP), AP50, AP75, APS, APM, and APL46. These metrics are widely recognized for assessing the accuracy of detection box localization by the model. Detailed test results are presented in Table 3.

Table 3 Evaluation results in terms of different detection models.

Specifically, APS, APM, and APL gauge the detection performance of target objects at diverse sizes. Given that our dataset contains over twice as many malignant nodules as benign nodules, the substantial number of malignant nodules enhances the model’s sensitivity in detecting malignancies. Consequently, the APL metrics for the results obtained with the five popular networks are higher.