Background & Summary

The subjective evaluation of histological slides for GC remains the gold standard for its diagnosis and guides its clinical treatment. The corresponding histological slide images are routinely available and can offer a wealth of information about TME, which plays a critical role in the disease progression and is a key determinant of treatment response in GC patients1,2,3. Thus, there is a high clinical need to comprehensively evaluate the TME patterns in the slide images of individual GC patients4,5,6,7.

Extensive research into TME patterns have provided new insights into the diagnosis and prognosis of various cancers8,9,10,11,12,13,14,15,16. Based on the quantification of various tissue components in the TME, several histological image biomarkers have been developed. For instance, Kather et al. quantified different tissue components within the TME of colorectal cancer (CRC) slides, developing a deep stroma score that significantly predicts CRC patient survival17. The work by Liang et.al. employed a deep learning-guided framework to analyze TME patterns in liver cancer histological images, discovering a novel tissue spatial biomarker strongly linked to patient prognosis18. Similarly, Wang et al. introduced the DeepGrade model, which utilizes TME features from whole-slide histopathology images of breast cancer to generate a risk score, serving as a prognostic biomarker and providing clinically relevant information beyond traditional histological grading19. Rong et al. developed a deep learning method for TME quantification within breast cancer histological images, identifying TME features as independent biomarkers for survival prediction20. Additionally, Wang et al. used a graph convolutional network to study the cell spatial organization in the TME derived from lung and oral cancer histological images, obtaining a spatial feature that excelled in prognostic and predictive tasks21. Another study by Wang et al. created a deep learning tool to explore the TME in lung cancer histological images, finding a cell spatial image biomarker predictive of patient survival and associated with gene expression22. Fremond et al. analyzed TME patterns in endometrial cancer histological images, identifying morpho-molecular correlates that could serve as a prognostic factor for risk stratification23.

Given these findings, it is evident that TME patterns in histological images play a crucial role in the diagnosis and prognosis of various cancers. Specifically, the TME patterns of GC have been shown to significantly impact its progression and therapeutic responses24. For example, Liu et al. found that different components in GC TME play their own roles in inducing the immune tolerance to promote the GC progress25. However, the extraction and analysis of TME features in GC histological images are are hindered by a scarcity of images with detailed TME annotations. To this end, we build, to our knowledge, the largest histological image dataset of GC with detailed annotation of TME components, aiming to provide a platform for exploring TME patterns in histological images, potentially discovering new biomarkers and guiding treatment strategies for gastric cancer. In addition, this dataset can also serve as a foundation for developing a pre-trained feature extractor, valuable for transfer learning in diagnosis and prognosis of other cancer types. Furthermore, compared to the existing large dataset, GasHisSDB, our dataset offers significant advantages mainly in two key aspects. While GasHisSDB only provides binary classification (normal/abnormal), our dataset features eight distinct tissue classes within the TME, enabling more comprehensive gastric diagnosis analysis. Notebly, our dataset integrates comprehensive clinical information alongside histological images, whereas GasHisSDB is only limited to image data.

Methods

Slides preparation and digitization

Formalin-fixed, paraffin-embedded tissue slides were provided by the Cancer Hospital of Harbin Medical University between the years 2013 and 2015. A total of 300 H&E-stained slides of gastric cancer patients were collected from the hospital archives according to the pathology reports. The study was approved by the Research Ethics Committee of the Harbin Medical University (ID: KY2024-16). The ethics committee has issued a waiver of consent and approved the open publication of the dataset. All tissue and clinical information were collected from patients from whom written informed consent had been obtained for research purposes. All histological slides were scanned with Aperio AT2 scanner (Leica Biosystems, Germany) scanner equipped with an objective lens with a magnification of 20× to obtain the histological images saved in.svs file format.

Annotation process

The slides were evaluated by two junior pathologists (Huiying Li, Kexin Chen) with more than 5 years of experience in the field and a senior pathologist (Yang Jiang) with more than 10 years of experience in order to confirm the diagnosis of each case and to annotate each slide image. In order to make quality control, a three-step annotation process was performed as follows, the initial labeling step, the verification step, and the final check step. We referred to these pathologists as A (senior pathologist), B (junior pathologist), and C (junior pathologist). An image was firstly randomly assigned to reader B or C. Once the labeling was finished, the image and annotation were then passed on to another reader for review. Finally, the annoations were checked by reader A. Specifically, eight tissue classes related to the tumor microenvironment (TME) were annotated in each slide image: adipose tissue (ADI), debris (DEB), mucus (MUC), muscle (MUS), lymphocyte aggregates (LYM), stroma (STR), normal mucosa (NOR), and tumor epithelium (TUM) (see Fig. 1). These annotated slide images in.svs file format were then tiled into 224 × 224 patches and saved in.png format. Each patch was labeled with the same tissue label as its corresponding tissue region. Finally, a total of nearly 31 K patches were extracted from 300 slide images. The detailed data distribution of these slide images and patches are illustrated in Fig. 2.

Fig. 1
figure 1

Histological slide image of gastric cancer and the labeled patches; (A) The whole slide image; (B) Extracted patches labeled with the eight TME tissue classes.

Fig. 2
figure 2

Size distribution of the slide images widths and heights. Distribution of the number of patches per slide.

Clinical data acquisition

The clinical data of these GC patients were obtained from the records in the information system of the Cancer Hospital of Harbin Medical University. The corresponding variables of the data are displayed in Table 1. Personal identifiers were removed to prevent the identification of individual patients.

Table 1 Overall clinical variables related to the provided histological slide images dataset.

Data Records

The complete dataset, named HMU-GC-HE-30K, is publicly available on Figshare (https://doi.org/10.6084/m9.figshare.25954813)26. It consists of two components: a file containing the annotated image patches from the corresponding histological slide images, and a spreadsheet named “HMU-GC-Clinical.csv”, which includes the clinical data tabulated according to Table 1. The dataset file structure is shown in Fig. 3. Histological slide images in.svs format, patch images in.png format, and clinical information data are provided. The labeled patch images of the slide images are stored in folders named according to the corresponding tumor microenvironment (TME) tissue components, such as ADI, DEB, MUC, etc. These images and related clinical information can be used to extract histological TME features for various downstream tasks, such as prediction and prognosis.

Fig. 3
figure 3

Overview of the dataset structure.

Technical Validation

The slide evaluation and annotation process involved a quality control process with three pathologists: two junior pathologists with over 5 years of experience and one senior pathologist with over 10 years of experience. The process followed a workflow where images were first randomly assigned to either of the junior pathologists for initial labeling. The labeled images were then cross-reviewed by the other junior pathologist, and finally, all annotations underwent a final check by the senior pathologist to ensure accurate diagnosis and annotation of each slide image.

The CHD dataset comprises 3,887 images for each of the eight Tumor Microenvironment (TME) components. To validate the CHD dataset proposed in this study, we used two previously published models: the Transformer architecture (ViT)27 and CNN-based model EfficientNet28, for the classification analysis of the eight TME tissue components mentioned above (see Fig. 4). Note that the models were trained using a 10-fold cross validation strategy. Initially, 20% of the dataset was allocated as an independent test set. The remaining data are for 10-fold cross-validation. To ensure proportional representation of each component in both of the training and test sets, a stratified sampling method was employed.

Fig. 4
figure 4

Workflow of data preprocessing and model architecture. (A) The histological slide image, that is, whole slide image (WSI) is digitized, segmented, and tessellated into 224 × 224 patches. (B) ViT model pipeline: Patch image is linearly projected into flattened patches, followed by feature extraction via a transformer layer with multi-head attention. Predictions for various tissue classes are performed using a multi-layer perceptron (MLP). (C) EfficientNet model pipeline: The input image undergoes initial feature extraction via a convolution layer, followed by deep feature extraction using MBConv blocks. Extracted features are then processed through global average pooling, a flatten layer, a dropout layer, and a fully connected (FC) layer for the prediction of various tissue classes.

For the classification task using ViT, we utilized the pretrained parameters downloaded from Hugging Face(https://huggingface.co/). The ViT model architecture included a dropout layer set at 0.3. We trained the model using the Adam optimizer with a learning rate of 0.001 and a weight decay of 1e-4. Early stopping was implemented with a patience of 5 epochs based on improvements in validation loss. As shown in Fig. 5, the 10-fold cross-validation model was applied to the independent test set, achieving an AUC of 0.94. Similarly, for the classification task using EfficientNet, we developed a custom model based on the EfficientNet architecture available in the efficientnet_pytorch library. To expedite model training, we used the pretrained weights of EfficientNet-B0. The optimizer, learning rate, and early stopping strategy were the same as those used for the ViT model. This model achieved an AUC of 0.96 on the independent test set.

Fig. 5
figure 5

ROC curves for the ViT and EfficientNet models. The AUC values are the average AUC of 10-fold cross validation.