A large histological images dataset of gastric cancer with tumour microenvironment annotation for AI

Lou, Shenghan; Ji, Jianxin; Li, Huiying; Zhang, Xuan; Jiang, Yang; Hua, Menglei; Chen, Kexin; Ge, Kaiyuan; Zhang, Qi; Wang, Liuying; Han, Peng; Cao, Lei

doi:10.1038/s41597-025-04489-9

Download PDF

Data Descriptor
Open access
Published: 22 January 2025

A large histological images dataset of gastric cancer with tumour microenvironment annotation for AI

Shenghan Lou¹,
Jianxin Ji²,
Huiying Li³,
Xuan Zhang²,
Yang Jiang³,
Menglei Hua²,
Kexin Chen³,
Kaiyuan Ge²,
Qi Zhang²,
Liuying Wang⁴,
Peng Han^1,5,6 &
…
Lei Cao²

Scientific Data volume 12, Article number: 138 (2025) Cite this article

3881 Accesses
Metrics details

Subjects

Abstract

Gastric cancer (GC) is the third leading cause of cancer death worldwide. Its clinical course varies considerably due to the highly heterogeneous tumour microenvironment (TME). Decomposing the complex TME from histological images into its constituent parts is crucial for evaluating its patterns and enhancing GC therapies. Although various deep learning methods were developed in medical field, their applications on this task are hindered by the lack of well-annotated histological images of GC. Through this work, we seek to provide a large database of histological images of GC completely annotated for 8 tissue classes in TME. The dataset consists of nearly 31 K histological images from 300 whole slide images. Additionally, we explained two deep learning models used as validation examples using this dataset.

An interpretable framework for gastric cancer classification using multi-channel attention mechanisms and transfer learning approach on histopathology images

Article Open access 16 April 2025

Clinically applicable histopathological diagnosis system for gastric cancer detection using deep learning

Article Open access 27 August 2020

Enhanced gastric cancer classification and quantification interpretable framework using digital histopathology images

Article Open access 28 September 2024

Background & Summary

The subjective evaluation of histological slides for GC remains the gold standard for its diagnosis and guides its clinical treatment. The corresponding histological slide images are routinely available and can offer a wealth of information about TME, which plays a critical role in the disease progression and is a key determinant of treatment response in GC patients^1,2,3. Thus, there is a high clinical need to comprehensively evaluate the TME patterns in the slide images of individual GC patients^4,5,6,7.

Extensive research into TME patterns have provided new insights into the diagnosis and prognosis of various cancers^{8,9,10,11,12,13,14,15,16}. Based on the quantification of various tissue components in the TME, several histological image biomarkers have been developed. For instance, Kather et al. quantified different tissue components within the TME of colorectal cancer (CRC) slides, developing a deep stroma score that significantly predicts CRC patient survival¹⁷. The work by Liang et.al. employed a deep learning-guided framework to analyze TME patterns in liver cancer histological images, discovering a novel tissue spatial biomarker strongly linked to patient prognosis¹⁸. Similarly, Wang et al. introduced the DeepGrade model, which utilizes TME features from whole-slide histopathology images of breast cancer to generate a risk score, serving as a prognostic biomarker and providing clinically relevant information beyond traditional histological grading¹⁹. Rong et al. developed a deep learning method for TME quantification within breast cancer histological images, identifying TME features as independent biomarkers for survival prediction²⁰. Additionally, Wang et al. used a graph convolutional network to study the cell spatial organization in the TME derived from lung and oral cancer histological images, obtaining a spatial feature that excelled in prognostic and predictive tasks²¹. Another study by Wang et al. created a deep learning tool to explore the TME in lung cancer histological images, finding a cell spatial image biomarker predictive of patient survival and associated with gene expression²². Fremond et al. analyzed TME patterns in endometrial cancer histological images, identifying morpho-molecular correlates that could serve as a prognostic factor for risk stratification²³.

Given these findings, it is evident that TME patterns in histological images play a crucial role in the diagnosis and prognosis of various cancers. Specifically, the TME patterns of GC have been shown to significantly impact its progression and therapeutic responses²⁴. For example, Liu et al. found that different components in GC TME play their own roles in inducing the immune tolerance to promote the GC progress²⁵. However, the extraction and analysis of TME features in GC histological images are are hindered by a scarcity of images with detailed TME annotations. To this end, we build, to our knowledge, the largest histological image dataset of GC with detailed annotation of TME components, aiming to provide a platform for exploring TME patterns in histological images, potentially discovering new biomarkers and guiding treatment strategies for gastric cancer. In addition, this dataset can also serve as a foundation for developing a pre-trained feature extractor, valuable for transfer learning in diagnosis and prognosis of other cancer types. Furthermore, compared to the existing large dataset, GasHisSDB, our dataset offers significant advantages mainly in two key aspects. While GasHisSDB only provides binary classification (normal/abnormal), our dataset features eight distinct tissue classes within the TME, enabling more comprehensive gastric diagnosis analysis. Notebly, our dataset integrates comprehensive clinical information alongside histological images, whereas GasHisSDB is only limited to image data.

Methods

Slides preparation and digitization

Formalin-fixed, paraffin-embedded tissue slides were provided by the Cancer Hospital of Harbin Medical University between the years 2013 and 2015. A total of 300 H&E-stained slides of gastric cancer patients were collected from the hospital archives according to the pathology reports. The study was approved by the Research Ethics Committee of the Harbin Medical University (ID: KY2024-16). The ethics committee has issued a waiver of consent and approved the open publication of the dataset. All tissue and clinical information were collected from patients from whom written informed consent had been obtained for research purposes. All histological slides were scanned with Aperio AT2 scanner (Leica Biosystems, Germany) scanner equipped with an objective lens with a magnification of 20× to obtain the histological images saved in.svs file format.

Annotation process

The slides were evaluated by two junior pathologists (Huiying Li, Kexin Chen) with more than 5 years of experience in the field and a senior pathologist (Yang Jiang) with more than 10 years of experience in order to confirm the diagnosis of each case and to annotate each slide image. In order to make quality control, a three-step annotation process was performed as follows, the initial labeling step, the verification step, and the final check step. We referred to these pathologists as A (senior pathologist), B (junior pathologist), and C (junior pathologist). An image was firstly randomly assigned to reader B or C. Once the labeling was finished, the image and annotation were then passed on to another reader for review. Finally, the annoations were checked by reader A. Specifically, eight tissue classes related to the tumor microenvironment (TME) were annotated in each slide image: adipose tissue (ADI), debris (DEB), mucus (MUC), muscle (MUS), lymphocyte aggregates (LYM), stroma (STR), normal mucosa (NOR), and tumor epithelium (TUM) (see Fig. 1). These annotated slide images in.svs file format were then tiled into 224 × 224 patches and saved in.png format. Each patch was labeled with the same tissue label as its corresponding tissue region. Finally, a total of nearly 31 K patches were extracted from 300 slide images. The detailed data distribution of these slide images and patches are illustrated in Fig. 2.

Clinical data acquisition

The clinical data of these GC patients were obtained from the records in the information system of the Cancer Hospital of Harbin Medical University. The corresponding variables of the data are displayed in Table 1. Personal identifiers were removed to prevent the identification of individual patients.

Table 1 Overall clinical variables related to the provided histological slide images dataset.

Full size table

Data Records

The complete dataset, named HMU-GC-HE-30K, is publicly available on Figshare (https://doi.org/10.6084/m9.figshare.25954813)²⁶. It consists of two components: a file containing the annotated image patches from the corresponding histological slide images, and a spreadsheet named “HMU-GC-Clinical.csv”, which includes the clinical data tabulated according to Table 1. The dataset file structure is shown in Fig. 3. Histological slide images in.svs format, patch images in.png format, and clinical information data are provided. The labeled patch images of the slide images are stored in folders named according to the corresponding tumor microenvironment (TME) tissue components, such as ADI, DEB, MUC, etc. These images and related clinical information can be used to extract histological TME features for various downstream tasks, such as prediction and prognosis.

Technical Validation

The slide evaluation and annotation process involved a quality control process with three pathologists: two junior pathologists with over 5 years of experience and one senior pathologist with over 10 years of experience. The process followed a workflow where images were first randomly assigned to either of the junior pathologists for initial labeling. The labeled images were then cross-reviewed by the other junior pathologist, and finally, all annotations underwent a final check by the senior pathologist to ensure accurate diagnosis and annotation of each slide image.

The CHD dataset comprises 3,887 images for each of the eight Tumor Microenvironment (TME) components. To validate the CHD dataset proposed in this study, we used two previously published models: the Transformer architecture (ViT)²⁷ and CNN-based model EfficientNet²⁸, for the classification analysis of the eight TME tissue components mentioned above (see Fig. 4). Note that the models were trained using a 10-fold cross validation strategy. Initially, 20% of the dataset was allocated as an independent test set. The remaining data are for 10-fold cross-validation. To ensure proportional representation of each component in both of the training and test sets, a stratified sampling method was employed.

For the classification task using ViT, we utilized the pretrained parameters downloaded from Hugging Face(https://huggingface.co/). The ViT model architecture included a dropout layer set at 0.3. We trained the model using the Adam optimizer with a learning rate of 0.001 and a weight decay of 1e-4. Early stopping was implemented with a patience of 5 epochs based on improvements in validation loss. As shown in Fig. 5, the 10-fold cross-validation model was applied to the independent test set, achieving an AUC of 0.94. Similarly, for the classification task using EfficientNet, we developed a custom model based on the EfficientNet architecture available in the efficientnet_pytorch library. To expedite model training, we used the pretrained weights of EfficientNet-B0. The optimizer, learning rate, and early stopping strategy were the same as those used for the ViT model. This model achieved an AUC of 0.96 on the independent test set.

Code availability

The code used in this study was written in Python3 and is available at GitHub (https://github.com/TomHardy1997/GCTC). The code is based on PyTorch (version 1.13.1) and OpenSlide (version 1.2) and provides custom dataloaders for use of the CHD dataset.

References

Kanavati, F. & Tsuneki, M. A deep learning model for gastric diffuse-type adenocarcinoma classification in whole slide images. Scientific Reports 11, 20486, https://doi.org/10.1038/s41598-021-99940-3 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Kuntz, S. et al. Gastrointestinal cancer classification and prognostication from histology using deep learning: Systematic review. European Journal of Cancer 155, 200–215, https://doi.org/10.1016/j.ejca.2021.07.012 (2021).
Article PubMed MATH Google Scholar
Wang, X. et al. Predicting gastric cancer outcome from resected lymph node histopathology images using deep learning. Nature Communications 12, 1637, https://doi.org/10.1038/s41467-021-21674-7 (2021).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Deng, Y. et al. Artificial intelligence applications in pathological diagnosis of gastric cancer. Heliyon 8 (2022).
Jiang, Y. et al. Non-invasive tumor microenvironment evaluation and treatment response prediction in gastric cancer using deep learning radiomics. Cell Reports Medicine 4 (2023).
Song, Z. et al. Clinically applicable histopathological diagnosis system for gastric cancer detection using deep learning. Nature communications 11, 4294 (2020).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Wang, Y. et al. Cell graph neural networks enable the precise prediction of patient survival in gastric cancer. NPJ precision oncology 6, 45 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Baghban, R. et al. Tumor microenvironment complexity and therapeutic implications at a glance. Cell Communication and Signaling 18, 1–19 (2020).
Article MATH Google Scholar
Bilotta, M. T., Antignani, A. & Fitzgerald, D. J. Managing the TME to improve the efficacy of cancer therapy. Frontiers in immunology 13, 954992 (2022).
Article CAS PubMed PubMed Central Google Scholar
Cao, R., Yuan, L., Ma, B., Wang, G. & Tian, Y. Tumour microenvironment (TME) characterization identified prognosis and immunotherapy response in muscle-invasive bladder cancer (MIBC). Cancer Immunology, Immunotherapy 70, 1–18 (2021).
Article CAS PubMed Google Scholar
Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell 40, 865–878. e866 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Gao, R. et al. Harnessing TME depicted by histological images to improve cancer prognosis through a deep learning system. Cell Reports Medicine (2024).
Giraldo, N. A. et al. The clinical role of the TME in solid cancer. British Journal of Cancer 120, 45–53, https://doi.org/10.1038/s41416-018-0327-z (2019).
Article PubMed MATH Google Scholar
Jin, M.-Z. & Jin, W.-L. The updated landscape of tumor microenvironment and drug repurposing. Signal Transduction and Targeted Therapy 5, 166, https://doi.org/10.1038/s41392-020-00280-x (2020).
Article PubMed PubMed Central MATH Google Scholar
Sheng, W., Li, X., Li, J., Mi, Y. & Li, F. Evaluating prognostic value and relevant gene signatures of tumor microenvironment characterization in esophageal carcinoma. Journal of Gastrointestinal Oncology 12, 1228 (2021).
Article PubMed PubMed Central MATH Google Scholar
Tiwari, A., Trivedi, R. & Lin, S.-Y. Tumor microenvironment: barrier or opportunity towards effective cancer therapy. Journal of Biomedical Science 29, 83, https://doi.org/10.1186/s12929-022-00866-3 (2022).
Article PubMed PubMed Central Google Scholar
Kather, J. N. et al. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine 16, e1002730 (2019).
Article PubMed PubMed Central Google Scholar
Liang, J. et al. Deep learning supported discovery of biomarkers for clinical prognosis of liver cancer. Nature Machine Intelligence 5, 408–420, https://doi.org/10.1038/s42256-023-00635-3 (2023).
Article MATH Google Scholar
Wang, Y. et al. Improved breast cancer histological grading using deep learning. Annals of Oncology 33, 89–98 (2022).
Article CAS PubMed MATH Google Scholar
Rong, R. et al. A Deep Learning Approach for Histology-Based Nucleus Segmentation and Tumor Microenvironment Characterization. Modern Pathology 36, 100196, https://doi.org/10.1016/j.modpat.2023.100196 (2023).
Article PubMed PubMed Central Google Scholar
Wang, S. et al. Deep learning of cell spatial organizations identifies clinically relevant insights in tissue images. Nature Communications 14, 7872, https://doi.org/10.1038/s41467-023-43172-8 (2023).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Wang, Y. et al. Transcriptional intra-tumour heterogeneity predicted by deep learning in routine breast histopathology slides provides independent prognostic information. European Journal of Cancer 191, 112953 (2023).
Article CAS PubMed Google Scholar
Fremond, S. et al. Interpretable deep learning model to predict the molecular classification of endometrial cancer from haematoxylin and eosin-stained whole-slide images: a combined analysis of the PORTEC randomised trials and clinical cohorts. The Lancet Digital Health 5, e71–e82 (2023).
Article CAS PubMed MATH Google Scholar
Zeng, D. et al. Tumor microenvironment characterization in gastric cancer identifies prognostic and immunotherapeutically relevant gene signatures. Cancer immunology research 7, 737–750 (2019).
Article CAS PubMed MATH Google Scholar
Liu, Y., Li, C., Lu, Y., Liu, C. & Yang, W. Tumor microenvironment-mediated immune tolerance in development and treatment of gastric cancer. Frontiers in Immunology 13, 1016817 (2022).
Article CAS PubMed PubMed Central MATH Google Scholar
Lou, S. et al. Gastric Cancer Histopathology Tissue Image Dataset (GCHTID). figshare. Dataset. https://doi.org/10.6084/m9.figshare.25954813.v1 (2024).
Article Google Scholar
Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Tan, M. & Le, Q. in International conference on machine learning. 6105–6114 (PMLR).

Download references

Acknowledgements

The study was supported by National Natural Science Foundation of China (NSFC) (82304250, 82273734), National Natural Science Foundation of China’s General Program (82072640), Outstanding Youth Project of Heilongjiang Natural Science Foundation (YQ2021H023), Collaborative Innovation Projects of “Double First-class” Disciplines in Heilongjiang Province (LJGXCG2023-087), Climing program of Harbin Medical University Cancer Hospital (PDYS-2024-14), Heilongjiang Provincial Natural Science Foundation of China (LH2023H096), the Postdoctoral research project in Heilongjiang Province (LBH-Z22210), the China Postdoctoral Science Foundation (2023MD744213).

Author information

Authors and Affiliations

Department of Oncology Surgery, Harbin Medical University Cancer Hospital, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Shenghan Lou & Peng Han
Department of Biostatistics, School of Public Health, Harbin Medical University, Harbin, 150081, China
Jianxin Ji, Xuan Zhang, Menglei Hua, Kaiyuan Ge, Qi Zhang & Lei Cao
Department of Pathology, Harbin Medical University Cancer Hospital, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Huiying Li, Yang Jiang & Kexin Chen
School of Health Management, Harbin Medical University, Harbin, 150081, China
Liuying Wang
Heilongjiang Province Key Laboratory of Molecular Oncology, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Peng Han
Heilongjiang Cancer Institute, No.150 Haping Road, Harbin, Heilongjiang, 150081, China
Peng Han

Authors

Shenghan Lou
View author publications
Search author on:PubMed Google Scholar
Jianxin Ji
View author publications
Search author on:PubMed Google Scholar
Huiying Li
View author publications
Search author on:PubMed Google Scholar
Xuan Zhang
View author publications
Search author on:PubMed Google Scholar
Yang Jiang
View author publications
Search author on:PubMed Google Scholar
Menglei Hua
View author publications
Search author on:PubMed Google Scholar
Kexin Chen
View author publications
Search author on:PubMed Google Scholar
Kaiyuan Ge
View author publications
Search author on:PubMed Google Scholar
Qi Zhang
View author publications
Search author on:PubMed Google Scholar
Liuying Wang
View author publications
Search author on:PubMed Google Scholar
Peng Han
View author publications
Search author on:PubMed Google Scholar
Lei Cao
View author publications
Search author on:PubMed Google Scholar

Contributions

S.L., J.J., L.W., L.C. and P.H. conceived of the project, S.L. and J.J. designed the detection model and developed the algorithms, H.L., Y.J., K.C, P.H. provided and annotated the slide images, J.J., M.H., X.Z., K.G., L.W. and Q.Z. participated in workstation environment deployment. J.J., X.Z. contributed to the preprocess of the data. S.L. and L.C. wrote the manuscript. All the authors revised the paper.

Corresponding authors

Correspondence to Liuying Wang, Peng Han or Lei Cao.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

supplementary file

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lou, S., Ji, J., Li, H. et al. A large histological images dataset of gastric cancer with tumour microenvironment annotation for AI. Sci Data 12, 138 (2025). https://doi.org/10.1038/s41597-025-04489-9

Download citation

Received: 06 June 2024
Accepted: 16 January 2025
Published: 22 January 2025
DOI: https://doi.org/10.1038/s41597-025-04489-9