Introduction

Hepatocellular carcinoma (HCC) is one of the most prevalent primary cancers and a leading cause of cancer-related mortality worldwide. It is estimated that HCC accounts for 80% of liver cancer (with an estimated incidence exceeding 900,000 cases and approximately 830,000 deaths annually) diagnoses worldwide1,2. This aggressive disease accounts for nearly 90% of liver cancer deaths, underscoring its significant impact on public health3,4. Although surgical treatment remains the most effective option, many patients miss this opportunity because of the insidious and rapidly progressing nature of the disease, frequently resulting in delayed diagnosis. Furthermore, HCC is characterized by high rates of chemotherapy resistance, metastasis, and recurrence, which contribute to its poor prognosis4,5. Therefore, understanding the molecular mechanisms underlying HCC is essential for identifying new diagnostic and therapeutic targets.

The cytoskeleton, composed of actin filaments, microtubules, and intermediate filaments, plays a critical role in maintaining cellular integrity and facilitating essential processes such as migration, invasion, and cell division6. This dynamic framework not only provides structural support but also orchestrates intracellular transport and signal transduction pathways, which are vital for various cellular functions7. Dysregulation of cytoskeletal components has been implicated in enhancing tumor invasion and metastasis8,9. Cytoskeletal dynamics are crucial in cancer progression, with proteins such as tubulin significantly influencing microtubule stability and dynamics10. Drugs that directly target tubulin and affect its polymerization and depolymerization, such as paclitaxel, have been widely utilized in various cancers, including breast and pancreatic cancer11. Additionally, emerging concepts, such as post-translational acetylation modifications and their corresponding therapeutic approaches (microtubule acetylation inhibitors), have been experimentally validated to significantly inhibit tumor cytoskeletal dynamics, demonstrating the potential to suppress tumor metastasis and progression12,13. Furthermore, cytoskeletal dynamics rely on streamlined molecular pathways, including RhoA, PI3K/AKT, and TGF-β10.

In this study, we compared two machine learning methods: Least Absolute Shrinkage and Selection Operator (LASSO)-stepwise Cox regression and Random Forest (RF). Utilizing one training set and two testing sets, we ultimately identified a robust 5-gene OS prediction model demonstrating satisfactory predictive performance. We further conducted a comprehensive analysis of this model and its 5 core genes across multiple aspects, including expression levels, immune infiltration, transcriptional regulation, mutation status, and drug response, as well as various transcriptomic dimensions such as bulk RNA sequencing, scRNA-seq, and spatial transcriptomics. Our research provides new theoretical references and potential targets for advancing the current study of the cytoskeleton in HCC.

Materials and methods

All animal experiments were conducted under the guidelines and the institutional animal care protocol approved by the laboratory animal care and ethics committee of Guangdong Laidi Biomedical Research Institute Co.,LTD(LDSW-SOR-AAP-05–023(02)). The study was conducted in accordance with the Declaration of Helsinki, and ethical review and approval were waived for the human data analysis portion due to exclusive use of publicly available databases that had previously received ethical approval at their original institutions.

Data collection and preprocessing

Transcriptomic data and clinical information of HCC patients and normal tissue donors were obtained from the TCGA-LIHC dataset (https://portal.gdc.cancer.gov/), the International Cancer Genome Consortium (ICGC) (LIRI-JP) dataset (https://xenabrowser.net/datapages/), and the Chinese HCC patients with HBV infection (CHCC-HBV) dataset14. All data were standardized and converted to FPKM for subsequent analyses.

Identification of differentially expressed genes (DEGs) and functional clustering

A total of 367 cytoskeleton-related genes were obtained from the MSigDB database. DEGs between high and low OS groups (divided by median OS) in the training set (TCGA-LIHC) were identified using the “limma” R package15, with p < 0.05 as the cutoff. R package “heatmap” was used to generate Heatmaps showing differences in gene expression between high and low OS groups. After performing ID conversion on the input molecular list, functional enrichment analyses using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)16,17,18 were performed using the “clusterProfiler” R package.

Gene correlation analysis and heatmap visualization

Heatmaps illustrating the correlations between genes were created using“corrplot”,“heatmap”, and“ggpubr”R packages. Correlation p-values were displayed in the figures as follows: * for < 0.05, ** for < 0.01, and *** for < 0.001.

Protein–protein interaction (PPI) network analysis

Online PPI analysis and clustering were performed on the STRING website (https://string-db.org/).

Prognostic heatmap visualization

Prognostic heatmaps were generated using “pheatmap” and “ggplot2” R packages. Scatter plots of OS-Status distribution and bar graphs of risk factor distribution were overlaid above the gene expression heatmap.

Construction of prognostic prediction model using COX regression and machine learning methods

A prognostic prediction model for HCC was constructed using COX regression and machine learning techniques. COX regression analysis was performed using the“survival”R package to identify key genes associated with OS. LASSO stepwise COX regression analysis was performed using the“glmnet”R package, with tenfold cross-validation to obtain the optimal model. The“gg_minimal_vimp”function in the“randomForestSRC”package was used to evaluate variable importance in the model, which was visualized alongside minimum depth information to identify the most critical features for model prediction. Subsequently, the“obliqueRSF”package was employed to calculate error rates (1—C-index) from predictions, recording error rates for tree numbers from 10 to 1000 in increments of 10. Based on the lowest error rate, a random forest (RF) model was established with 110 trees. Risk scores for each sample were calculated based on predictions and incorporated into the training and testing datasets for further analysis.

Evaluation of predictive performance

ROC curves were generated, and the area under the curve (AUC) was calculated using the“timeROC”and“pROC”R packages to evaluate predictive performance. The difference in predictive performance between the two models was assessed using the DeLong test, with a p-value of less than 0.05 considered statistically significant. To compare survival outcomes between different risk groups, the R package“maxstat”was used to determine the optimal cutoff value for risk score, and the“survival”R package was used to plot Kaplan–Meier (K-M) curves. (Subsequent survival analysis was conducted using the same method.) C-index values at corresponding time points were calculated using the “survival” package, and the results were visualized using the “ggplot2”.

Multivariate COX regression and nomogram construction

Multivariate COX regression was performed using the“survival”R package, integrating survival time, survival status, and other data to evaluate the prognostic significance of risk score and clinical features. The“forestploter”R package was used to create a forest plot that displays the outcomes of the multivariate COX regression analysis. A nomogram-related model was constructed and visualized using the“rms”R package. To validate the multivariate COX analysis and nomogram model, calibration curve analysis and visualization were performed using the“rms”R package. Additionally, to provide a more intuitive representation of the predictive value of independent risk factors, the prognostic model was fitted using the“survival”package, and DCA analysis was conducted using the“stdca”R package.

Consensus clustering analysis and sankey diagram visualization

Cluster analysis was conducted with ConsensusClusterPlus19, utilizing agglomerative PAM clustering based on 1-Pearson correlation distances, and resampling 80% of the samples over 10 repetitions. The optimal cluster count was identified using the empirical cumulative distribution function (ECDF) plot. The R package “networkD3” was used to analyze the flow relationships among risk score, Consensus Clusters, Status, and OS, and a Sankey diagram was created.

t-SNE analysis of gene expression

t-SNE analyses were conducted using“Rtsne”R packages. Expression profiles were first z-score normalized, and two groups, defined by high and low risk scores (with its median as the cutoff threshold), were displayed to show group distribution.

Differential expression and immune infiltration of 5 core genes

We analyzed the expression difference of 5 core genes between HCC tissues and normal liver tissues from HCC patients in the TIMER2.0 website (http://timer.cistrome.org/). The relationship between the expression level of 5 genes and the infiltration of four immune cell types: CD4 + T cells, Regulatory T cells (Tregs), myeloid-derived suppressor cells (MDSCs), and cancer-associated fibroblasts (CAFs), was examined via Spearman correlation analysis.

Mutation gene analysis

Genomic data for TCGA-LIHC, encompassing somatic mutations and DNA copy number variations, were retrieved from the Genomic Data Commons (GDC) website (https://portal.gdc.cancer.gov/). Differential mutation gene analysis was performed using R packages “maftools”, “base”, and “graphics” to identify mutation genes associated with risk scores and explore their expression correlations.

Transcription factor analysis and network venn diagram visualization

Transcription factors associated with the 5 core genes in the liver were retrieved from the “hTFtarget” database. R packages “igraph”, “ggraph”, and “ggplot2” were used to create a network Venn diagram. Common transcription factors shared by the 5 genes were annotated in the diagram.

Methylation analysis of HCC patients

Methylation levels and differential analysis data for the 5 core genes were obtained from the UALCAN website (https://ualcan.path.uab.edu/).

Single-cell RNA sequencing analysis

Single-cell RNA sequencing (scRNA-seq) analysis was performed using the GSE149614 dataset, which includes 21 samples (8 non-tumor liver tissues (N), 10 primary HCC tumor tissues (T), 1 metastatic lymph node tissue (L), and 2 portal vein tumor thrombus tissues (P)). The analysis utilized the“Seurat”R package (version 4.0) for data processing and analysis. Cells were filtered based on criteria of 200 < nFeature_RNA < 8000 genes per cell and mitochondrial gene expression proportions < 10%. For samples with ≥ 10 cells, batch correction was implemented via Seurat’s integration pipeline, involving“SplitObject”,“FindIntegrationAnchors”, and“IntegrateData”based on canonical correlation analysis. Data normalization and standardization were performed using the"NormalizeData + FindVariableFeatures + ScaleData"workflow, focusing on 2000 highly variable genes. Principal component analysis (PCA) was conducted with“RunPCA”, selecting principal components explaining 90% of variance (default 20 components). Clustering was performed using“FindNeighbors”and“FindClusters”, with cell types annotated via the“SingleR”function against the HumanPrimaryCellAtlasData reference. Dimensionality reduction and visualization were achieved using UMAP. Key analyses included expression profiling of 5 risk genes (ARPC1A, CCNB2, CKAP5, DCTN2, TTK), calculation of risk scores (weighted by gene-specific coefficients), and investigation of immune checkpoint ligand-receptor pairs (PD-L1/PD-1, CD80/CD86/CTLA4, NECTIN2/PVR/TIGIT, LGALS9/HAVCR2, CD47/SIRPA). Pearson correlation analysis assessed relationships between risk genes, immune cell infiltration ratios, and immune escape genes at the sample level. Visualizations, including UMAP plots, violin plots, and correlation heatmaps, were generated using“ggplot2”, with statistical reliability ensured by requiring ≥ 3 cells per group and significance denoted by * (p < 0.05).

Spatial transcriptomics analysis

Spatial transcriptomics analysis was performed using the HCCDB website (http://lifeome.net:809/#/home)20, specifically using four slice samples from the “HCC2” group. The spatial distribution of the 5 core genes (ARPC1A, CCNB2, CKAP5, DCTN2, TTK) was visualized.

Drug prediction and drug sensitivity analysis

Drug prediction was carried out using the DSigDB database from Enrichr (https://maayanlab.cloud/Enrichr/). Drug sensitivity analysis was performed using the Liver Cancer Organoid Biobank (LICOB) dataset21, which contains expression profiles and half-maximal inhibitory concentration (IC50) data for HCC organoids. Additionally, two datasets,“Expression_Public_24Q2_subsetted”and"PRISM_Repurposing_Public_24Q2_subsetted (IC50)", were downloaded from the DepMap database (https://depmap.org/portal/) and used as raw data for drug resistance analysis in cell lines.

Molecular docking

Protein structures were retrieved from the PDB database (https://www.pdb.org/), and drug structures were obtained from the PubChem database (https://pubchem.ncbi.nlm.nih.gov/). After removing water molecules and non-essential compounds from the molecules using MOE (Chemical Computing Group, Montreal, QC, Canada), molecular docking was performed following the principles of maximum binding affinity, maximum interaction sites, and minimum binding energy to select the most suitable docking scheme. Additionally, two-dimensional and three-dimensional docking diagrams were generated using MOE.

Cell culture

PLC/PRF/5 and Hep3B cell lines were purchased from the American Type Culture Collection (ATCC, Manassas, VA, USA). HCCLM3, Huh-7, and SNU449 cell lines were purchased from Cellcook Biotech (Guangzhou, China). Cells were cultured at 37 °C with 5% CO₂ in Dulbecco’s Modified Eagle Medium (DMEM) supplemented with 10% fetal bovine serum (FBS) and 1 × penicillin–streptomycin (PS). Cells were passaged when they reached 80% confluence. Sorafenib (5 μM) and/or Irinotecan (10 μM) were used as indicated.

Reverse transcription quantitative polymerase chain reaction (RT-qPCR)

Vazyme TRIzol reagent (Nanjing, China) was used to extract total RNA from HCC cells and then Vazyme HisScript® III RT SuperMix kit (Nanjing, China) was used to reverse-transcribe RNA into cDNA. RT-qPCR was performed using the Vazyme ChamQ Universal SYBR qPCR Kit (Nanjing, China) on a Roche LightCycler® 96 instrument. GAPDH was used as an internal reference to normalize the expression of target genes. Each sample was analyzed in triplicate. The reverse transcription and RT-qPCR procedures followed the manufacturer’s instructions for reagents. Primer sequences used in this study include:

GAPDH: Forward Sequence: GTCTCCTCTGACTTCAACAGCG; Reverse Sequence: ACCACCCTGTTGCTGTAGCCAA.

ARPC1A: Forward Sequence: AGAGTTCCTGCCGCTCCTAAGT; Reverse Sequence: ACTTGGAGACGAAGGTCAGGCA.

CCNB2: Forward Sequence: CAACCAGAGCAGCACAAGTAGC; Reverse Sequence: GGAGCCAACTTTTCCATCTGTAC.

CKAP5: Forward Sequence: TCATCACAGTCCTTGGAGACAGC; Reverse Sequence: CAGCCATTCCTTCATGCCAGTC.

DCTN2: Forward Sequence: CCCAGATAGCAGCCTTGTCACT; Reverse Sequence: TCCAGCTCTGTCAGGCGCTTTT.

TTK: Forward Sequence: CCGAGATTTGGTTGTGCCTGGA; Reverse Sequence: CCGAGATTTGGTTGTGCCTGGA.

Cell proliferation assay

PLC/PRF/5 cells were seeded into 96-well plates at a density of 1000 cells per well. Cell proliferation was measured at 24, 48, 72, 96, and 120 h using a Cell Counting Kit-8 (CCK8, RARbio, China). After 24 h, media were refreshed, and cells were treated with the corresponding drug concentrations according to the experimental groups. At each time point, 100 μL of complete medium containing 10 μL of CCK-8 reagent was added to each well, mixed thoroughly, and incubated for 2 h. Absorbance was measured at 450 nm.

Colony formation assay

1000 cells per well were plated in 6-well plates and incubated at 37 °C with 5% CO₂ for 14 days. Cells were treated with the corresponding drug concentrations during the culture period. Colonies were fixed with methanol and stained with 0.05% crystal violet for 30 min. Colonies containing more than 50 cells were counted under a microscope.

Xenograft formation in nude mice

All animal experiments were conducted under the guidelines and the institutional animal care protocol approved by the laboratory animal care and ethics committee of Guangdong Laidi Biomedical Research Institute Co.,LTD(LDSW-SOR-AAP-05–023(02)).

Female BALB/c nude mice (4–6 weeks old) purchased from GemPharmatech LLC (Jiangsu, China), were utilized to establish subcutaneous xenograft tumor models. Transfected PLC/PRF/5 cells were collected, washed, and resuspended in PBS. Subsequently, 2 × 10^6 cells (suspended in 100 μL PBS) were injected subcutaneously into the axilla of each mouse. Once the tumor volume in the nude mice reached 200 mm3, the mice were grouped for different treatments. The tumor volume and body weight of each mouse were recorded every three days. Mice were considered to have reached the observation endpoint if the total treatment duration reached 28 days, tumor volume reached 1500 mm3, or significant weight loss occurred due to tumor growth or treatment. According to the protocol, mice were euthanized upon exhibiting any of these criteria: (1) signs of uncontrollable pain or decreased mobility; (2) weight loss exceeding 20% of initial body mass; (3) lethargy, diminished response to external stimuli, dull or ruffled fur, hunched posture, reduced food and water intake, decreased urine and fecal output; (4) behavioral and clinical manifestations including palpable abdominal masses, persistent somnolence, sustained temperature abnormalities, respiratory distress, or other indicators of severe health deterioration; (5) reaching the defined observation endpoint. For conditions (1) through (4), specimens were collected through euthanasia prior to symptom onset, based on anticipated tumor development timing or experimental design requirements. Such mice were euthanized by cervical dislocation and disposed of humanely.

Statistical analysis

All statistical analyses were performed using R software (version 4.2.0) and GraphPad Prism (version 9.0). Data are presented as mean ± standard deviation (SD). Comparisons between two groups were performed using the Student’s t-test, while comparisons among multiple groups were conducted using one-way ANOVA followed by Tukey’s post hoc test. Survival analysis was performed with the Kaplan–Meier method, and the log-rank test was used to evaluate differences between the survival curves. Correlation analysis was performed using Spearman’s or Pearson’s correlation coefficients as appropriate. A p-value < 0.05 was considered statistically significant. We define statistical significance as usual: **** P ≤ 0.0001; *** P ≤ 0.001; ** P ≤ 0.01; * P ≤ 0.05; n.s., not significant.

The flowchart of this study is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of this study.

Results

Identification of core cytoskeleton-related DEGs in hepatocellular carcinoma

First, we classified 365 HCC samples from the TCGA-LIHC database into two groups based on overall survival (OS). Differential expression analysis of 367 cytoskeleton-related genes between these two groups revealed 110 differentially expressed genes (DEGs) (Fig. 2A). Subsequently, we conducted interaction analysis (Fig. 2B) and clustering analysis (Fig. 2C) of the DEGs using STRING, identifying 2 major clusters containing 80 cytoskeleton-related DEGs (green and red gene clusters in Fig. 2C). Correlation analysis among the DEGs indicated that the relationships among the 80 genes within these two major clusters were more consistent than those among all 110 DEGs, with most genes exhibiting positive correlations with each other in expression levels (Figs. 2D-E). Finally, GO and KEGG analyses of the clustered 80 DEGs revealed associations primarily with cytoskeletal structure, intercellular junctions, spindle formation, and cellular invasion activities (Fig. 2F).

Fig. 2
figure 2

Identification of core cytoskeleton-related DEGs in hepatocellular carcinoma. (A) Screening for DEGs associated with OS and the cytoskeleton. (B) Interaction analysis and (C) clustering of the DEGs. (D) Correlation analysis among all 110 DEGs, and (E) correlation analysis of 80 genes within the two major clusters (green and red gene clusters). (F) GO and KEGG analysis of the two major DEG clusters.

Construction of prognostic models for hepatocellular carcinoma associated with cytoskeleton

To establish a prognostic predictive model for HCC, we performed univariate COX analysis on the 80 identified genes, screening out 13 key genes (Fig. 3A). To develop a more robust model, we utilized the TCGA-LIHC dataset as the training cohort. We first applied LASSO regression analysis, determining an optimal λ value of 0.01, which led to the selection of 9 candidate genes (Figs. 3B and 3 C). Subsequently, we constructed a five-gene model comprising ARPC1A, CCNB2, CKAP5, DCTN2, and TTK through stepwise COX regression from the 9 genes identified by LASSO (Fig. 3D). The heatmap of gene expression associated with OS indicated a consistent distribution, with higher expression levels observed in individuals with lower OS (Fig. 3E). The prognostic ROC curve demonstrated strong predictive capability for the model, with AUC values of 0.77 (95% CI: 0.70–0.83), 0.74 (95% CI: 0.67–0.81), and 0.73 (95% CI: 0.65–0.81) for 1-year, 3-year, and 5-year survival, respectively (Fig. 3F). Based on the risk score from the five-gene model, the K-M curve indicated that HCC patients in the high-risk score group experienced significantly shorter survival than those in the low-risk score group (Fig. 3G). The C-index was calculated to be 0.695 (95% CI: 0.671–0.719), indicating good predictive performance of the model (Fig. 3H).

Fig. 3
figure 3

Construction of prognostic models for hepatocellular carcinoma associated with cytoskeleton. Analysis results from TCGA-LIHC database: (A) Forest plot of 13 key genes from univariate COX analysis of DEGs. (B) Optimal λ value and cross-validation plot from LASSO regression analysis, (C) LASSO model coefficients plot. (D) Forest plot of 5 core genes from stepwise COX regression of 9 LASSO-selected genes, (E) OS-associated gene expression heatmap. (F) Prognostic ROC curve, (G) K-M survival curve (OS), and (H) C-index plot for 5-gene model risk score from LASSO-stepwise COX analysis. (I) Identification of 5 core genes using VIMP + minimum depth method in Random Forest model. (J) Random Forest tree number vs. error rate analysis, (K) prognostic ROC curve, (L) K-M survival curve, and (M) C-index plot for optimal 5-gene model from Random Forest analysis.

Furthermore, within the TCGA-LIHC training cohort, we constructed an alternative prognostic risk model using the RF method. Employing the variable importance measure (VIMP) combined with minimal depth, we identified 5 core genes (ARPC5, CAPZA1, CKAP5, WASF2, and TTK) (Fig. 3I). Analysis of the optimal five-gene model from the random forest revealed the lowest error rate with 110 trees (Fig. 3J). The prognostic ROC curve for the random forest model indicated AUC values of 0.91 (95% CI: 0.88–0.95), 0.89 (95% CI: 0.84–0.93), and 0.90 (95% CI: 0.86–0.95) for 1-year, 3-year, and 5-year survival, respectively (Fig. 3K), demonstrating stronger prognostic capability compared with LASSO-stepwise COX model (DeLong test, p < 0.05). K-M analysis based on the risk score from the five-gene model similarly indicated that HCC patients in the high-risk score group had considerably shorter survival than their counterparts (Fig. 3L). The C-index for the RF model was calculated at 0.836 (95% CI: 0.809–0.872), suggesting that the model’s predictive capability was superior to that of the LASSO-stepwise COX regression model in the TCGA-LIHC dataset (Fig. 3M).

Validation of the prognostic models

After establishing the two models, we validated them using 2 test datasets: ICGC (LIRI-JP) and CHCC-HBV. We depicted prognostic ROC curves, K-M curves for OS, and C-index values based on 2 models derived from the LASSO-stepwise COX method and the RF method.

The results indicated that AUC for the prognostic ROC curves of the LASSO-stepwise COX model consistently remained at 0.7 or above in both test datasets. In contrast, the RF model exhibited poor predictive capability in both test datasets, with AUC values below 0.7. Notably, in the CHCC-HBV dataset, the AUC for the prognostic ROC curve was particularly low, dropping to 0.51 (95% CI: 0.41–0.62) at the 3-year mark. DeLong test indicated that the AUC values of ROC curves for corresponding years of the LASSO-stepwise COX model in each dataset were significantly greater than those of the corresponding RF model (p < 0.05). (Figs. 4A-D).

Fig. 4
figure 4

Validation of the prognostic models. (A-D) ROC curves, (E–H) K-M survival curves, and (I-L) C-index plots from LASSO-Stepwise COX and Random Forest models using ICGC (LIRI-JP) and CHCC-HBV datasets. (M–O) Forest plot, nomogram, and calibration curve from multivariate COX analysis of 5-gene model (TCGA-LIHC).

Moreover, the K-M curves indicated that in the ICGC (LIRI-JP) (n = 231) and CHCC-HBV test datasets (n = 159), the p-values generated by the RF model were significantly higher than those produced by the corresponding LASSO-stepwise COX model, further suggesting that the predictive capability of the former may not be as robust as that of the latter (Figs. 4E-H).

The C-index values for the LASSO-stepwise COX model were 0.748 (95% CI: 0.703–0.794) in the ICGC (LIRI-JP) dataset and 0.677 (95% CI: 0.643–0.710) in the CHCC-HBV dataset, while the RF model yielded C-index values of 0.583 (95% CI: 0.52–0.65) and 0.584 (95% CI: 0.534–0.637), respectively (Figs. 4I-L). Overall, combining the results from the two testing methods, we found that the RF model had limited generalizability to other datasets, whereas the LASSO-stepwise COX model maintained robust predictive performance. Consequently, we proceeded with the five-gene model derived from the LASSO-stepwise COX method for further analyses.

For the model constructed by the LASSO-stepwise COX method, we incorporated conventional clinical parameters and conducted multivariate COX regression analysis in the TCGA-LIHC dataset (n = 353). This analysis revealed that both risk score and age were independent risk factors (Fig. 4M). We subsequently constructed a prognostic nomogram based on this multivariate model, which exhibited a C-index of 0.721 (95% CI: 0.687–0.754) (Fig. 4N). The nomogram was further validated using calibration curves (Fig. 4O) and decision curve analysis (DCA) (Figure S1A), both indicating good predictive value. In the two test datasets, ICGC (LIRI-JP) (n = 231) and CHCC-HBV (n = 159), the heatmap of gene expression related to OS showed a similar distribution of the 5 genes, with higher expression levels observed in individuals with low OS, consistent with the training dataset TCGA-LIHC (Figures S1B and S1D). Similarly, multivariate COX regression analysis incorporating clinical factors indicated that the risk score continued to be an independent risk factor in both test datasets (Figures S1C and S1E). To evaluate the performance of our risk model, we compared it with existing models constructed around TCGA-LIHC from other studies. Based on a previous study that summarized C-index values of 10 different aspects of hepatocellular carcinoma machine learning models, including HCC prognosis-related gene signatures (HPRGs), anoikis-related genes signatures (ARGs), lactic acid metabolism-related gene signatures (LMRGs), cuproptosis-related gene signatures (CRGs), epithelial-mesenchymal transition-related gene signatures (EMTRGs), fatty acid metabolism gene signatures (FMRGs), cancer-associated fibroblast signatures (CFGs), necroptosis-related gene signatures (NRGs), inflammatory response-related gene signatures (RRGs), and immunotherapy-related gene signatures (IRGs), the C-index ranges were 0.496–0.709 in the training set TCGA-LIHC (with HPRGs being optimal) and 0.502–0.703 in the testing set LIRI-JP (with HPRGs being optimal)22. Our current risk model demonstrated C-index values of 0.695 (0.671–0.719) in TCGA-LIHC and 0.748 (0.703–0.794) in LIRI-JP, which were non-inferior to these studies and even more robust than these existing models. These findings further support the clinical value of our 5-gene model in hepatocellular carcinoma prognosis prediction.

Additionally, we identified the intersection of genes involved in the LASSO-stepwise COX and RF analyses, yielding two common genes: CKAP5 and TTK (Figure S2A). Due to the limited number of genes, we further analyzed CKAP5 and TTK using COX regression to construct a new model. However, the COX regression results indicated that only TTK exhibited statistically significant differences (Figure S2B). This finding suggests that TTK may serve as a core gene significantly influencing the prognosis of HCC, providing an important lead for subsequent drug screening analyses and experiments. Therefore, we further assessed the predictive capability of TTK. The prognostic ROC curves derived from TCGA-LIHC, ICGC (LIRI-JP), and CHCC-HBV datasets demonstrated that TTK also had good prognostic judgment capability (Figure S2C). K-M analysis revealed that elevated TTK expression was linked to worse prognosis (Figure S2D).

Representational role of the five-gene model for two clusters based on the expression of eighty core DEGs in hepatocellular carcinoma

To assess whether the 5 genes derived from the LASSO-stepwise COX method accurately represent the 80 DEGs obtained from clustering, we performed dimensionality reduction analyses using the risk score derived from the LASSO-stepwise COX model as the grouping criterion (Low/High), with its median as the cutoff threshold, and the expression of the 80 DEGs as the expression profile. The t-SNE analysis (Fig. 5A), preliminarily indicated that the risk score effectively segregated the expression profiles of all 80 genes into two subgroups. Additionally, we performed consensus clustering on the expression levels of the 80 DEGs using ConsensusClusterPlus, determining the optimal cluster number to be k = 2 (Fig. 5B). The ECDF plot illustrated the consensus clustering for each k (Fig. 5C). We used the diagnostic ROC curve to evaluate the ability of the risk score from the LASSO-stepwise COX model to predict the clustering results of the 80 DEGs, yielding an AUC of 0.76 (95% CI: 0.71–0.81) (Fig. 5D). Finally, the Sankey diagram illustrated the relationships among clustering, the five-gene model risk score, survival status, and OS, suggesting a strong consistency between the risk score and the clustering results of the 80 DEGs. Most samples in cluster C1 exhibited high-risk score, while cluster C2 displayed the opposite trend (Fig. 5E). In summary, we believe that the 5 core genes and the model they constitute can represent the 80 genes.

Fig. 5
figure 5

Representational Role of the Five-Gene Model for Two Clusters Based on the Expression of Eighty Core DEGs in hepatocellular carcinoma. (A) t-SNE analysis of 80 DEGs by risk score (Low/High) from LASSO-Stepwise COX model. (B) Consensus clustering of 80 DEGs (optimal k = 2). (C) ECDF plots of consensus distributions. (D) ROC curve of 5-gene model diagnostic value for clusters. (E) Sankey diagram of clustering, risk score, survival status, and OS relationships.

The expression of 5 genes in hepatocellular carcinoma and their relationships with prognosis, immune suppressor cells, and genetic mutations

Analysis of the TCGA-LIHC dataset revealed remarkable differences in the expression level of the 5 genes between HCC and normal liver tissues, with all 5 genes showing markedly elevated expression in HCC (Fig. 6A). Furthermore, K-M curves indicated the prognostic value of these genes, as high expression levels of all 5 genes were related to poor OS in HCC patients (Fig. 6B). After further analysis with the TIMER2.0 database, the expression levels of 5 genes exhibited significant positive correlations with the infiltration levels of CD4 + T cells, Tregs, MDSCs, and CAFs (Fig. 6C). This suggests that high expression of these genes may indicate an immunosuppressive or immune evasion state in HCC23,24,25. Finally, we analyzed the mutation status of the 5 genes in high/low-risk score groups from the TCGA-LIHC dataset. No statistically significant differences in mutation rates were observed between the two groups. However, further analysis of differential mutations revealed that the mutations (mainly missense mutations) of TP53 exhibited the most significant differences, notably increasing in the high-risk score group. The waterfall plot illustrates the top 20 mutated genes distinguished by the risk score of the five-gene model (Fig. 6D). Moreover, differential mutation analysis of the 5 core genes based on high and low OS grouping criteria revealed limited mutated individuals for each gene, with no mutations found for CCNB2 in the TCGA-LIHC dataset. Consequently, the analysis did not yield statistically significant differences, leading us to conclude that mutations in these 5 genes have no significant impact on the prognosis of HCC (Figure S2E).

Fig. 6
figure 6

The expression of 5 genes in hepatocellular carcinoma and their relationships with prognosis, immune suppressor cells, and genetic mutations. (A) 5-gene expression in HCC vs. normal liver tissue (TCGA). (B) K-M survival curves of 5 genes (TCGA). (C) Correlation of 5-gene expression with immune cell infiltration (CD4 + T cells, Tregs, MDSCs, CAFs) using TIMER2.0. (D) Waterfall plot of 5-gene risk score and mutations (TCGA, excluding mutation-free samples).

Transcriptional regulation analysis of 5 genes

To investigate the factors influencing the transcriptional differences of the five identified genes, an analysis of their transcriptional regulation was performed. Utilizing the hTFtarget database, we identified transcription factors associated with each gene in liver tissue and determined the intersection of these transcription factors, constructing a Venn diagram. The Venn diagram revealed three common transcription factors shared among the 5 genes: CREB1, SIN3A, and YY1 (Fig. 7A), suggesting their potential key roles in regulating 5 genes’ expression. Additionally, we assessed the methylation levels of the 5 genes using the TCGA-LIHC dataset. Our analysis indicated that the methylation levels of CCNB2, CKAP5, and DCTN2 were significantly lower in HCC compared to normal liver tissues. Although ARPC1A and TTK exhibited no statistically significant differences in methylation levels between HCC and normal liver tissues, both genes showed reduced methylation levels in HCC relative to normal tissues (Fig. 7B). These findings reflect the transcriptional activation status of the 5 genes in hepatocellular carcinoma.

Fig. 7
figure 7

Transcriptional regulation analysis of 5 genes. (A) Network Venn diagram of predicted transcription factors for 5 genes in liver (hTFtarget database). (B) Methylation levels of 5 genes (TCGA database).

Single-cell sequencing analysis of 5 genes and risk score

We analyzed the scRNA-seq data from GSE149614 and utilized UMAP plots to illustrate the distribution of quality control-passed cells according to sample (Fig. 8A) and cell type (Fig. 8B), along with the counts of various cell types. Among these, (benign/malignant) hepatocytes constituted the highest proportion (Fig. 8C). To further investigate the impact of the 5 risk score-related genes on immune infiltration, we performed a Pearson correlation analysis between their expression levels and the infiltration levels of immune cells. The results indicated that the tumor infiltration levels of major cytotoxic immune cells, including T cells and Natural Killer (NK) cells, correlated negatively with the expression levels of the 5 genes, particularly for NK cells (p < 0.05) (Fig. 8D).

Fig. 8
figure 8

Single-cell sequencing analysis of 5 genes and risk score. (A-B) UMAP plots of samples and cell types. (C) scRNA-seq cell type counts. (D) Violin plot of risk score distribution by cell types (excluding < 3 cells with risk score > 0). (E) Heatmap of correlation between 5 genes and immune molecules (p < 0.05). (F) Violin plot of hepatocyte risk scores across tissue types (excluding < 3 cells with risk score > 0). (G) UMAP plot of 5-gene expression and risk score distribution in hepatocytes.

Moreover, we analyzed the expression correlation between risk scores and immune-related molecules (Fig. 8E). The findings revealed that the expression of PVR was significantly positively correlated with the expression levels of the 5 genes (p < 0.05), suggesting that PVR may play an active role in tumor immune evasion. Furthermore, several genes predominantly expressed by immune cells, such as PDCD1, TIGIT, and LAG3, exhibited negative correlations with the expression levels of the 5 genes. We considered that this observation might be influenced by the negative correlation between risk score-related genes and the infiltration levels of immune cells like T cells. Nonetheless, CTLA4 still displayed a positive correlation trend, reflecting its potential role as a key target of the 5 genes affecting immune cells.

To better define the risk score situation in (benign/malignant) hepatocytes and accurately understand the relationship between risk scores and HCC occurrence and development, we categorized the samples based on their tissue types. The analysis demonstrated that risk scores were elevated in metastatic lymph nodes (L), hepatocellular carcinoma tissues (T), and portal vein tumor thrombus (P) samples, with the highest scores observed in metastatic lymph nodes, while the lowest scores were recorded in normal liver tissues (N) (Fig. 8F and S3A). Additionally, the expression levels of the 5 genes and their corresponding risk scores in hepatocytes were depicted using UMAP plots, as shown in Fig. 8G.

Spatial transcriptomics analysis of 5 genes’ expression

To further validate the predictive role of the 5 genes, we conducted spatial transcriptomics analysis using the HCCDB database (http://lifeome.net:809/#/home/). The tissue specimens from the HCC2 dataset were found to be consistent with the tissue types identified in the GSE149614 scRNA-seq, including normal liver tissue (N), hepatocellular carcinoma tissue (T), lymph node metastases (L), and portal vein tumor thrombus (P). Consequently, we selected the HCC2 samples for this exploratory analysis.

Initially, hematoxylin and eosin (H&E) staining of the tissue sections revealed the overall morphology of the samples. Based on this morphological assessment, we delineated the distribution of different tissue and cell types (Fig. 9A). Subsequently, UMAP plots were employed to further illustrate the distribution of (Fig. 9B) tissue types and (Fig. 9C) cell types, along with their interrelationships through clustering and dimensionality reduction.

Fig. 9
figure 9

Spatial transcriptomics analysis of 5 genes’ expression. (A) Hematoxylin and eosin (HE) staining of tissue sections from spatial transcriptomics, showing tissue types’ and cell types’ distribution. UMAP plot of (B) tissue types and (C) cell types. Distribution of expression levels of the 5 genes (D) in spatial transcriptomics tissue sections and (E) their corresponding UMAP plot.

Finally, we mapped the distribution of the 5 genes across the overall morphology of the tissue sections. All 5 genes showed significantly higher expression levels in malignant tissues, specifically hepatocellular carcinoma (T), lymph node metastases (L), and portal vein tumor thrombus (P) (Fig. 9D). Additionally, incorporating the previous UMAP analysis of cell types, through the UMAP plots of the 5 genes, we also found their expression levels were elevated in both tumor and stromal cell types (Fig. 9E). These findings corroborate the results obtained from the previous scRNA-seq analysis.

Drug screening and molecular docking of 5 genes

We utilized the DSigDB database (https://maayanlab.cloud/Enrichr/) to predict protein interactions corresponding to the 5 core genes, yielding 168 results. Among these, the top ten candidate compounds with the lowest p-values are presented (Fig. 10A). Among the 87 statistically significant results (p < 0.05), which included 83 compounds, irinotecan exhibited the lowest p-value, and both irinotecan and sorafenib were identified with the highest frequency (three occurrences each).

Fig. 10
figure 10

Drug screening and molecular docking of 5 genes. (A) Top 10 candidate compounds with lowest p-values predicted by DSigDB database. (B) Top 10 positive and negative correlations from Pearson correlation analysis between risk scores and drug IC50 values in LICOB dataset (p < 0.05). (C) Venn diagrams showing intersection of DSigDB-predicted compounds with significantly correlated drugs from both datasets. (D) STITCH chemical interaction network of LICOB-identified compounds, with sorafenib in central position (redder nodes indicate stronger associations). (E) 2D/3D binding conformations of molecular docking: sorafenib-TTK and irinotecan-CCNB2/DCTN2/TTK.

Subsequently, we performed Pearson correlation analysis on risk scores and drug IC50 values for all HCC cell lines in the DepMap PRISM dataset and all HCC organoids in the LICOB dataset, respectively, to obtain all drugs with significant correlations (p < 0.05) (Fig. 10B and S3B show the top 20 positive and negative correlations for each dataset). Venn diagrams illustrated the intersection of compounds predicted by DSigDB with drugs significantly correlated with risk scores in both the DepMap PRISM and LICOB datasets (Fig. 10C). However, the two compounds identified in the intersection of DSigDB and the DepMap PRISM dataset, chlorpromazine and ciclopirox, had combined scores (158.8 and 309.2) and p-values (0.028 and 0.017) in the DSigDB ranking that were lower than those for sorafenib obtained from the intersection with the LICOB dataset (highest combined score of 387.4, minimum p-value of 0.014). Notably, chlorpromazine and ciclopirox are not approved for tumor treatment, whereas sorafenib is a first-line therapy for HCC. Furthermore, the LICOB dataset results stem from organoid analyses derived from tumor specimens of liver cancer patients, providing greater reliability than the cell line data from the DepMap PRISM dataset. Additionally, the frequent occurrence of sorafenib in the DSigDB analysis suggests stronger evidence supporting its relevance.

Finally, chemical interaction network analysis conducted via the STITCH database (http://stitch.embl.de/) further indicated sorafenib’s central role among HCC treatment compounds in the LICOB dataset (Fig. 10D). In contrast, chlorpromazine and ciclopirox did not occupy central positions in the chemical interaction networks of compounds in the DepMap PRISM dataset (Figures S3C and S3D).

In summary, we ultimately proceeded with further analysis of the two drugs, irinotecan and sorafenib. To elucidate the interaction relationships between the drugs and their targets, we performed molecular docking of compounds with key drug targets (sorafenib with TTK; irinotecan with CCNB2, DCTN2, and TTK) according to the drug targets indicated by DSigDB, selecting the optimal scheme as the docking approach (maximum binding affinity, maximum interaction sites, and minimum binding energy), and used 2D and 3D diagrams to display the best docking results (Fig. 10E).

Validation of drug-target docking results by in vitro and in vivo experiments

Irinotecan is a commonly used chemotherapeutic agent for malignant tumors, including colorectal cancer, and can be included in combination therapy regimens for HCC. Previous studies have reported that in vitro, irinotecan and its metabolites enhance the inhibitory effect of sorafenib on HCC cell lines26,27. Furthermore, our earlier investigations identified TTK as a core gene influencing the prognosis of HCC among cytoskeleton-related genes, with both irinotecan and sorafenib targeting TTK. This suggests that the combination of these two drugs may enhance therapeutic efficacy.

To further validate these predictions, we first employed RT-qPCR to assess the relative expression levels of the 5 genes in HCC cell lines and calculated the risk score. The PLC/PRF/5 cell line, which exhibited the highest risk score, was selected for subsequent experiments (Fig. 11A). The results of the CCK-8 assay following drug treatment indicated that the proliferation rate of the PLC/PRF/5 cell line was significantly reduced following combined treatment with irinotecan and sorafenib compared to either drug alone or the control group (Fig. 11B). Additionally, the results of the corresponding colony formation assays corroborated these findings (Fig. 11C).

Fig. 11
figure 11

Validation of drug-target docking results by in vitro and in vivo experiments. (A) RT-qPCR relative expression of 5 genes in hepatocellular carcinoma cell lines and risk score calculation. (B) CCK8 proliferation assay of PLC/PRF/5 cells treated with Irinotecan, Sorafenib, and combination. (C) Colony formation assay images and counts of drug-treated PLC/PRF/5 cells. (D-E) Tumor images and (F) body weight/tumor volume curves from nude mouse xenograft experiment with different treatment groups (Control, Irinotecan 40 mg/kg/day, Sorafenib 30 mg/kg/day, combination).

Finally, in the subcutaneous tumor formation experiments in nude mice, the tumor volume in the combination group (849.03 ± 75.62 mm3 at day 28) was significantly reduced compared to the irinotecan group (1114.39 ± 135.31 mm3), sorafenib group (939.16 ± 5.60 mm3), and control group (1527.63 ± 92.57 mm3), as well as the tumor growth rate (Figs. 11D-E). Moreover, combination therapy did not significantly affect the body weight of the nude mice, indicating that this regimen is safe and feasible (Fig. 11F). These results suggest that the clinical combination of these two drugs may improve the prognosis of patients with HCC.

Discussion

The prognosis of HCC remains poor, particularly for patients with intermediate to advanced stages. The cytoskeleton not only provides structural support for cells but also plays crucial roles in cell division, migration, and intracellular signal transduction28,29. Dysregulation of cytoskeleton-related genes has been confirmed to participate in cancer progression10, but comprehensive studies integrating these genes into prognostic models remain limited. Advances in high-throughput sequencing technology have provided new opportunities to explore the functions of cytoskeleton-related genes in HCC30,31.

In this study, we identified 110 cytoskeleton-related DEGs in HCC. Through clustering analysis and univariate COX regression, we reduced the DEGs to 13 genes. Subsequently, we developed a robust five-gene prognostic model (ARPC1A, CCNB2, CKAP5, DCTN2, and TTK) using LASSO stepwise COX regression. Previous studies have confirmed that these 5 genes play important roles in various tumors. ARPC1A, as a key subunit of the Arp2/3 complex, regulates actin filament branched polymerization and drives the formation of invasive membrane protrusions in cancer cells. It was recently discovered to regulate cytoskeletal dynamics through interaction with EML4, enhancing lamellipodia formation, cell motility, local invasion, and metastasis32. In prostate cancer, ARPC1A is associated with poor prognosis and promotes tumor cell migration, invasion, and cytoskeletal alterations33. CCNB2, as a cell cycle regulator, controls cell division through the G2/M checkpoint. Its overexpression leads to cell cycle dysregulation and malignant tumor proliferation, and has been identified as an independent predictive biomarker for poor prognosis in HCC, breast cancer, and other malignancies34,35. CKAP5 ensures proper chromosome segregation by stabilizing microtubule-kinetochore connections, and its abnormal expression affects spindle assembly and genomic stability. CKAP5 has been found to stabilize CENP-E at kinetochores in various cancer cell lines by regulating microtubule-chromosome connections, ensuring smooth cell cycle progression in cancer cells36. DCTN2, as a component of the dynein complex, mediates intracellular material transport and signal transduction. DCTN2 exerts oncogenic functions through binding with ROCK1 (in HCC) or as a FER substrate (in triple-negative breast cancer/TNBC)37,38. TTK, as a core regulator of the spindle assembly checkpoint, not only controls cell division processes but also regulates cytoskeletal reorganization and EMT processes through phosphorylation of multiple substrate proteins. As a multifunctional protein kinase, TTK is regarded as an oncogene and therapeutic target in various malignancies including HCC, ovarian cancer, and TNBC39,40,41. Notably, our study is among the first to integrate multiple cytoskeleton-related genes into a prognostic model for HCC, enhancing our understanding of cytoskeletal dynamics in cancer progression.

In this study, through analysis of prognostic ROC curves, K-M curves, and C-index, we found that the LASSO-stepwise COX model demonstrated superior stability and generalization capability compared to the RF model. Although the RF model showed significantly better predictive performance in the training set, it performed poorly in the test set, suggesting potential overfitting and emphasizing the necessity of robust validation methods. Combined with previous studies, we found that the poor external generalization ability of RF models is mainly attributed to their tendency to capture noise features in training sets when dealing with high-dimensional genomic data, leading to overfitting phenomena (especially when optimal features are not effectively selected). Additionally, differences in sample feature distributions and clinical heterogeneity between different datasets can also lead to RF model overfitting42. This highlights the importance of conducting multi-data and multi-dimensional comparisons when selecting prognostic models. Compared with 10 previous HCC machine learning models from different aspects (including HPRGs, ARGs, etc.), our model demonstrated non-inferior or even superior performance in both TCGA-LIHC and LIRI-JP datasets, highlighting the value of cytoskeleton-related genes in HCC prognosis prediction.

Furthermore, immune infiltration analysis revealed increased levels of immunosuppressive cells (such as regulatory T cells) in high-risk score patients, potentially contributing to poorer prognosis. Multiple cytoskeleton-related molecules have been identified as biomarkers for immune infiltration/escape through bioinformatics analysis and experimental validation, such as TAGLN243. Additionally, treatment regimens combining cytoskeleton-targeting drugs, represented by microtubule-targeting agents, with immune checkpoint inhibitors (ICIs) have been applied and promoted in various malignancies including TNBC and non-small cell lung cancer (NSCLC), with particularly favorable results observed in“hot”tumors44. This also provides new insights for HCC immunotherapy. For example, previous studies have shown that targeting TTK can inhibit various CTNNB1 (β-catenin) mutated cancers45. TTK inhibitors such as CFI-402257 and BAY1217389 have been proven to be potential therapeutic drugs for HCC or ovarian cancer39,41. Moreover, TTK inhibition promotes anti-PD1 immunotherapy in TNBC40, highlighting the potential impact of cytoskeletal regulation in cancer treatment.

In vitro and in vivo experiments confirmed that the combination treatment of irinotecan and sorafenib was more effective than monotherapy. Irinotecan, as a topoisomerase I inhibitor, and sorafenib, as a multi-kinase inhibitor, have both been FDA-approved for cancer treatment with good clinical safety profiles and substantial clinical evidence supporting their use. Current studies on this combination regimen (including a study based on pediatric hepatocellular carcinoma) have shown acceptable toxicity profiles26,27. We can preliminarily determine that this is a clinically feasible combination regimen, and our in vivo experiments also suggest that this combination is safe. However, in our study and previous studies, the application time of the combination regimen was relatively short, and potential off-target effects and long-term safety still require further evaluation. Additionally, clinical applications indicate that this regimen has a strict therapeutic window, requiring further research to determine optimal dosing and efficacy.

Despite these advances, several limitations warrant consideration. The retrospective nature of our study requires prospective validation to confirm clinical applicability. The biological mechanisms underlying the prognostic significance of the identified genes require further investigation. Furthermore, genetic and environmental heterogeneity of HCC among different populations may limit the generalizability of our findings, highlighting the necessity of validation in diverse cohorts.

In conclusion, we have proposed a novel five-gene prognostic model for HCC and identified potential therapeutic targets, laying the foundation for future research to improve patient outcomes through precision medicine and targeted therapy.