Introduction

Large-scale water conservancy projects involve various building types, such as the South to North Water Diversion Project, which includes concrete dams, high-fill channels, aqueducts, culverts, and inverted siphons1. Unlike commercial concrete buildings, hydraulic structures are continuously exposed to atmospheric and aquatic environments. Therefore, long-term effects, such as hydrochemical corrosion2, water flow erosion3, and freeze-thaw cycles4,5, may lead to unpredictable damage to the appearance of buildings. The apparent damage to hydraulic structures may further form safety hazards such as steel corrosion, infiltration, and piping6,7. Therefore, the timely detection and maintenance of apparent damage to hydraulic structures can avoid serious quality problems and maintain the integrity and functionality of hydraulic engineering8,9,10,11.

Methods for inspecting the apparent integrity of hydraulic structures include manual visual inspection and inspection technology based on intelligent algorithms. Among them, manual visual inspection has several obvious limitations, such as low efficiency, low safety, and high labor costs12. For instance, the mean elevation of the Xiabandi Water Conservancy Hub in Xinjiang exceeds 3000 m, and the climatic conditions and topographical environment constrain the operational scope of the technical personnel13. Nevertheless, unmanned aerial vehicles and artificial intelligence technologies effectively mitigate the challenges in the routine inspection of reservoirs. Intelligent inspection technologies encompass the Internet of Things (IoT) and Artificial Intelligence (AI), wherein AI models execute specific tasks using IoT devices as platforms14. The implementation of intelligent inspection technology as a substitute for technical personnel in the visual integrity assessment of hydraulic structures not only enhances operational efficiency, but also mitigates errors and biases stemming from subjective factors15,16. Consequently, intelligent inspection technologies based on drones and artificial intelligence models have become a research hotspot in the field of smart water conservation. Currently, intelligent algorithms for detecting apparent damage in buildings can be categorised into two primary approaches: image segmentation17 and image classification18. Image segmentation algorithms delineate damaged areas on concrete surfaces from the background on a pixel-by-pixel basis, whereas image recognition algorithms classify damaged areas block-by-block.

Currently, research on surface damage identification algorithms for hydraulic structures has predominantly focused on crack detection19. For instance, Russell and Selvaraj proposed an end-to-end learning methodology that utilises deep convolutional neural networks (CNN) to learn crack features in parallel channels at multiple scales, thus adapting to cracks of varying severity levels20. Nevertheless, parallel structures may potentially increase the complexity of the model and present challenges associated with significant reliance on extensive training datasets. Ali et al. conducted a systematic comparison of the performance of four distinct deep learning frameworks in crack-detection tasks, with particular emphasis on dataset size, heterogeneity, network complexity, and training time. Subsequently, they proposed a customised CNN model that demonstrated efficacy on relatively small datasets, albeit solely for crack recognition tasks, and necessitated substantial computational resources21. Zhu et al. addressed the issue of imbalanced positive and negative samples in data using an improved VanillaNet network. The optimised loss function and adaptive frequency filtering token mixer enhanced the feature extraction and classification accuracy of underwater crack images22. However, the proposed improvements may increase the complexity and resource consumption of the model, and the model may not be capable of recognising multiple types of damage. Chen et al. proposed utilising convolutional neural networks to determine the presence of cracks and subsequently employed the U-Net model for pixel-level segmentation of cracks. The experimental results demonstrate that this method exhibits high accuracy, recall, and F1 scores in crack detection tasks23. However, this study used a substantial amount of training data to train the model. Consequently, the model exhibits strong data dependence and limited scalability. Deng et al. proposed a methodology for detecting concrete cracks based on a faster R-CNN that can automatically localise cracks from raw images and mitigate interference from handwritten annotations24. However, this approach is contingent on the quality and diversity of the training data, and insufficient training data may result in performance degradation. Additionally, this method requires substantial computational resources. It is evident that these research methods have made varying contributions to the field of hydraulic structure health detection. However, these studies were predicated on a substantial volume of training data to develop deep learning models, resulting in significant data dependency and limited functional scalability.

The limited diversity in the apparent damage dataset of hydraulic structures constrains the capacity of deep-learning models to acquire more complex features and differentiate various types of damage. Consequently, these models cannot provide decision support for comprehensive maintenance measures. Transfer learning has been extensively applied in diverse fields including text sentiment analysis, speech recognition, pattern recognition, and anomaly detection25,26. In recent years, several researchers have proposed the use of transfer learning techniques to address the challenge of insufficient labelled data in the field of hydraulic engineering27. Hüthwohl et al. proposed a multilevel classifier for bridge defect classification that can recognise multiple types of defects with an average recognition accuracy of 85%28. However, this approach requires thousands of training data points to train multiple subclassifiers, and the performance of the top-level classifier may be influenced by the performance of the low-level classifier. Li and Bao proposed a real-time pixel level automatic segmentation and quantification framework for underwater cracks in dams based on lightweight semantic segmentation networks and two-stage hybrid transfer learning. This method is robust and efficient for the identification of cracks in underwater structures29. However, its data dependence remains strong, and its ability to detect structural damage in other scenarios is unknown. Furthermore, they proposed an integrated framework for the health detection of hydraulic tunnel structures that mitigates model training costs and data dependencies through the implementation of cross-___domain transfer learning strategies30,31. However, this methodology trains the models based on underwater structural damage data. Consequently, its efficacy in identifying diverse types of structural damages in alternative scenarios remains undetermined. Wang et al.. collected a dataset comprising 2000 crack images and fine-tuned the pre-trained ResNet-18 residual network on this dataset to classify transverse, longitudinal, and grid cracks on concrete surfaces32. However, this methodology has limitations in identifying holes and spalling damages. These research findings suggest that transfer learning methodologies can effectively mitigate reliance on extensive datasets for constructing apparent damage recognition models. However, a certain quantity of training data is necessary to fine-tune the model. Furthermore, these methodologies may necessitate architectural redesign when confronted with novel types of damage. Consequently, their efficacy in identifying damage in alternative scenarios remains uncertain and their interpretability is limited.

Certain researchers have proposed building apparent damage recognition models from the perspective of feature engineering, integrating deep learning models and classical statistical learning models. For instance, Sharma et al. proposed the use of Convolutional Neural Networks (CNN) to automatically extract features from RGB images and subsequently employed Support Vector Machines (SVM) to classify damaged images. This methodology achieved an accuracy of 90.76% on a dataset of 550 images33. Khartik et al. evaluated the performance of 12 pre-trained convolutional neural network models on the SDENT, CCIC, and BCD datasets, verifying the efficacy of the features extracted by these models in enhancing classification accuracy34. Furthermore, they ascertained that the integration of deep learning feature extraction and an SVM classifier can significantly improve the precision of crack detection. These studies are based on learning effective damage features from a small number of samples and then using statistical learning methods to classify the feature data, thus having good interpretability and operability. However, these studies did not process the extracted data further, which resulted in the effect of the quality of feature data on the performance of statistical learning models. In this study, a set of images of high-fill concrete slope damage, including crack, fracture, hole, and normal area, was obtained. This study fine-tunes the top-level parameters of a lightweight pre-trained model to learn advanced features of high-fill slope damage and automatically extracts feature data. Subsequently, we developed a robust discriminative feature-selection model capable of selecting a subset of discriminative features and eliminating redundant features, resulting in a selected feature set with optimal global characteristics. Based on the discriminative dimension, we employed an ensemble learning classification algorithm to categorise the damage images, thereby effectively enhancing the performance of the model in identifying multiple damage categories. This research method exhibits good interpretability and operability. Feature selection does not require projection operations but directly filters the dimensions of the data, thus significantly reducing the time consumption of ensemble learning algorithms and improving the performance of damage recognition.

The structure of the article is as follows: In the relevant work section, we present common lightweight pre-training models, subsequently develop a robust discriminative feature selection model, and elucidate its design principles. In the research methodology section, the technical approach proposed in this study is delineated, emphasising the correlation between modules. In the experimental analysis section, we conduct a systematic comparison of the performance differences among various models and verify the efficacy of the research methodology through ablation experiments. Finally, we summarise the content of this study and discuss the potential limitations of the research methodology.

Related works

Transfer learning techniques

Transfer learning (TL) has been successfully applied to fields such as image recognition, natural language processing, medical image analysis, and reinforcement learning26,35. Researchers typically employ transfer learning methods to address challenges, such as insufficient labeled data, limited computational resources, and mismatched sample distributions36. The transfer of knowledge or features from source tasks to new tasks facilitates problem-solving without necessitating model retraining, thereby reducing training costs and data dependencies37,38. Pre-trained models were developed and optimized on large-scale datasets featuring complex network structures and numerous parameters. Despite the robust feature-learning capabilities of pre-trained models, it remains necessary to fine-tune these models using a dataset specific to the target task, ensuring that the learned features are applicable to the intended application39,40. Figure 1 illustrates the transfer learning strategy. Currently, the continuous emergence of new convolutional neural network models presents a challenge in selecting the most appropriate model. However, computational resources and model performance remain critical factors for user consideration.

This study evaluated four prominent lightweight backbone networks: ResNet41, MobileNet42, EfficientNet43, and RegNetY44. Owing to variations in parameter size and network architecture, each pre-trained model is categorized into distinct versions. This study determined the most appropriate version based on benchmark test results. All pre-trained models were trained on the comprehensive and diverse ImageNet-1k dataset, which encompasses 1000 distinct categories, including common objects and scenes, such as plants, streets, vehicles, and buildings45.

Fig. 1
figure 1

Schematic of transfer learning based on model sharing.

Fig. 2
figure 2

Schematic diagram of the ResNet-18 network structure.

Table 1 Computational complexity (flops) and parameter count of ResNet.
Fig. 3
figure 3

Schematic diagram of the MobileNet-v3-large network structure.

Table 2 Computational complexity (flops) and parameter count of MobileNet.
  1. (i)

    ResNet: He et al. proposed ResNet in 2015, which effectively addressed the challenges of gradient vanishing and explosion in deep network training, such as AlexNet, VGGNet, and GoogleNet, by introducing a residual learning framework. The ResNet model comprises of multiple residual blocks stacked together, with each residual block containing two convolutional layers, one normalization layer, and one ReLU activation function. The residual structure significantly enhances the capacity of deep learning models for feature extraction. The ResNet model was categorized into multiple versions, including but not limited to ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the numbers represent the number of layers in the model. An increase in the number of layers typically corresponds to an increase in the model complexity, which may result in higher requirements for computational resources and data volume. The complexity information of the ResNet model is shown in Table 1, and the network structure diagram is shown in Fig. 2.

  2. (ii)

    MobileNet: In 2016, The Google team proposed a lightweight deep learning model that utilizes depthwise separable convolutions to construct lightweight deep neural networks. The MobileNet model is highly flexible and can be readily integrated into mobile applications. As the Google team continues to enhance the network architecture of MobileNet, there are currently MobileNet-v1/v2/v3/v4 versions available with both large and small variants of MobileNet-v3. The complexity information of the MobileNet is shown in Table 2, and a network structure diagram is illustrated in Fig. 3. MobileNet-v1 substitutes traditional convolution with depth-wise separable convolution, increasing the network depth while maintaining a lower number of parameters. MobileNet-v2 primarily introduces a reverse residual structure. MobileNet-v3 incorporates a bottleneck module with an attention mechanism and a novel activation function, h-sigmoid(x), replacing the activation function of ReLU6. MobileNet-v4 demonstrates significant improvements compared to its predecessors, including a new attention block that can increase inference speed and a novel distillation technique that can enhance the model’s generalization capability. MobileNet-v4 offers a compact MNv4-Conv-S model and high-performance MNv4-Hybrid-L variant, with the latter exhibiting superior performance on the ImageNet-1k dataset.

  3. (iii)

    EfficientNet: EfficientNet is a deep convolutional neural network proposed by the Google team in 2019 that systematically enhances the accuracy and efficiency of a model utilizing composite scaling methods. The composite scaling method involves simultaneously expanding the depth, width, and resolution of the model rather than adjusting these parameters independently. The EfficientNet model significantly improves the computational efficiency while maintaining high accuracy, rendering it highly suitable for resource-constrained devices. As shown in Table 3, the larger the value corresponding to the version, the more parameters the model possesses, and the computational complexity increases accordingly. EfficientNet has versions B0–B8 as well as L2. From Fig. 4, it can be observed that the EfficientNet model comprises a large number of deep convolutions and inverted residual blocks. The inverse residual block is first subjected to depth-wise separable convolution, followed by point-wise convolution. EfficientNet employs squeezing and excitation modules to enhance feature representation, thereby optimizing resource utilization and improving the performance and efficiency of the model.

  4. (iv)

    RegNetY: RegNet is a series of convolutional neural network architectures proposed by Facebook AI Research (FAIR), with the primary objective of discovering a range of high-performance network models through the systematic design of the network space rather than individual network instances. The network structure of RegNet typically comprises a simple stem, body composed of multiple stages, and classification head. Each stage consists of multiple blocks with identical structures that may incorporate standard residual blocks or other variants. RegNetY, a variant of the RegNet series, introduces a Squeeze and Excitation (SE) module. The SE module recalibrates the weights of the feature channels using a global average pool and two fully connected layers, thereby enhancing the feature extraction capability of the network. Table 4 provides detailed information about the RegNetY model, and Fig. 5 illustrates the network structure of RegNetY-800MF.

Fig. 4
figure 4

Schematic diagram of the EfficientNet-B8 network structure.

Table 3 Computational complexity (flops) and parameter count of EfficientNet.
Fig. 5
figure 5

Schematic diagram of the RegNetY-800 network structure.

Table 4 Computational complexity (flops) and parameter count of RegNetY.

It is evident that the number of parameters and computational complexity of ResNet-34 and ResNet-50 are at least twice those of ResNet-18, whereas ResNet-101 and ResNet-152 are at least four times those of ResNet-18. The MobileNet-v3 network employs a reverse residual structure, and the model performance improves with an increase in network layers; however, the number of parameters is significantly smaller than that of the traditional ResNet series models. This is attributed to MobileNet-v3’s extensive use of deep convolution and pointwise convolution, which substantially reduces the computational complexity. From EfficientNet-B0 to EfficientNet-B8, the computational complexity of the model increased by nearly tenfold, and the number of parameters in the model increased by nearly thirteenfold. The increase in the number of model parameters means that the model has a stronger ability to extract features, but it also means that the amount of data required to fine-tune the model will increase. The RegNetY model provides multiple version specifications, allowing users to choose the most suitable version based on the requirements of the target task.

The results in Table 7 indicate that using pre-trained models to classify hydraulic apparent damage is effective. However, the performance of classifying the surface damage of hydraulic structures solely using pre-trained models is far from sufficient. Its effectiveness and stability still need to be further enhanced through transfer-learning methods such as fine-tuning the model with limited data. In this study, we consider fine-tuning the pre-trained model with as little training set as possible to achieve better classification performance.

Discriminant feature selection

In fact, in situations where training data are scarce, using pre-trained models for identifying surface damage in hydraulic concrete is not sufficient. Introducing ensemble learning classifiers can significantly improve the damage recognition performance, but it cannot avoid the problem of time consumption. This study further proposed modeling the extracted feature data to improve the class separability of samples and reduce sample dimensionality. Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction technique that focuses on finding the direction with the greatest differences in data between different categories46. Therefore, LDA not only considers the variance of the data, but also considers category information. This study combined the concepts of linear discriminant analysis and sparse learning to design a discriminative feature selection model.

In this context, lowercase italic letters are utilized to denote scalars, lowercase letters to represent vectors, and uppercase letters to designate matrices. Table 5 provides a comprehensive list of key symbols and their corresponding meanings. There are two forms of traditional linear discriminant criteria.

Table 5 Some key symbols and their meanings.
$$\begin{aligned} & \mathop {\arg \min }\limits _{\textrm{A} \in R^{d \times l}} - \mathrm{\frac{Tr\left( A^TS_bA \right) }{Tr\left( A^TS_wA \right) }} + \lambda \sum _{j=1}^{d} \Vert \textrm{a}_j \Vert _{\infty }, \end{aligned}$$
(1)
$$\begin{aligned} & \mathop {\arg \min }\limits _{\textrm{A} \in R^{d \times l}} - \mathrm{Tr\left( \frac{A^TS_bA}{A^TS_wA} \right) } + \lambda \sum _{j=1}^{d} \Vert \textrm{a}_j \Vert _{\infty }, \end{aligned}$$
(2)

It is evident that problem (1) is in the form of a trace ratio problem, and problem (2) is in the form of a ratio trace problem. Regardless of the form of the criterion, it may face singularity problems, particularly when the dimensionality of the sample is much larger than the number of samples47. When facing a small sample dataset, the intra-class scatter matrix \(\textrm{S}_w\) in the denominator may experience irreversible situations, which indirectly leads to the inability of the eigenvalue decomposition process to proceed normally. In addition, the optimization process of the trace ratio criterion is relatively complex48,49, which has led to the emergence of another new form of discrimination criterion: the trace difference criterion. The form of the trace difference criterion is similar to that of the maximum margin criterion (MMC)50, but it is more flexible. The optimization objective function of the discriminative feature selection model designed in this study is as follows:

$$\begin{aligned} \mathop {\arg \min }\limits _{\mathrm{A^TA=I}} Tr\left[ \textrm{A}^T \left( \textrm{S}_w - \alpha \textrm{S}_b \right) \textrm{A} \right] + \gamma \Vert \textrm{A} \Vert _{2,1}, \end{aligned}$$
(3)

where \(\mathrm A\) represents the row sparse projection matrix, \(\textrm{S}_w\) represents the intra-class scatter matrix, and \(\textrm{S}_b\) represents the inter-class scatter matrix. The \(l_{2,1}\) norm of a matrix \(\mathrm A\) is defined as \(\Vert \textrm{A} \Vert _{2,1} = \sum _{i=1}^{d}\sqrt{\sum _{j=1}^{l}\vert a_{i,j} \vert ^2}\). The first term is the trace difference criterion, and the second term is the sparse regularization term. The hyperparameter \(\alpha\) is used to control the relative importance of intra-class and inter-class scatter, while the hyperparameter \(\gamma\) is used to control the strength of sparse constraints. When we use a larger value of \(\gamma\), the value of \(\Vert \textrm{A} \Vert _{2,1}\) decreases, which means that matrix \(\mathrm A\) will have more rows approaching 0. Because matrix \(\mathrm A\) is orthogonal, it has at least l non-zero rows whose indices correspond to the selected features. The calculation formulas for \(\textrm{S}_w\) and \(\textrm{S}_b\) are as follows:

$$\begin{aligned} & \begin{aligned} \textrm{S}_b&= \frac{1}{2} \sum _{i=1}^{c}\sum _{j=1}^{c} \frac{n_in_j}{n^2} \left( \mu _i - \mu _j \right) \left( \mu _i - \mu _j \right) ^\textrm{T} = \sum _{i,j=1}^{c} \frac{n_in_j}{2n^2} \Vert \left( \mu _i-\mu _j \right) \Vert _2^2 \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} & \textrm{S}_w = \frac{1}{n} \sum _{i=1}^{c} \sum _{j=1}^{n_i} n_i\left( \textrm{x}_j^i - \mu _i \right) \left( \textrm{x}_j^i - \mu _i \right) ^T = \frac{1}{n} \sum _{i=1}^{c} \sum _{\textrm{x}_j \in X^{(i)}} {n_i} \Vert \left( \textrm{x}_j - \mu _i \right) \Vert _2^2, \end{aligned}$$
(5)

The parameters acquired by the pre-trained model on the source dataset may not be fully applicable to the target task dataset. Consequently, the features extracted by the pre-trained model may not necessarily be relevant to the target task, and outliers or noise may exist. Although the attention mechanism of pretrained models is employed to enhance their feature extraction capabilities, it cannot entirely eliminate the influence of noise or outliers. To improve the robustness of the model against outliers or noise, we optimize problem (3) through iterative reweighted method, which mitigates noise interference and expedites model convergence.

$$\begin{aligned} \begin{aligned}&\mathop {\arg \min }\limits _{\textrm{A}, \textrm{h}_k} \sum _{k=1}^{c} \sum _{\textrm{x}_i \in X^{(k)}} {s_{ik}} \Vert \textrm{A}^T\left( \textrm{x}_i - \textrm{h}_k \right) \Vert _2^2 - \alpha \sum _{i,j=1}^{c} \frac{n_in_j}{2n^2} {g_{ij}} \Vert \textrm{A}^T\left( \mu _i-\mu _j \right) \Vert _2^2 + \gamma \Vert \textrm{A} \Vert _{2,1}, \\&s.t. \mathrm{A^TA=I}, s_{ik}=\frac{1}{2\Vert \mathrm{A^T} \left( \textrm{x}_i - \textrm{h}_k \right) \Vert }, g_{ij} = \frac{1}{2\Vert \mathrm{A^T}\left( \mu _i - \mu _j \right) \Vert } \end{aligned} \end{aligned}$$
(6)

\(\textrm{h}_k\) represents the weighted intra-class mean vector, \(s_{ik}\) represents the weight of the intra-class scatter, and \(g_{ij}\) represents the weight of the inter-class scatter. Continuous use of the weights generated in the previous iteration to correct the residual term during the iteration process can accelerate the convergence of the algorithm. Evidently, this approach can weaken the impact of square operations and improve the robustness of the model. When there is a distant point \(x^i_j\) in class i, we assign a smaller weight to the distance between sample \(x^i_j\) and mean \(\mu _i\), which can reduce the impact of distant points on the intra-class divergence calculation. Observing the second item, it can be seen that for class pairs with larger inter-class distances, they will receive a smaller weight, while for class pairs with smaller inter-class distances, they will receive a larger weight. This makes the learned projection matrix more balanced when focusing on maximizing the distance between all the class pairs.

In the process of minimizing the objective function, the first term guides the projection matrix to enhance the cohesion of class samples, while the second term directs the projection matrix to maximize the distance between all class pairs. During the iteration of the subspace, the sparse regularization term \(\Vert \textrm{A} \Vert _{2,1}\) applies row sparsity to projection matrix \(\mathrm A\). Certain features that contribute significantly to the discriminative subspace are selected, whereas redundant features are disregarded. Discriminative feature selection models can be employed to analyze the damage feature data of hydraulic structures, and specific important dimensions can be selected to reduce the volume of data while maintaining the model performance and decreasing the inference time.

Algorithm

The weighted class mean vector \(\textrm{h}_k\) and row sparse projection matrix \(\mathrm A\) are the variables to be optimized. We adopted an alternating optimization strategy to iteratively update the values of \(\textrm{h}_k\) and \(\mathrm A\) until the objective function converged. First, we randomly initialize and fix matrix \(\mathrm A\), and then solve for the weighted class mean vector \(\textrm{h}_k\). Then, after fixing \(\textrm{h}_k\), we solve the projection matrix \(\mathrm A\).

Fixed projection matrix \(\mathrm A\), update weighted class mean \(\textrm{h}_k\):

When fixing the projection matrix A, solving optimization problem (6) is equivalent to solving the following problem:

$$\begin{aligned} \begin{aligned}&\mathop {\arg \min }\limits _{\textrm{h}_k} \sum _{k=1}^{c} \sum _{\textrm{x}_i \in X^{(k)}} {s_{ik}} \Vert \textrm{A}^T\left( \textrm{x}_i - \textrm{h}_k \right) \Vert _2^2, \\&s.t. \mathrm{A^TA=I}, s_{ik}=\frac{1}{2\Vert \mathrm{A^T} \left( \textrm{x}_i - \textrm{h}_k \right) \Vert }. \end{aligned} \end{aligned}$$
(7)

Since problem (7) is a convex optimization problem, We directly conduct the derivative on \(\textrm{h}_k\) and set the derivative to zero. Finally, we obtain:

$$\begin{aligned} \begin{aligned}&\sum _{\textrm{x}_i \in X^{(k)}} s_{ik} \left( \mathrm{AA^Th}_k -\mathrm{AA^Tx}_i \right) = \mathrm{AA^T}\sum _{\textrm{x}_i \in X^{(k)}} s_{ik}\left( \textrm{h}_k - \textrm{x}_i \right) =0. \end{aligned} \end{aligned}$$
(8)

Let \(\sum _{\textrm{x}_i \in X^{(k)}}s_{ik} \left( \textrm{h}_k - \textrm{x}_i \right) = \left( \theta \textrm{A} + \eta \textrm{A}^{\perp } \right)\), where \(\textrm{A}^{\perp }\) denotes the orthogonal complement of \(\textrm{A}\). Then, by setting \(\theta =0\), we can obtain \(\sum _{\textrm{x}_i \in X^{(k)}}s_{ik} \left( \textrm{h}_k - \textrm{x}_i \right) = \eta \textrm{A}^{\perp }\). In this case, \(\textrm{h}_k\) is deduced as follows:

$$\begin{aligned} \textrm{h}_k = \frac{\left( \sum _{\textrm{x}_i \in \textrm{x}^{(k)}}s_{ik}\textrm{x}_i + \eta \textrm{A}^{\perp } \right) }{\sum _{\textrm{x}_i \in \textrm{X}^{(k)}} s_{ik}}. \end{aligned}$$
(9)

Then, we substitute \(\sum _{\textrm{x}_i \in X^{(k)}}s_{ik} \left( \textrm{h}_k - \textrm{x}_i \right) = \eta \textrm{A}^{\perp }\) into Eq. (8) to obtain \(\mathrm{AA^T\eta A^\perp } = 0\), which shows that \(\eta\) can be any value. We set \(\eta = 0\) and obtain the optimal within-class mean as follows:

$$\begin{aligned} \textrm{h}_k^* = \frac{\left( \sum _{\textrm{x}_i \in \textrm{X}^{(k)}}s_{ik}\textrm{x}_i \right) }{\sum _{\textrm{x}_i \in \textrm{X}^{(k)}} s_{ik}}. \end{aligned}$$
(10)

Fix \(\textrm{h}_k\) and solve for \(\textrm{A}\):

When \(\textrm{h}_k\) is fixed, we solve the following problem:

$$\begin{aligned} \begin{aligned}&\mathop {\arg \min }\limits _{\textrm{A}} Tr \left( \mathrm{A^TEA} \right) - \alpha Tr \left( \mathrm{A^TFA} \right) +\gamma Tr \left( \mathrm{A^TDA} \right) . \\&s.t. \mathrm{A^TA=I}, \textrm{E} = \sum _{k=1}^{c} \sum _{\textrm{x}_i \in \textrm{X}^{(k)}} s_{ik} \left( \textrm{x}_i - \textrm{h}_k^* \right) \left( \textrm{x}_i - \textrm{h}_k^* \right) ^\textrm{T}, \\&\textrm{F} = \sum _{i,j=1}^{c} \frac{n_in_j}{2n^2} g_{ij} \left( \mu _i - \mu _j \right) \left( \mu _i - \mu _j \right) ^T, \\&\textrm{D} = \begin{bmatrix} \ddots & & \\ & \frac{1}{2} \Vert \textrm{a}^i \Vert _2^{-1} & \\ & & \ddots \end{bmatrix} \end{aligned} \end{aligned}$$
(11)

where \(s_{ik}=\frac{1}{2\Vert \mathrm{A^T} \left( \textrm{x}_i - \textrm{h}_k^* \right) \Vert _2}, g_{ij} = \frac{1}{2\Vert \mathrm{A^T}\left( \mu _i - \mu _j \right) \Vert _2}\) . We can further combine all terms in Eq. (11) to obtain the following minimization optimization problem:

$$\begin{aligned} \begin{aligned}&\mathop {\arg \min }\limits _{\textrm{A}} Tr \left( \mathrm{A^TMA} \right) , \\&s.t. \mathrm{A^TA=I}, \left( \textrm{E} - \alpha \textrm{F} + \gamma \textrm{D} \right) = \textrm{M}, \alpha> 0, \gamma > 0. \end{aligned} \end{aligned}$$
(12)

It is evident that the minimum of Eq. (12) is obtained by summing the l smallest eigenvalues of \(\textrm{M}\), and the projection matrix A is constructed by arranging the eigenvectors corresponding to the l smallest eigenvalues. The specific steps of the solution algorithm are listed in Table 6.

Table 6 Algorithm process of discriminative feature selection model.

Research methods

Hydraulic concrete is the most important construction material used in hydraulic structures, and microfractures may develop during the initial phase of cement solidification. Under the influence of external loads or freeze-thaw cycles, these microscopic fissures may be subjected to the combined effects of external pressure and ice expansion pressure for extended periods, potentially resulting in severe structural damage such as rupture or surface exfoliation. Prompt detection of these structural impairments can facilitate early remediation and ensure the continued safe operation of water conservation projects. This study proposes an efficient methodological approach for identifying damage in hydraulic concrete, comprising three core steps: feature extraction, discriminant analysis, and ensemble learning classification, as illustrated in Fig. 6.

Fig. 6
figure 6

Schematic diagram of technical roadmap for identifying apparent damage in hydraulic structures.

Initially, the fine-tuned model was utilized to extract image features, wherein each image yielded a one-dimensional feature vector. These feature vectors typically originate from the pooling layer of pre-trained models, which integrates feature information from multiple channels. Consequently, the extracted features capture the global and multilevel information of the image while mitigating the influence of local details. The feature extraction process is illustrated in the first part of Fig. 6.

Second, the feature vectors extracted by pre-trained models typically have hundreds or thousands of dimensions, which may potentially degrade the performance of ensemble learning classifiers. Therefore, it is necessary to conduct discriminant analysis on these dimensions to reduce dimensionality and enhance sample class separability. In the relevant work section, we develop a discriminative feature selection model that maximizes the inter-class distance and minimizes the intra-class distance through iterative refinement of the discriminative subspace. Feature selection was accomplished by incorporating sparse learning technology into the iterative process of the discriminative subspaces. In essence, the process of discriminative analysis of features or dimensions constitutes the training of discriminative feature selection models. The outcome of the discriminant analysis is a row-sparse projection matrix W, wherein the non-zero rows of matrix W correspond to the dimensions to be preserved. The discriminant analysis process is illustrated in the second part of Fig. 6.

Finally, we employed an ensemble learning algorithm to classify low-dimensional samples, which can mitigate the variance of individual classifiers, thereby exhibiting enhanced stability when addressing noise or outliers. In this investigation, we utilize the bagging algorithm, which generates multiple training subsets through random sampling. These subsets are subsequently employed to train multiple base learners, including decision trees, neural networks, and support vector machines. Specifically, we select a decision tree as the base learner. The predictions from each base learner were aggregated through voting or averaging. The classification process is illustrated in the third panel of Fig. 6.

Experimentation and analysis

We conducted two sets of experiments to evaluate the efficacy of our research methodology. In the benchmark experiment, we directly modified the number of output units in the classification layer of the pretrained model to adapt to the target task. Subsequently, we utilized the training set to train the classification layer parameters of the pretrained model, and the experimental results were employed to validate the effectiveness of the pretrained model. In advanced experiments, the parameters of the selected layers of the pre-trained model were unfrozen to enable the model to learn the surface damage characteristics of hydraulic structures and enhance its generalization capability. To improve the performance of hydraulic apparent damage recognition, a fine-tuned pre-trained model is utilized to extract damage image features, followed by the application of a discriminative feature selection model to eliminate redundant features. Subsequently, the discriminative feature data were input into the random forest algorithm for classification, with the number of base learners set to 100. It should be noted that the ranges of the parameters \(\alpha\) and \(\gamma\) for the discriminative feature selection model are set to \([10^{-1},10^{2}]\) and \([10^{1},10^{5}]\), respectively. The experiments were conducted on a system running Ubuntu 20.04, equipped with an 8-core processor, 25GB of RAM, and supported by CUDA version 11.4. The NVIDIA driver version used was 470.57.02, as verified by the NVIDIA-SMI. Figure 7 shows the schematic of the experimental process.

It should be noted that this study used the weighted average of precision, recall, and F1 score as evaluation indicators for the classification results. The dataset of apparent damage to hydraulic structures involves four categories, which means \(c=4\). The calculation formulas for classification accuracy, precision, recall, and F1 score are as follows:

$$\begin{aligned} & Weighted\, Average\, Precision = \frac{\sum _{i=1}^n (support_i \times precision_i)}{\sum _{i=1}^n support_i}. \end{aligned}$$
(13)
$$\begin{aligned} & Weighted\, Average\, Recall = \frac{\sum _{i=1}^n (support_i \times recall_i)}{\sum _{i=1}^n support_i}. \end{aligned}$$
(14)
$$\begin{aligned} & Weighted\, Average\, F1\, Score = \frac{\sum _{i=1}^n (support_i \times recall_i)}{\sum _{i=1}^n support_i}. \end{aligned}$$
(15)
$$\begin{aligned} & Accuracy = \frac{Number\, of\, Correct\, Predictions}{Total\, number\, of\, Correct\, Predictions}. \end{aligned}$$
(16)
Fig. 7
figure 7

Schematic representation of apparent damage identification model architecture for hydraulic structures.

Data collection

In hydraulic engineering, high-fill channels are a common type of building, as shown in the structural diagram in Fig. 8. We collected apparent images of high-fill channels using camera equipment and obtained 400 patch images with a size of 150 \(\times\) 150 pixels after cropping, including crack, fracture, hole, and normal areas. From Fig. 9, it can be seen that there is a significant color difference between the damaged area of the hole and the background area, whereas the damage of the crack and fracture types shows a blurry striped pattern. In terms of shape, cracks are mainly single stripes, whereas fracture is composed of multiple combined stripes. Compared with crack and fracture, hole damage is characterized by nonlinearity and irregular changes in shape and size.

To test the performance of the model more realistically, we performed data augmentation on 400 patch images using techniques such as flipping, translation, distortion, mesh masking, perspective, and exposure. In addition, we made random adjustments to the color, adjustment, and sharpness. Grid masks can simulate occlusion, whereas perspective, distortion, and flipping can simulate changes in perspective. After performing the above operation on each image once, 4800 images were obtained. Figure 10 shows images obtained after data augmentation.

Fig. 8
figure 8

Schematic diagram of high-fill concrete slope structure.

Fig. 9
figure 9

Surface damage of high fill concrete slope, (a) hole, (b) crack, (c) fracture, (d) normal.

Fig. 10
figure 10

Example of high-fill concrete slope damage image after data enhancement.

Baseline experiment

The ResNet, MobileNet, EfficientNet, and RegNetY models are available in 5, 5, 10, and 12 versions, respectively. The parameter sizes and network structures of the different versions exhibit considerable variation, rendering it challenging to determine the most appropriate version for a given task. Consequently, we designed benchmark experiments to select the most suitable model and provide a reference for subsequent advanced experiments. In the benchmark test, we exclusively modified the output layer by reducing the original 1000 output units to four units, corresponding to the four categories. We randomly selected 100 images from each class to constitute the training set and 100 images to form the validation set and utilized the remaining samples as test specimens.

We employed Adaptive Moment Estimation (Adam) as the optimization algorithm for model training. Furthermore, we established a learning rate of \(10^{-3}\) and set the number of training cycles to 20. We subsequently retrained the fully connected layer parameters of the pretrained model on the training set, evaluated them on the validation set, and ultimately assessed their generalization ability on the test set. We conducted five independent experiments and obtained the mean accuracy, precision, recall, and F1 score of the classification results, as well as the training and inference times of the model. The detailed experimental results are presented in Table 7.

Table 7 Classification accuracy, precision, recall, and F1 score of the original pre-trained model on the test set.

The experimental results presented in Table 7 demonstrate that the direct application of pre-trained models for classifying damage images exhibits significant efficacy, suggesting that the features learned by these models on the ImageNet dataset are highly applicable to the damage dataset of hydraulic structures. Furthermore, based on the metrics of classification accuracy, precision, recall, and F1 score, the pre-trained model demonstrates a relatively balanced recognition performance across the four categories of hydraulic structure damage, indicating that the general features learned by the model possess good generalization capability. However, the classification accuracy of certain models exhibits considerable fluctuation, suggesting that the stability of these models requires improvement. On a training set comprising 400 images, the accuracy of all pre-trained models ranges from 73.9% to 86.26%, which is currently insufficient to meet the requirements of practical application scenarios. Consequently, it is necessary to fine-tune the parameters of these models to enhance their performance in identifying apparent damage in hydraulic structures.

Advanced experiment

The results of the benchmark experiment validated the applicability of the pretrained ResNet, MobileNet, EfficientNet, and RegNetY models for identifying apparent damage in hydraulic structures. However, the performance of these models still needs further improvement, so we propose to use a dataset of surface damage images of hydraulic structures to fine-tune some parameters of these models. Unlike the benchmark experiment, 400 training images were used to fine-tune the model, and another 400 images were used to evaluate the performance of the model. This study focused on fine-tuning the top-level parameters of the network structure, as indicated by the red boxed network layers in Figs. 2, 3, 4, and 5. In addition, we set the number of training rounds to 20 and the learning rate range of the model to be between \(10^{-1}\) and \(10^{-3}\). After conducting five independent experiments, we evaluated the model using metrics such as classification accuracy mean, precision, recall, and F1-score. Table 8 shows the fine-tuned model classification performance, while Fig. 11 illustrates the convergence of the ResNet-18, MobileNet-v4, EfficientNet-B0, and RegNetY-800 models.

As shown in Table 8, the ResNet-18, MobileNet-v4, EfficientNet-B0, and RegNetY-32GF models demonstrated superior performance on a diverse dataset of apparent damage to hydraulic structures. Given that the floating-point operations of RegNetY-12GF and RegNetY-32GF significantly exceeded those of RegNetY-800MF, the RegNetY-800MF model was selected as the representative model of the RegNetY series. A comparison of Tables 7 and 8 reveals that fine-tuning the parameters of the top-level module of the pre-trained model can effectively enhance the capacity of the model to extract apparent damage to hydraulic structures. As illustrated in Fig. 11, with an increase in the number of training iterations, the loss values of the ResNet, MobileNet, EfficientNet, and RegNetY models gradually converged and maintained a stable state. To mitigate the risk of overfitting during feature extraction, which may lead to poor generalization ability of the model to novel samples, we employed the early stopping method to regulate the number of training iterations. The early stopping method refers to a reduction in the model training iterations. Consequently, during the experiment, we set the number of training iterations between 5 and 15. The numerical results presented in Table 8 indicate that the highest recognition accuracies of ResNet-18, MobileNet-v4, EfficientNet-B0, and RegNetY-800MF are 84.58%, 87.74%, 89.52%, and 89.12%, respectively.

Table 9 presents the results of employing the random forest algorithm to classify the extracted feature samples as well as the experimental outcomes of utilizing feature selection for dimensionality reduction. Figure 12 demonstrates that employing the random forest algorithm instead of the neural network classification layer can significantly enhance the model’s performance in identifying apparent damage to hydraulic structures, thereby improving its anti-interference capability. Furthermore, as the number of features increases, the classification accuracy rapidly increases, indicating that the discriminative feature selection model can effectively select a subset of discriminative features while reducing redundant features. The reduction of redundant features in feature data not only decreases the time consumption of ensemble learning algorithms and ensures the capacity of the model to process large batches of images in real time, but also enhances the performance of apparent damage recognition in hydraulic structures to a certain extent. Ultimately, the classification accuracies of the ResNet-18, MobileNet-v4, EfficientNet-B0, and RegNetY-800MF models were enhanced by 4.62%, 1.58%, 1.9%, and 1.7%, respectively, while substantially reducing the inference time required to process 4000 test samples.

Fig. 11
figure 11

(a) and (b) represent the accuracy and loss value variation curves of ResNet-18 on the training and validation sets, respectively, after fine tuning. (c) and (d) represent the accuracy and loss value variation curves of fine-tuned MobileNet-v4 on the training and validation sets, respectively. (e) and (f) represent the accuracy and loss value variation curves of the fine-tuned EfficientNet-B0 for the training and validation sets, respectively. (g) and (h) represent the accuracy and loss value variation curves of the fine-tuned RegNetY-800MF for the training and validation sets, respectively.

Table 8 Classification accuracy, precision, recall, and F1 score of the pre-trained model (fine-tuned) on the test set.
Table 9 Numerical results of classification experiments integrating discriminative feature selection and random forest.
Fig. 12
figure 12

(a)–(d) represent the relationship between the accuracy of surface damage identification in hydraulic structures and the number of discriminative features used on the feature datasets extracted from ResNet-18, MobileNet-v4, Efficientnet-B0, and RegNetY-800MF.

To further validate the performance of the models in identifying apparent damage to the four types of hydraulic structures, a confusion matrix was plotted, and the accuracy, precision, recall, and F1-score for each category were calculated, as illustrated in Fig. 13 and Table 10. The experimental results presented in this section demonstrate that although the research method has achieved satisfactory overall performance, limitations exist in identifying the types of damage caused by fragmentation. Notably, the method corresponding to Fig. 13c exhibited suboptimal performance in identifying fragmentation, with a recognition accuracy of 81.03%. Conversely, the method corresponding to Fig. 13d demonstrated superior performance and high recognition accuracy in identifying the hydraulic surface damage of fracture types. Based on these findings, it can be concluded that the RegNetY-800MF model demonstrates a relatively balanced performance in distinguishing crack, fracture, hole, and normal categories when extracting the apparent damage characteristics of hydraulic structures, rendering it the optimal choice. An exemplar image illustrating therecognition errors of the RegNetY-800MF model is shown in Fig. 14. Evidently, the research method requires improvement in its capacity to identify damage types with indistinct boundaries, and its ability to detect subtle cracks is inadequate, particularly when the surface of hydraulic structures deteriorates or is obscured by water stains. It is apparent that under the conditions of limited samples, it is necessary to effectively explore damage features to enhance the accuracy of surface damage identification in hydraulic structures.

Table 10 Detailed classification results of classification experiments integrating discriminative feature selection and random forest classification algorithms.
Fig. 13
figure 13

(a)–(d) represent classification confusion matrices corresponding to the ResNet-18, MobileNet-v4, EfficientNet-B0, and RegNetY-800MF models, respectively.

Fig. 14
figure 14

Several examples of misclassification in the apparent damage identification model of hydraulic structures correspond to the RegnetY-800MF.

Conclusion

This study proposes an efficacious methodology for identifying apparent damage in hydraulic structures, which does not require a substantial volume of training data to effectively identify the four types of damage. Utilizing the principle of transfer learning, we fine-tuned the pre-trained model with a limited number of training samples and subsequently extracted features from the surface damage images of the hydraulic concrete structures. The experimental results corroborate the efficacy of employing transfer learning methods for damage identification. Furthermore, we substituted the classification layer of the neural network with an ensemble learning classifier, thereby substantially enhancing the accuracy of the damage recognition. To augment the model’s capacity to process large volumes of image data, discriminant analysis was conducted on the extracted feature data, significantly reducing the model’s inference time without compromising classification performance. The experimental results indicate that the accuracies of this research method in identifying crack, fracture, hole, and normal categories were 87.65%, 87.82%, 96.99%, and 95.25%, respectively, which demonstrates the potential application value of this method. From both data and model perspectives, the technical approach proposed in this study exhibits considerable scalability.

Although the method proposed in this study achieved good performance on the damage image dataset of high-fill slopes, the interpretability of the pre-trained model in extracting the damage features of hydraulic structures during the feature extraction stage is limited. In addition, the discriminative feature selection model fails to consider the multimodal distribution characteristics of the samples, which may result in the discriminative power of the selected feature subset that still requires improvement. In future work, we will focus on improving the interpretability and performance of this model.