Attention activation network for bearing fault diagnosis under various noise environments

Zhang, Yu; Lin, Lianlei; Wang, Junkai; Zhang, Wei; Gao, Sheng; Zhang, Zongwei

doi:10.1038/s41598-025-85275-w

Download PDF

Article
Open access
Published: 06 January 2025

Attention activation network for bearing fault diagnosis under various noise environments

Yu Zhang^1,2,
Lianlei Lin^1,2,
Junkai Wang^1,2,
Wei Zhang³,
Sheng Gao^1,2 &
…
Zongwei Zhang^1,2

Scientific Reports volume 15, Article number: 977 (2025) Cite this article

2047 Accesses
1 Citations
Metrics details

Subjects

Abstract

Bearings are critical in mechanical systems, as their health impacts system reliability. Proactive monitoring and diagnosing of bearing faults can prevent significant safety issues. Among various diagnostic methods that analyze bearing vibration signals, deep learning is notably effective. However, bearings often operate in noisy environments, especially during failures, which poses a challenge to most current deep learning methods that assume noise-free data. Therefore, this paper designs a Multi-Location Multi-Scale Multi-Level Information Attention Activation Network (MLSCA-CW) with excellent performance in different kinds of strong noise environments by combining soft threshold, self-activation, and self-attention mechanisms. The model has enhanced filtering performance and multi-___location information fusion ability. Our comparative and ablation experiments demonstrate that the model’s components, including the multi-___location and multi-scale vibration extraction module, soft threshold noise filtering module, multi-scale self-activation mechanism, and layer attention mechanism, are highly effective in filtering noise from various locations and extracting multi-dimensional features. The MLSCA-CW model achieves 92.02% accuracy against various strong noise disturbance and outperforms SOTA methods under challenging working conditions in CWRU dataset.

Multi scale convolutional neural network combining BiLSTM and attention mechanism for bearing fault diagnosis under multiple working conditions

Article Open access 15 April 2025

Vibration signal analysis for rolling bearings faults diagnosis based on deep-shallow features fusion

Article Open access 18 March 2025

Implementation of MF block in CNN for advanced REB fault diagnosis

Article Open access 25 May 2025

Introduction

Bearings are crucial components in mechanical systems, typically responsible for load-bearing and transmission. Their condition directly influences the reliability of mechanical systems. During operation, alternating stress can lead to the degradation or damage of bearing components, posing safety risks to both equipment and personnel. Consequently, monitoring the health and diagnosing faults in bearings is essential for preventing major safety incidents in mechanical systems¹. Initially, bearing fault diagnosis primarily revolved around physical modeling and signal processing². This approach necessitated a deep understanding of the mechanical system structure from researchers, and the diagnostic methods were constrained by prior knowledge, limiting their broad application in industrial settings^3,4,5,6,7. With the rapid advancement and deployment of deep learning methods, data-driven deep learning algorithms have emerged as the dominant technology. They can adaptively extract fault characteristics and perform training and learning. Deep learning algorithms are capable of directly establishing a multi-dimensional nonlinear mapping relationship between fault signals and their categories, enabling accurate and swift diagnosis of bearing faults⁸. However, bearings often operate in complex and harsh environments, resulting in vibration signals that contain significant environmental noise. Particularly in the event of bearing failure, fault features may be obscured by noise, substantially diminishing the model’s accuracy^9,10.

In recent years, the deep learning field has seen rapid advancements, with numerous models achieving breakthroughs in areas such as computer vision, natural language processing, and weather forecasting^11,12,13. These developments demonstrate the potential of deep learning to automatically discern data patterns layer by layer from extensive datasets. Currently, mainstream CNN networks and their variants have shown promising results in diagnosing bearing faults with pure fault signal datasets. However, the real-world operating environments of bearings significantly differ from ideal laboratory conditions. Mechanical systems work amidst a variety of strong noises, which disrupts the application of deep learning networks trained on pure signals for actual diagnosis and prediction, leading to less satisfactory outcomes. In response to this challenge, some researchers have begun exploring bearing fault diagnosis in single noisy environments, using vibration signals. For instance, Zhang et al.¹⁴ proposed an enhanced residual network aimed at diagnosing bearing faults under Gaussian noise with SNR of 0–8 dB. Hou et al.¹⁵ added Gaussian-like noise signals with SNR of 10–20 dB to the dataset, used the attention mechanism, and filtered the noise through the Fourier transform. Similarly, Yao et al.¹⁶ developed an intelligent method using a stacked inverse residual convolution neural network (SIRCNN), which allows for the diagnosis of bearing faults in environments with a single type of noise, with SNR of − 6 to 10 dB. In conclusion, current research mainly targets vibration signals with single, weak noise. Studies on bearing fault diagnosis in signals with multiple, strong noises are still scarce. Moreover, data from a single sensor can be limited by factors like its operational state, potentially causing distortion and errors in the vibration signals¹⁷. Thus, the model’s failure to learn distinct fault features and its decreased robustness reduce its effectiveness. Given this limitation, it’s crucial to use information from sensors at multiple locations, allowing the deep learning-based network to identify more significant and complete fault features in fault diagnosis.

Addressing this challenge, this paper designs a novel Multi-Location Multi-Scale Multi-Level Information Attention Activation Network (MLSCA-CW), which combines soft threshold, self-activation, and self-attention mechanisms to perform excellently in various strong noise environments. Specifically, we considered noise levels SNR in the range of − 9 to 9 dB, which are strong noise levels in real-world conditions. Additionally, our experimental design includes the mixture of multiple noise sources to simulate the complex noise environment in actual equipment. The experimental results show that even in such complex and strong noise environments, our method can effectively extract fault features and perform accurate fault diagnosis. The main contributions of this paper include:

1.
A novel feature extraction module (MLVFE) with adaptive extraction and fusion ability of multi-___location sensor information is designed. The module uses multiple one-dimensional convolutions to extract features from multiple angles for each ___location channel and generates dynamic weights for feature information of different locations based on the weight distribution sub-module, to realize efficient extraction and weighted fusion of multi-___location sensor information.
2.
Improved self-activation mechanism (MSC-SA). Specifically, unlike previous studies, this paper uses multi-scale convolution to extract multi-scale signal features and efficiently and adaptively amends the main features according to the features of different scales, so that more information can be considered to achieve more effective feature extraction.
3.
The deep residual shrinkage network module (DRSN-CW) with different thresholds per channel was distributed to multiple key locations of the MLSCA-CW model. The soft threshold mechanism was used to further filter the noise in the vibration features extracted by the multi-___location feature extraction module and the multi-dimensional signal features extracted by multiple self-activation mechanisms, to improve the robustness of the model.
4.
Multi-layer attention module is proposed. Specifically, the multi-layer attention module is added after the self-activation mechanism to automatically learn the weights of shallow and deep features in multiple feature subspaces, so as to adaptively enhance beneficial features and suppress unbeneficial features.

The rest of this article is organized as follows. “Related work” section reviews related work, “Method” section provides a detailed introduction to the proposed method, “Experiment” section describes the experiment and results, and “Conclusion” section gives the conclusion.

Related work

Deep learning has emerged as a prominent method for bearing fault diagnosis. Unlike traditional machine learning methods^18,19,20,21, deep learning models bypass the cumbersome process of feature engineering and act as powerful universal function approximators with strong expressiveness. Especially with ample data, deep learning can autonomously learn complex data patterns. Convolutional Neural Networks (CNN)²², Autoencoders (AE)^23,24, Deep Belief Networks (DBN)^25,26, and Recurrent Neural Networks (RNN)²⁷ have also created a wave of buzz in the field of intelligent fault diagnosis. Since their initial application to rolling bearing fault identification in 2016²⁸, CNNs and their variants have achieved satisfactory results in predicting pure bearing fault vibration signals²⁹ . Wang et al.³⁰ applied strategies like wavelet decomposition and interpolation to transform vibration signals into grayscale images, enhancing the CNN model’s generalization performance. Wen et al.³¹ utilized a CNN based on LeNet-5 to enhance the learned feature scale, achieving an accuracy of 99.79% on a pure test set. Zhuang et al.³² improved model performance by employing a multi-scale CNN to extract features at various scales. Li et al.³³ introduced a novel bearing defect diagnosis algorithm utilizing an updated Dempster-Shafer hypothesis CNN (IDSCNN), which further enhanced network performance. Pan et al.³⁴ developed a new architecture named “LiftingNet,” based on CNN, to mitigate the impact of speed variations on prediction outcomes.

However, most of the current deep learning methods use pure datasets, and a few scholars add no more than two classes of low-intensity noise to the dataset for experiments. For example, Deng et al.³⁵ proposed a framework MgNet based on multi-granularity information fusion to explore the performance of the network in weak Laplacian noise and Gaussian noise environments. Yan et al.³⁶ proposed an attention mechanism-guided residual convolution variational auto-encoder network (AM-RCVAE), which can diagnose bearing faults under Gaussian white noise and pink noise with SNR of − 6 to 6 dB. Zhang et al.¹⁴ proposed a residual learning network that can improve network information flow for bearing fault diagnosis of a single type with SNR of 0–9 dB. Yao et al.¹⁶ proposed a stacked inverse residual convolution neural network (SIRCNN) for classifying the type and severity of bearing failure in single noise environments with SNR of − 6 to 10 dB. Bearing fault diagnosis under high intensity and multi-noise environment is the basis for applying bearing fault diagnosis methods based on deep learning to actual diagnosis.

However, the existing network ignores the contribution of shallow features, the extraction of multi-scale information and the fusion of multi-___location information, and only overly relies on the deep network to extract effective information at a specific ___location. This results in their poor performance in environments with complex noise. Therefore, it’s necessary to design a new neural network to complete multi-scale feature extraction and fusion of multi-position vibration signals, multi-level feature adaptive aggregation and multi-stage noise filtering, so as to realize efficient extraction of fault features and accurate judgment of fault types.

Method

In this paper, the MLSCA-CW model mainly includes the variable-scale multi-___location information extraction and weighted fusion module, the noise filtering module based on soft threshold, the improved multi-scale self-activation mechanism, and the multi-layer attention mechanism. In the network design, we took the coupling information contained in data from multiple ___location sensors and the complementary nature of different scale features within a single sensor into account. To achieve this, we weighted and aggregated signal features extracted through one-dimensional convolution with multiple receptive fields at each ___location. The information from multiple ___location sensors is fused to enhance the feature information. To mitigate the interference of noise signals in these features, the signals obtained by the multi-___location feature extraction module are input into the soft threshold processing module for adaptive adjustment and suppression of noise signals, thereby effectively filtering out a majority of noise present in the feature signals. Then, the output feature signals are subsequently subjected to multi-scale convolution for the extraction of features at various scales. Additionally, a self-activation mechanism is computed based on the processed features obtained from different scales to dynamically rectify the primary feature, thereby enhancing the clarity of fault feature signals. The signal passes through multiple self-activated modules with the same structure to extract multi-level features from shallow to deep layers. The characteristic fault signals and the noise signals doped by the further extraction and separation will become more easily distinguished in the high-dimensional space. At the end of each self-activation layer, the soft threshold mechanism is used to further filter the noise, and the characteristic signals are further purified. The pure deep and shallow features encompass the most crucial information necessary for bearing fault diagnosis. By employing a multi-layer attention module, the shallow and deep feature signals are meticulously processed to acquire pure high-recognition signals, which are subsequently fed into a multi-layer perceptron to yield the ultimate fault diagnosis outcomes.

Framework of MLSCA-CW

The MLSCA-CW network comprises a multi-___location feature extraction and fusion module, four soft threshold signal filtering modules, three feature mapping processors, a multi-layer attention module, and a classifier that integrates feature information to produce diagnostic results. The framework of MLSCA-CW is shown in Fig. 1.

Key modules of MLSCA-CW

Multi-___location vibration feature extractor

In neural network, the quality of input signal is related to the final performance of the network, and high-quality data input is the basis of excellent network performance. Zheng et al.¹⁷ found that the data collected by a single ___location sensor would have problems such as signal distortion, weak anti-interference ability of signal information, and incomplete extracted characteristic signals. Furthermore, the sensor captures one-dimensional signals that are constrained by the limited expressive capacity of one-dimensional convolution, leading to incomplete feature extraction from input signals. There is a lack of consensus among scholars regarding efficient methods for converting these one-dimensional signals into more expressive two-dimensional features. Among various transformation methods, Chen et al.³⁷ incorporated a dedicated feature extractor into the network’s front end to enable adaptive and learnable feature extraction, while simultaneously reducing the need for additional data preprocessing as demonstrated in Cheng’s study³⁸. Furthermore, research has demonstrated³⁹ that during the process of extracting network data features, convolutional networks with multiple receptive fields exhibit a more comprehensive expression of data characteristics compared to those with fixed view fields. This enhanced representation capability also contributes to increased robustness and accuracy within the network. In summary, we incorporate a self-designed multi-receptive field multi-___location vibration feature extraction module (MLVFE, Fig. 2) based on one-dimensional convolution in the network’s front end to achieve a more comprehensive and holistic nonlinear feature mapping for multiple one-dimensional feature signals.

Specifically, let ${x_1} \in {\mathbb {R}^{batch \times 1024}}$,${x_2} \in {\mathbb {R}^{batch \times 1024}}$represent the signal inputs of sensors at two distinct locations, then the output 2D feature map ${f_{map}} \in {\mathbb {R}^{batch \times 32 \times 32 \times 2n}}$ of MLVFE can be calculated according to Eq. (1). Where, “$\otimes$” is convolution operation, “$Conv1{D_i}$” is Conv1D with convolution kernel size of i, “$Conv2{D_i}$” is Conv2D with convolution kernel size of i, “BN” is batch normalization. Besides, to couple different spatial features , a trainable weight${w_i}$,$i \in [1,n]$ is added to each Conv1D in MLVFE. Furthermore, to facilitates the fusion and complementation of vibration signal characteristics across these locations, an adaptive learning parameter W, similar to the learning weight of attention mechanism, is calculated by RELU et al. operations.

We use multiple unshared one-dimensional convolution sets to process signals at different locations separately, and have captured the multi-scale features contained in the signals at each ___location. Take the signal input at a certain ___location as an example, assuming its shape is ${x_1} \in {\mathbb {R}^{batch \times 1024}}$, after 3 ($n=3$) one-dimensional convolution processing with kernel size 1, 3 and 5, vectors of shape ${x_{n=1}} \in {\mathbb {R}^{batch \times 1024}}$, ${x_{n=2}} \in {\mathbb {R}^{batch \times 1024}}$ and ${x_{n=3}} \in {\mathbb {R}^{batch \times 1024}}$ can be generated, and then they are concated into vectors of shape ${x_{1}^{n=3}} \in {\mathbb {R}^{batch \times 1024 \times 3}}$.

$$\begin{aligned} \left\{ \begin{array}{l} {f_{map}} = {f_{map2n}} \times W\\ W = \text {Softmax}\{ \text {Maxpool}[\text {RELU}(Conv2{D_2} \otimes {f_{map2n}})]\} \\ {f_{map2n}} = {\textrm{Concat}}({f_{map1}},{f_{map2}})\\ {f_{map1}} = {\mathop {\textrm{Re}}\nolimits } {\mathrm{shape[BN(}}{f_{signal1}}),1024to32 \times 32\mathrm{{]}}\\ {f_{map2}} = {\mathop {\textrm{Re}}\nolimits } {\mathrm{shape[BN(}}{f_{signal2}}),1024to32 \times 32\mathrm{{]}}\\ {f_{signal1}} = \text {GELU}[\text {Concat}_{i = 1}^n(Conv1{D_{i \times 2 - 1}} \otimes {x_1} \times {w_i})]\\ {f_{signal2}} = \text {GELU}[\text {Concat}_{i = 1}^n(Conv1{D_{i \times 2 - 1}} \otimes {x_2} \times {w_i})]\\ \text {GELU}(x) = x \cdot \int _{ - \infty }^x {\frac{{{e^{ - {t^2}/2}}}}{{\sqrt{2\pi } }}} dt \\ \quad \quad \quad \quad \ \ \approx 0.5x\{ 1 + \tanh [\sqrt{2/\pi } (x + 0.044715{x^3})]\} \\ \text {RELU} = \max (0,x) \end{array} \right. \end{aligned}$$

(1)

Deep residual shrinkage network with channel-wise thresholds

Soft threshold segmentation is a commonly used denoising method in signal processing^40,41. The general workflow involves converting the original input signal into a feature space, where near-zero values are not significant, through a series of processing steps. Then, a soft threshold is applied to convert these near-zero features to zero. The classic wavelet thresholding method in signal processing is to design a task-oriented filter to convert beneficial information into very positive or negative features, and at the same time convert noise and other unbeneficial information into near zero features, to eliminate the influence of noise. This approach is effective but a task-oriented filter requires very specialized signal processing knowledge and has been very challenging. Thanks to the rapid development of deep learning, the basic working principle of deep learning is to use gradient descent algorithm to adaptively fit data in multiple dimensions, which coincides with the feature engineering in soft thresholding⁴². Therefore, the DRSNs series network combining soft threshold and deep learning network came into being. The structural framework of Deep Residual Shrinkage Networks with Channel-wise Thresholds (DRSN-CW) is shown in Fig. 3.

The basic principle function of soft threshold is shown in Eq. (2):

$$\begin{aligned} y = \left\{ {\begin{array}{*{20}{c}} {x - \tau }& {x > \tau }\\ 0& { - \tau \le x \le \tau }\\ {x + \tau }& {x < - \tau } \end{array}} \right. \end{aligned}$$

(2)

Where x is the input feature signal, y is the output feature signal, and $\tau$ is the threshold and positive parameter. The threshold method, akin to the RELU function but not identical, sets features near to zero as zero while preserving both positive and negative features.

Deep learning uses gradient descent method to gradually approach the optimal point, and analyzes the threshold method from the principle of gradient descent. As shown in Eq. (3), the derivative of the input x on the output y. The derivative of the input to the output in the threshold method, is either 0 or 1. This property effectively mitigates issues related to gradient vanishing and exploding during the solving process.

$$\begin{aligned} \frac{{\partial y}}{{\partial x}} = \left\{ {\begin{array}{*{20}{c}} 1& {x > \tau }\\ 0& { - \tau \le x \le \tau }\\ 1& {x < - \tau } \end{array}} \right. \end{aligned}$$

(3)

The DRSN-CW module is composed of three “BN” and “RELU” operations, two convolutional layers, two fully connected layers, a global average pooling layer, and a sigmoid activation function. As shown in Fig. 3, the green part is the proprietary module for soft threshold estimation. In this specific section, the absolute values of the inputs are passed through a global average pool to generate one-dimensional vectors. The one-dimensional vector is connected by two layers of neurons, each with the same number of channels as the input feature map, in order to obtain a scaling parameter vector. Then, we pass the scaled parameter vector through the sigmoid function. Using the expression in Eq. (4), we can transform it into a one-dimensional vector where each element is between (0, 1).

$$\begin{aligned} {\alpha _c} = \frac{1}{{1 + {e^{ - {z_c}}}}} \end{aligned}$$

(4)

where ${Z_c}$ is the characteristic of the Cth neuron, ${\alpha _c}$ is the Cth scaling parameter. After that, the thresholds are calculated as Eq. (5):

$$\begin{aligned} {\tau _c} = {\alpha _c} \cdot {\mathop {\text {average}}\limits _{i,j}}\, \left| {{x_{i,j,c}}} \right| \end{aligned}$$

(5)

where ${\tau _c}$is the threshold of the Cth channel of the feature map. i, j and c are indexes of the feature map width, height, and channel. The threshold after adaptive learning remains within a reasonable range. Then, a threshold filter is applied to the input signal. Additionally, to avoid gradient vanishing and exploding issues in the network, the classical identity mapping technique from residual networks is employed.

Multi-scale multi-headed self-activation mechanism

Activation function is the key to adaptive learning of deep learning network. Appropriate activation function ensures the mapping ability of neural network in multi-dimensional feature subspace. Currently, scholars divide the activation function into two major categories, namely the linear rectification function RELU ( $\text {RELU}\left( x \right) = \text {max}\left( {0,x} \right)$ ) and its related variants, as well as the Swish( $\text {Swish} = x \cdot \sigma (x)$ , $\sigma (x) = \frac{1}{{1 + {e^{ - x}}}}$ ) function with high precision in the neural network framework system⁴³. The activation function ${H_\beta }(x)$ of the self-activation mechanism adopted in this paper is a smooth approximation of the RELU function, combined with the advantages of the accuracy of the Swish function. And ${H_\beta }(x)$ is calculate by Eq. (6).

$$\begin{aligned} \begin{array}{l} {H_\beta }({\eta _a}(x),{\eta _b}(x))\\ = {\eta _a}(x) \cdot \frac{{{e^{\beta {\eta _a}(x)}}}}{{{e^{\beta {\eta _a}(x)}} + {e^{\beta {\eta _b}(x)}}}} + {\eta _b}(x) \cdot \frac{{{e^{\beta {\eta _b}(x)}}}}{{{e^{\beta {\eta _a}(x)}} + {e^{\beta {\eta _b}(x)}}}}\\ = ({\eta _a}(x) - {\eta _b}(x)) \cdot \sigma [\beta ({\eta _a}(x) - {\eta _b}(x))] + {\eta _b}(x)\\ = {H_\beta }(x) \end{array} \end{aligned}$$

(6)

Where $\beta$ is a learnable parameter, ${\eta _a}(x)$, ${\eta _b}(x)$ are the signal characteristics after mapping through different receptive fields.When $\beta \rightarrow \infty$, ${\eta _a}(x) = 0$ and ${\eta _b}(x) = x$, ${H_\beta }(x) = \text {max}(0,x) = \text {RELU}(x)$ holds, while ${H_\beta }(x) = x \cdot \sigma (x) = \text {Swish}(x)$ holds if ${\eta _a}(x) = 0$, $\beta = {\eta _b}(x) = 1$ are satisfied. That is, the self-activation function adopted in this paper can select the activation function suitable for specific tasks according to different adaptive parameters of tasks. RELU and Swish are both special forms of ${\eta _b}(x)$⁴³.In addition, ${\eta _b}(x)$ is different from the Swish function with fixed upper and lower bounds. If ${\eta _a}(x) = {p_1} \cdot x$, ${\eta _b}(x) = {p_2} \cdot x$, and ${p_1}$, ${p_2}$ $\in {\mathbb {R}}$, then substitute ${\eta _a}(x)$, ${\eta _b}(x)$ into Eq. (6), and we get: ${H_\beta }(x) = ({p_1} - {p_2})x \cdot \sigma (\beta x) + {p_2} \cdot x$.After that, calculating the first and second derivatives and solving the equation $\frac{{{d^2}}}{{d{x^2}}}[{H_\beta }(x)] = 0$. When $\beta > 0$, The maximum and minimum values of the first derivative of $\frac{d}{{dx}}[{H_\beta }(x)$ can be obtained as shown in Eq. (7):

$$\begin{aligned} \begin{array}{l} \text {Maxima}(\frac{d}{{dx}}[{H_\beta }(x)]) \approx 1.0998{p_1} - 0.0998{p_2}\\ \text {Minima}(\frac{d}{{dx}}[{H_\beta }(x)]) \approx 1.0998{p_2} - 0.0998{p_1} \end{array} \end{aligned}$$

(7)

As shown in Eq. (7) that modifying the values of ${p_1}$ and ${p_1}$ can adjust the gradient range of the activation function in the model, namely the activation degree of neurons. And the value of $\beta \in [0, + \infty )$( the larger $\beta$ is, the smaller x is needed to make ${H_\beta }(x)$ closer to the lower bounds, and the opposite is the same ) can change the speed of x approaching the upper/lower bounds. Therefore, choosing a reasonable parameter $\beta$ can improve the overall performance of the model. For example, in⁴⁴, parameter $\beta$ is set as a trainable parameter and the performance of the model is improved. And in our proposed module, parameter $\beta$ is used as a learnable parameter determined by network training.

The activation degree of neurons can be adaptively adjust in the Multi-Scale Multi-headed Self-Activation module (MSC-SA) with three trainable parameters similar to the transformer induction bias, so as to avoid the limitations of fixed activation function in feature extraction and gradient calculation. Building on this, to further enhance the model’s robustness, we employ multi-scale convolution combined with point convolution to extract channel information. This information serves as a parameter to regulate neuron activation levels, facilitating the creation of a new self-activation function, as illustrated in Fig. 4.

To ensure consistent residual connection feature shape in network design, the input feature channels are adjusted through under-sampling multi-scale convolution prior to MSC-SA extracting input features. Then, leveraging the number of channel heads, we apply operations akin to the multi-head self-attention mechanism found in transformers. The program computes the value of the self-activation function for the feature, conducting these calculations in parallel.

Multi-headed layer attention mechanism

The MLSCA-CW model use three multi-head self-activation modules with multi-scale receptive fields to extract the shallow and deep features of the signal from shallow to deep. In previous studies, scholars only used the final deep features as the discriminant basis of the final classifier, which made the shallow signal features directly discarded, while the shallow signal features may contain beneficial information for prediction. Therefore, this paper proposes an innovative approach that seamlessly integrates shallow and deep feature signals using a multi-head attention mechanism. This enables the model to adaptively leverage both shallow and deep features for comprehensive information extraction, thereby enhancing robustness and improving prediction accuracy. After noise reduction by the DRSN-CW module, both shallow and deep features are input into a linear layer, as shown in Fig. 1, to ensure a uniform feature shape across all outputs. To enhance the attention mechanism’s ability to learn the relationship between deep and shallow features, the input is processed through an embedding layer. This layer effectively captures the complex hidden relationships between them. The process can be expressed as Eq. (8).

$$\begin{aligned} x_{{f_i}}^{emb} = E{x_{{f_i}}} \end{aligned}$$

(8)

Where, ${x_{{f_i}}}$ represent the input features, $x_{{f_i}}^{emb}$ represent the result of embedding, and E is a trainable matrix to nonlinear mapping from ${x_{{f_i}}}$ to $x_{{f_i}}^{emb}$. Subscript ${f_i}$( $i = 1,2, \cdots$ ) represents the network layer index.

The MLSCA-CW model applies the attention mechanism in the form of multiple heads to capture the coupling between all features. Specifically, take the input ${a_1}$ as an example, through the multi-head ( head=4) mechanism will get four outputs $b_1^{h1},b_1^{h2},b_1^{h3},b_1^{h4}$. To get the output $b_1$ corresponding to ${a_1}$ , the multi-head attention mechanism first splicing four output $b_1^{h1},b_1^{h2},b_1^{h3},b_1^{h4}$. Then, a single-layer fully connected neural network without nonlinear activation layer is used to perform linear transformation on the splicing results to obtain $b_1$. The complete mathematical operation process of this process is as follows:

$z_{{f_i}}^{\left( {in} \right) }$ represent the input of the multi-layer attention module, and it also the output of the embedding $x_{{f_i}}^{emb}$. We can get $z_{{f_i}}^{\left( {in} \right) } = x_{{f_i}}^{emb}$. And we can calculate the q/k/v vector by Eq. (9).

$$\begin{aligned} \begin{array}{l} q_{{f_i}}^{\left( a \right) } = W_Q^{\left( a \right) }LN\left( {z_{{f_i}}^{\left( {in} \right) }} \right) \in {\mathbb {R}^{{D_h}}}\\ k_{{f_i}}^{\left( a \right) } = W_K^{\left( a \right) }LN\left( {z_{{f_i}}^{\left( {in} \right) }} \right) \in {\mathbb {R}^{{D_h}}}\\ v_{{f_i}}^{\left( a \right) } = W_V^{\left( a \right) }LN\left( {z_{{f_i}}^{\left( {in} \right) }} \right) \in {\mathbb {R}^{{D_h}}} \end{array} \end{aligned}$$

(9)

Where LN() represent the layer normalization, $a = 1,...,A$ is the index of attention head, and A is the total number of attention heads. Calculating attention weights $\alpha _{{f_i}}^{\left( a \right) }$ through dot product, as shown in Eq. (10).

$$\begin{aligned} \alpha _{ {f_i} }^{\left( {a} \right) } = SM\left( {{{\frac{{q_{\left( {f_i} \right) }^{\left( {a} \right) }}}{{\sqrt{{D_h}} }}}^T}\left[ {{{\left\{ {k_{\left( {f_i} \right) }^{\left( {a} \right) }} \right\} }_{f_i = 1,...,9}}} \right] } \right) \end{aligned}$$

(10)

Where SM is softmax activate function.

In the attention mechanism encoding process, as shown in Eq. (11), $s_{{f_i}}^{\left( a \right) }$ is obtained by taking the weighted sum of the first computed v vector using the self-attention coefficient of each attention head.

$$\begin{aligned} s_{{f_i}}^{\left( a \right) } = \sum \limits _{{f_i}' = 1}^N {\alpha _{{f_i},{f_i}'}^{\left( a \right) }} v_{{f_i}'}^{\left( a \right) } \end{aligned}$$

(11)

Then, as shown in Eq. (12), these vectors from all heads are successively projected to obtain the attention representation results of various hierarchical features within the network.

$$\begin{aligned} {z'}_{ {f_i} }^{\left( SA \right) } = {W_o}\left[ {\begin{array}{*{20}{c}} {s_{{f_i}}^{\left( {1} \right) }}\\ \vdots \\ {s_{{f_i}}^{\left( {A} \right) }} \end{array}} \right] \end{aligned}$$

(12)

Experiment

This study utilizes the Pytorch-based deep learning platform to develop deep neural networks, leveraging autograd and GPUs for accelerated Tensor computations. Additionally, Windows 10 serves as the operating system.

Dataset

The data sources of this paper are Case Western Reserve University (CWRU) Bearing Fault Diagnosis Dataset⁴⁵ and the Intelligent Maintenance Systems (IMS) Dataset⁴⁶. The CWRU dataset contains the pure vibration data of bearings from different fault modes and working conditions and is widely used. In this paper, it is applied to the research of bearing fault diagnosis under different kinds of strong noise. The bearing working environment of IMS data set is closer to the actual conditions. Therefore, this paper applies it to the study of the comprehensive performance of the model.

a) Case Western Reserve University (CWRU) Dataset

The data set was collected by the Case Western Reserve University Bearing Data Center, and the bearing failures were man-made failures induced by Electrical Discharge Machining (EDM). Specifically, the failures with diameters of 0.007, 0.014 and 0.021 inches are introduced into the inner ring, outer ring and rolling body of the bearing respectively. Then, the bearings were placed in the test motor and data were recorded using vibration sensors at the motor’s drive end (DE) and fan end (FE). Throughout the experiment, the motor load range was 0 to 3 hp, the motor speed oscillated slightly within the range of 1720–1797 RPM, and the vibration signal sampling frequency was 12 kHz and 48 kHz. And the DE and FE data were selected in our experiment, with a fault size of 0.007 inch and a sampling frequency of 12 kHz, under a load condition of 0 hp⁴⁵.

b). Intelligent Maintenance Systems (IMS) Dataset

The dataset is life-cycle data for rolling bearings produced by the NSF I/UCR Intelligent Maintenance Systems Center at the University of Cincinnati, USA, with support from Rexnord, Inc., Milwaukee, WI⁴⁶.

As shown in Fig. 5, the AC motor rotates at a constant speed of 2000 rpm and is connected to the shaft by a friction belt. Four Rexnord ZA-2115 double row roller bearings were installed on the rotating shaft, and a radial load of 6000 pounds was applied to the bearings through a spring mechanism. And all bearings are force-lubricated. Measuring the vibration signals of bearings in X and Y directions with high-precision quartz accelerometers. All bearings are damaged after serving beyond the design service life (more than 100 million times of rotation). This paper considers that the data collected by the accelerometer in two directions are bearing signals in two different locations. In our experiment, the health status of the bearing dataset was classified into four categories, namely outer ring failure (OR), inner ring failure (IR), rolling sphere failure (BF) and healthy bearing (H).

c) University of Ottawa (UofO) Dataset

The experiments are conducted using a SpectraQuest machinery fault simulator (MFS-PK5M). The motor drives the shaft, and its rotational speed is controlled by an AC drive. Two ER16K ball bearings support the shaft; the left bearing is healthy, while the right one is replaced with bearings in various health conditions. An ICP accelerometer (Model 623C01) is mounted on the housing of the experimental bearing to collect vibration data. Additionally, an incremental encoder (EPC model 775) measures the shaft’s rotational speed.And data sets are designed in detail in Reference^48,49.

Experimental setup

To convert one-dimensional vibration signal into two-dimensional feature map, a sample in this paper is 1024 vibration signal sampling points intercepted at the same time in two locations.In CWRU dataset, we treat drive end (DE) and fan end (FE) as two different locations. And the samples are intercepted by sliding window according to the step size of 128. And we got 4744 samples. The total number of samples of health status, rolling element fault, outer ring fault and inner ring fault are 1186, 1186, 1186 and 1186 respectively.

In IMS dataset, referring to the analysis of IMS data by Hai Qiu et al.⁴⁷, we selected data from bearing 1 of dataset1 23/10/2003 12:06:24 to 09/11/2003 13:05:58 that were considered healthy. Data selected from bearing 3 of dataset1 18/11/2003 08:22:30 to 24/11/2003 23:57:32 are considered inner ring failures. Data selected from bearing 4 of dataset1 18/11/2003 08:22:30 to 24/11/2003 23:57:32 are considered rolling element failures. The data selected from bearing 3 of dataset3 12/4/2004 14:51:57 to 18/4/2004 02:42:55 is considered to be an outer ring failure.the different channel data of bearing 1, 3, and 4 in the dataset 1 are regarded as two locations; in the dataset3, the data of bearing 3 is repeated twice as two locations data respectively. And the samples are intercepted by sliding window according to the step size of 1024. And we got 4,095 samples. The total number of samples of health status, rolling element fault, outer ring fault and inner ring fault are 1023, 1123, 1123 and 1124 respectively.

In the UofO dataset, data sets 01 collected under the condition of health, inner ring failure and outer ring failure are selected respectively under the condition of bearing speed increase. In each 01 bearing data set, ’Channel_1’ is the vibration data measured by the accelerometer, and ’Channel_2’ is the speed data measured by the encoder. In this paper, ’Channel_1’ and ’Channel_2’ are regarded as two different position data. Using sliding window interception, the step size is 1024, and 5,860 samples are obtained. The total number of samples of health status, outer ring fault and inner ring fault are 1953, 1953 and 1954 respectively.

All samples are divided into training set, validation set and test set according to the proportion of 6:2:2. To ensure the credibility and reproducibility of our experimental results, we established a fixed random seed. In the training process, we set the learning rate to 0.001 and send up to 64 samples to the model for training (i.e. batch size = 64). The loss function is set to cross-entropy loss, and we choose the stochastic gradient descent (Adam)³² optimizer to optimize the network.

Three evaluation indicators were selected, namely prediction accuracy (ACC), F1-score and Recall, and the corresponding evaluation index calculation formula is as follows:

$$\begin{aligned} ACC= & \frac{1}{C}\sum \limits _{i = 1}^C {\frac{{T{P_i} + T{N_i}}}{{T{P_i} + F{P_i} + T{N_i} + F{N_i}}}} \end{aligned}$$

(13)

$$\begin{aligned} F1= & \frac{1}{C}\sum \limits _{i = 1}^C {\frac{{2T{P_i}}}{{2T{P_i} + F{P_i} + F{N_i}}}} \end{aligned}$$

(14)

$$\begin{aligned} Recall= & \frac{1}{C}\sum \limits _{i = 1}^C {\frac{{T{P_i}}}{{T{P_i} + F{N_i}}}} \end{aligned}$$

(15)

Where, $T{P_i}$, $T{N_i}$, $F{P_i}$ and $F{N_i}$ represents the true positive rate, true negative rate, false positive rate and false negative rate of i class respectively. And C denotes the number of classes.

To discuss the effectiveness of the proposed MLSCA-CW network, we next select six benchmark models for comparison.

1.
LR model: LR model is a logical regression model, which is often used to deal with regression problems where the dependent variable is a categorical variable. It can deal with multi-classification problems. It is a classic network in the field of machine learning, with good interpretability, low calculation cost and fast training speed.
2.
MC-CNN model⁵⁰: This network is widely employed in fault diagnosis due to its incorporation of multi-scale convolution into the traditional CNN architecture, thereby enhancing the model’s robustness. And this is a multi-scale method with excellent performance. And this is a multi-scale method with excellent performance.
3.
WDCNN model⁵¹: It is a widely used bearing fault diagnosis method. It uses the wide convolution kernel of the first convolution layer to extract features and suppress noise.
4.
Multiscale inner product model⁵²: A deep learning method based on multi-scale inner product and locally connected features to extract fault diagnosis in vibration signals, with strong ability to collect multi-scale local information. And this is a multi-scale method with excellent performance. And this is a multi-scale method with excellent performance.
5.
SANet model⁵³: Multi-scale convolution is used for feature extraction. Besides, it is inspired by the transformer architecture, incorporates a self-activation mechanism, with strong comprehensive performance such as accuracy and robustness. And this is a multi-scale method with excellent performance. And this is a multi-scale method with excellent performance.
6.
The QCNN model⁵⁴: It is a convolutional neural network constructed based on quadratic neurons, from which the attention mechanism is derived and has a strong filtering effect on noise.

This study conducts preliminary tests on the performance of the model using the CWRU dataset and IMS dataset. Meantime, to study the anti-interference ability of the model to multi-noise and strong noise, we added Gaussian noise, Laplacian noise, Brownian noise, and Violet noise to the CWRU original signal under the condition that the SNR ( $SNR = 10 \times {\log _{10}}\left( {\frac{{{P_{signal}}}}{{{P_{noise}}}}} \right)$ ) ranged from − 9 to 9 db and the interval was 3db. Subsequently, to test the comprehensive performance of the model, we introduced the IMS dataset for further comprehensive performance testing.

Results and discussion

Ablation experiment

To assess the efficacy of the proposed module, we conducted comprehensive ablation experiments on various network modules in the presence of violet noise with SNRs ranging from -9dB to 9dB. The experimental findings are presented in Tables 1 and 2.The data in Table 1 are the ablation experiments of the two locations model, and the data in Table 2 are the ablation experiments of the single ___location model ( drive end (DE) data ). Where the MLSCA model removes the soft threshold module from the MLSCA-CW model, while MLSCA-CW-LSA eliminates the layer attention mechanism at the end of its structure. Additionally, MLSCA-CW-SCA represents the model without the multi-scale self-activation mechanism.And MLSCA-FFT represents the replacement of the soft threshold module with a fast Fourier transform (FFT).

According to the data analysis presented in Table 1, the removal of the soft threshold noise reduction module from the MLSCA-CW model leads to a significant decrease in its noise resistance. The MLSCA model fault classification accuracy at -9dB SNR is only 68.487%, and the F1-score is even only 57.745%, the model’s classification effect is particularly affected by noise. As a comparison, the proposed MLSCA-CW model has a high classification accuracy of 92.017% and an F1-score index of 87.802% under the same conditions. In addition, for a low-noise environment with a SNR of 9, the MLSCA model’s classification accuracy without the soft-thresholding module is only 87.815%, which further suggests that the soft-thresholding module is effective in filtering out the majority of the noise and retaining the effective main fault features. The MLSCA-CW-LSA model removes the layer attention module making it unable to effectively utilize the low-dimensional features, and the model classification can only refer to the final extracted high-dimensional features while ignoring the important information that may be contained in the shallow features. From the results, after removing the layer attention, the model extracts high-dimensional features that can support the model for deep nonlinearities when the SNR is greater than 0. Meantime, the model’s predicted classification accuracy approaches 100%.

However, as the noise intensity continues to increase, the performance of the model begins to decrease and gradually falls below that of the proposed MLSCA-CW model. Obviously, with the enhancement of noise signals, the performance of the model without layer attention mechanism decreases more sharply, which further shows that the designed layer attention mechanism can well integrate high-dimensional and low-dimensional vibration features, so that the information obtained by the classifier is more comprehensive and complete, and ultimately greatly improves the model performance. The MLSCA-CW-SCA model removes the multi-scale self-activation mechanism. From the table 1, compared with the proposed complete model, the removal of the self-activation mechanism causes a serious performance decline. Even under very weak noise intensity (SNR=9dB), the model has only 92.857% prediction accuracy, while the model accuracy decreases to 78.992% in a strong noise environment with a SNR of − 9. This shows that our improved self-activation mechanism has a positive impact on improving the accuracy and robustness of the MLSCA-CW model. FFT is an effective means to reduce noise in traditional signal processing. MLSCA-FFT model uses FFT to replace soft threshold module in MLSCA-CW model. Through the experimental results, we can find that the MLSCA-FFT model has a great loss of noise filtering ability for both single and multi-position signals. In the strong noise environment, the performance of MLSCA-FFT model decreased by more than 30%. This further shows that soft threshold has great advantages in noise filtering of bearing data compared with traditional FFT. Because soft threshold dynamically adjusts the processing of noise through adaptive channle-specific thresholds to maintain higher diagnostic accuracy and stability, especially in real-time, dynamically changing noise environments. In contrast, FFT is based on a fixed frequency ___domain filter, its excessive suppression of a specific frequency in a high noise environment may lead to the loss of important signals. And verifies the advanced nature of the methods used in this paper.

The single ___location ablation data in Table 2 were analyzed, and the rule was consistent with that shown by the two locations model in Table 1. However, the performance of each experimental data of the single ___location decreased compared with that of the two loctions data. This shows that the two locations feature extractors can effectively supplement the single loction feature; Thus, the fault features extracted by the model are enriched, and the accuracy of bearing faults is improved.

Overall, the soft-thresholding module is essential for effectively filtering noise and preserving fault features, particularly in high-noise environments. The layer attention mechanism plays a crucial role in integrating both low-dimensional and high-dimensional features, enabling the model to focus on critical patterns in noisy data. The multi-scale self-activation mechanism enhances the model’s robustness by dynamically activating important features across multiple scales, which is especially beneficial under varying noise conditions. Additionally, utilizing multi-___location data provides richer context and features, improving fault classification accuracy by capturing a more comprehensive set of fault information from multiple sensor locations.

In summary, the soft threshold module, layer attention mechanism, improved self-activation mechanism ,and two-___location feature extractors employed by the MLSCA-CW model have significantly contributed to enhancing its performance and robustness.

Table 1 Ablation experimental results of models under different intensities of Violet noise(CWRU dataset, two locations).

Full size table

Table 2 Ablation experimental results of models under different intensities of Violet noise (CWRU dataset, single ___location).

Full size table

Comparative experiments

Model performance without noise

To evaluate model performance without noise, tests were conducted on the CWRU and IMS datasets. As shown in Table 3, Except for the LR model, all models achieved 100% prediction accuracy on the CWRU dataset. On the IMS dataset, the MLSCA-CW, MC-CNN, and QCNN models reached 100% prediction accuracy.

Table 3 Comprehensive performance of each model.

Full size table

In order to evaluate the performance of the model more intuitively, we further visualized the prediction results of the IMS dataset and obtained the confusion matrix as shown in Fig. 6.

According to the results in Fig. 6, the MLSCA-CW, MC-CNN, and QCNN models show excellent performance, while the other models have mediocre performance. Specifically, the LR model is extremely inaccurate for inner ring fault judgment; the WDCNN model fails only for part of the outer ring fault judgment; the Multiscale inner product model is partially unclear for the inner ring fault and healthy bearing judgment; and the SANet model is more confusing for distinguishing between outer ring fault and healthy bearing signal information.

In noise-free environments, the MLSCA-CW, MC-CNN, and QCNN models all achieved 100% prediction accuracy. However, bearings often operate amidst various types of intense noise in real-world conditions. Therefore, we tested the performance of each model under various types and intensities of noise.

BModel performance under various and intense noise conditions

a) CWRU dataset performance

Due to the presence of various types of intense noise in the operating environment of bearings. This paper is dedicated to exploring the performance of the model in complex environments containing multiple types of noise backgrounds. The proposed model and related benchmark comparison models are subjected to performance testing. The proposed model and six comparative models were subjected to testing using the same dataset (CWRU dataset) and testing methods, and the corresponding results are presented in Tables 4 and 5.

Analyzing the data in Table 4, the MLSCA-CW (Two locations) model has excellent anti-noise ability. At -9dB SNR and in five different noise environments, the classification accuracy of the model is more than 90%. Mixed noise is a equal proportional mixture of Gaussian noise, Laplacian noise, Brownian noise, and Violet noise. In addition, under the same noise conditions, the MLSCA-CW model is better than other comparison models. It is worth mentioning that in Gaussian noise and Laplacian noise environments, the classification accuracy of the MLSCA-CW model can reach 100% in all SNR cases tested. In addition, we also found that classical LR models cannot effectively learn overly complex nonlinear relationships. Except for significant performance degradation in certain noise environments and SNRs less than -3dB (e.g., SANet performs poorly in Brownian and Purple noises), most of the models showed high performance in lower noise scenarios. It is worth noting that the accuracy of the benchmark models we compared significantly decreased in strong noise environments between -9dB and -6dB.MC-CNN combines traditional convolution to realize multi-scale feature extraction, and feature extraction based on a multi-scale feeling field can make the features collected by the model more complete. However, under high noise intensity conditions (e.g., SNR less than -6 dB) the original signal has been annihilated, and the features extracted using only multiscale convolution will lead to a sharp decrease in the prediction performance of the model due to the inclusion of many large noise features.

The WDCNN model uses a wide convolution kernel in the first convolutional layer to extract features and suppress noise, and the results show that it does have a suppression effect on noise and has a better prediction of Laplacian noise. However, for other kinds of noise and when the noise signal strength is high, the performance of the WDCNN model has a more substantial degradation, and the classification accuracy of the model under Gaussian noise with a -9 dB SNR even drops to 60.504%. The multiscale inner product model uses multi-scale inner product and locally connected feature extraction method, which does have a high filtering effect on noise; its performance is better than MLSCA-CW model under -9dB Brown noise condition, but there is still a gap between Multiscale inner product model and MLSCA-CW model in other cases. This also shows that the use of multi-scale inner product method with correlation filtering noise feature extraction method is a potential method, and the MLSCA-CW model we proposed using multi-scale convolution and layer attention mechanism with corresponding feature enhancement method is very correct, and the effect is immediate. SANet network mainly uses the self-activation mechanism of depth separation, which often has better performance in dealing with some strong noise, which provides a new idea for improving network robustness. However, from the experimental results, self-activation mechanism of depth separation still needs to be combined with other means to realize the actual working environment of bearings for multiple types and strong noise. QCNN model is essentially a model based on attention mechanism. It can be clearly seen from the test results that the overall performance of the model is relatively excellent. However, there is still considerable scope for enhancing the performance of violet noise as well as other types of noise in highly noisy environments with SNRs of -9dB and -6dB.

This experimental result shows that proper use of attention mechanism helps improve the robustness of the model. Based on the above analysis, we have good reason to believe that the MLSCA-CW model combining the multi-scale multi-___location feature extractor, soft threshold noise filtering, multi-scale self-activation mechanism and layer attention mechanism will have a significant improvement in prediction accuracy and robustness compared to the benchmark model. In fact, the data in Table 4 and Table 5 strongly prove the correctness of the above conclusion. For mixed noise environments, MLSCA-CW (Two locations) model outperforms other comparison models in SNR ranging from -9 to 9dB. In strong noise environments of -9dB, the prediction accuracy reaches 95.378 %. Besides, an important feature of the MLSCA-CW model is that it allows for more fully utilization of sensor signals from multiple locations to complete bearing fault diagnosis tasks under complex conditions. Specifically, the extracted multi-___location fault features are used to realize the fusion complementation of the feature signals and enrich the fault feature information, so that the network can better fit the nonlinear mapping relationship between the signal features and the fault types, which in turn improves the model prediction accuracy. Table 4 and Table 5 give the experimental results of MLSCA-CW model using single ___location sensor signal information and double ___location sensor signal. By comparison, we can find that in the Brownian noise prediction except for -9dB environment, the accuracy of MLSCA-CW(Single ___location) is slightly higher than that of MLSCA-CW(Two locations) by 0.42%. However, in other noise signals of different SNRs, the performance of MLSCA-CW(Single ___location) does not exceed that of MLSCA-CW(Two locations). The above analysis shows that the multi-___location information aggregation method proposed in this paper enhances the model’s ability to extract fault features and information to a certain extent , and improves the model’s prediction performance in different application scenarios.

Table 4 ACC performance comparison of different models in CWRU dataset under different SNRs (%).

Full size table

Table 5 F1-score performance comparison of different models in CWRU dataset under different SNRs (%).

Full size table

b) UofO dataset performance

In order to further verify the generalization and robustness of the proposed model, we select the UofO bearing variable speed diagnosis dataset for further verification. Since this paper pays more attention to the diagnostic performance of the network in a multi-type strong noise environment, the noise intensity of the signal-to-noise ratio of -9, -6 and -3dB is selected for this comparison experiment. In addition, ACC, F1-score and Recall were selected as evaluation indicators, as shown in Table 6.

As shown in Table 6, the proposed MLSCA-CW (Two Locations) maintains optimal performance under different SNRs and various noise conditions. The prediction accuracy of no less than 95.819% can be maintained in each single type − 9 dB noise environment. For the − 9 dB mixed noise of variable speed bearings, the prediction accuracy of 73.976% can be maintained. For -3dB mixed-noise, the prediction accuracy of 86.263% is maintained, which is 10% higher than that of other optimal comparison models.The MLSCA-CW (Single Location) lacks the signal information of a ___location, resulting in incomplete fault features obtained by the model, which reduces the prediction accuracy of the model. The experimental results also verify that the model prediction accuracy is decreased due to incomplete fault features under variable speed conditions, and the confusion between fault and noise characteristics is intensified due to the loss of single position information.For LR model, its nonlinear mapping ability is weak. Under complex environmental conditions such as Violet-noise, Brown-noise and Mixed-noise, the model expression ability is low, and the prediction accuracy is only about 30%. The MC-CNN model realizes multi-scale feature extraction based on multi-scale convolution, and collects multi-scale fault features through multi-scale field, so as to improve and strengthen fault feature extraction and improve the mode expression ability. It is consistent with the results of CWRU dataset. For Guass-noise and Laplce-noise, the model has strong expressiability and good effect. For high-intensity Violet-noise, Brownain-noise and Mixed-noise, multi-scale convolution can not better distinguish fault features and noise features, reaching the performance expression limit of the model, and the worst prediction accuracy is only about 40%.

The first convolution layer of WDCNN uses wide convolution to extract features and suppress noise. The experiment proves that the WDCNN model can achieve a good separation of fault features and Laplace-noise, which makes the model have strong expression ability in Laplace-noise environment. However, for other types of noise (especially -9dB strong noise), the WDCNN model’s ability to separate noise is greatly reduced, and its prediction accuracy is basically maintained at 60–70%.The multiscale inner product model uses the multiscale inner product and local connected feature extraction method, which has a good effect on noise filtering. However, its performance compared to MLSCA-CW (Two Locations) still has some gaps. Under the conditions of -9dB Mixed-noise, the prediction accuracy is only 68%, and even with the reduction of 6 dB SNR, the prediction accuracy of the model is only about 5% higher. The results also show that multi-scale inner product and locally connected feature extraction is a promising method for noise removal. This further demonstrates the superiority of the MLSCA-CW (Two Locations) model proposed in this paper by combining multi-scale convolution with layer attention mechanism.The SANet network mainly adopts the self-activation mechanism of deep separation based on Transformer, and its model performance is in the acceptable range when processing some kinds of noise data (such as CWRU data) under constant rotation. However, in the environment of variable speed and high noise intensity, the performance of SANet is even worse than that of LR, which indicates that the expression ability of the model SANet in the environment of variable speed still has great room for improvement. Essentially, QCNN is a model based on attention mechanism, which has a strong filtering ability for strong Guass-noise and Laplace-noise, while its filtering ability for other noises and Mixed-noise is greatly reduced.In general, attention-based models such as Multiscale inner product, SAnet and QCNN, as well as models based on multi-scale convolution operators such as MC-CNN and WDCNN have better filtering ability for partial noise than LR models. This further verifies that the MLSCA-CW (Two Locations) model proposed in this paper combines multi-scale multi-___location feature extractor, soft threshold noise filter, multi-scale self-activation mechanism and layer attention mechanism to filter out multiple types of strong noise. In addition, an important feature of MLSCA-CW model is to extract fault features by using multiple position sensor information. According to the results, the prediction accuracy of MLSCA-CW (Single Location) model in variable speed environment decreased significantly compared with MLSCA-CW (Two Locations). This also shows that the extraction of multi-position vibration features is indeed helpful to improve the model’s perfection of fault features and the separation of noise signals. At the same time, compared with the constant speed of CWRU, MLSCA-CW (Single Location) performance degradation is more obvious. This also shows that MLSCA-CW (Two Locations) is more comprehensive for feature supplement under variable speed conditions, and is more suitable for complex situations.

To further visually analyze the performance of each model under high SNRs, we visualized the confusion matrix under − 9 dB SNR of each model, and the results were shown in Fig. 7. As shown in Fig. 7, MLSCA-CW (Two Locations) is highly confusing for the fault signals of healthy bearings and outer rings in Mixed-noise environments. However, its overall comprehensive performance is better than other benchmark models.For MLSCA-CW (Single Location), it can be seen that the model’s ability to distinguish between health and outer circle faults decreases, and more signals of health and outer circle faults are confused. Among the other models, the Multiscale inner product model has a uniform distribution of success rate for all noise types, but its overall recognition accuracy is lower than MLSCA-CW (Two Locations). MC-CNN is poor in identifying the outer ring faults of Brown-noise. WDCNN is poor for most noise health and outer ring fault detection. Under Laplace-noise environment, SANet’s fault identification accuracy is 0 in the outer faults, and all oute faults are misjudged as healthy. QCNN has poor effect on the identification of healthy bearings under the noise environment of Violet-noise, Brown-noise and Mixed-noise.

In a nutshell, the performance of the proposed MLSCA-CW (Two Locations) in the UofO variable speed bearing data set is still optimal, while the other basic models have different degrees of defects in their ability to express different types of noise environments. This also validates the advanced nature of the approach adopted by the MLSCA-CW model. It also shows that MLSCA-CW has good generalization and robustness.

Although with the rapid development of GPU technology, the computing cost has decreased, the algorithm efficiency is still an important index to evaluate the model. Therefore, the model inference time of a single sample is tested under the premise of ensuring the model performance. The results show that the inference time of a single sample in MLSCA-CW model is 0.515 ms. The single sample inference time of MC-CNN is 0.943ms, the single sample inference time of Multiscale inner product is 0.725 ms, the single sample inference time of SANet is 0.725 ms, and the single sample inference time of QCNN is 0.456ms. In general, the proposed model has significant performance advantages and computational efficiency is better than most of the comparison models, showing promising application prospects in near real-time computing.

Table 6 ACC, F1-score, and Recall comparison of different models in UofO dataset under different SNRs (%).

Full size table

Conclusion

This paper proposes a new multi-position, multi-scale, multi-level information attention activation network (MLSCA-CW) for bearing fault diagnosis. Ablation experiments show that the soft-threshold noise removal, layer attention mechanism, multi-scale self-activation mechanism, and extraction of multi-scale features from multi-position sensor data used in this paper are highly effective in filtering out various types of high-intensity noise, while improving the model’s expressive power. Furthermore, the soft-threshold noise removal method offers more significant advantages compared to the traditional FFT approach.

Comparison experiments demonstrate that, under both constant and variable-speed bearing operation conditions, the proposed MLSCA-CW outperforms current advanced multi-scale bearing fault diagnosis models, such as MC-CNN, SAnet, and QCNN, in the presence of various types of strong noise. Under the CWRU dataset, the MLSCA-CW model maintains a high diagnostic accuracy of 92.02% in noise environments ranging from − 9 to 9 dB. In the UofO variable-speed bearing fault diagnosis dataset, the model achieves a diagnostic rate of 95.819% under various single noise conditions ranging from − 9 to − 3 dB. Even under mixed noise, the diagnostic accuracy remains significantly higher than that of multi-scale baseline models like SANet. Experiments conducted on datasets with different working conditions and mixed noise sources further validate the model’s robustness and practicality. Additionally, the inference time for a single sample in the MLSCA-CW model is only 0.515 ms, indicating its potential for near-real-time computation.

In summary, the MLSCA-CW model, which integrates soft-threshold, self-activation, and self-attention mechanisms to effectively extract multi-scale features from multi-position sensor data, demonstrates strong adaptability and robustness under high-intensity and multi-noise real-world conditions, along with significant real-time computing potential. The MLSCA-CW model provides reliable technical support for machinery fault diagnosis. Future research can further optimize the model structure and expand its application to fault diagnosis of more types of mechanical equipment.

Data availability

This paper uses the CWRU dataset, IMS dataset & UofO dataset for training and testing. The CWRU dataset is available at the following website https://engineering.case.edu/bearingdatacenter/download-data-file. The IMS dataset is available at the following website https://www.nasa.gov/intelligent-systems-division/discovery-and-systems-health/pcoe/pcoe-data-set-repository/. The UofO dataset is available at the following website https://data.mendeley.com/datasets/v43hmbwxpm/1.

References

Li, X., Shao, H., Lu, S., Xiang, J. & Cai, B. Highly efficient fault diagnosis of rotating machinery under time-varying speeds using lsismm and small infrared thermal images. IEEE Trans. Syst. Man Cybern. Syst. 52, 7328–7340. https://doi.org/10.1109/TSMC.2022.3151185 (2022).
Article Google Scholar
Upadhyay, N. & Kankar, P. Diagnosis of bearing defects using tunable q-wavelet transform. J. Mech. Sci. Technol. 32, 549–558. https://doi.org/10.1007/s12206-018-0102-8 (2018).
Article MATH Google Scholar
Assaad, B., Eltabach, M. & Antoni, J. Vibration-based condition monitoring of a multistage epicyclic gearbox in lifting cranes. Mech. Syst. Signal Process. 42, 351–367 (2014).
Article ADS Google Scholar
Rauber, T. W., De Assis Boldt, F. & Varejao, F. M. Heterogeneous feature models and feature selection applied to bearing fault diagnosis. IEEE Trans. Industr. Electron. 62, 637–646 (2015).
Article MATH Google Scholar
Yu, X., Dong, F., Ding, E., Wu, S. & Fan, C. Rolling bearing fault diagnosis using modified lfda and emd with sensitive feature selection. IEEE Access 6, 3715–3730 (2017).
Article MATH Google Scholar
Upadhyay, N. & Chourasiya, S. Extreme learning machine and ensemble techniques for classification of rolling element bearing defects. Life Cycle Reliab. Saf. Eng. 11, 189–201. https://doi.org/10.1007/s41872-022-00196-1 (2022).
Article MATH Google Scholar
Kavathekar, S., Upadhyay, N. & Kankar, P. Fault classification of ball bearing by rotation forest technique. Procedia Technol. 23, 187–192. https://doi.org/10.1016/j.protcy.2016.03.016 (2016) (3rd International Conference on Innovations in Automation and Mechatronics Engineering 2016, ICIAME 2016 05-06 February, 2016.).
Article MATH Google Scholar
Chen, X., Zhang, B. & Gao, D. Bearing fault diagnosis base on multi-scale cnn and lstm model. J. Intell. Manuf. 32, 971–987 (2021).
Article MATH Google Scholar
Bonnett, A. H. & Yung, C. Increased efficiency versus increased reliability. IEEE Ind. Appl. Mag. 14, 29–36. https://doi.org/10.1109/MIA.2007.909802 (2008).
Article Google Scholar
Liu, D., Cui, L., Cheng, W., Zhao, D. & Wen, W. Rolling bearing fault severity recognition via data mining integrated with convolutional neural network. IEEE Sens. J. 22, 5768–5777. https://doi.org/10.1109/JSEN.2022.3146151 (2022).
Article ADS MATH Google Scholar
Guo, M. H. et al. Attention mechanisms in computer vision: A survey. Comput. Vis. Media 8, 331–368. https://doi.org/10.1007/s41095-022-0271-y (2022).
Article MATH Google Scholar
Khurana, D. et al. Natural language processing: state of the art, current trends and challenges. Multimed. Tools Appl. 82, 3713–3744. https://doi.org/10.1007/s11042-022-13428-4 (2023).
Article PubMed MATH Google Scholar
Wang, J., Lin, L., Gao, S. & Zhang, Z. Deep generation network for multivariate spatio-temporal data based on separated attention. Inf. Sci. 633, 85–103. https://doi.org/10.1016/j.ins.2023.03.062 (2023).
Article MATH Google Scholar
Zhang, W., Li, X. & Ding, Q. Deep residual learning-based fault diagnosis method for rotating machinery. ISA Trans. 95, 295–305. https://doi.org/10.1016/j.isatra.2018.12.025 (2019).
Article PubMed MATH Google Scholar
Hou, Y., Wang, J., Chen, Z., Ma, J. & Li, T. Diagnosisformer: An efficient rolling bearing fault diagnosis method based on improved transformer. Eng. Appl. Artif. Intell. 124, 106507. https://doi.org/10.1016/j.engappai.2023.106507 (2023).
Article Google Scholar
Yao, D. et al. A lightweight neural network with strong robustness for bearing fault diagnosis. Measurement https://doi.org/10.1016/j.measurement.2020.107756 (2020).
Article MATH Google Scholar
Zheng, J., Yang, C., Zheng, F. & Jiang, B. A rolling bearing fault diagnosis method using multi-sensor data and periodic sampling. In 2022 IEEE International Conference on Multimedia and Expo (ICME), 1–6, https://doi.org/10.1109/ICME52920.2022.9859658 (2022).
Konar, P. & Chattopadhyay, P. Bearing fault detection of induction motor using wavelet and support vector machines (svms). Appl. Soft Comput. 11, 4203–4211 (2011).
Article MATH Google Scholar
Yang, Y., Yu, D. & Cheng, J. A roller bearing fault diagnosis method based on emd energy entropy and ann. J. Sound Vib. 294, 269–277 (2006).
Article ADS MATH Google Scholar
Amarnath, M., Sugumaran, V. & Kumar, H. Exploiting sound signals for fault diagnosis of bearings using decision tree. Measurement 46, 1250–1256 (2013).
Article ADS MATH Google Scholar
Sun, J., Yu, Z. & Wang, H. On-line fault diagnosis of rolling bearing based on machine learning algorithm. In 2020 5th International Conference on Information Science, Computer Technology and Transportation (ISCTT), 402–407, https://doi.org/10.1109/ISCTT51595.2020.00075 (2020).
Zhao, R. et al. Deep learning and its applications to machine health monitoring. Mech. Syst. Signal Process. 115, 213–237 (2019).
Article ADS MATH Google Scholar
Sun, W., Deng, A., Deng, M., Zhu, J. & Zhai, Y. Multi-view feature fusion for rolling bearing fault diagnosis using random forest and autoencoder. J. Southeast Univ. 35, 33–40 (2019).
MATH Google Scholar
Gu, Y., Cao, J., Song, X. & Yao, J. A denoising autoencoder-based bearing fault diagnosis system for time-___domain vibration signal. Wirel. Commun. Mob. Comput. 2021, 9790053 (2021).
Article MATH Google Scholar
Chen, Z. & Li, W. Multisensor feature fusion for bearing fault diagnosis using sparse autoencoder and deep belief network. IEEE Trans. Instrum. Meas. 66, 1693–1702 (2017).
Article ADS MATH Google Scholar
Shao, H., Jiang, H., Li, X. & Liang, T. Rolling bearing fault detection using continuous deep belief network with locally linear embedding. Comput. Ind. 96, 27–39 (2018).
Article MATH Google Scholar
Yuan, M., Wu, Y. & Lin, L. Fault diagnosis and remaining useful life estimation of aero engine using lstm neural network. In Proceedings of the IEEE International Conference on Aircraft Utility Systems (2016).
Singh, S., Howard, C. & Hansen, C. Convolutional neural network based fault detection for rotating machinery. J. Sound Vib. 377, 331–345 (2016).
Article MATH Google Scholar
Wang, J., Zhuang, J., Duan, L. & Cheng, W. A multi-scale convolution neural network for featureless fault diagnosis. In 2016 International Symposium on Flexible Automation (ISFA), 65–70, https://doi.org/10.1109/ISFA.2016.7790137 (2016).
Wang, J., Zhuang, J., Duan, L. & Cheng, W. A multi-scale convolution neural network for featureless fault diagnosis. In Proceedings of the International Symposium on Flexible Automation (2016).
Wen, L., Li, X., Gao, L. & Zhang, Y. A new convolutional neural network-based data-driven fault diagnosis method. IEEE Trans. Industr. Electron. 65, 5990–5998. https://doi.org/10.1109/TIE.2017.2774777 (2018).
Article MATH Google Scholar
Zilong, Z. & Wei, Q. Intelligent fault diagnosis of rolling bearing using one-dimensional multi-scale deep convolutional neural network based health state classification. In 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), 1–6, https://doi.org/10.1109/ICNSC.2018.8361296 (2018).
Li, S., Liu, G., Tang, X., Lu, J. & Hu, J. An ensemble deep convolutional neural network model with improved d-s evidence fusion for bearing fault diagnosis. Sensors 17, 1729. https://doi.org/10.3390/s17081729 (2017).
Article ADS PubMed PubMed Central MATH Google Scholar
Pan, J., Zi, Y., Chen, J., Zhou, Z. & Wang, B. Liftingnet: A novel deep learning network with layerwise feature learning from noisy mechanical data for fault classification. IEEE Trans. Industr. Electron. 65, 4973–4982. https://doi.org/10.1109/TIE.2017.2767540 (2018).
Article MATH Google Scholar
Deng, J. et al. Mgnet: A fault diagnosis approach for multi-bearing system based on auxiliary bearing and multi-granularity information fusion. Mech. Syst. Signal Process. 193, 110253 (2023).
Article MATH Google Scholar
Yan, X. A., Lu, Y. Y., Liu, Y. & Jia, M. P. Attention mechanism-guided residual convolution variational autoencoder for bearing fault diagnosis under noisy environments. Meas. Sci. Technol.34 (2023).
Chen, X., Zhang, B. & Gao, D. Bearing fault diagnosis base on multi-scale cnn and lstm model. J. Intell. Manuf. 32, 971–987. https://doi.org/10.1007/s10845-020-01600-2 (2021).
Article MATH Google Scholar
Cheng, Y., Lin, M., Wu, J., Zhu, H. & Shao, X. Intelligent fault diagnosis of rotating machinery based on continuous wavelet transform-local binary convolutional neural network. Knowl.-Based Syst. 216, 106796. https://doi.org/10.1016/j.knosys.2021.106796 (2021).
Article MATH Google Scholar
Yang, H.-H., Huang, K.-C. & Chen, W.-T. Laffnet: A lightweight adaptive feature fusion network for underwater image enhancement. arXiv preprint arXiv:2105.01299 (2021).
Donoho, D. L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 41, 613–627. https://doi.org/10.1109/18.382009 (1995).
Article MathSciNet MATH Google Scholar
Isogawa, K., Ida, T., Shiodera, T. & Takeguchi, T. Deep shrinkage convolutional neural network for adaptive noise reduction. IEEE Signal Process. Lett. 25, 224–228. https://doi.org/10.1109/LSP.2017.2782270 (2018).
Article ADS Google Scholar
Zhao, M., Zhong, S., Fu, X., Tang, B. & Pecht, M. Deep residual shrinkage networks for fault diagnosis. IEEE Trans. Industr. Inf. 16, 4681–4690. https://doi.org/10.1109/TII.2019.2943898 (2020).
Article MATH Google Scholar
Howard, A. et al. Searching for mobilenetv3. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 1314–1324, https://doi.org/10.1109/ICCV.2019.00140 (2019).
Ma, N., Zhang, X., Liu, M. et al. Activate or not: Learning customized activation. arXiv e-prints (2020).
University, C. W. R. Bearing data center website http://csegroups.case.edu/bearingdatacenter/home (2007).
Lee, J., Qiu, H., Yu, G. & Lin, J. Ims, university of cincinnati bearing data set. NASA Prognostics Data Repository, NASA Ames Research Center, Moffett Field, CA (2007).
Qiu, H., Lee, J., Lin, J. & Services, R. T. (2006) Wavelet filter-based weak signature detection method and its application on rolling element bearing prognostics. J. Sound Vib. 289, 1066–1090, https://doi.org/10.1016/j.jsv.2005.03.007 .
Huang, H. & Baddour, N. Bearing vibration data collected under time-varying rotational speed conditions. Data Brief 21, 1745–1749. https://doi.org/10.1016/j.dib.2018.11.019 (2018).
Article PubMed PubMed Central MATH Google Scholar
Huang, H. & Baddour, N. Bearing vibration data under time-varying rotational speed conditions. Mendeley DataV1, https://doi.org/10.17632/v43hmbwxpm.1 (2018).
Huang, W., Cheng, J., Yang, Y. & Guo, G. An improved deep convolutional neural network with multi-scale information for bearing fault diagnosis. Neurocomputing 359, 77–92 (2019).
Article MATH Google Scholar
Wei, Z. et al. A new deep learning model for fault diagnosis with good anti-noise and ___domain adaptation ability on raw vibration signals. Sensors 17, 425. https://doi.org/10.3390/s17030425 (2017).
Article ADS MATH Google Scholar
Pan, T. et al. A novel deep learning network via multiscale inner product with locally connected feature extraction for intelligent fault detection. IEEE Trans. Industr. Inf. 15, 5119–5128. https://doi.org/10.1109/TII.2019.2896665 (2019).
Article MATH Google Scholar
Fang, H., Deng, J., Chen, D. et al. You can get smaller: A lightweight self-activation convolution unit modified by transformer for fault diagnosis. Adv. Eng. Inform. (2023).
Liao, J.-X. et al. Attention-embedded quadratic network (qttention) for effective and interpretable bearing fault diagnosis. IEEE Trans. Instrum. Meas. 72, 3511113. https://doi.org/10.1109/TIM.2023.3259031 (2023).
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electronics and Information Engineering, Harbin Institute of Technology, Harbin, China
Yu Zhang, Lianlei Lin, Junkai Wang, Sheng Gao & Zongwei Zhang
Technological Innovation Center of Littoral Test, Harbin, China
Yu Zhang, Lianlei Lin, Junkai Wang, Sheng Gao & Zongwei Zhang
College of Information Science and Technology, Eastern Institute of Technology, Ningbo, China
Wei Zhang

Authors

Yu Zhang
View author publications
Search author on:PubMed Google Scholar
Lianlei Lin
View author publications
Search author on:PubMed Google Scholar
Junkai Wang
View author publications
Search author on:PubMed Google Scholar
Wei Zhang
View author publications
Search author on:PubMed Google Scholar
Sheng Gao
View author publications
Search author on:PubMed Google Scholar
Zongwei Zhang
View author publications
Search author on:PubMed Google Scholar

Contributions

Yu Zhang: Conceptualization, Methodology, Software, Visualization, Writing - original draft. Lianlei Lin: Formal analysis, Supervision, Writing - review & editing. Junkai Wang: Writing - review & editing. Wei Zhang: Data curation, Validation. Sheng Gao: Supervision, Writing - review & editing. Zongwei Zhang: Supervision, Writing - review & editing.

Corresponding author

Correspondence to Lianlei Lin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhang, Y., Lin, L., Wang, J. et al. Attention activation network for bearing fault diagnosis under various noise environments. Sci Rep 15, 977 (2025). https://doi.org/10.1038/s41598-025-85275-w

Download citation

Received: 14 October 2024
Accepted: 01 January 2025
Published: 06 January 2025
DOI: https://doi.org/10.1038/s41598-025-85275-w

Subjects

Abstract

Similar content being viewed by others

Multi scale convolutional neural network combining BiLSTM and attention mechanism for bearing fault diagnosis under multiple working conditions

Vibration signal analysis for rolling bearings faults diagnosis based on deep-shallow features fusion

Implementation of MF block in CNN for advanced REB fault diagnosis

Introduction

Related work

Method

Framework of MLSCA-CW

Key modules of MLSCA-CW

Multi-___location vibration feature extractor

Deep residual shrinkage network with channel-wise thresholds

Multi-scale multi-headed self-activation mechanism

Multi-headed layer attention mechanism

Experiment

Dataset

Experimental setup

Results and discussion

Ablation experiment

Comparative experiments

Model performance without noise

BModel performance under various and intense noise conditions

Conclusion

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links