A hybrid local-global neural network for visual classification using raw EEG signals

Xue, Shuning; Jin, Bu; Jiang, Jie; Guo, Longteng; Liu, Jing

doi:10.1038/s41598-024-77923-4

Download PDF

Article
Open access
Published: 08 November 2024

A hybrid local-global neural network for visual classification using raw EEG signals

Shuning Xue^1,2,
Bu Jin²,
Jie Jiang²,
Longteng Guo² &
…
Jing Liu^1,2

Scientific Reports volume 14, Article number: 27170 (2024) Cite this article

2991 Accesses
1 Citations
Metrics details

Subjects

Abstract

EEG-based brain-computer interfaces (BCIs) have the potential to decode visual information. Recently, artificial neural networks (ANNs) have been used to classify EEG signals evoked by visual stimuli. However, methods using ANNs to extract features from raw signals still perform lower than traditional frequency-___domain features, and the methods are typically evaluated on small-scale datasets at a low sample rate, which can hinder the capabilities of deep-learning models. To overcome these limitations, we propose a hybrid local-global neural network, which can be trained end-to-end from raw signals without handcrafted features. Specifically, we first propose a reweight module to learn channel weights adaptively. Then, a local feature extraction module is designed to capture basic EEG features. Next, a spatial integration module fuses information from each electrode, and a global feature extraction module integrates overall time-___domain characteristics. Additionally, a feature fusion module is proposed to extract efficient features in high sampling rate settings. The proposed model achieves state-of-the-art results on two commonly used small-scale datasets and outperforms baseline methods on three under-studied large-scale datasets. Ablation experimental results demonstrate that the proposed modules have a stable performance improvement ability on multiple datasets across different sample rates, providing a robust end-to-end learning framework.

Optimising the classification of feature-based attention in frequency-tagged electroencephalography data

Article Open access 13 June 2022

An empirical comparison of deep learning explainability approaches for EEG using simulated ground truth

Article Open access 18 October 2023

Semi-supervised generative and discriminative adversarial learning for motor imagery-based brain–computer interface

Article Open access 17 March 2022

Introduction

The brain-computer interface (BCI) is an advanced technology for human beings to interact with the outside world¹. Classification of neural signals generated by visual stimuli has been a promising paradigm for brain-computer interface technology². Electroencephalography (EEG) is a non-invasive method to measure brain activities, which is widely used due to its high temporal resolution and low cost³. Recently, deep-learning models have been studied to classify EEG signals, which has the potential to help people express their thoughts and directly control external devices with neural signals^4,5,6.

To develop a robust brain-computer interface, many studies have been devoted to modeling the relationship between EEG signals and visual stimuli with artificial neural networks. Kaneshiro et al.⁷ used principal component analysis (PCA) and linear discriminant analysis (LDA) to identify objects from EEG signals and classify them into semantic categories (6-class) and individual exemplars (72-class) using only temporal information. Spampinato et al.⁸ employed convolutional neural networks (CNN) and long short-term memory (LSTM) networks to classify block-design EEG signals. Li et al.⁹ and Ahmed et al.¹⁰ explored the accuracy of various machine-learning methods and deep-learning models in classifying EEG signals under the random-trial design and concluded that there are certain challenges in classifying raw signals. To explore more effective neural network architecture, Kalafatovich et al.¹¹ proposed a CNN model that is suitable for modelling temporal EEG signals. And the transformer architecture was also explored with attention modules constructed with convolutional kernels¹². Recently, Luo et al.⁶ have combined time-___domain, frequency-___domain, and spatial features using the architecture of graph convolution and transformer to achieve new optimal performance on two datasets¹³. In these works, researchers have utilised ensembling network structures and introduced frequency-___domain features to enhances the performance of models. However, we suppose that the capabilities of deep-learning models have not been fully exploited, and there is still a lot of room for improvement without relying on the handcrafted features.

In order to maximize the potential of deep learning for constructing efficient and reliable brain-computer interface devices, there are some remaining issues that need to be addressed:

(1) The architecture of deep-learning models for visual classification of EEG signals is yet to be explored thoroughly. At present, researchers have designed deep-learning models with different structures to effectively model the characteristics of EEG signals. In order to further improve the effect of EEG visual decoding, it is necessary to combine the advantages of different deep-learning models. Some studies utilized only one type of neural network, which has limitations in unifying the modeling of information across different scales.

(2) Spatial modeling of EEG signals frequently depends on the spatial coordinates of electrode caps. EEG data contains explicit spatial information, i.e., the three-dimensional spatial coordinates of the EEG cap. Some studies have modeled spatial information based on fixed electrode positions, while others have manually selected electrodes located near the visual cortex to exclude irrelevant information. However, the physiological connections of different brain regions are different among different subjects, and there is likely to be a mismatch between the fixed electrode positions and real EEG cap placement. In addition, different regions of the brain work together synergistically to process visual information globally¹⁴, which means that each electrode can contribute to visual classification. A model that can adaptively assign electrode weights and combine information between different electrodes according to the EEG signals of different subjects can better model the EEG characteristics.

(3) High-performance visual classification methods of EEG signals frequently depend on manually designed features in the frequency ___domain. Traditional EEG classification often relies on handcrafted frequency-___domain features^15,16. For EEG visual classification, the combination of frequency-___domain features and deep-learning methods has also improved classification performance⁶. However, the limitations of handcrafted features may hinder the exploration of EEG deep-learning models. Neural networks can extract valid features from raw signals that may be more efficient than handcrafted features.

(4) The validation datasets used for EEG visual classification methods are frequently limited in size. It is worth noting that most of the research on EEG visual classification has been conducted on small-scale datasets in low sampling rate settings, which might not give a complete evaluation of the model’s performance. Besides, deep learning has been found to be more effective on large-scale datasets¹⁷.

In order to overcome the above issues, we design a hybrid neural network model, which utilizes the benefits of various neural networks. We adopt the one-dimensional temporal convolutional neural network to achieve efficient local feature extraction, and realize the integration of temporal global features through the transformer block. And we introduce residual connections to help model maintain sensitivity to low-level features while extracting high-level features. In high sampling rate settings, we proposed a feature fusion module to extracting informative features, which further improved the performance. To improve the representation of spatial information in EEG data, we propose a reweight module that utilizes the adaptive learning ability of artificial neural networks. This module models a unique spatial distribution based on EEG data from each participant and automatically assigns different attention weights to all brain electrodes. To take full advantage of the ability of deep learning to extract features from raw signals, we trained the model end-to-end using only the raw time series signals, achieving higher accuracies than the state-of-the-art method that utilized frequency-___domain features. Besides, we conducted extensive experiments on five public EEG datasets to verify the effectiveness of our proposed model.

To summarise, the main contributions of this study are as follows: (1) We have introduced a hybrid local-global neural network for classifying raw EEG signals, which improves decoding accuracy by effectively extracting local temporal features of EEG signals and integrating global features. (2) We validated the effectiveness of deep learning on five datasets, particularly on three large datasets. Both slow and fast serial visual presentation EEG signals can be classified with promising accuracy, higher than that of random. This proves that deep-learning methods are worthy of exploration in neuroscience research. (3) We conducted detailed experiments to compare the impact of preprocessing and hyperparameters, such as sample rate, clamping threshold, learning rate, batch size, and parameters of models. We analyzed the parameter sensitivity in deep-learning models used for EEG classification. (4) We visualized the attention maps of our proposed reweight modules on two datasets, demonstrating the adaptive learning capability of our methods. Furthermore, we demonstrated the effectiveness of our method by visualizing the most influential features using t-distributed Stochastic Neighbour Embedding (t-SNE).

Methods

This section provides a detailed description of our proposed neural network architecture, introduces five publicly available datasets, and explains the preprocessing steps for EEG signals. Fig. 1 illustrates the overall framework of our approaches for visual recognition of EEG signals. The framework demonstrates an end-to-end classification of the raw EEG data. The network consisted of three parts: Temporal and Spatial Convolutional Block, Global-temporal Transformer Block, Aggregation and Classification.

Temporal and spatial convolutional block

Fig. 2 illustrates the detailed architecture of the Temporal and Spatial Convolutional Block. It can be summarized as three modules.

Reweight module

In this module, a matrix W that has dimensions E \(\times\) N will be adaptively learned, where E represents the number of input electrodes and N represents the number of reweighted electrodes. In the past, models used to directly extract timing features from the signal of each electrode, and some work manually selected certain electrodes, especially those in the occipital lobe region, as they were most useful for visual classification. The reweight module is designed to adaptively weigh the signals of each electrode of the original EEG, which helps the model to focus on the important electrodes and reduce the weight of the irrelevant ones that are less related to the visual stimulus. Each new reweighted electrode can represent a different weighted summation scheme, and this can result in various electrode combination features being generated as the output.

Local-temporal module

The module consists of four layers of one-dimensional temporal convolution with a skip connection. CNNs are known for their ability to extract local features. As the number of layers increases, simple local time series features combine to form more complex signal shapes, allowing for effective extraction of local time series features present in complex and variable EEG signals. The parameters of the first three layers are referenced from TSCNN¹⁸. We reconstructed the model based on the description provided in the article. The kernel sizes for all layers are set to 5, with the numbers of features being 16, 32, and 64 respectively. The stride is set to 1, and no padding strategy is adopted. We chose the exponential linear unit (ELU) activation function¹⁹ as it shows better performance in previous EEG experiments^20,21.

Compared to previous work, we have implemented several key improvements. These include the introduction of residual connections and adjustments to the pooling and dropout²³ layers. The residual connection structure²² is meant to prevent the model from disregarding important low-level information, ensuring that it can handle complex signals while still maintaining sensitivity to simple curves. The residual connection consists of two layers of temporal convolution. As shown in Fig. 2, suppose the number of input feature maps is \(C_3\), the first layer maps the number of feature maps from \(C_3\) to \(2C_3\), and the second layer maps the number of feature maps from \(2C_3\) back to \(C_3\). For signals with low sampling rates, we suppose that the pooling layer should be used sparingly to avoid information loss. Therefore, we use less pooling than TSCNN and replace max-pooling with mean-pooling. To address the issue of overfitting, we have adjusted the number of dropout layers and the dropout rate to ensure the robustness of the model.

Spatial-integration module

The module comprises of a single layer of one-dimensional spatial convolution, where the kernel size is identical to the number of reweighted electrodes of local features. Along with the reweighting of features, it has a random dropout layer, a batch normalization layer²⁴, and an activation layer. The dropout layer helps prevent overfitting of the model, while the batch normalization layer normalizes the feature distribution, leading to a better learning performance. The activation function in the network is ELU. Unlike other models, we divide the learning of spatial features into two parts, which are placed in front of and after the learning of temporal features. In the earlier stage, the reweight module selects the most informative electrode. In the later stage, the spatial information integration module ensures that the model doesn’t focus too much on the timing information of a single electrode after passing through the Local-temporal Module. This allows the model to maintain attention to spatial information and integrate the extracted timing information of each electrode more effectively.

Global-temporal transformer block

To improve the model’s ability to perceive macroscopic patterns, we incorporated transformer structures instead of relying solely on convolutional structures. The transformer is a neural network model based on “Scaled Dot-Product Attention” to extract global features effectively. Its performance has surpassed traditional CNNs and recurrent neural networks (RNNs), especially on large-scale datasets. Recent studies have utilized transformers to extract global-temporal features of EEG signals for classification of mental imagery²⁵, prediction of mental workload²⁶, and emotion analysis²⁷. In this article, we only utilized one transformer encoder layer and did not stack transformer layers, because the data size is still relatively small. Research has shown that transformer models require a larger amount of data than CNNs to extract features effectively²⁸.

Aggregation and classification

For the low sample rate setting, the temporal and spatial features will be directly aggregated and classified by a feed-forward network. For the high sample rate setting, the temporal features will be further fused before classification.

Feature fusion module

This module is designed to deal with high sampling rates and long signals. In the high sample rate setting, after the Temporal and Spatial Convolutional Block extracts local features, the Global-temporal Transformer Block extracts global features and retains the input length, leading to large output dimensions. To aggregate more discriminant features, the temporal dimension of output is reduced in this module. The reduction is fulfilled by a one-dimensional temporal convolution layer to reweight the temporal features similar to the reweight module. This fusion of features facilitates the learning of the classification module and prevents the performance degradation resulting from high sampling rates.

Classification module

The module comprises of a multilayer perceptron (MLP) that integrates the temporal and spatial features and generates the final discrimination result via a 3-layer fully connected (FC) network. The size of the hidden layer is adjusted to suit the number of classification categories for different datasets and classification tasks.

Table 1 Description of the Five Public EEG Datasets. Exp.: Experimental, low-speed: low-speed serial visual presentation, RSVP: rapid serial visual presentation.

Full size table

Experiments

Datasets

Table 1 provides the details of five public EEG datasets for visual recognition.

EEG72⁷

The first dataset consists of EEG data collected from 10 subjects. The experiment involved showing colour images from six semantic categories. In total, 72 images were presented on a mid-grey background. Each image was displayed for 500 ms. The EEG data was recorded using 128-channel EGI HCGSN 110 nets. The dataset is also called SU DB in TSCNN¹⁸.

EEG200¹³

The second dataset consists of EEG data collected from 16 subjects. The experiment involved showing colour images from ten semantic categories. In total, 200 images without background were presented. These images can be categorized into two types: animate and inanimate. Each image was displayed for 200 ms. The EEG data was collected using a 64-channel BrainVision actiCHamp system.

ImageNet-EEG¹⁰

The third dataset comprises EEG data collected from a single participant. The study involved presenting natural images from 40 classes of the ImageNet database³¹. A total of 40,000 images were shown. Each image was displayed for 2 seconds. The EEG data was recorded using a BioSemi ActiveTwo recorder with 96 channels.

THINGS-EEG-10Hz²⁹

The fourth dataset consists of EEG data collected from 50 participants. The data was collected using a RSVP paradigm, which involved displaying natural images cropped to a square shape. These images were selected from 1,854 object concepts of the THINGS database³². Each image was displayed for 50 ms, followed by a 50 ms blank screen. The EEG data was recorded using a 64-channel BrainVision actiCHamp system. The dataset is divided into two parts: 1854 training concepts and 200 test concepts. We only used the 1854 training concepts of the dataset. Due to the limited number of pictures in each concepts, we combined the three criteria of higher-level object categories provided in THINGS and classified all single-category samples into 27 high-level categories for model validation.

THINGS-EEGS-5Hz³⁰

The fifth dataset consists of EEG data collected from 10 participants. This dataset used similar experimental paradigms and visual stimuli with THINGS-EEG-10Hz, but differed in image presentation latencies. In this dataset, each image was displayed for 100 milliseconds, followed by a 100 ms blank screen. The EEG data was collected using a BrainVision actiCHamp amplifier and 64-channel EASYCAP. The dataset is divided into two parts: 1654 training concepts and 200 test concepts. We only used the 1654 training concepts of the dataset. For the same reason as THINGS-EEG-10Hz of limited number of pictures in each concepts, we have merged the 1,654 concepts into 27 high-level categories.

Data preprocessing

Low sample rate

The EEG signals from EEG72 were filtered between 1 and 25 Hz. The signals were then downsampled to 62.5 Hz and epoched into trials of 32 time samples. For other datasets, the MNE package³³ was used for preprocessing. The sampling rates of the data were adjusted to 62.5 Hz to be consistent with the preprocessing of EEG72 and state-of-the-art model CAW-MASA-STST⁶. The epoch schemes are shown below:

For the ImageNet-EEG dataset, epochs were created ranging from -100 to 2,000 ms relative to stimulus onset. Baseline correction was applied by subtracting the mean of the pre-stimulus interval for each trial.
For EEG200, THINGS-EEG-10Hz, and THINGS-EEG-5Hz datasets, epochs were created ranging from 0 to 500 ms relative to stimulus onset. No further preprocessing or artifact correction methods were applied.

Besides, we applied the z-score normalization and clamped large voltages.

High sample rate

Previous works have been concentrating on the EEG72 dataset in a low sample rate setting. However, the highest frequency of interest in EEG is commonly around 100 Hz. To study the influence of sample rate, we utilized two more sample rate settings: 125 Hz and 250Hz. In both settings, we notched 50 Hz power line interference. And in the 250Hz setting, we filtered signals in the frequency range of 0.1 to 100 Hz. Other preprocessing steps are the same as at low sample rate.

Evaluation metrics

To evaluate the performance of our method, we adopt two widely used evaluation metrics: the classification accuracy and the confusion matrix. Accuracy is the proportion of correctly identified test samples to all test samples. Confusion matrices are used to measure the performance of multiclass classification problems by showing the counts of predicted and actual values for each class.

During the experiments, we applied 10-fold cross-validation to evaluate the models, accuracies over test folds were averaged and reported in the following section. To set appropriate hyperparameters for the proposed model, we utilized a grid search strategy to optimize the learning rate, batch size, and other parameters. We employed the Adam³⁴ for the optimization algorithm.

Baseline models

In order to demonstrate the effectiveness of our model, we compared the classification accuracy with six baseline models (linear, CNN, LSTM, transformer, EEGNet and ShallowConvNet) and reported accuracies from other works^{6,11,12,18,35}. Two of these baseline models are adopted from LMDA³⁶. The remaining four baseline models are implemented by ourselves. Brief descriptions of three baseline models are given as follows:

CNN model¹⁸

We reproduced the temporal stream structure of TSCNN¹⁸. The CNN model consists of three temporal 1D-convolution layers and one spatial 1D-convolution layer. The kernel sizes of three temporal 1D-convolution layers are all set to 5, and the numbers of convolutional kernels are 16, 32, and 64, respectively.

EEGNet²⁰

The EEGNet has a compact CNN architecture that consists of a common convolution layer, a depthwise convolution layer and a separable convolution layer. The performance of this model has been validated on several BCI paradigms and EEG benchmarks^37,38,39. The implement of EEGNet are adopted from LMDA³⁶.

ShallowConvNet²¹

The ShallowConvNet consists of two CNN layers and a fully connected layer with a softmax activation, this architecture has successfully applied to EEG signals for the classification of MI paradigms^40,41. The implementation of ShallowConvNet are adopted from LMDA³⁶.

We implement the above baseline models and our approach using the publicly available PyTorch 1.6.0 framework.

Table 2 Comparison with the State-of-the-Art Methods for EEG72.

Full size table

Table 3 Comparison with the Baseline and State-of-the-Art Methods for EEG200. * means the method is reproduced and reported by CAW-MASA-STST⁶, and # means the method is reproduced by LMDA³⁶ and is tested in our environment.

Full size table

Results and discussion

Overall performance

In this section, we compared our proposed model with the baseline models on five different datasets to verify its effectiveness. Table 2, 3, and 4 demonstrate that our model achieves high decoding accuracy on all five datasets.

For the EEG72 dataset, our method achieves 55.93% accuracy in the 6-class task, which is 1.11% higher than the state-of-the-art method. CAW-MASA-STST is an advanced model that combines spatio-temporal and spectral-temporal features, resulting in superior performance. Moreover, for the 72-class task, the classification accuracy is also better than that of other methods, improved by 2.26% than CAW-MASA-STST. This improvement in decoding accuracy suggests that our model can extract features from EEG signals more efficiently.

For the EEG200 dataset, our proposed method achieves a 2-class classification accuracy of 69.08%, outperforming all other models. However, in the 10-class task, the proposed method does not achieve the best decoding accuracy. The classification accuracy is 29.39%, which is 1.98% lower than CAW-MASA-STST, but still far better than other methods, namely 4.26% higher than the accuracy of ShallowConvNet and 4.63% higher than the accuracy of LSTM-CNN. It is also worth noting that the accuracy of ShallowConvNet on our preprocessed data is much lower than the reported performance in CAW-MASA-STST, which may be attributed to the difference in data preprocessing or the implementation of ShallowConvNet.

For the ImageNet-EEG dataset, the classification accuracy of the proposed method for the 40-class task is 11.99%, which is better than all baseline models. Furthermore, we experimented with high sampling rate data on the ImageNet-EEG dataset and achieved an accuracy of 12.53% at 125 Hz and 13.01% at 250 Hz (see Table 6). This shows that the model learns and utilizes the information in the high-frequency band to further improve the performance.

For the THINGS-EEG-10Hz and THINGS-EEG-5Hz datasets, the proposed method outperforms all baseline models with 18.74% and 23.87% accuracy in the 27-class task, respectively. The difference in accuracy between the two datasets suggests that the EEG signals stimulated by pictures shown at 5Hz are more distinguishable. This is likely due to the longer reaction time of participants, reducing the overlapping effect of adjacent sample signals.

Overall, our proposed method outperforms other methods in five datasets, indicating that it can effectively extract more discriminative features of EEG signals.

Table 4 Comparison with the Baseline Methods for ImageNet-EEG, THINGS-EEG-10Hz and THINGS-EEG-5Hz. # means the method is reproduced by LMDA³⁶ and is tested in our environment.

Full size table

Table 5 Accuracy (%) of Ablation Experiment at 62.5 Hz on EEG72 and ImageNet-EEG. The shortcut means the skip connection, Trans. means the transformer.

Full size table

Table 6 Accuracy (%) of Ablation Experiment at 125 and 250 Hz on ImageNet-EEG.

Full size table

Ablation studies

We have conducted a series of experiments on our proposed method to evaluate the impact of reweighting, skip connection, transformer, and dimension reduction module. We have tested various combinations of these modules and compared their performance. Table 5 displays the classification performance of eight conditions at 62.5 Hz on both the EEG72 and ImageNet-EEG datasets. Additionally, Table 6 presents the classification performance of sixteen conditions at 125 and 250 Hz on the ImageNet-EEG dataset.

Efficacy of reweight module

The reweight module enables the model to focus on the more informative electrodes while filtering out the less relevant ones. The results of the module’s effectiveness are shown in Table 5 and Table 6. When compared with the base model, the reweight module improved the accuracy on the EEG72 dataset for 6-class and 72-class by 3.15% and 5.40%, respectively. Additionally, the reweight module also improved the accuracy on the ImageNet-EEG dataset by 1.10%, 1.28%, and 0.61% at 62.5 Hz, 125 Hz, and 250 Hz, respectively. The model’s performance with the reweight module was better than other models, indicating that adaptively learning the EEG channel weights is crucial for the model’s high performance.

Efficacy of skip connection

The skip connection (shortcut) is a technique that enables a deep-learning model with a high depth not only to effectively extract high-level features but also to maintain sensitivity to low-level features. From Table 5 and Table 6, it is shown that the skip connection only results in a minor improvement in accuracy. This could be because our method uses a lightweight model, so the benefits of the skip connection may not be significant.

Efficacy of transformer

The transformer layer is a powerful tool for modelling large-scale global features, especially when dealing with large data sets. However, as shown in Table 5 and Table 6, the transformer layer only yields a slight improvement in accuracy. This could be attributed to the limited experimental data available from a single subject, which may constrain the potential of the transformer model.

Efficacy of dimension reduction

Signals with a high sampling rate contain more useful information at high frequencies, making it possible to achieve better results compared to signals at 62.5 Hz. However, we observed that training the same model at 125 Hz and 250 Hz without adjusting the parameters at 62.5 Hz resulted in much lower accuracy. This suggests that the model had difficulty classifying due to the large number of features at a high sampling rate. To overcome this issue, we designed a feature fusion module that enabled the model to extract more expressive features by reducing the dimensionality of the features. As shown in Table 6, the dimension reduction led to a 2.13% and 2.82% increase in accuracy on the ImageNet-EEG dataset at 125 Hz and 250 Hz, respectively.

Parameter sensitivity

Deep-learning models are highly sensitive to hyperparameter settings, which means that optimizing these parameters is crucial to obtain a reliable model. Grid search is a widely used strategy for hyperparameter optimization in deep-learning models, and we adopted it to optimize the hyperparameters of our proposed model, including learning rate, weight decay, and batch size.

From Fig. 3 (a), it can be observed that setting the learning rate to 1e-3 resulted in the highest accuracy. A learning rate of 1e-2 may cause the model to oscillate around the global optimum, resulting in lower accuracy. Meanwhile, a learning rate of 1e-4 may lead to the model falling into a local optimum, also resulting in lower accuracy. From Fig. 3 (b), it can be observed that increasing weight decay harms the model’s performance on the ImageNet-EEG dataset. As weight decay increases, the model tends to underfit and bring worse results. From Fig. 3 (c), it can be observed that setting batch size to 128 leads to the highest accuracy. A small batch size of 64 may cause the gradient descent process unstable, resulting in lower accuracy. And increasing the batch size beyond 128 does not improve the model’s performance.

We also experimented with other hyperparameters, including clamping threshold, dropout of transformer, dimension reduction, FC hidden layer size, and dropout of FC. Fig. 3 (d) showed that clamping reduced the model’s performance on the ImageNet-EEG dataset. Meanwhile, from Fig. 3 (e) (f) (g) (h), it can be seen that optimizing the other hyperparameters improved the model’s performance. As we achieved better results than the baselines with the initial parameters, we did not further adjust these parameters.

Channel attention of reweight module

We employed heatmaps to visualize the effectiveness of the reweight module on EEG72 and ImageNet-EEG datasets. Fig. 4 shows the weight distribution maps of the reweight modules for 10 subjects on the EEG72 datasets. The darker red areas on the scalp indicate that the electrode at that position has a greater weight. From the figure, we can observe that only a few electrodes of the multi-channel EEG signal are strongly correlated with the visual target recognition task. The electrodes with heavy weight are mainly distributed in the occipital lobe of the brain, which corresponds to the visual cortex and is responsible for processing visual information. This result is consistent with known neuroscience theories. However, there are some differences in the EEG channel weights of these subjects due to individual differences. This indicates that the proposed model can learn the EEG channel weights of each subject adaptively without artificially setting the channel weights. By setting small weights for unrelated channels, the model can suppress noise on unrelated channels, thus improving the recognition accuracy of the model.

Comparison between different classes

The confusion matrices were averaged across all folds and subjects. Fig. 5 illustrates the averaged confusion matrices of the four methods on the EEG72 dataset for 6-class. The model mostly confuses the FV class with the IO class, while the HF class is the most distinguishable. The decoding accuracy of all four methods for the human faces (HF) category has reached more than 68%, which is higher than other categories. This shows that the human visual system is highly sensitive to human faces. Compared with the other three methods, our proposed method’s decoding accuracy is improved in all categories, demonstrating that it can effectively extract more discriminative features for each category.

The t-distributed Stochastic Neighbour Embedding (t-SNE)⁴² is a statistical technique commonly used for dimension reduction and feature visualization. In our study, we assessed the feature extraction ability of a model by extracting features from the fully-connected layer of an individual fold, and then using them to visualize EEG features in a 2-D embedding dimension. Fig. 6 showcases the t-SNE maps of the EEG72 dataset for two subjects and six classes. Our approach effectively captures discriminative information and enhances the separability of different classes, as the results indicate. Samples of the same class exhibit similarities and the HF class is the most distinguishable.

Conclusion

In this study, we propose a hybrid neural network to investigate the ability of deep learning for visual recognition based on raw EEG signals. The proposed framework first extracts local temporal and spatial features with linear and convolutional models. Then, the local features are further processed to extract global temporal features with the transformer model. Next, the temporal and spatial features will be aggregated and classified by a feed-forward network. We validated the effectiveness of our model on five different datasets across different sample rates. The proposed model achieved consistently higher accuracy in 6-class and 72-class classifications for the EEG72 dataset compared to other state-of-the-art models. In the case of the EEG200 dataset, our model achieved the highest accuracy in 2-class classification. Furthermore, for the ImageNet-EEG, THINGS-EEG-10Hz, and THINGS-EEG-5Hz datasets, our model outperformed other baseline models. This work demonstrates that neural networks can extract robust features from raw EEG signals, leading to good performance in EEG classification during visual recognition tasks. To further facilitate the practical application of brain-computer interfaces, we recommend designing more sophisticated deep learning models with effective training objectives and pre-training on large datasets to reduce the effects of noise.

Data availibility

The datasets and code will be available from the corresponding author on request.

References

Wolpaw, J. R., Birbaumer, N., McFarland, D. J., Pfurtscheller, G. & Vaughan, T. M. Brain-computer interfaces for communication and control. Clinical neurophysiology 113, 767–791 (2002).
Article PubMed Google Scholar
Xu, M. et al. A brain-computer interface based on miniature-event-related potentials induced by very small lateral visual stimuli. IEEE Transactions on Biomedical Engineering 65, 1166–1175 (2018).
Article PubMed Google Scholar
Ye, Z., Yao, L., Zhang, Y. & Gustin, S. Self-supervised cross-modal visual retrieval from brain activities. Pattern Recognition 145, 109915 (2024).
Article Google Scholar
Kshirsagar, G. B. & Londhe, N. D. Improving performance of devanagari script input-based p300 speller using deep learning. IEEE Transactions on Biomedical Engineering 66, 2992–3005 (2018).
Article PubMed Google Scholar
Su, E., Cai, S., Xie, L., Li, H. & Schultz, T. Stanet: A spatiotemporal attention network for decoding auditory spatial attention from eeg. IEEE Transactions on Biomedical Engineering 69, 2233–2242 (2022).
Article PubMed Google Scholar
Luo, J. et al. A dual-branch spatio-temporal-spectral transformer feature fusion network for eeg-based visual recognition. IEEE Transactions on Industrial Informatics (2023).
Kaneshiro, B., Perreau Guimaraes, M., Kim, H.-S., Norcia, A. M. & Suppes, P. A representational similarity analysis of the dynamics of object processing using single-trial eeg classification. Plos one 10, e0135697 (2015).
Article PubMed PubMed Central Google Scholar
Spampinato, C. et al. Deep learning human mind for automated visual classification. In booktitleProceedings of the IEEE conference on computer vision and pattern recognition, 6809–6817 (2017).
Li, R. et al. The perils and pitfalls of block design for eeg classification experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 316–333 (2020).
Google Scholar
Ahmed, H., Wilbur, R. B., Bharadwaj, H. M. & Siskind, J. M. Object classification from randomized eeg trials. In booktitleProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3845–3854 (2021).
Kalafatovich, J., Lee, M. & Lee, S.-W. Decoding visual recognition of objects from eeg signals based on attention-driven convolutional neural network. In booktitle2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2985–2990 (organizationIEEE, 2020).
Bagchi, S. & Bathula, D. R. Eeg-convtransformer for single-trial eeg-based visual stimulus classification. Pattern Recognition 129, 108757 (2022).
Article Google Scholar
Grootswagers, T., Robinson, A. K. & Carlson, T. A. The representational dynamics of visual objects in rapid serial visual processing streams. NeuroImage 188, 668–679 (2019).
Article PubMed Google Scholar
Fink, G. R. et al. Neural mechanisms involved in the processing of global and local aspects of hierarchically organized visual stimuli. Brain: a journal of neurology 120, 1779–1791 (1997).
Chin, Z. Y., Ang, K. K., Wang, C., Guan, C. & Zhang, H. Multi-class filter bank common spatial pattern for four-class motor imagery bci. In booktitle2009 annual international conference of the IEEE engineering in medicine and biology society, 571–574 (organizationIEEE, 2009).
Hwang, S., Hong, K., Son, G. & Byun, H. Learning cnn features from de features for eeg-based emotion recognition. Pattern Analysis and Applications 23, 1323–1335 (2020).
Article Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436–444 (2015).
CAS PubMed Google Scholar
Kalafatovich, J., Lee, M. & Lee, S.-W. Learning spatiotemporal graph representations for visual perception using eeg signals. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31, 97–108 (2022).
Article Google Scholar
Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).
Lawhern, V. J. et al. Eegnet: a compact convolutional neural network for eeg-based brain-computer interfaces. Journal of neural engineering 15, 056013 (2018).
Article ADS PubMed Google Scholar
Schirrmeister, R. T. et al. Deep learning with convolutional neural networks for eeg decoding and visualization. Human brain mapping 38, 5391–5420 (2017).
Article PubMed PubMed Central Google Scholar
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In booktitleProceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1929–1958 (2014).
MathSciNet Google Scholar
Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In booktitleInternational conference on machine learning, 448–456 (organizationpmlr, 2015).
Ahn, H.-J., Lee, D.-H., Jeong, J.-H. & Lee, S.-W. Multiscale convolutional transformer for eeg classification of mental imagery in different modalities. IEEE Transactions on Neural Systems and Rehabilitation Engineering 31, 646–656 (2022).
Article Google Scholar
Siddhad, G., Gupta, A., Dogra, D. P. & Roy, P. P. Efficacy of transformer networks for classification of raw eeg data. arXiv preprint arXiv:2202.05170 (2022).
Peng, G., Zhao, K., Zhang, H., Xu, D. & Kong, X. Temporal relative transformer encoding cooperating with channel attention for eeg emotion analysis. Computers in Biology and Medicine 154, 106537 (2023).
Article PubMed Google Scholar
Zhai, X., Kolesnikov, A., Houlsby, N. & Beyer, L. Scaling vision transformers. In booktitleProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12104–12113 (2022).
Grootswagers, T., Zhou, I., Robinson, A. K., Hebart, M. N. & Carlson, T. A. Human eeg recordings for 1,854 concepts presented in rapid serial visual presentation streams. Scientific Data 9, 3 (2022).
Article PubMed PubMed Central Google Scholar
Gifford, A. T., Dwivedi, K., Roig, G. & Cichy, R. M. A large and rich eeg dataset for modeling human visual object recognition. NeuroImage 264, 119754 (2022).
Article PubMed Google Scholar
Deng, J. et al. Imagenet: A large-scale hierarchical image database. In booktitle2009 IEEE conference on computer vision and pattern recognition, 248–255 (organizationIeee, 2009).
Hebart, M. N. et al. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PloS one 14, e0223792 (2019).
Article CAS PubMed PubMed Central Google Scholar
Gramfort, A. et al. Mne software for processing meg and eeg data. neuroimage 86, 446–460 (2014).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Bagchi, S. & Bathula, D. R. Adequately wide 1d cnn facilitates improved eeg based visual object recognition. In booktitle2021 29th European Signal Processing Conference (EUSIPCO), 1276–1280 (organizationIEEE, 2021).
Miao, Z., Zhao, M., Zhang, X. & Ming, D. Lmda-net: A lightweight multi-dimensional attention network for general eeg-based brain-computer interfaces and interpretability. NeuroImage 120209 (2023).
Raza, H., Chowdhury, A., Bhattacharyya, S. & Samothrakis, S. Single-trial eeg classification with eegnet and neural structured learning for improving bci performance. In booktitle2020 International Joint Conference on Neural Networks (IJCNN), 1–8 (organizationIEEE, 2020).
Wang, X. et al. An accurate eegnet-based motor-imagery brain–computer interface for low-power edge computing. In booktitle2020 IEEE international symposium on medical measurements and applications (MeMeA), 1–6 (organizationIEEE, 2020).
Zhu, Y., Li, Y., Lu, J. & Li, P. Eegnet with ensemble learning to improve the cross-session classification of ssvep based bci from ear-eeg. IEEE Access 9, 15295–15303 (2021).
Article Google Scholar
Riyad, M., Khalil, M. & Adib, A. Incep-eegnet: a convnet for motor imagery decoding. In booktitleImage and Signal Processing: 9th International Conference, ICISP 2020, Marrakesh, Morocco, June 4–6, 2020, Proceedings 9, 103–111 (organizationSpringer, 2020).
Kim, S.-J., Lee, D.-H. & Lee, S.-W. Rethinking cnn architecture for enhancing decoding performance of motor imagery-based eeg signals. IEEE Access 10, 96984–96996 (2022).
Article Google Scholar
Van der Maaten, L. & Hinton, G. Visualizing data using t-sne. Journal of machine learning research 9 (2008).

Download references

Acknowledgements

This work was supported by the National Science and Technology Major Project (NO. 2023ZD0121201), National Natural Science Foundation of China (U21B2043, 62206279).

Author information

Authors and Affiliations

The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100049, China
Shuning Xue & Jing Liu
Zidongtaichu Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Shuning Xue, Bu Jin, Jie Jiang, Longteng Guo & Jing Liu

Authors

Shuning Xue
View author publications
Search author on:PubMed Google Scholar
Bu Jin
View author publications
Search author on:PubMed Google Scholar
Jie Jiang
View author publications
Search author on:PubMed Google Scholar
Longteng Guo
View author publications
Search author on:PubMed Google Scholar
Jing Liu
View author publications
Search author on:PubMed Google Scholar

Contributions

S.X. wrote the main manuscript text. B.J. discussed the setup of the paper. J.J. checked the complete setup of the paper, reviewed it and gave suggestions for enhancement. L.G. helped in reviewing of the paper. J.L. thoroughly investigated the proposed work and provided suggestions for improvement. All authors reviewed the manuscript.

Corresponding author

Correspondence to Jing Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xue, S., Jin, B., Jiang, J. et al. A hybrid local-global neural network for visual classification using raw EEG signals. Sci Rep 14, 27170 (2024). https://doi.org/10.1038/s41598-024-77923-4

Download citation

Received: 10 August 2024
Accepted: 28 October 2024
Published: 08 November 2024
DOI: https://doi.org/10.1038/s41598-024-77923-4

Subjects

Abstract

Similar content being viewed by others

Optimising the classification of feature-based attention in frequency-tagged electroencephalography data

An empirical comparison of deep learning explainability approaches for EEG using simulated ground truth

Semi-supervised generative and discriminative adversarial learning for motor imagery-based brain–computer interface

Introduction

Methods

Temporal and spatial convolutional block

Reweight module

Local-temporal module

Spatial-integration module

Global-temporal transformer block

Aggregation and classification

Feature fusion module

Classification module

Experiments

Datasets

EEG727

EEG20013

ImageNet-EEG10

THINGS-EEG-10Hz29

THINGS-EEGS-5Hz30

Data preprocessing

Low sample rate

High sample rate

Evaluation metrics

Baseline models

CNN model18

EEGNet20

ShallowConvNet21

Results and discussion

Overall performance

Ablation studies

Efficacy of reweight module

Efficacy of skip connection

Efficacy of transformer

Efficacy of dimension reduction

Parameter sensitivity

Channel attention of reweight module

Comparison between different classes

Conclusion

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links

EEG72⁷

EEG200¹³

ImageNet-EEG¹⁰

THINGS-EEG-10Hz²⁹

THINGS-EEGS-5Hz³⁰

CNN model¹⁸

EEGNet²⁰

ShallowConvNet²¹