Introduction

The brain-computer interface (BCI) is an advanced technology for human beings to interact with the outside world1. Classification of neural signals generated by visual stimuli has been a promising paradigm for brain-computer interface technology2. Electroencephalography (EEG) is a non-invasive method to measure brain activities, which is widely used due to its high temporal resolution and low cost3. Recently, deep-learning models have been studied to classify EEG signals, which has the potential to help people express their thoughts and directly control external devices with neural signals4,5,6.

To develop a robust brain-computer interface, many studies have been devoted to modeling the relationship between EEG signals and visual stimuli with artificial neural networks. Kaneshiro et al.7 used principal component analysis (PCA) and linear discriminant analysis (LDA) to identify objects from EEG signals and classify them into semantic categories (6-class) and individual exemplars (72-class) using only temporal information. Spampinato et al.8 employed convolutional neural networks (CNN) and long short-term memory (LSTM) networks to classify block-design EEG signals. Li et al.9 and Ahmed et al.10 explored the accuracy of various machine-learning methods and deep-learning models in classifying EEG signals under the random-trial design and concluded that there are certain challenges in classifying raw signals. To explore more effective neural network architecture, Kalafatovich et al.11 proposed a CNN model that is suitable for modelling temporal EEG signals. And the transformer architecture was also explored with attention modules constructed with convolutional kernels12. Recently, Luo et al.6 have combined time-___domain, frequency-___domain, and spatial features using the architecture of graph convolution and transformer to achieve new optimal performance on two datasets13. In these works, researchers have utilised ensembling network structures and introduced frequency-___domain features to enhances the performance of models. However, we suppose that the capabilities of deep-learning models have not been fully exploited, and there is still a lot of room for improvement without relying on the handcrafted features.

In order to maximize the potential of deep learning for constructing efficient and reliable brain-computer interface devices, there are some remaining issues that need to be addressed:

(1) The architecture of deep-learning models for visual classification of EEG signals is yet to be explored thoroughly. At present, researchers have designed deep-learning models with different structures to effectively model the characteristics of EEG signals. In order to further improve the effect of EEG visual decoding, it is necessary to combine the advantages of different deep-learning models. Some studies utilized only one type of neural network, which has limitations in unifying the modeling of information across different scales.

(2) Spatial modeling of EEG signals frequently depends on the spatial coordinates of electrode caps. EEG data contains explicit spatial information, i.e., the three-dimensional spatial coordinates of the EEG cap. Some studies have modeled spatial information based on fixed electrode positions, while others have manually selected electrodes located near the visual cortex to exclude irrelevant information. However, the physiological connections of different brain regions are different among different subjects, and there is likely to be a mismatch between the fixed electrode positions and real EEG cap placement. In addition, different regions of the brain work together synergistically to process visual information globally14, which means that each electrode can contribute to visual classification. A model that can adaptively assign electrode weights and combine information between different electrodes according to the EEG signals of different subjects can better model the EEG characteristics.

(3) High-performance visual classification methods of EEG signals frequently depend on manually designed features in the frequency ___domain. Traditional EEG classification often relies on handcrafted frequency-___domain features15,16. For EEG visual classification, the combination of frequency-___domain features and deep-learning methods has also improved classification performance6. However, the limitations of handcrafted features may hinder the exploration of EEG deep-learning models. Neural networks can extract valid features from raw signals that may be more efficient than handcrafted features.

(4) The validation datasets used for EEG visual classification methods are frequently limited in size. It is worth noting that most of the research on EEG visual classification has been conducted on small-scale datasets in low sampling rate settings, which might not give a complete evaluation of the model’s performance. Besides, deep learning has been found to be more effective on large-scale datasets17.

In order to overcome the above issues, we design a hybrid neural network model, which utilizes the benefits of various neural networks. We adopt the one-dimensional temporal convolutional neural network to achieve efficient local feature extraction, and realize the integration of temporal global features through the transformer block. And we introduce residual connections to help model maintain sensitivity to low-level features while extracting high-level features. In high sampling rate settings, we proposed a feature fusion module to extracting informative features, which further improved the performance. To improve the representation of spatial information in EEG data, we propose a reweight module that utilizes the adaptive learning ability of artificial neural networks. This module models a unique spatial distribution based on EEG data from each participant and automatically assigns different attention weights to all brain electrodes. To take full advantage of the ability of deep learning to extract features from raw signals, we trained the model end-to-end using only the raw time series signals, achieving higher accuracies than the state-of-the-art method that utilized frequency-___domain features. Besides, we conducted extensive experiments on five public EEG datasets to verify the effectiveness of our proposed model.

To summarise, the main contributions of this study are as follows: (1) We have introduced a hybrid local-global neural network for classifying raw EEG signals, which improves decoding accuracy by effectively extracting local temporal features of EEG signals and integrating global features. (2) We validated the effectiveness of deep learning on five datasets, particularly on three large datasets. Both slow and fast serial visual presentation EEG signals can be classified with promising accuracy, higher than that of random. This proves that deep-learning methods are worthy of exploration in neuroscience research. (3) We conducted detailed experiments to compare the impact of preprocessing and hyperparameters, such as sample rate, clamping threshold, learning rate, batch size, and parameters of models. We analyzed the parameter sensitivity in deep-learning models used for EEG classification. (4) We visualized the attention maps of our proposed reweight modules on two datasets, demonstrating the adaptive learning capability of our methods. Furthermore, we demonstrated the effectiveness of our method by visualizing the most influential features using t-distributed Stochastic Neighbour Embedding (t-SNE).

Methods

This section provides a detailed description of our proposed neural network architecture, introduces five publicly available datasets, and explains the preprocessing steps for EEG signals. Fig. 1 illustrates the overall framework of our approaches for visual recognition of EEG signals. The framework demonstrates an end-to-end classification of the raw EEG data. The network consisted of three parts: Temporal and Spatial Convolutional Block, Global-temporal Transformer Block, Aggregation and Classification.

Fig. 1
figure 1

The overall architecture of our model for EEG classification. First, the input EEG signals are downsampled and epoched into trials. Baseline correction is an option step and no further preprocessing steps are applied. Next, raw EEG trials are encoded by Temporal and Spatial Convolutional Block. Then, the transformer block further extracts global-temporal features. Ultimately, EEG features are aggregated and classified into labels.

Fig. 2
figure 2

Diagram of the Temporal and Spatial Convolutional Block. First, the block takes the EEG signal and electrode positions as input. The channel-wise signal is reweighted and combined with other channelsTM signals. Then, the reweighted signals are processed by several temporal 1D-convolution blocks with BatchNorm and Dropout layers. Skip connection is adopted and spatial information is adaptively pooling fulfilled by a convolutional kernel.

Temporal and spatial convolutional block

Fig. 2 illustrates the detailed architecture of the Temporal and Spatial Convolutional Block. It can be summarized as three modules.

Reweight module

In this module, a matrix W that has dimensions E \(\times\) N will be adaptively learned, where E represents the number of input electrodes and N represents the number of reweighted electrodes. In the past, models used to directly extract timing features from the signal of each electrode, and some work manually selected certain electrodes, especially those in the occipital lobe region, as they were most useful for visual classification. The reweight module is designed to adaptively weigh the signals of each electrode of the original EEG, which helps the model to focus on the important electrodes and reduce the weight of the irrelevant ones that are less related to the visual stimulus. Each new reweighted electrode can represent a different weighted summation scheme, and this can result in various electrode combination features being generated as the output.

Local-temporal module

The module consists of four layers of one-dimensional temporal convolution with a skip connection. CNNs are known for their ability to extract local features. As the number of layers increases, simple local time series features combine to form more complex signal shapes, allowing for effective extraction of local time series features present in complex and variable EEG signals. The parameters of the first three layers are referenced from TSCNN18. We reconstructed the model based on the description provided in the article. The kernel sizes for all layers are set to 5, with the numbers of features being 16, 32, and 64 respectively. The stride is set to 1, and no padding strategy is adopted. We chose the exponential linear unit (ELU) activation function19 as it shows better performance in previous EEG experiments20,21.

Compared to previous work, we have implemented several key improvements. These include the introduction of residual connections and adjustments to the pooling and dropout23 layers. The residual connection structure22 is meant to prevent the model from disregarding important low-level information, ensuring that it can handle complex signals while still maintaining sensitivity to simple curves. The residual connection consists of two layers of temporal convolution. As shown in Fig. 2, suppose the number of input feature maps is \(C_3\), the first layer maps the number of feature maps from \(C_3\) to \(2C_3\), and the second layer maps the number of feature maps from \(2C_3\) back to \(C_3\). For signals with low sampling rates, we suppose that the pooling layer should be used sparingly to avoid information loss. Therefore, we use less pooling than TSCNN and replace max-pooling with mean-pooling. To address the issue of overfitting, we have adjusted the number of dropout layers and the dropout rate to ensure the robustness of the model.

Spatial-integration module

The module comprises of a single layer of one-dimensional spatial convolution, where the kernel size is identical to the number of reweighted electrodes of local features. Along with the reweighting of features, it has a random dropout layer, a batch normalization layer24, and an activation layer. The dropout layer helps prevent overfitting of the model, while the batch normalization layer normalizes the feature distribution, leading to a better learning performance. The activation function in the network is ELU. Unlike other models, we divide the learning of spatial features into two parts, which are placed in front of and after the learning of temporal features. In the earlier stage, the reweight module selects the most informative electrode. In the later stage, the spatial information integration module ensures that the model doesn’t focus too much on the timing information of a single electrode after passing through the Local-temporal Module. This allows the model to maintain attention to spatial information and integrate the extracted timing information of each electrode more effectively.

Global-temporal transformer block

To improve the model’s ability to perceive macroscopic patterns, we incorporated transformer structures instead of relying solely on convolutional structures. The transformer is a neural network model based on “Scaled Dot-Product Attention” to extract global features effectively. Its performance has surpassed traditional CNNs and recurrent neural networks (RNNs), especially on large-scale datasets. Recent studies have utilized transformers to extract global-temporal features of EEG signals for classification of mental imagery25, prediction of mental workload26, and emotion analysis27. In this article, we only utilized one transformer encoder layer and did not stack transformer layers, because the data size is still relatively small. Research has shown that transformer models require a larger amount of data than CNNs to extract features effectively28.

Aggregation and classification

For the low sample rate setting, the temporal and spatial features will be directly aggregated and classified by a feed-forward network. For the high sample rate setting, the temporal features will be further fused before classification.

Feature fusion module

This module is designed to deal with high sampling rates and long signals. In the high sample rate setting, after the Temporal and Spatial Convolutional Block extracts local features, the Global-temporal Transformer Block extracts global features and retains the input length, leading to large output dimensions. To aggregate more discriminant features, the temporal dimension of output is reduced in this module. The reduction is fulfilled by a one-dimensional temporal convolution layer to reweight the temporal features similar to the reweight module. This fusion of features facilitates the learning of the classification module and prevents the performance degradation resulting from high sampling rates.

Classification module

The module comprises of a multilayer perceptron (MLP) that integrates the temporal and spatial features and generates the final discrimination result via a 3-layer fully connected (FC) network. The size of the hidden layer is adjusted to suit the number of classification categories for different datasets and classification tasks.

Table 1 Description of the Five Public EEG Datasets. Exp.: Experimental, low-speed: low-speed serial visual presentation, RSVP: rapid serial visual presentation.

Experiments

Datasets

Table 1 provides the details of five public EEG datasets for visual recognition.

EEG727

The first dataset consists of EEG data collected from 10 subjects. The experiment involved showing colour images from six semantic categories. In total, 72 images were presented on a mid-grey background. Each image was displayed for 500 ms. The EEG data was recorded using 128-channel EGI HCGSN 110 nets. The dataset is also called SU DB in TSCNN18.

EEG20013

The second dataset consists of EEG data collected from 16 subjects. The experiment involved showing colour images from ten semantic categories. In total, 200 images without background were presented. These images can be categorized into two types: animate and inanimate. Each image was displayed for 200 ms. The EEG data was collected using a 64-channel BrainVision actiCHamp system.

ImageNet-EEG10

The third dataset comprises EEG data collected from a single participant. The study involved presenting natural images from 40 classes of the ImageNet database31. A total of 40,000 images were shown. Each image was displayed for 2 seconds. The EEG data was recorded using a BioSemi ActiveTwo recorder with 96 channels.

THINGS-EEG-10Hz29

The fourth dataset consists of EEG data collected from 50 participants. The data was collected using a RSVP paradigm, which involved displaying natural images cropped to a square shape. These images were selected from 1,854 object concepts of the THINGS database32. Each image was displayed for 50 ms, followed by a 50 ms blank screen. The EEG data was recorded using a 64-channel BrainVision actiCHamp system. The dataset is divided into two parts: 1854 training concepts and 200 test concepts. We only used the 1854 training concepts of the dataset. Due to the limited number of pictures in each concepts, we combined the three criteria of higher-level object categories provided in THINGS and classified all single-category samples into 27 high-level categories for model validation.

THINGS-EEGS-5Hz30

The fifth dataset consists of EEG data collected from 10 participants. This dataset used similar experimental paradigms and visual stimuli with THINGS-EEG-10Hz, but differed in image presentation latencies. In this dataset, each image was displayed for 100 milliseconds, followed by a 100 ms blank screen. The EEG data was collected using a BrainVision actiCHamp amplifier and 64-channel EASYCAP. The dataset is divided into two parts: 1654 training concepts and 200 test concepts. We only used the 1654 training concepts of the dataset. For the same reason as THINGS-EEG-10Hz of limited number of pictures in each concepts, we have merged the 1,654 concepts into 27 high-level categories.

Data preprocessing

Low sample rate

The EEG signals from EEG72 were filtered between 1 and 25 Hz. The signals were then downsampled to 62.5 Hz and epoched into trials of 32 time samples. For other datasets, the MNE package33 was used for preprocessing. The sampling rates of the data were adjusted to 62.5 Hz to be consistent with the preprocessing of EEG72 and state-of-the-art model CAW-MASA-STST6. The epoch schemes are shown below:

  • For the ImageNet-EEG dataset, epochs were created ranging from -100 to 2,000 ms relative to stimulus onset. Baseline correction was applied by subtracting the mean of the pre-stimulus interval for each trial.

  • For EEG200, THINGS-EEG-10Hz, and THINGS-EEG-5Hz datasets, epochs were created ranging from 0 to 500 ms relative to stimulus onset. No further preprocessing or artifact correction methods were applied.

Besides, we applied the z-score normalization and clamped large voltages.

High sample rate

Previous works have been concentrating on the EEG72 dataset in a low sample rate setting. However, the highest frequency of interest in EEG is commonly around 100 Hz. To study the influence of sample rate, we utilized two more sample rate settings: 125 Hz and 250Hz. In both settings, we notched 50 Hz power line interference. And in the 250Hz setting, we filtered signals in the frequency range of 0.1 to 100 Hz. Other preprocessing steps are the same as at low sample rate.

Evaluation metrics

To evaluate the performance of our method, we adopt two widely used evaluation metrics: the classification accuracy and the confusion matrix. Accuracy is the proportion of correctly identified test samples to all test samples. Confusion matrices are used to measure the performance of multiclass classification problems by showing the counts of predicted and actual values for each class.

During the experiments, we applied 10-fold cross-validation to evaluate the models, accuracies over test folds were averaged and reported in the following section. To set appropriate hyperparameters for the proposed model, we utilized a grid search strategy to optimize the learning rate, batch size, and other parameters. We employed the Adam34 for the optimization algorithm.

Baseline models

In order to demonstrate the effectiveness of our model, we compared the classification accuracy with six baseline models (linear, CNN, LSTM, transformer, EEGNet and ShallowConvNet) and reported accuracies from other works6,11,12,18,35. Two of these baseline models are adopted from LMDA36. The remaining four baseline models are implemented by ourselves. Brief descriptions of three baseline models are given as follows:

CNN model18

We reproduced the temporal stream structure of TSCNN18. The CNN model consists of three temporal 1D-convolution layers and one spatial 1D-convolution layer. The kernel sizes of three temporal 1D-convolution layers are all set to 5, and the numbers of convolutional kernels are 16, 32, and 64, respectively.

EEGNet20

The EEGNet has a compact CNN architecture that consists of a common convolution layer, a depthwise convolution layer and a separable convolution layer. The performance of this model has been validated on several BCI paradigms and EEG benchmarks37,38,39. The implement of EEGNet are adopted from LMDA36.

ShallowConvNet21

The ShallowConvNet consists of two CNN layers and a fully connected layer with a softmax activation, this architecture has successfully applied to EEG signals for the classification of MI paradigms40,41. The implementation of ShallowConvNet are adopted from LMDA36.

We implement the above baseline models and our approach using the publicly available PyTorch 1.6.0 framework.

Table 2 Comparison with the State-of-the-Art Methods for EEG72.
Table 3 Comparison with the Baseline and State-of-the-Art Methods for EEG200. * means the method is reproduced and reported by CAW-MASA-STST6, and # means the method is reproduced by LMDA36 and is tested in our environment.

Results and discussion

Overall performance

In this section, we compared our proposed model with the baseline models on five different datasets to verify its effectiveness. Table 2, 3, and 4 demonstrate that our model achieves high decoding accuracy on all five datasets.

For the EEG72 dataset, our method achieves 55.93% accuracy in the 6-class task, which is 1.11% higher than the state-of-the-art method. CAW-MASA-STST is an advanced model that combines spatio-temporal and spectral-temporal features, resulting in superior performance. Moreover, for the 72-class task, the classification accuracy is also better than that of other methods, improved by 2.26% than CAW-MASA-STST. This improvement in decoding accuracy suggests that our model can extract features from EEG signals more efficiently.

For the EEG200 dataset, our proposed method achieves a 2-class classification accuracy of 69.08%, outperforming all other models. However, in the 10-class task, the proposed method does not achieve the best decoding accuracy. The classification accuracy is 29.39%, which is 1.98% lower than CAW-MASA-STST, but still far better than other methods, namely 4.26% higher than the accuracy of ShallowConvNet and 4.63% higher than the accuracy of LSTM-CNN. It is also worth noting that the accuracy of ShallowConvNet on our preprocessed data is much lower than the reported performance in CAW-MASA-STST, which may be attributed to the difference in data preprocessing or the implementation of ShallowConvNet.

For the ImageNet-EEG dataset, the classification accuracy of the proposed method for the 40-class task is 11.99%, which is better than all baseline models. Furthermore, we experimented with high sampling rate data on the ImageNet-EEG dataset and achieved an accuracy of 12.53% at 125 Hz and 13.01% at 250 Hz (see Table 6). This shows that the model learns and utilizes the information in the high-frequency band to further improve the performance.

For the THINGS-EEG-10Hz and THINGS-EEG-5Hz datasets, the proposed method outperforms all baseline models with 18.74% and 23.87% accuracy in the 27-class task, respectively. The difference in accuracy between the two datasets suggests that the EEG signals stimulated by pictures shown at 5Hz are more distinguishable. This is likely due to the longer reaction time of participants, reducing the overlapping effect of adjacent sample signals.

Overall, our proposed method outperforms other methods in five datasets, indicating that it can effectively extract more discriminative features of EEG signals.

Table 4 Comparison with the Baseline Methods for ImageNet-EEG, THINGS-EEG-10Hz and THINGS-EEG-5Hz. # means the method is reproduced by LMDA36 and is tested in our environment.
Table 5 Accuracy (%) of Ablation Experiment at 62.5 Hz on EEG72 and ImageNet-EEG. The shortcut means the skip connection, Trans. means the transformer.
Table 6 Accuracy (%) of Ablation Experiment at 125 and 250 Hz on ImageNet-EEG.

Ablation studies

We have conducted a series of experiments on our proposed method to evaluate the impact of reweighting, skip connection, transformer, and dimension reduction module. We have tested various combinations of these modules and compared their performance. Table 5 displays the classification performance of eight conditions at 62.5 Hz on both the EEG72 and ImageNet-EEG datasets. Additionally, Table 6 presents the classification performance of sixteen conditions at 125 and 250 Hz on the ImageNet-EEG dataset.

Efficacy of reweight module

The reweight module enables the model to focus on the more informative electrodes while filtering out the less relevant ones. The results of the module’s effectiveness are shown in Table 5 and Table 6. When compared with the base model, the reweight module improved the accuracy on the EEG72 dataset for 6-class and 72-class by 3.15% and 5.40%, respectively. Additionally, the reweight module also improved the accuracy on the ImageNet-EEG dataset by 1.10%, 1.28%, and 0.61% at 62.5 Hz, 125 Hz, and 250 Hz, respectively. The model’s performance with the reweight module was better than other models, indicating that adaptively learning the EEG channel weights is crucial for the model’s high performance.

Efficacy of skip connection

The skip connection (shortcut) is a technique that enables a deep-learning model with a high depth not only to effectively extract high-level features but also to maintain sensitivity to low-level features. From Table 5 and Table 6, it is shown that the skip connection only results in a minor improvement in accuracy. This could be because our method uses a lightweight model, so the benefits of the skip connection may not be significant.

Efficacy of transformer

The transformer layer is a powerful tool for modelling large-scale global features, especially when dealing with large data sets. However, as shown in Table 5 and Table 6, the transformer layer only yields a slight improvement in accuracy. This could be attributed to the limited experimental data available from a single subject, which may constrain the potential of the transformer model.

Efficacy of dimension reduction

Signals with a high sampling rate contain more useful information at high frequencies, making it possible to achieve better results compared to signals at 62.5 Hz. However, we observed that training the same model at 125 Hz and 250 Hz without adjusting the parameters at 62.5 Hz resulted in much lower accuracy. This suggests that the model had difficulty classifying due to the large number of features at a high sampling rate. To overcome this issue, we designed a feature fusion module that enabled the model to extract more expressive features by reducing the dimensionality of the features. As shown in Table 6, the dimension reduction led to a 2.13% and 2.82% increase in accuracy on the ImageNet-EEG dataset at 125 Hz and 250 Hz, respectively.

Fig. 3
figure 3

Performance comparison with different hyperparameters on ImageNet-EEG dataset. (a) Learning rate. (b) Weight decay. (c) Batch size. (d) Clamping threshold. (e) Dropout of transformer. (f) Dimension reduction. (g) FC hidden layer size. (h) Dropout of FC.

Parameter sensitivity

Deep-learning models are highly sensitive to hyperparameter settings, which means that optimizing these parameters is crucial to obtain a reliable model. Grid search is a widely used strategy for hyperparameter optimization in deep-learning models, and we adopted it to optimize the hyperparameters of our proposed model, including learning rate, weight decay, and batch size.

From Fig. 3 (a), it can be observed that setting the learning rate to 1e-3 resulted in the highest accuracy. A learning rate of 1e-2 may cause the model to oscillate around the global optimum, resulting in lower accuracy. Meanwhile, a learning rate of 1e-4 may lead to the model falling into a local optimum, also resulting in lower accuracy. From Fig. 3 (b), it can be observed that increasing weight decay harms the model’s performance on the ImageNet-EEG dataset. As weight decay increases, the model tends to underfit and bring worse results. From Fig. 3 (c), it can be observed that setting batch size to 128 leads to the highest accuracy. A small batch size of 64 may cause the gradient descent process unstable, resulting in lower accuracy. And increasing the batch size beyond 128 does not improve the model’s performance.

We also experimented with other hyperparameters, including clamping threshold, dropout of transformer, dimension reduction, FC hidden layer size, and dropout of FC. Fig. 3 (d) showed that clamping reduced the model’s performance on the ImageNet-EEG dataset. Meanwhile, from Fig. 3 (e) (f) (g) (h), it can be seen that optimizing the other hyperparameters improved the model’s performance. As we achieved better results than the baselines with the initial parameters, we did not further adjust these parameters.

Fig. 4
figure 4

Visualization of weight distribution of 124 electrodes on EEG72 dataset. The weight distribution maps are calculated by summing the absolute values of the weights of each electrode in reweight module.

Channel attention of reweight module

We employed heatmaps to visualize the effectiveness of the reweight module on EEG72 and ImageNet-EEG datasets. Fig. 4 shows the weight distribution maps of the reweight modules for 10 subjects on the EEG72 datasets. The darker red areas on the scalp indicate that the electrode at that position has a greater weight. From the figure, we can observe that only a few electrodes of the multi-channel EEG signal are strongly correlated with the visual target recognition task. The electrodes with heavy weight are mainly distributed in the occipital lobe of the brain, which corresponds to the visual cortex and is responsible for processing visual information. This result is consistent with known neuroscience theories. However, there are some differences in the EEG channel weights of these subjects due to individual differences. This indicates that the proposed model can learn the EEG channel weights of each subject adaptively without artificially setting the channel weights. By setting small weights for unrelated channels, the model can suppress noise on unrelated channels, thus improving the recognition accuracy of the model.

Fig. 5
figure 5

The confusion matrixes on EEG72.

Fig. 6
figure 6

The t-SNE visualization for the high-dimensional features from the fully connected layer of subject 5 and subject 6 in EEG72 dataset.

Comparison between different classes

The confusion matrices were averaged across all folds and subjects. Fig. 5 illustrates the averaged confusion matrices of the four methods on the EEG72 dataset for 6-class. The model mostly confuses the FV class with the IO class, while the HF class is the most distinguishable. The decoding accuracy of all four methods for the human faces (HF) category has reached more than 68%, which is higher than other categories. This shows that the human visual system is highly sensitive to human faces. Compared with the other three methods, our proposed method’s decoding accuracy is improved in all categories, demonstrating that it can effectively extract more discriminative features for each category.

The t-distributed Stochastic Neighbour Embedding (t-SNE)42 is a statistical technique commonly used for dimension reduction and feature visualization. In our study, we assessed the feature extraction ability of a model by extracting features from the fully-connected layer of an individual fold, and then using them to visualize EEG features in a 2-D embedding dimension. Fig. 6 showcases the t-SNE maps of the EEG72 dataset for two subjects and six classes. Our approach effectively captures discriminative information and enhances the separability of different classes, as the results indicate. Samples of the same class exhibit similarities and the HF class is the most distinguishable.

Conclusion

In this study, we propose a hybrid neural network to investigate the ability of deep learning for visual recognition based on raw EEG signals. The proposed framework first extracts local temporal and spatial features with linear and convolutional models. Then, the local features are further processed to extract global temporal features with the transformer model. Next, the temporal and spatial features will be aggregated and classified by a feed-forward network. We validated the effectiveness of our model on five different datasets across different sample rates. The proposed model achieved consistently higher accuracy in 6-class and 72-class classifications for the EEG72 dataset compared to other state-of-the-art models. In the case of the EEG200 dataset, our model achieved the highest accuracy in 2-class classification. Furthermore, for the ImageNet-EEG, THINGS-EEG-10Hz, and THINGS-EEG-5Hz datasets, our model outperformed other baseline models. This work demonstrates that neural networks can extract robust features from raw EEG signals, leading to good performance in EEG classification during visual recognition tasks. To further facilitate the practical application of brain-computer interfaces, we recommend designing more sophisticated deep learning models with effective training objectives and pre-training on large datasets to reduce the effects of noise.