Abstract
Intelligent Transportation Systems (ITSs) have become pivotal in urban traffic management by utilizing traffic flow prediction, which aids in alleviating congestion and facilitating route planning. This study introduces the Linear Attention Based Spatial-Temporal Multi-Graph Convolutional Neural Network (LASTGCN), a novel deep learning model tailored for traffic flow prediction. LASTGCN incorporates a Multifactor Fusion Unit (MFF-unit) to dynamically integrate meteorological factors, an advanced multi-graph convolutional network for spatial correlations, and the Receptance Weighted Key Value (RWKV) block, which employs a linear attention mechanism for efficient processing of historical traffic data.The model achieves computational efficiency by using RWKV, which offers advantages over Transformer-based models in handling large-scale data while capturing complex dependencies. The model is designed to achieve computational efficiency, making it suitable for mid-term traffic management scenarios and potentially adaptable to real-time applications with further optimization. Experimental results using real-world highway traffic datasets indicate that LASTGCN outperforms several state-of-the-art methods in terms of accuracy and robustness, especially in long-term predictions. Additionally, integrating external factors such as weather conditions was found to significantly enhance the model’s predictive accuracy.
Similar content being viewed by others
Introduction
Driven by the rapid progression of the Internet of Things (IoT) and 5G communication technologies in recent years, an extensive range of traffic information is now readily accessible through various advanced sensors. This influx of data has spurred the growth and development of Intelligent Transportation Systems (ITSs)1. ITSs have become a cornerstone of urban traffic management and the broader scope of smart city infrastructure. One essential component of ITSs2, traffic flow prediction, enables precise forecasting of traffic patterns. This predictive capability serves a multitude of purposes: it aids in reducing traffic congestion, curbing fuel consumption, and mitigating air pollution. Moreover, it helps commuters and city dwellers in planning their routes and travel times more efficiently3,4.
As traffic-related datasets continue to grow and machine learning technology develops, an increasing number of researchers are applying deep learning methodologies to tackle traffic prediction problems5,6. Recent studies predominantly leverage Graph Neural Networks (GNNs) to model the spatial aspects of traffic data and Recurrent Neural Networks (RNNs) for temporal correlations7,8. Despite the effective representation of spatial and temporal connections within the transportation network, and their successful application in numerous spatial-temporal prediction tasks, RNN-based approaches, including ARGCNN9 and T-GCN10, frequently suffer from issues like time-consuming iterative propagation and gradient explosion/vanishing. This renders them suboptimal for long-term prediction tasks11. In contrast, Transformer models12 have risen to prominence due to their adept handling of local and long-range dependencies and their amenability to parallelized training13. Noteworthy contemporary models such as ASTGCN14, GMAN15, and STGAT16 integrate the self-attention mechanism with GCN, simultaneously capturing the dynamic spatial-temporal characteristics of traffic data. However, this substantial progress is not without challenges, particularly the inherent quadratic complexity of the self-attention mechanism in Transformers12. This specific complexity leads to an inevitable, exponential surge in both inference time and memory utilization as the length of historical traffic sequence data increases. The result of this surge is a significant increase in computational cost. As a consequence, the task of deploying these models on traditional hardware infrastructure becomes progressively more challenging.
Recent endeavors aimed at ameliorating the limitations of the attention mechanism have strived to improve the scaling properties of Transformers17,18,19. A prominent example is the Receptance Weighted Key Value (RWKV) model20, which implements a linear attention mechanism, dispensing with the conventional dot-product token interaction. This approach marries the efficient parallelizable training of Transformers with the efficient inference capabilities of RNNs. However, it’s worth noting that most of this research has focused on the field of natural language processing, with little exploration of optimized Transformer architectures within traffic flow prediction. The task parallels between traffic flow forecasting and language modeling are quite striking, especially when one considers their mutual emphasis on the management of sequential data and the prognostication of forthcoming activities. This arises from an inherent necessity within these tasks to decipher the patterns embedded within past data and extrapolate their future implications. Given these similarities, it stands to reason that such methodologies might hold substantial untapped potential in the realm of traffic flow prediction.
Traffic flow prediction is a complex, multifaceted challenge21. It not only grapples with intricate spatial and temporal correlations but must also account for a plethora of external factors22,23. For instance, as weather conditions fluctuate from sunny to heavy rain, the traffic speed generally decreases. Such influences introduce a degree of randomness into traffic states, further complicating accurate traffic forecasting24,25.
To tackle these issues, we propose a novel deep learning model for traffic flow prediction: Linear Attention Based Spatial-Temporal Multi-Graph Convolutional Neural Network(LASTGCN).This model possesses the ability to efficiently capture dynamic spatiotemporal features and the interference of external factors. By utilizing an optimized and efficient linear attention mechanism, it minimizes computational complexity and overhead. These attributes significantly facilitate its deployment, making it convenient for implementation within sensor systems. The key contributions of this paper can be summarized as follows:
-
Multifactor Fusion Unit: We introduce an Multifactor Fusion Unit (MFF-unit) that enhances traffic prediction models by dynamically integrating multiple meteorological factors, considering their temporal fluctuations for a more accurate representation of the weather-traffic nexus.
-
Multigraph Convolutional Neural Network: We propose an advanced multi-graph convolutional network designed to integrate spatial correlations within the traffic network. By leveraging three spatial correlation graphs—distance, correlation, and interaction—we enhance the granularity of traffic flow understanding, optimizing the model’s learning efficiency and prediction performance.
-
RWKV Block: We are the first to incorporate the Receptance Weighted Key Value (RWKV) block16 into traffic flow prediction, employing a linear attention mechanism to better handle historical traffic flow. This integral component of the model adeptly discerns the spatial-temporal interdependencies inherent in historical traffic flow data, thereby achieving an optimal trade-off between computational efficiency and holistic performance within our model.
-
Superior Performance: The framework we proposed has undergone extensive experimentation on PeMSD dataset. These experiments demonstrate that the proposed framework is significantly superior to several advanced forecasting methods in terms of accuracy and robustness.
Related work
Deep learning based traffic flow prediction
The realm of traffic flow forecasting has witnessed a significant evolution with the advent of deep learning models. These models, known for their ability to capture nonlinear spatiotemporal effects, have served as the cornerstone for numerous traffic prediction methods.
In the early stage of this evolution, the primary focus was on harnessing temporal correlations. This phase was dominated by Recurrent Neural Networks (RNNs) and their derivatives, Long Short-Term Memory (LSTM)26 and Gated Recurrent Units (GRU)27, due to their proficiency in capturing temporal dependencies. Noteworthy contributions in this area include the works of Van Lint et al.28 and Fu et al.29, who successfully applied these models to extract temporal dependencies from traffic data. A subsequent stride in traffic forecasting involved accounting for the spatial correlation of traffic data. Wu and Tan30 proposed a novel integration of Convolutional Neural Networks (CNN) and LSTM for capturing the spatiotemporal correlation of traffic data. Concurrently, Zhang et al.31 advanced a CNN-based model, DeepST, specifically designed to simulate spatial near-distance dependencies and the inherent cyclical trends of time.
Despite their successes, CNN-based models fell short when dealing with non-Euclidean spatial data, stimulating the development of GCNs32,33. Models such as the STGCN proposed by Yu et al.34, T-GCN model proposed by Zhao et al.10, and the Diffusion Convolutional Recurrent Neural Network (DCRNN) model from Li et al.35, adeptly handled these spatial–temporal relationships.
Recently, Transformer architecture12 employing attention mechanisms and a position encoding strategy, has been introduced in this ___domain. Novel models such as the ASTGCN14 effectively process traffic data directly on the original graph-based traffic network. Moreover, GMAN15 extends the attention mechanism to predict traffic conditions at varying time steps and locations on a road network graph, marking a promising development in the field of traffic flow forecasting.
Optimizing attention mechanism
There are two main strategies to mitigate the memory and computational complexity of transformers, which inherently scale quadratically with sequence length. The first strategy focuses on refining the Attention Mechanism13. A plethora of transformer derivatives, termed “x-formers” have been developed, employing tactics such as sparse attention36,37,38, full attention matrix approximation19,39,40, and chunked attention with gating41. Notably, recent innovations like FlashAttention17 improve memory efficiency, but time complexity remains an issue, either remaining quadratic or influenced by hidden factors like chunk size. Conversely, RWKV20 exemplifies superior space-time complexity trade-offs during inference by ingeniously casting linear attention as an RNN.
The second strategy involves supplanting the attention mechanism with alternative modules for scalability in long sequences. For instance, MLP-Mixer42,43 advocates replacing attention with Multi-Layer Perceptrons in computer vision. Additionally, the Attention Free Transformer (AFT)44 replaces dot-product self-attention with an efficient variant resembling multi-head attention in its feature dimension correspondence. Concurrently, RNN-based components have been adapted to extend context length, such as the Recurrent Memory Transformer45 and Linear Recurrent Units46.
Preliminaries
Definition 1
In this research, we conceptualize a traffic network as a weighted directed graph, represented by \(\:\text{G}=(\text{V},\text{E})\).Here, \(\:\text{V}\) is a finite set of \(\:\left|\text{V}\:\right|\:=\:\text{N}\) nodes indicating \(\:\text{N}\) observed traffic sensors on the traffic network. \(\:\text{E}\) represents all the edges signifying the connectivity among the nodes. The weighted adjacency matrix of graph \(\:\text{G}\) is denoted by \(\:\text{A}\in\:{\mathbb{R}}^{\text{N}\times\:\text{N}}\) .Each node within the traffic network \(\:\text{G}\) detects \(\:\text{F}\:\) traffic conditions of interest(e.g.,traffic volume, traffic speed, etc.). Consequently, each node generates a feature vector of length \(\:\text{F}\) at every time step.
Definition 2
We consider a traffic network \(\:\text{G}\:\) where each node records an \(\:\text{f}\:\)-th time series, representing a sequence of traffic flow with \(\:\text{f}\in\:\left(1,\dots\:,\text{F}\right)\). The value of the \(\:\text{c}\)-th feature of node \(\:\text{i}\:\) at time \(\:\text{t}\:\) is denoted by \(\:{\text{x}}_{\text{c},\text{i}}^{\text{t}}\in\:\mathbb{R}\), while \(\:{\text{x}}_{\text{t}}^{\text{i}}\in\:{\mathbb{R}}^{\text{F}}\) represents the values of all features of node \(\:\text{i}\) at time \(\:\text{t}\). The feature values of all nodes at time \(\:\text{t}\) are represented by \(\:{\text{X}}_{\text{t}}={\left({\text{x}}_{\text{t}}^{1},{\text{x}}_{\text{t}}^{2},\dots\:,{\text{x}}_{\text{t}}^{\text{N}}\right)}^{\text{T}}\in\:{\mathbb{R}}^{\text{N}\times\:\text{F}}\) .The feature values of all nodes over \(\:{\uptau\:}\) time slices are denoted by \(\:\mathcal{X}={\left({\text{X}}_{1},{\text{X}}_{2},\dots\:,{\text{X}}_{{\uptau\:}}\right)}^{\text{T}}\in\:{\mathbb{R}}^{\text{N}\times\:\text{F}\times\:{\uptau\:}}\).
Definition3
The external factors influencing traffic conditions can be conceptualized as supplementary characteristics affiliated with the segments of an urban road network. These factors may be organized into an attribute matrix denoted as \(\:\mathcal{K}=\left({\text{K}}_{1},{\text{K}}_{2},\dots\:,{\text{K}}_{\text{l}}\right)\), where \(\:\text{l}\) signifies the total number of categories encompassing the supplementary information. Furthermore, the collection of supplementary information pertaining to category \(\:\text{j}\) can be symbolized as \(\:{\text{K}}_{\text{j}}=\left({\text{j}}_{1},{\text{j}}_{2},\dots\:,{\text{j}}_{\text{t}}\right)\). Within this notation, \(\:{\text{j}}_{\text{t}}^{\text{i}}\) represents the \(\:\text{j}\)-th type of supplementary information corresponding to the \(\:\text{i}\)-th road segment at a specific time instance denoted by \(\:\text{t}\).
Problem
Given \(\:\mathcal{X}\), which represents all kinds of historical traffic conditions of all nodes on the traffic network over past \(\:{\uptau\:}\) time step, the goal is to predict future traffic flow sequences \(\:\mathcal{Y}={\left({\text{y}}_{1},{\text{y}}_{2},\dots\:,{\text{y}}_{\text{N}}\right)}^{\text{T}}\in\:{\mathbb{R}}^{\text{N}\times\:\text{F}\times\:{\text{T}}_{\text{p}}}\) of all nodes on the entire traffic network over the next \(\:{\text{T}}_{\text{p}}\)time step. Here, \(\:{\text{y}}_{\text{i}}=\left({\text{y}}_{{\uptau\:}+1}^{\text{i}},{\text{y}}_{{\uptau\:}+2}^{\text{i}},\dots\:,{\text{y}}_{{\uptau\:}+{\text{T}}_{\text{p}}}^{\text{i}}\right)\in\:{\mathbb{R}}^{{\text{T}}_{\text{p}}}\) denotes the future traffic flow of node \(\:\text{i}\) from \(\:{\uptau\:}+1\). The traffic prediction problem can be understood as learning the function \(\:\text{f}\), as shown in Eq. (1):
Methodology
This paper proposes the architecture of the LASTGCN model with the comprehensive design of the proposed model illustrated succinctly in Fig. 1. The model comprises three independent components with the same structure, which are designed to respectively model the recent, daily-periodic and weekly-periodic dependencies of the historical data. Finally, the outputs of the three components are integrated to obtain the final prediction. Each component within our proposed model shares a uniform network structure, being composed of a series of stacked RWKV block20, GCNs, and two distinct fusion unit, as meticulously depicted in the associated Fig. 2a. The GCN demonstrates a substantial competency in managing localized spatial correlations. The RWKV block20, designed through an innovative approach, seamlessly blends the parallelizable training efficiency typically found in Transformers with the potent inference capabilities associated with RNNs. This integration of capabilities renders it highly capable in handling both long-term temporal correlations and global spatial dependencies, thereby enhancing computational efficiency. The impact of external variables on variations in traffic flow has been also taken into account, including considerations such as temperature, visibility, and prevailing weather conditions. A multifactor fusion unit(MFF-unit) has been thoughtfully incorporated into our design to effectively capture these exogenous environmental perturbations. To complete our model, we introduce a temporal fusion unit that is specifically designed to capture global time dependency.
Illustration of the LASTGCN model architecture and RWKV block. (a) Detailed representation of a single component within the LASTGCN model, showcasing a series of stacked RWKV blocks, GCNs, and two distinct fusion units that model historical data dependencies; (b) The internal structure of the RWKV architecture, highlighting the principles of time-mixing and channel-mixing blocks, including the four primary elements: Receptance vector (R), Weight (W), Key (K), and Value (V), and how they interact through recurrent structures.
Three time series segments
We adopted the three time series segment approach, which was first proposed in the ASTGCN14.Assuming a sampling frequency of \(\:\text{q}\) times per day and a current time of \(\:{\text{t}}_{0}\), we consider a prediction window of size \(\:{\text{T}}_{\text{p}}\). Three time series segments of lengths\(\:\:{\text{T}}_{\text{h}}\),\(\:{\text{T}}_{\text{d}}\), and \(\:{\text{T}}_{\text{w}}\) serve as the input data for the recent, daily-period, and weekly-period components, where these segments are all integer multiples of \(\:{\text{T}}_{\text{p}}\). The details of these three time series segments are as follows:
-
The recent segment.
The influence of recent traffic flows on future traffic patterns is inevitable, given the gradual nature of traffic congestion formation and dispersion. This idea is encapsulated in the representation of a segment of historical time series directly adjacent to the predicting period, \(\:{\mathcal{X}}_{\text{h}}\), as defined in Eq. (2):
$$\:\begin{array}{c}{\mathcal{X}}_{h}=({\mathbf{X}}_{{t}_{0}-{T}_{h}+1},{\mathbf{X}}_{{t}_{0}-{T}_{h}+2},\dots\:,{\mathbf{X}}_{{t}_{0}})\in\:{\mathbb{R}}^{N\times\:F\times\:{T}_{h}}.\end{array}$$(2) -
The Daily-Periodic Segment.
In order to capture the daily periodicity of traffic data, which often exhibits recurring patterns due to regular daily routines of individuals such as morning traffic peaks, a segment is considered. This segment, denoted as \(\:{\mathcal{X}}_{\text{d}}\), consists of segments from the past few days at the same time period as the predicting period. It is defined in Eq. (3):
$$\:\begin{array}{c}\begin{array}{cc}{\mathcal{X}}_{d}=(&\:{\mathbf{X}}_{{t}_{0}-\left({T}_{d}/{T}_{p}\right)\text{*}q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-\left({T}_{d}/{T}_{p}\right)\text{*}q+{T}_{p}},\\\:&\:{\mathbf{X}}_{{t}_{0}-\left({T}_{d}/{T}_{p}-1\right)\text{*}q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-\left({T}_{d}/{T}_{p}-1\right)\text{*}q+{T}_{p}},\\\:&\:⋮,\\\:&\:{\mathbf{X}}_{{t}_{0}-q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-q+{T}_{p}})\in\:{\mathbb{R}}^{N\times\:F\times\:{T}_{d}}.\:\end{array}\#\end{array}$$(3) -
The weekly-periodic segment.
In order to effectively capture the recurring weekly periodic characteristics inherent in traffic data, the weekly-periodic segment is considered. This segment, denoted as \(\:{\mathcal{X}}_{\text{w}}\), is composed of the segments from the last few weeks that have the same week attributes and time intervals as the forecasting period. Conventionally, traffic patterns experienced on a specific Monday, for example, tend to exhibit notable correlations with their historical counterparts on previous Mondays, whilst they may considerably deviate from patterns observed during weekend periods. Therefore, the weekly-periodic segment becomes a pivotal element in modeling traffic patterns. The segment \(\:{\mathcal{X}}_{\text{w}}\) is defined in Eq. (4):
$$\:\begin{array}{c}\begin{array}{cc}{\mathcal{X}}_{w}=(&\:{\mathbf{X}}_{{t}_{0}-7*\left({T}_{w}/{T}_{p}\right)*q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-7*({T}_{w}/{T}_{p}*q+{T}_{p}}),\\\:&\:{\mathbf{X}}_{{t}_{0}-7*\left({T}_{w}/{T}_{p}-1\right)*q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-7*\left({T}_{w}/{T}_{p}-1\right)*q+{T}_{p}},\\\:&\:⋮,\\\:&\:{\mathbf{X}}_{{t}_{0}-7*q+1},\dots\:,{\mathbf{X}}_{{t}_{0}-7*q+{T}_{p}})\in\:{\mathbb{R}}^{F\times\:N\times\:{T}_{w}}.\end{array}\end{array}$$(4)
Multi-factors fusion unit
Meteorological conditions predominantly serve as the crucial external determinants influencing people’s travel decisions, with extreme weather conditions posing significant constraints on travel and, by extension, impacting road traffic. The influence of differing meteorological elements and the severity of weather conditions yield disparate effects on the fluctuations in traffic flow, as expounded in reference47. For instance, during periods of heavy rain, a noticeable decline in traffic volume is frequently observed, compared to that under normal weather conditions. Consequently, integrating meteorological factors into traffic prediction models assists in curbing inaccuracies21,48.
The AST-GCN model49 perceives external factors as both static and dynamic attributes of roadway segments within the traffic network, and develops an attribute-augmented unit to glean additional traffic features. However, their experimentation was limited to one dynamic meteorological factor. Drawing from their approach, we have chosen to incorporate three dynamic meteorological factors: temperature, visibility, and weather conditions.
In pursuit of capturing dynamic meteorological attributes, we conceive an enhanced approach known as the Multifactor Fusion Unit (MFF-unit). This strategy leverages a fusion of two matrices: the traffic feature matrix \(\:\text{X}\), and the external attribute matrix \(\:\text{D}\). \(\:\text{D}\in\:{\mathbb{R}}^{\text{n}\times\:\left(\text{w}\text{*}\text{t}\right)}\) is a collection of \(\:\text{w}\) different dynamic attributes \(\:{\text{D}}_{1},{\text{D}}_{2},\dots\:,{\text{D}}_{\text{w}}\). In this paper, \(\:\text{w}\) is set to 3, representing T, V, and C. These three meteorological factors demonstrate temporal fluctuations, thereby categorizing them as dynamic attributes. Acknowledging the cumulative influence of these dynamic elements over a period on traffic conditions, our methodology diverges from the conventional approach of selecting attribute values that correlate solely with a single time instance \(\:\text{t}\). Contrarily, we augment the selection window size to encompass \(\:\text{m}+1\),thereby opting for \(\:{\text{D}}_{\text{t}-\text{m},\text{t}}^{\text{w}}=\left[{\text{D}}_{\text{t}-\text{m}}^{\text{w}},{\text{D}}_{\text{t}-\text{m}-1}^{\text{w}},\dots\:,{\text{D}}_{\text{t}}^{\text{w}}\right]\) for each dynamic attribute submatrix \(\:{\text{D}}_{\text{w}}\). We further fusing the traffic flow data and weather data through MFF-unit, and obtain attribute-enhanced matrix \(\:{\text{E}}_{\text{t}}\:\)at time t expressed as:
where\(\:\:{\varvec{E}}_{t}\in\:{\mathbb{R}}^{n\times\:\left(F+3\times\:\left(m+1\right)\right)}.\)
Multi-graph Convolution network
Spectral graph theory facilitates the extension of convolution operations beyond the confines of grid-based data structures, permitting their application to graph-based structures. The traffic network is a prime example of such a graph-based structure, with its inherent characteristics and the attributes of each node can be construed as signals on the graph50. To judiciously leverage the topological attributes of the traffic network, graph convolutions, grounded in spectral graph theory, are employed at each discrete time slice. This approach serves not only to directly manipulate the signals but also to harness the correlations among these signals within the spatial ___domain of the traffic network.
To seamlessly integrate spatial correlations within the traffic network utilizing spectral graph theory, we have devised a sophisticated layer termed the Spatial Graph Convolutional Layer, as depicted in Fig. 3. A notable aspect of this layer is its scalability; it can be systematically stacked and expanded to accommodate varying scales and complexities inherent in attribute-enhanced traffic flow. To adeptly capture the interactions amongst nodes, the Spatial Multi-Graph Convolutional Layer is employed, which is adept at performing convolution operations tailored to the intricacies of node interactions. However, it is imperative to recognize that if the graphs generated are not sufficiently representative of the underlying correlations between nodes, this inadequacy could be detrimental to the learning model and may consequently attenuate prediction performance. With this consideration in mind, we have meticulously constructed three spatial correlation graphs: distance graph \(\:{\text{G}}_{\text{D}}\), interaction graph \(\:{\text{G}}_{\text{I}}\), and correlation graph \(\:{\text{G}}_{\text{C}}\). These graphs are aligned with three distinct time series segments and are instrumental in accurately portraying the spatial correlations between nodes within the extracted traffic graph. A more elaborate discourse on the structure and utility of these graphs will follow:
-
Distance graph.
Our analysis of traffic graphs led to a noteworthy observation: nodes in close physical proximity tend to have similar navigational patterns. Motivated by this, we constructed a distance graph \(\:{\text{G}}_{\text{D}}\left(\text{V},\text{E},{\mathcal{A}}_{\text{D}}\right)\), to effectively capture the relationships between nodes based on their spatial proximity. The linchpin of \(\:{\text{G}}_{\text{D}}\left(\text{V},\text{E},{\mathcal{A}}_{\text{D}}\right)\) is the weighted adjacency matrix, for which we employed Gaussian similarity, as described by the Eq. (6):
$$\:\begin{array}{c}{\mathcal{W}}_{ij}=\text{exp}\left(-{\mathcal{D}}_{\mathcal{i}\mathcal{j}}^{2}/{{\upsigma\:}}^{2}\right),\end{array}$$(6)where \(\:{\mathcal{D}}_{\mathcal{i}\mathcal{j}}\) is the Euclidean distance between monitoring points \(\:{\text{n}}_{\text{i}}\) and \(\:{\text{n}}_{\text{j}}\), and \(\:{\upsigma\:}\)51controls the neighborhood width through its standard deviation of distances. The adjacency matrix \(\:{\mathcal{A}}_{\text{D}}\) of distance graph \(\:{\text{G}}_{\text{D}}\) is given by:
$$\:\begin{array}{c}{\mathcal{A}}_{D}^{ij}=\left\{\begin{array}{c}{\mathcal{W}}_{ij},\:\:\text{i}\text{f}\text{}i\ne\:j,\\\:0,\:\:\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}.\end{array}\right.\end{array}$$(7) -
Correlation graph.
In constructing spatial correlation graphs, it is imperative to comprehend the historical usage patterns of each node across different time segments. With this objective, we devised a correlation graph \(\:{\text{G}}_{\text{C}}\left(\text{V},\text{E},{\mathcal{A}}_{\text{C}}\right)\) that encapsulates the relationships between nodes through their historical correlations. The Pearson correlation coefficient is used to quantify these correlations. Let \(\:\text{X}\) and \(\:\text{Y}\) represent the traffic flow volumes at monitoring points \(\:{\text{n}}_{\text{i}}\) and \(\:{\text{n}}_{\text{j}}\) respectively; the Pearson coefficient is defined as:
$$\:\begin{array}{c}{\mathcal{C}}_{ij}=\frac{Cov\left(X,Y\right)}{{\sigma\:}_{X}{\sigma\:}_{Y}}.\end{array}$$(8)where \(\:\text{C}\text{o}\text{v}\left(\cdot\:,\cdot\:\right)\) denotes the covariance, \(\:{{\upsigma\:}}_{\text{X}}\) and \(\:{{\upsigma\:}}_{\text{Y}}\) are the standard deviations of X and Y respectively. The adjacency matrix \(\:{\mathcal{A}}_{\text{C}}\) of \(\:{\text{G}}_{\text{C}}\) is then defined as:
$$\:\begin{array}{c}{\mathcal{A}}_{C}^{ij}=\left\{\begin{array}{c}{\mathcal{C}}_{ij},\:\:\text{i}\text{f}\text{}i\ne\:j,\\\:0,\:\:\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}.\end{array}\right.\end{array}$$(9) -
Interaction graph.
The historical traffic flow between nodes significantly influences the construction of spatial correlation graphs. Guided by this principle, we introduce an interaction graph \(\:{\text{G}}_{\text{I}}\left(\text{V},\text{E},{\mathcal{A}}_{\text{I}}\right)\) which represents the frequency of interaction between nodes, drawing upon historical navigation records. The adjacent matrix \(\:{\mathcal{A}}_{\text{I}}\:\)of \(\:{\text{G}}_{\text{I}}\) is represented as:
$$\:\begin{array}{c}{\mathcal{A}}_{I}^{ij}=\left\{\begin{array}{c}{\mathcal{J}}_{ij},\:\:\text{i}\text{f}\text{}i\ne\:j,\\\:0,\:\:\text{o}\text{t}\text{h}\text{e}\text{r}\text{w}\text{i}\text{s}\text{e}.\end{array}\right.\end{array}$$(10)where \(\:{\mathcal{I}}_{\mathcal{i}\mathcal{j}}\) represents the number of traffic flow records between points \(\:{\text{n}}_{\text{i}}\) and \(\:{\text{n}}_{\text{j}}\).
The process of spatial multi-graph convolution is shown in Fig. 3. The input to the graph convolutional layer is an attribute-augmented matrix \(\:E\in\:{\mathbb{R}}^{N\times\:{C}_{in}\times\:\tau\:}\),which is subjected to the spatial graph convolution propagation rule defined as:
$$\:\begin{array}{c}S=\varTheta\:{\text{*}}_{G}E\in\:{\mathbb{R}}^{n\times\:m\times\:{c}_{\text{out\:}}},\end{array}$$(11)where \(\:{\uptheta\:}\:\)represents the spectral kernel of graph convolution. \(\:{\text{*}}_{\text{G}}\) refers to the graph convolution operation, where G defines the graph structure such as an adjacency matrix or Laplacian matrix that governs feature propagation and aggregation. E is the input feature matrix containing the features of each node in the graph. Upon processing the three time-series segments through the MFF-Unit, we obtain three attribute-augmented traffic flow matrices. These matrices are then propagated through three dedicated GCN modules. Each module works in synchrony with an associated traffic graph (i.e.,\(\:{\text{G}}_{\text{D}}\), \(\:{\text{G}}_{\text{I}}\) and \(\:{\text{G}}_{\text{C}}\)). This elaborate procedure culminates in the formation of three distinct feature matrices: \(\:{\mathcal{S}}_{\text{D}}\),\(\:{\mathcal{S}}_{\text{I}}\) and \(\:{\mathcal{S}}_{\text{C}}\). To expedite the spectral graph convolution operations, we employ the Chebyshev polynomial50 for kernel estimation.
The receptance weighted key value (RWKV) block
The RWKV architecture20 built around the principles of time-mixing and channel-mixing blocks, as illustrated in Fig. 2b. The RWKV comprises four primary elements:
-
R: Receptance vector serves as the acceptance of past information.
-
W: Weight is a positional decay vector and functions as a trainable parameter.
-
K: Key is a vector analogous to K in traditional attention.
-
V: Value is a vector analogous to V in traditional attention.
Constructed from a series of stacked residual blocks52, the RWKV architecture is defined by time-mixing and channel-mixing sub-blocks, each with recurrent structures. The recurrence comes into play both as a linear interpolation between the current and preceding time step input and as a time-dependent update of the WKV, as illustrated in Eq. (18).
The time-mixing block is given by:
In the time-mixing block, the \(\:{\text{w}\text{k}\text{v}}_{\text{t}}\) computation is an integral part of the RWKV architecture, filling the role of the attention mechanism in Transformers. Importantly, it does this without incurring a quadratic cost since interactions are scalar. As time progresses, the vector \(\:{\text{o}}_{\text{t}}\) reflects an increasingly extensive historical summary. The model performs a weighted summation in the positional interval of [1, t] and then multiplies the outcome by the receptance \(\:{\upsigma\:}\left(\text{r}\right)\). Thus, multiplicative interactions occur within a given timestep and are accumulated over different timesteps. Further, the channel-mixing block is given by:
Within this block, the model employs a squared ReLU activation function52. In both the time-mixing and channel-mixing blocks, the sigmoid of the receptance is intuitively applied as a “forget gate,” eliminating superfluous historical information.
Temporal fusion unit
We designed a module named Temporal Fusion Unit with the prime objective of enhancing the capture of temporal dependent relationships within traffic flow. This module, essentially a fully connected attention mechanism, is adept at integrating the outputs from the RWKV blocks at each time step. Specifically, each RWKV block yields an output characterized by a shape of \(\:(\text{N},{\text{F}}_{\text{o}\text{u}\text{t}})\), and collectively they produce a sequence output of shape \(\:(\text{N},\text{t},{\text{F}}_{\text{o}\text{u}\text{t}})\:\), where \(\:\text{N}\) corresponds to the batch size, \(\:\text{t}\) denotes the time step length, and \(\:{\text{F}}_{\text{o}\text{u}\text{t}}\) represents the number of output features.
A salient aspect of the Temporal Fusion Unit is the computation of attention scores for the outputs at each time step. This is particularly instrumental in capturing the dependencies in traffic flow over time, as it allocates more significance to relevant time steps. The attention scores are computed via a fully connected network utilizing a softmax function. his involves employing a weight matrix \(\:{\text{W}}_{\text{a}}\) and a bias vector \(\:{\text{b}}_{\text{a}}\) for calculating the attention score at each time step, according to the formula:
where \(\:{\text{h}}_{\text{t}}\) denotes the output at time step \(\:\text{t}\). Following this, the weighted sum is computed by multiplying the output at each time step by its respective attention score, as represented by the formula:
This culminates in \(\:\text{c}\), the final aggregated result which is the weighted sum of the output at each time step and its corresponding attention score.
Multi-component fusion
Ultimately, the outputs of the three components are fused. Given that the impact weights of the three components for each node vary, they ought to be learned from the historical data. Therefore, the resultant prediction post-integration is as follows:
where \(\\hat :{\text{Y}}\) is the final output of the model’s prediction.\(\:{\text{W}}_{\text{h}}\),\(\:{\text{W}}_{\text{d}}\) and \(\:{\text{W}}_{\text{w}}\) are the learned weight matrices or coefficients that control the influence of each part of the feature on the final prediction.\(\:\hat {{\text{Y}}}_{\text{h}}\),\(\:\hat {{\text{Y}}}_{\text{d}}\) and \(\:\hat {{\text{Y}}}_{\text{w}}\) represent the outputs of three components related to the recent segment, the daily-periodic segment and the weekly-periodic segment respectively.
Experiments
Datasets
To validate our model, we utilized two notable traffic datasets, PeMSD453 and PeMSD853, as well as meteorological data from the National Oceanic and Atmospheric Administration (NOAA)54. The PeMSD4 dataset consists of data from the San Francisco Bay Area, gathered from January to February 2018, sourced from 3,848 detectors on 29 roads. In contrast, the PeMSD8 dataset comprises data collected in San Bernardino from July to August 2016, drawn from 1,979 detectors on 8 roads. The NOAA meteorological data coincides with these same periods and locations.
In our analysis, each detector, collecting traffic data every 5 min, provides 288 data points per day. We consider various traffic measurements such as total flow, average speed, and average occupancy. In urban traffic flow monitoring, the minimum distance between detectors is usually set between 3 and 5 miles to avoid data redundancy and maintain sufficient spatial resolution. We choose 3.5 miles as the minimum distance to ensure sufficient spatial coverage under different traffic modes and reduce redundant data at the same time. The final detector counts were 307 for PeMSD4 and 170 for PeMSD8. Any missing values were addressed with linear interpolation, and we standardized all data using zero-mean normalization. In other comparison methods, the detector spacing was also set to 3.5 miles and the same number of detectors were used. We used these methods as baselines to ensure fairness in our experiments.
To evaluate the model’s generalization capability, the datasets were divided into training, validation, and testing sets in a time-sequential manner. Specifically, the first 70% of the data was used for training, the next 10% for validation, and the remaining 20% for testing. This approach ensures that the testing set only includes future data unseen during the training phase, closely simulating real-world forecasting scenarios.
Moreover, the testing set was designed to encompass diverse traffic conditions, including weekdays, weekends, peak hours, off-peak hours, and varying weather conditions. This ensures that the evaluation results comprehensively reflect the model’s robustness and adaptability across different scenarios. For instance, PeMSD4’s testing set covers two weeks of traffic data, while PeMSD8’s testing set spans one week. This proportion was selected to balance computational feasibility with statistical reliability in performance evaluation.
Experimental settings
Evaluation metrics
We use Mean Absolute Error (MAE) and Root Mean-Squared Error (RMSE) to evaluate the prediction performance of our model.
-
Mean Absolute Error (MAE).
MAE describes the average of the sum of the absolute difference between the predicted result and the ground truth. It is mainly used to evaluate the prediction error.
$$\:\begin{array}{c}MAE=\frac{1}{n}\sum\:_{i=1}^{n}\left|{y}_{i}-\hat {{y}}_{i}\right|.\end{array}$$(23) -
Root Mean-Squared Error(RMSE).
The smaller the RMSE value, the smaller the prediction error, and the better the performance of the model.
$$\:\begin{array}{c}\text{R}\text{M}\text{S}\text{E}=\sqrt{\frac{1}{n}\sum\:_{i=1}^{n}{(y}_{i}-\hat {{y}}_{i}{)}^{2}}\end{array}$$(24)
Baselines
We compare LASTGCN with the following baseline methods:
-
Historical Average(HA);
-
Vector Auto-Regressive(VAR)57;
-
Long Short-Term Memory network(LSTM)26;
-
Gated Recurrent Unit network(GRU)27;
-
Adaptive Graph Convolutional Recurrent Network (AGCRN)9 which employs adaptive graph and integrates GRU with Graph convolutions with node adaptive parameter learning;
-
Spatio-temporal graph convolutional network(STGCN)34 that combines graph convolutional layers and convolutional sequence learning layer;
-
Temporal graph convolutional network (T-GCN)10 that combines GCN and GRU for traffic prediction;
-
Diffusion convolutional recurrent neural network(DCRNN)35 that integrates diffusion convolution with sequence-to-sequence architecture;
-
Graph WaveNet57 that combines graph convolution with dilated casual convolution;
-
Attention based spatial-temporal graph convolutional network(ASTGCN)14 that combines attention mechanism and the spatial-temporal convolution;
-
Graph Multi-attention Network (GMAN)15 which multiple spatio-temporal attention blocks.
To ensure a fair comparison, we retrained all deep learning models on the same datasets used for evaluating LASTGCN. This includes using the same data splits (training, validation, and testing) and preprocessing techniques (e.g., zero-mean normalization and interpolation for missing values). For traditional statistical methods, we implemented the algorithms according to their original descriptions, ensuring consistency with their reported implementations.
For the retraining of baseline deep learning models, hyperparameters were optimized based on their respective original papers and adjusted for the PeMSD4 and PeMSD8 datasets where necessary. For instance, batch sizes, learning rates, and epoch counts were tuned to achieve the best performance on these datasets. All models were trained on the same hardware to eliminate potential variations caused by differences in computational resources.
Hyperparameters
In the training phase of the model, we judiciously selected hyperparameters to optimize performance. Utilizing the Adam optimization algorithm58,,59, we set the learning rate at 0.01 and employed a batch size of 64, with training spanning across 3,000 epochs. The architecture includes an aggregation of six RWKV blocks (N = 6) and is configured with 100 hidden units. Moreover, in the specialized GCN configuration within the model, an optimal performance was achieved by incorporating 32 units in the hidden layer. We set segment lengths as follows: \(\:{T}_{h}\)=24, \(\:{T}_{d}\)=12, and \(\:{T}_{w}\)=24. The size of the predictive window, \(\:{T}_{p}\), is set at 12, which means the model predicts traffic flow for the next hour.
Hyperparameter selection and grid search
To ensure the optimal combination of hyperparameters, we conducted a systematic grid search experiment. The selection process was guided by empirical testing and prior knowledge from relevant studies.
Grid Search Scope:
-
Learning Rate: Tested values of 0.001, 0.01, and 0.1.
-
Batch Size: Tested values of 32, 64, and 128.
-
Epoch Count: Tested values of 1,000, 2,000, and 3,000.
Experiment Results:
-
Learning Rate: A learning rate of 0.01 provided the best trade-off between convergence speed and training stability. Lower values (e.g., 0.001) resulted in slower convergence, while higher values (e.g., 0.1) led to unstable training dynamics.
-
Batch Size: A batch size of 64 ensured efficient GPU memory utilization while maintaining model generalization performance. Smaller batch sizes (e.g., 32) increased training time, and larger sizes (e.g., 128) slightly degraded performance.
-
Epoch Count: Performance stabilized around 2,000 epochs, with additional improvements observed up to 3,000 epochs.
These selected hyperparameters were validated to perform optimally on the PeMSD4 and PeMSD8 datasets, balancing training efficiency and prediction accuracy.
Experiment results
Forecasting performance comparison
We embarked on a comparative analysis, pitting our models against 12 baseline methods, utilizing datasets PeMSD4 and PeMSD8. Table 1 elucidates the mean performance results for predictions of traffic flow within the forthcoming hour. A conspicuous observation is the suboptimal prediction accuracy rendered by traditional time series analysis methods, which unveils their inadequacies in grappling with the intricate non-linearities and complexity inherent in traffic data. Conversely, deep learning-based methodologies demonstrate superior predictive prowess. Notably, models such as AGCRN, T-GCN, DCRNN, and Graph WaveNet, which judiciously incorporate both temporal and spatial correlations, exhibit enhanced performance relative to models like LSTM and GRU that focus exclusively on temporal aspects, as well as STGCN, which is solely predicated on GCN. The performance of ASTGCN and GMAN surpasses that of the previously mentioned models, underscoring the efficacy of attention mechanisms in adeptly capturing the dynamism of traffic data. Our proprietary model, LASTGCN, which ingeniously harnesses linear attention in tandem with RWKV blocks and GCN, reigns supreme in performance with respect to RMSE on PeMSD4 and MAE on PeMSD8. Although it marginally trails GMAN in two other metrics by 0.18 and 0.26, it still holds its ground, exhibiting performance on par with models that singularly rely on attention mechanisms.
Figures 4 and 5 demonstrate the evolution of predictive performance as the forecast interval increases for various methods. In general terms, the difficulty of prediction tends to rise with an increase in the forecast interval, leading to an inevitable growth in prediction error. The figures indicate that for short-term predictions, methods that only take into account temporal correlations, such as HA, ARIMA, LSTM, VAR, and GRU, can deliver satisfactory results. Despite their initial effectiveness, these methods experience a significant drop in predictive accuracy as the forecast interval expands. In contrast, even though the error rates of deep learning methods do rise with the extension of the forecast interval, they consistently maintain a commendable overall performance. Our LASTGCN model consistently achieves the best predictive performance.
Ablation experiments
The ablation experiments in this study primarily focus on the impact of features, as features directly influence the prediction performance and are one of the key factors in model construction. There were 7 groups of trials: only adding temperature, only adding visibility, only adding weather conditions, temperature + visibility, temperature + weather conditions, visibility + weather conditions, and temperature + visibility + weather conditions. The results are listed in Table 2.
The analytical outcomes delineate that by adroitly integrating external constituents, the performance of our algorithm outstrips that of the model bereft of external factor assimilation. Furthermore, there exists a palpable correlation between the triumvirate of meteorological components and the vicissitudes in traffic flow, with weather conditions and ambient temperature emerging as the most potent influencers. In summation, the integration of meteorological variables into the traffic prediction framework proves instrumental in attenuating inaccuracies, thus contributing significantly to the refinement of the model’s predictive fidelity.
We first conducted feature ablation experiments to validate the contribution of different features to traffic flow prediction. Although ablation experiments of the model’s main components were not performed in this study, future research will supplement these experiments and explore the impact of each module on the final prediction results.
Robust experiments
Aiming to emulate the inherent imperfections and stochastic nature of real-world data acquisition processes, we subjected the LASTGCN model to a gamut of perturbation analyses to gauge its robustness in the face of noise contamination. We imparted two archetypal variants of random noise into the data: the first variant adheres to a Gaussian distribution \(\:N\in\:\left(0,{\sigma\:}^{2}\right)\),where \(\:\sigma\:\in\:\left(\text{0.2,0.4,0.8,1},2\right)\); while the second variant conforms to a Poisson distribution P(λ), where \(\:\lambda\:\in\:\left(\text{1,2},\text{4,8},16\right)\). The experimental results are shown in Fig. 6. Figure 6a shows the results before and after adding Gaussian noise to the PeMS4 data set. Similarly, Fig. 6b shows the results before and after adding Poisson noise to the PeMS4 data set. Upon close examination of the graphical renditions, a couple of revelations emerge. The first insight is the marginal attrition in the LASTGCN model’s predictive performance subsequent to the adulteration with noise, indicating a negligible decline. The second insight is the revelation that the model’s evaluation metrics remain astonishingly stable, irrespective of the variance in the noise distribution.
In summation, these observations collectively corroborate the formidable robustness of the LASTGCN model, underscoring its proficiency in countervailing the vicissitudes engendered by noise pollution.
Computation time
We present the training time and inference time of STGCN, DCRNN, Graph WaveNet, GMAN, and LASTGCN on the PeMS dataset in Table 3. During the training phase, LASTGCN demonstrates a commendable temporal efficiency that is nearly on par with STGCN, with a trifling lag of 9.01 s. This performance compared to the more sluggish training velocities of GMAN and Graph WaveNet. However, DCRNN is markedly hindered by its recursive network architecture, which necessitates a protracted sequence learning process, thus relegating it to the least efficient position among the methodologies examined. While STGCN’s efficiency is laudable in the training phase, it is crucial to highlight that this comes at the expense of predictive accuracy, as evidenced in Table 1.
The inference phase paints a different picture in terms of efficiency and predictive capability. In the inference phase, LASTGCN, GMAN, and Graph WaveNet emerge superior, owing to their adeptness at generating forward-looking predictions in a single pass, which significantly curtails the time consumed. Specifically, LASTGCN exhibits an inference alacrity that surpasses GMAN and falls marginally behind Graph WaveNet by an inconsequential 1.33 s. STGCN and DCRNN, conversely, are encumbered by the necessity of iterative computations for prediction generation. This characteristic, inherent in their design, renders them less efficient in the inference stage. Nonetheless, LASTGCN’s average inference time of 8.01 s is broken down as follows: preprocessing (2.5 s), inference (4.1 s), and post-processing (1.41 s).
Although 8.01 s may not strictly meet the definition of real-time for certain high-frequency applications, such as dynamic signal control, it is sufficient for mid-term traffic management scenarios (e.g., congestion trend forecasting). Furthermore, by leveraging hardware accelerations such as GPUs and optimizing RWKV computations, the inference time could be significantly reduced in future implementations.
Conclusions
In this study, we presented LASTGCN, a novel deep learning model aimed at predicting traffic flow with high precision. Through the combination of linear attention mechanisms with multi-graph convolutional networks, and the incorporation of external factors such as weather conditions, LASTGCN is capable of capturing complex spatial-temporal features of traffic data. Our empirical evaluation using real-world datasets showcased the superiority of LASTGCN over several state-of-the-art methods in terms of accuracy. The integration of meteorological variables significantly improved the model’s performance, highlighting the importance of considering external factors. LASTGCN offers a powerful solution for intelligent traffic systems, and has potential for broader applications in smart city infrastructures.
However, this study has certain limitations. The hyperparameter tuning, though thorough, may not guarantee globally optimal configurations due to computational constraints. Additionally, the datasets used are geographically and temporally specific, which could limit generalizability. While baseline models were retrained to ensure fairness, their performance might still reflect optimization differences.
Future work will focus on validating LASTGCN on more diverse datasets, reducing computational latency for real-time applications, and exploring advanced graph construction techniques. Expanding integration with real-world systems like IoT devices and weather tools could further enhance its practical utility.
Data availability
Data are available from the corresponding author upon request.
References
Hou, Y., Deng, Z. & Cui, H. Short-term traffic flow prediction with weather conditions: based on deep learning algorithms and data fusion. Complex. 2021, 1–14. https://doi.org/10.1155/2021/6662959 (2021).
Yang, B., Sun, S., Li, J., Lin, X. & Tian, Y. Traffic flow prediction using LSTM with feature enhancement. Neurocomputing 332 https://doi.org/10.1016/j.neucom.2018.12.016 (2018).
Zhang, D. & Kabuka, M. Combining weather condition data to predict traffic flow: A GRU based deep learning approach. IET Intel. Transport Syst. 12 https://doi.org/10.1049/iet-its.2017.0313 (2018).
Wang, Z., Su, X. & Ding, Z. Long-term traffic prediction based on LSTM encoder-decoder architecture. IEEE Trans. Intell. Transp. Syst. PP, 1–11. https://doi.org/10.1109/TITS.2020.2995546 (2020).
Li, Y., Bai, F., Lyu, C., Qu, X. & Liu, Y. A systematic review of generative adversarial networks for traffic state prediction: overview, taxonomy, and future prospects. Inform. Fusion. 102915. (2025).
Kuang, S., Liu, Y., Wang, X., Wu, X. & Wei, Y. Harnessing multimodal large language models for traffic knowledge graph generation and decision-making. Commun. Transp. Res. 4, 100146 (2024).
Qu, X., Lin, H. & Liu, Y. Envisioning the future of transportation: inspiration of ChatGPT and large models. Commun. Transp. Res. 3, 100103 (2023).
Liu, Y. et al. Can language models be used for real-world urban-delivery route optimization? Innovation. 4(6). (2023).
Bai, L., Yao, L., Li, C., Wang, X. & Wang, C. Adaptive Graph Convolutional Recurrent Network for Traffic Forecasting (2020).
Zhao, L. et al. T-GCN: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 21, 3848–3858. https://doi.org/10.1109/TITS.2019.2935152 (2020).
Le, P. & Zuidema, W. Quantifying the vanishing gradient and long distance dependency problem in recursive neural net-works and recursive LSTMs (2016).
Vaswani, A. et al. Attention Is All You Need (2017).
Tay, Y., Dehghani, M., Bahri, D. & Metzler, D. Efficient transformers: A survey (2022).
Guo, S., Lin, Y., Feng, N., Song, C. & Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. AAAI 33, 922–929. https://doi.org/10.1609/aaai.v33i01.3301922 (2019).
Zheng, C., Fan, X., Wang, C. & Qi, J. G. M. A. N. A graph multi-attention network for traffic prediction. AAAI 34, 1234–1241. https://doi.org/10.1609/aaai.v34i01.5477 (2020).
Kong, X. et al. Spatial-temporal graph attention networks for traffic flow forecasting. IEEE Access. 8, 134363–134372. https://doi.org/10.1109/ACCESS.2020.3011186 (2020).
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness (2022).
Zaheer, M. et al. Big bird: Transformers for longer sequences (2021).
Wang, S., Li, B. Z., Khabsa, M., Fang, H. & Ma, H. Linformer: Self-Attention with Linear Complexity (2020).
Peng, B. et al. RWKV: Reinventing RNNs for the Transformer Era.
Koesdwiady, A., Soua, R. & Karray, F. Improving traffic flow prediction with weather information in connected cars: A deep learning approach. IEEE Trans. Veh. Technol. 65, 1–1. https://doi.org/10.1109/TVT.2016.2585575 (2016).
Mirbaha, B. et al. Predicting average vehicle speed in two lane highways considering weather condition and traffic characteristics. IOP Conf. Series: Mater. Sci. Eng. 245, 042024. https://doi.org/10.1088/1757-899X/245/4/042024 (2017).
Hosseini, H., Moshiri, B., Rahimi-Kian, A. & Araabi, B. Traffic flow prediction using MI algorithm and considering noisy and data loss conditions: an application to Minnesota traffic flow prediction. PROMET - Traffic Transp.. 26 https://doi.org/10.7307/ptt.v26i5.1429 (2014).
Xu, Q., Pang, Y. & Liu, Y. Air traffic density prediction using bayesian ensemble graph attention network (BEGAN). Transp. Res. Part. C: Emerg. Technol. 153, 104225 (2023).
Xu, Q. et al. PIGAT: Physics-informed graph attention transformer for air traffic state prediction. IEEE Trans. Intell. Transp. Syst. (2024).
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 (1997).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling (2014).
Lint, J. W. C., Hoogendoorn, S. & Zuvlen, H. Freeway travel time prediction with state-space neural networks: modeling state-space dynamics with recurrent neural networks. Transp. Res. Rec. 1811 https://doi.org/10.3141/1811-04 (2002).
Fu, R., Zhang, Z. & Li, L. Using LSTM and GRU neural network methods for traffic flow prediction. In Proceedings of the 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), 324–328. (2016).
Wu, Y. & Tan, H. Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning frame-work (2016).
Zhang, J., Zheng, Y., Qi, D., Li, R. & Yi, X. DNN-based prediction model for spatial-temporal data.
Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks (2017).
Jiang, W. & Luo, J. Graph neural network for traffic forecasting: A survey. Expert Syst. Appl. 207, 117921. https://doi.org/10.1016/j.eswa.2022.117921 (2022).
Yu, B., Yin, H. & Zhu, Z. Spatio-Temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence; International Joint Conferences on Artificial Intelligence Organization: Stockholm, Sweden, July, 3634–3640. (2018).
Li, Y., Yu, R., Shahabi, C. & Liu, Y. Diffusion convolutional recurrent neural network: data-driven traffic forecasting (2018).
Beltagy, I., Peters, M. E. & Cohan, A. Longformer: The long-document transformer (2020).
Kitaev, N., Kaiser, Ł. & Levskaya, A. Reformer: The efficient transformer (2020).
Guo, M. et al. LongT5: Efficient text-to-text transformer for long sequences. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022; Association for Compu-tational Linguistics: Seattle, United States, 724–736. (2022).
Choromanski, K. et al. Rethinking attention with performers (2022).
Ma, X. et al. Luna: Linear Unified Nested Attention (2021).
Ma, X. et al. Mega: Moving Average Equipped Gated At-tention (2023).
Tolstikhin, I. et al. MLP-Mixer: An All-MLP Architecture for Vision (2021).
Liu, H., Dai, Z., So, D. R. & Le, Q. V. Pay Attention to MLPs (2021).
Zhai, S. et al. An Attention Free Transformer (2021).
Bulatov, A., Kuratov, Y. & Burtsev, M. S. Recurrent Memory Transformer (2022).
Orvieto, A. et al. Resurrecting Recurrent Neural Networks for Long Sequences (2023).
Li, H., Wang, Q. & Xiong, W. New model of travel-time prediction considering weather conditions: case study of urban expressway. J. Transp. Eng. Part. A: Syst. 147, 04020161. https://doi.org/10.1061/JTEPBS.0000491 (2021).
Jia, Y., Wu, J. & Xu, M. Traffic flow prediction with rainfall impact using a deep learning method. J. Adv. Transp. 2017, e6575947 https://doi.org/10.1155/2017/6575947 (2017).
Zhu, J. et al. AST-GCN: attribute-augmented Spatiotemporal graph convolutional network for traffic forecasting (2020).
Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A. & Vandergheynst, P. The emerging field of signal processing on graphs: extending high-dimensional data analysis to networks and other irregular domains. IEEE. Signal. Process. Mag. 30, 83–98. https://doi.org/10.1109/MSP.2012.2235192 (2013).
von Luxburg U. A Tutorial on spectral clustering (2007).
He, K., Zhang, X., Ren, S. & Sun J. Deep residual learning for image recognition (2015).
So, D. R. et al. Primer: searching for efficient transformers for language modeling (2022).
Chen, C., Petty, K., Skabardonis, A., Varaiya, P. & Jia, Z. Freeway performance measurement system: mining loop detector data. Transp. Res. Rec. 1748, 96–102. https://doi.org/10.3141/1748-12 (2001).
Qi, X., Mei, G., Tu, J., Xi, N. & Piccialli, F. A deep learning approach for long-term traffic flow prediction with multifactor fusion using spatiotemporal graph convolutional network. IEEE Trans. Intell. Transp. Syst. 1–14. https://doi.org/10.1109/TITS.2022.3201879 (2022).
Williams, B. & Hoel, L. Modeling and forecasting vehicular traffic flow as a seasonal ARIMA process: theoretical basis and empirical results. J. Transp. Eng. 129, 664–672 https://doi.org/10.1061/(ASCE)0733-947X(2003)129:6(664) (2003).
Zhang, J. et al. Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transp. Syst. 12, 1624–1639. https://doi.org/10.1109/TITS.2011.2158001 (2011).
Wu, Z., Pan, S., Long, G., Jiang, J. & Zhang, C. Graph WaveNet for deep spatial-temporal graph modeling (2019).
Kingma, D. P., Ba, J. & Adam A Method for Stochastic Optimization (2017).
Author information
Authors and Affiliations
Contributions
methodology, Y.Z. and W.X.; software, B.M.; validation, F.Z., J.Y., H.Y. and Z.D.; investigation, F.Z., J.Y., H.Y. and Z.D.; resources, Y.Z., W.X. and D.Z.; data curation, F.Z., J.Y., H.Y. and Z.D.; writing—original draft preparation, B.M.; writing—review and editing, Y.Z., W.X. and D.Z.; visualization, B.M.; supervision, Y.Z., W.X. and D.Z.; project administration, D.Z.; funding acquisition, Y.Z. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, Y., Xu, W., Ma, B. et al. Linear attention based spatiotemporal multi graph GCN for traffic flow prediction. Sci Rep 15, 8249 (2025). https://doi.org/10.1038/s41598-025-93179-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-93179-y
Keywords
This article is cited by
-
Emerging Trends in Graph Neural Networks for Traffic Flow Prediction: A Survey
Archives of Computational Methods in Engineering (2025)