Introduction

Time series prediction has been applied in widespread domains, such as the prediction of stock price1, weather forecast2, trajectory of hurricanes or typhoons3, traffic flow prediction4, etc. The time series vary in the sampling number and frequency, which challenge the training process of the prediction models. In time series prediction, a model acquires the knowledge (features) of various regular characteristics inherent in the entire dataset. The prediction model has to be trained with different approaches for different datasets due to the variation of the collected sample series. In the datasets with long series, the model can be trained either independently on each series or collectively. However, for those datasets with short series (within hundreds of samples), the common way is to train the model with multiple series in one session. The paper will focus on this kind of ‘multiple-series’ training of prediction models.

Many time series comprise a combination of an overall or long-term trend, distinct long and short period regularities, and irregular fluctuations5 from nonlinear and non-stationary series. This characteristic significantly elevates the feature complexity compared to those from stationary series, and consequently limits the predictive capacity. Currently, several effective methods exist to segregate different periodic regularities and overall variation trends in time series, such as the Seasonal and Trend decomposition using LOESS (STL), wavelet transform, Empirical Mode Decomposition (EMD), etc.5. There are also some related improved models such as RobustSTL6, Fast RobustSTL7, Seasonal-Trend-Dispersion decomposition (STD)5.

However, the aforementioned methods to decompose periodic regularities and general trends of time series, including EMD that we will employ, are constrained to decompose a single long series at a time. There are at least 3 limitations of such methods.

  • The features learned from training with a single short series might be incomplete to all the features of the dataset.

  • The features learned from multiple short series might be inconsistent to each other.

  • It is none-economic to feed a model with short series when it is capable to accept a longer series.

Consequently, it is necessary to concatenate each individual series prior to decomposition, in order to apply these decomposition methods (such as EMD8) to multi-series data. During this concatenation process, two challenges should be taken into account:

  • The value range diverse among different series, and when this difference is substantial, the inherent characteristics of change will be significantly diminished after concatenation, consequently reducing the decomposed periodicity. This issue can be resolved by normalizing each series separately9.

  • There exists discontinuity on the junction between concatenating series. It is shown in the preliminary results that the decomposition model will generate an anomalous component at the junctions that deviates from regular patterns and disrupts nearby periodic regularities, if the variation at the junction between the concatenate series surpasses the variation within adjacent observations of each respective series.

To address the second challenge, this paper introduces a connector insertion method to generate sub-sequences that align with the original periodic regularities through the decomposition. It incorporates a connector between each pair of series to be concatenated, ensuring the continuity of the multi-series at the junctions and enhancing the consistency of sub-sequences with their original periodic regularity through decomposition. The contributions of the paper can be summarized as follows:

  • It proposes two types of connectors for the multiple series concatenation, the linear interpolation connector shorted as LIP, and the linear interpolation superimposed with random vibrations shorted as LRV.

  • It introduces a framework consisting of four modules: data pre-processing, EMD, prediction via long and short term memory (LSTM10) mechanism, and output. The data pre-processing includes separate normalization and series concatenation. For prediction purposes, it employs three methods: LSTM for all sub-sequences; LSTM with temporal attention mechanism (LSTM-TA) for all sub-sequences; and LSTM-TA for some sub-sequences combined with LSTM for others.

  • It presents experiments on multiple multi-series datasets comparing to direct concatenation with both types of connectors, and found the appropriate scope of the connectors. The linear and vibrating connector suits those series with obvious periodic characteristics, while simple linear interpolation connector is more appropriate in cases where such characteristics are absent.

The rest of the paper is organized as follows: section "Preliminaries" prepares the preliminary definitions and theorems of EMD and backbone of the method; section "Methodology" describes the method with the general framework, the settings of two types of connectors, and the insertion details; section "Experiments" shows the experiment results of prediction via connectors; section "Related works" surveys the time series prediction methods and series decomposition methods; and section "Conclusion" concludes the paper.

Preliminaries

Empirical mode decomposition

Empirical Mode Decomposition (EMD) is a method to process nonlinear and non-stationary sequences, to separate different period features of the sequence. EMD decomposes the sequence into a series of Intrinsic Mode Function (IMF) sub-sequences and a residue. A candidate sub-sequence satisfies the following conditions.

  1. 1.

    For the entire dataset, the number of extremes and the number of zero crossings must either be equal or differ at most by one;

  2. 2.

    At any point, it must be zero for the mean value of the envelopes defined by the local maxima and the local minima.

A classical EMD algorithm consists of 5 steps8:

Step 1:

Given a sequence x(t), it sets the initial residue and number \(r(t) = x(t), k = 1\)

Step 2:

It joins all the maxima as the upper envelope \(e_{\max }(t)\), and respectively all the minima as the lower envelope, \(e_{\min }(t)\), by the cubic spline lines.

Step 3:

It gets the mean of the envelope m(t) and the candidate c(t):

$$\begin{aligned} m(t)=\frac{1}{2}(e_{min}(t)+e_{max}(t)) \end{aligned}$$
(1)
$$\begin{aligned} c(t)=r(t)-m(t) \end{aligned}$$
(2)
Step 4:

If the candidate c(t) meets the conditions above, it sets \(\textrm{IMF}_k(t) = c(t)\) as an IMF sub-sequence and recalculates the residue \(r(t)=x(t)-c(t)\), and go to Step 1 with \(k=k+1\) and \(x(t)=r(t)\). Otherwise, it takes the current candidate c(t) as the input, i.e. \(x(t) = c(t)\) and repeat Steps 3\(\sim\)5.

Step 5:

It terminates if the residue r(t) is a constant or has at most one minimum point and one maximum point each. Then the input is decomposed as the following:

$$\begin{aligned} x(t)=\sum _{i=1}^{k}{{\textrm{IMF}}_i(t)}+r(t) \end{aligned}$$
(3)

Otherwise, increment \(k = k + 1\) and repeat Steps 2\(\sim\)5.

EEMD11 and CEEMDAN12 are two variants of EMD. CEEMDAN optimizes EMD adaptive noise and multiple iterations. It first performs EMD decomposition on the original sequence, and then further decomposes it by adding adaptive noise and multiple iterations to eliminate modal aliasing.

EMD with temporal attention

In the classical long and short-term memory (LSTM) model, a unit is composed of three types of gates: input, forget and output. Such gates implement the selective memory and long-term dependence on the input information.

LSTM aims to address the long-term dependence problem of RNN; however, ‘long short-term memory’ is distinct from long-term memory. When dealing with relatively long sequences or sequences with uncertain lengths, the information storage capacity of the hidden and cell states output by LSTM remains limited. The temporal attention mechanism, also known as the global attention mechanism, is one of the methods used to better capture the patterns within long sequences. Temporal attention is described as the follow 5 steps:

  1. Step 1.

    It outputs the hidden state h and the cell state c into the LSTM unit for one-step prediction, obtaining the updated hidden state \(h_t\) and cell state \(c_t\).

  2. Step 2.

    It concatenates \(h_t\) to the tail of X, the output tensor of the LSTM encoder, along the time dimension as the prediction sequence, \(H_0\).

  3. Step 3.

    Let Q be the last time step of \(H_0\), as well as K and V be the \((T+1)\)th to the 2nd time step from the bottom of \(H_0\), then T is the time dimension of the input to the LSTM encoder. The attention value is defined as the equation below:

    $$\begin{aligned} Atten=\textrm{diag}\left( \textrm{softmax}\left( \frac{KQ^\textrm{T}}{\sqrt{T}}\right) ^\textrm{T}\right) V \end{aligned}$$
    (4)

    where ‘diag(·)’ is a function to convert the column vector to a diagonal matrix. In this case, it is equivalent to multiplying each dimension of the column vector output by the softmax function with the corresponding row vector in V.

  4. Step 4.

    It adds the attention values for each time dimension as \(h^*\) in Formula 4.

  5. Step 5.

    It concatenate Q to the tail of \(h^*\), which becomes the prediction result \(h_{atten}\) to output.

Temporal attention can be utilized for multi-step prediction by iteratively executing the aforementioned procedure in the decoder. After each step, the hidden state h and cell state c are updated to \(h_t\) and \(c_t\), respectively, and the prediction results of each step are concatenated along the time dimension in sequence.

Methodology

Framework

This paper proposes a framework in Fig. 1 for multi-series prediction based on connector and EMD. It employs a pipeline of normalization and decomposition. Firstly, it normalizes the series in the original dataset separately instead of using global normalization. Then, it applies two kinds of connectors to concatenate the short series into long ones. Afterwards, such long series are concatenated in the time dimension and decomposed using CEEMDAN.

Each individual series in the original dataset undergoes separate normalization, followed by concatenation with specific connector. The resulting concatenated series is then decomposed into multiple sub-sequences, comprising a set of Intrinsic Mode Functions (IMF) series (\(\textrm{IMF}_1\), \(\textrm{IMF}_2, \ldots , \textrm{IMF}_n\)) and a residual series (RES). Each sub-sequence serves as input for training the prediction model, and the integration of these output series serves as the final prediction result.

Figure 1
figure 1

System framework.

Detailed steps

The separate normalization approach effectively eliminates value range discrepancies between series, resulting in a more uniform distribution of decomposed series. However, it is important to note that there is almost no continuity between the adjacent series to concatenate, i.e. there can be a significant variation at the junction as is shown in Fig. 2 on time spot 400, 800 and 1200. If these discontinuities were ignored during concatenation and subsequent decomposition, it led to sub-sequences with distinct features at the junction points that could potentially impact subsequent analysis.

Figure 2
figure 2

The images of series \(x_1(t)\) (left), \(x_2(t)\) (middle) and \(x_3(t)\) (right) as well as their first several sub-sequences after decomposition. In each group of images of this and the following figures, the first sub-image represents the series before decomposition, and then each sub-image represents \(\textrm{IMF}_1\), \(\textrm{IMF}_2\), etc.

For instance, there is a multi-series dataset with five series \(s_1(t) \sim s_5(t)\) where k is assigned 400 as a experienced value without loss of generality:

$$\begin{aligned} s_1(t)=\sin {\left( \frac{t}{k}-1/2\right) \pi },(t=0,1,2,\ldots ,k-1) \\ s_2(t)=\sin {\left( \frac{2t}{k}-1/2\right) \pi },(t=0,1,2,\ldots ,k-1) \\ s_3(t)=\sin {\left( \frac{2t}{k}\right) \pi },(t=0,1,2,\ldots ,k-1) \\ s_4(t)=\sin {\left( \frac{2t}{k}-1/4\right) \pi },(t=0,1,2,\ldots ,k-1) \\ s_5(t)=\sin {\left( \frac{2t}{k}-3/4\right) \pi },(t=0,1,2,\ldots ,k-1) \end{aligned}$$

The five series \(s_1(t)\sim s_5(t)\) need not be normalized since the ranges of them are all the range [-1, 1]. Based on these series, we make three synthetic datasets below. We sequentially concatenate \(s_1(t)\sim s_5(t)\) to get \(x_1(t)\):

$$\begin{aligned} x_1(t)=\{ \begin{matrix} s_1(t), & t=0,1,2,\ldots ,k-1 \\ s_2(t-k), & t=k,k+1,\ldots ,2k-1 \\ s_3(t-2k), & t=2k,2k+1,\ldots ,3k-1 \\ s_4(t-3k), & t=3k,3k+1,\ldots ,4k-1 \\ s_5(t-4k), & t=4k,4k+1,\ldots ,5k-1 \\ \end{matrix}. \end{aligned}$$

To simulate the mixing of multiple modes, a short period and low amplitude sine wave component is superimposed into \(x_1(t)\) to get \(x_2(t)\):

$$\begin{aligned} x_2(t)=x_1(t)+\gamma \sin {\frac{\pi t}{\alpha }} \end{aligned}$$

Furthermore, to better simulate the situation of the dataset selected in this paper, a random shock within ± 0.1 is superimposed into \(x_1(t)\) to get \(x_3(t)\):

$$\begin{aligned} x_3(t)=x_1(t)+\textrm{uniform}(-\gamma ,\gamma ) \end{aligned}$$

where \(\alpha\) is initialized as 10 as an experienced value; \(\gamma\) is the amplitude of the vibration, initialized as 0.1 as ten percent of the variation of the series; \(\textrm{uniform}(a, b)\) is a function that outputs a random floating-point number in the range (ab). By adding the random vibration function uniform, it simulates the uncertainty of the variation between adjacent observations , thus \(x_2(t)\) and \(x_3(t)\) can better simulate the numerical variation characteristics of the dataset selected in this paper than \(x_1(t)\).

Use CEEMDAN to decompose the series \(x_1(t)\), \(x_2(t)\) and \(x_3(t)\), and the images of their several sub-sequences are shown in Fig. 2. As for \(x_1(t)\), its \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\) only generate waveforms at and near the junctions of the first three junctions. The amplitude of the sub-sequence is directly proportional to the magnitude of change in the original series. The value of other parts is always zero. At the fourth junction, no waveform appears in the sub-sequence since the values are continuous. \(\textrm{IMF}_3\) shows the characteristics similar to those of \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\), but there is a significant deformation in the second segment. Although \(\textrm{IMF}_4\) approximately captures the overall trend of the original data, there is still a significant disparity in trends near the junction point compared to the original series. Moreover, this deviation becomes more pronounced as the magnitude of changes increases within the original data. As for \(x_2(t)\), besides some characteristics reflected in \(x_1(t)\), it shows the short-term regularity of the original series on \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\). However, the series junction exhibits a longer period and significantly higher amplitude compared to other regions, deviating substantially from the original periodic regularity. Similarly, the sub-sequence of \(x_3(t)\) also demonstrates characteristics akin to those of \(x_1(t)\) and \(x_2(t)\), with a pronounced increase in amplitude at the junction of the original series as well as distinct periodic features that differ from other sections.

Real-world datasets have more complex periodic regularities and variations than the synthetic datasets above (Sec. 4.1 shows the details about these datasets). Taking randomly 5 areas from the Monthly ReTail Sales of the USA (MRTS) dataset and another random 5 stocks in Stock-D as examples, after direct concatenation and decomposition, the images of their several sub-sequences are shown in the left column of Fig. 3. As in the previous experiments on synthetic datasets, in the sample of MRTS, the amplitudes of \(\textrm{IMF}_1\sim \textrm{IMF}_4\) are significantly higher than those of the other parts at and near the four junctions where t = 336, 672, 974 and 1310. In the example of Stock-D, at three junctions where t = 124, 248 and 372, there is no significant difference of the observation changes from nearby, thus the phenomenon in the previous experiments is not obviously reflected. However, at the junction where t = 496, since the observation still have a high upward jump, a higher amplitude appears at the same position of \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\).

Figure 3
figure 3

The images of the sample of MRTS (the top row) and Stock-D datasets (the bottom row) as well as their first several sub-sequences after decomposition. In each row, the left figure is the series with direct concatenation, the middle figure is the series with linear connector, and the right figure is the series with linear and randomly vibrating connector.

From the aforementioned series decomposition experiments conducted on both synthetic and real-world datasets, it is observed that even after separate normalization of the datasets, direct concatenation of these series prior to decomposition results in significant amplitude spikes at the junctions compared to other regions. Consequently, this leads to deviations from periodic regularities in the decomposed sub-sequences near these junctions. Additionally, an increase in jump magnitude within the original series corresponds to amplified waveform deviations from the original periodic regularity at these junctions.

Connector insertion

To address the issue of waveform deviation in the decomposed series, caused by significant jumps during direct concatenation that disrupt the original periodic regularity and overall trend, we propose an alternative approach known as indirect concatenation of multi-series. This method involves introducing a connector between each segment of the series. We consider the following two types of connector.

Linear Interpolation (LIP) Given the end value of the former series, \(y_{\textrm{end}1}\), and the start value of the latter series, \(y_{\textrm{start}2}\), the LIP connector interpolates a linear series of the length s, as the number of the interpolation values added in between \(y_{\textrm{end}1}\) and \(y_{\textrm{start}2}\). There will be s values inserted, that the ith (i = 1, 2, ..., s) value of the LIP connector from \(y_{\textrm{end}1}\) is computed as in Formula (5).

$$\begin{aligned} y_i=y_{\textrm{end1}}+\frac{(y_{\textrm{start2}}-y_{\textrm{end1}})i}{s+1} \end{aligned}$$
(5)

The LIP connector relieves the ‘jumping’ phase between \(y_{\textrm{end}1}\) and \(y_{\textrm{start}2}\), to reduce the IMF decomposed to fit such phases. The length of connector is worth mentioning too. Experienced tests show that given the length of the short series as L, the length of connectors, s is supposed to work well in the certain range as is discussed in Sec. 4.3.

Linear Interpolation with Random Vibration (LRV) The straight forward simple linear interpolation is still not the optimal when the ‘jumping’ distance \(\delta = |y_{\textrm{end}1} - y_{\textrm{start}2}|\), because the linear sampling values lead to multiple IMF with ‘firm’ frequencies. Based on Equation (5), a vibration function \(\textrm{uniform}(-d,d)\) is introduced into the LRV connectors. It makes each interpolation point vary with a random value as shown in Formula (6).

$$\begin{aligned} y_i=y_{\textrm{end1}}+\frac{(y_{\textrm{start2}}-y_{\textrm{end1}})i}{s+1}+\textrm{uniform}(-d,d) \end{aligned}$$
(6)

The LRV connector introduces the vibration in the uniform distribution within the range of \(\pm d\) on the LIP connector. LRV connector brings vibrations smoothly, not too rigid (such as simple linear) to disturb the uncertainty of the original series, and thus keep the short-term regularities of the whole series as consistent as possible.

Apart from the differences at the junctions introduced by the connectors, there is no disparities in the periodic characteristics of each sub-sequence within the regions representing the original data.

A comprehensive evaluation of the advantages and disadvantages of both connectors will be verified via further experiments on diverse datasets in Sec. 4.

Experiments

This section describes the datasets used in the experiment first, then it validates the usage of two types of connectors, further more, it discusses the length of connectors in relationship with the connection effects, and the usage of normalization in the connection.

Experiments are performed on 5 real-world datasets: Monthly ReTail Sales of the USA (MRTS), three selected and organized stock price datasets Stock-D/Stock-W/Stock-M, and the Socioeconomic Status Score (SES). It evaluates the performance of the connector-based approach using four metrics, the Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and R-square (\(R^2\)) metrics. Each dataset is divided into the training and test subsets according to a certain proportion with series as the unit.

After normalization, each dataset is processed by three methods: direct concatenation, connect with LIP connector, and connect with LRV connector. The three concatenated series for each dataset are separately decomposed and trained. For each dataset, the length of connector, the number of sub-sequences, and the model selection for each sub-sequence in ‘Temporal Attention Integration’ are shown in the following subsection.

Dataset description

The datasets used and or analyzed during the current study available from the corresponding author on reasonable request.

Table 1 Characters of datasets.

Monthly ReTail Sales of the USA (MRTS): It’s a database of sales from https://www.kaggle.com/datasets/landlord/usa-monthly-retail-trade (March 24, 2023). It contains the data of monthly sales in various fields of the US retail industry from January 1992 to May 2020. We selected the original statistical data (excluding revised data) stored in excel format. Eliminating some total items and combining with some situations in the experiment, we select the data from 28 fields.

Stock-D/Stock-W/Stock-M: It’s a database of stock price from https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction (June 26, 2023). Three stock price datasets selected and organized from the competition dataset of the Tokyo Stock Exchange Stock Price Prediction. Stocks are screened based on average daily volume. Stock-D is the closing price of 50 companies’ stocks from July 1 to December 30, 2019. Stock-W is the weekly price of 100 companies’ stocks in 51 weeks of 2019 (the stock market is closed for the whole first week of May). Stock-M is the monthly price of 120 companies’ stocks during 2017\(\sim\)2019.

Socioeconomic Status Score (SES): It’s a database of macroeconomic indicators from https://www.kaggle.com/datasets/sdorius/globses (November 11, 2022). It contains socioeconomic status percentage scores for 149 countries every 10 years between 1880 and 2010.

The relevant characteristics of the above datasets are shown in Table 1. The prediction length is 1 and the number of neural network units is 32 for all datasets.

Connector validity

Table 2 Connector Validity.

The average results of multiple tests (at least 5 times) are presented in Table 2. The best results are highlighted in bold, and second-best results are underlined. For each prediction model, on the MRTS dataset, incorporating the LRV connector performs superior prediction results compared to direct concatenation. However, adding only the LIP connector does not exhibit as good prediction performance as direct concatenation. On the other four datasets, under identical conditions, incorporating a LIP connector demonstrates better prediction performance than both direct concatenation and incorporating a LRV connector.

The efficacy of various concatenation techniques for multiple time series differs significantly across datasets. This variability stems from the nature of the MRTS dataset, which captures monthly sales figures. The sales volumes of certain items, particularly seasonal products such as agricultural goods, sideline products, and clothing, are subject to annual fluctuations. Furthermore, specific events, including store anniversaries and annual shopping events like ‘year-end specials,’ can also lead to variations in yearly sales patterns.

When comparing the use of a LRV connector to the use of a LIP connector alone, the latter tends to diverge more from the periodic changes observed in the data, which can significantly affect the identification of corresponding periodic patterns. On the other hand, the other four multi-series datasets, which include stock price variations, do not exhibit clear periodic regularities.

Additionally, datasets with sparse time series, such as SES, demonstrate a wide range of variation characteristics. Even when these series are concatenated, they do not reveal any discernible periodic regularity. Therefore, for these types of datasets, the LIP connector, without the random vibrations, can still provide a rough alignment with the variation characteristics of each individual series.

As a stage conclusion, the choice of concatenation method for multi-series datasets should be tailored to their inherent characteristics. It is advisable to insert incorporates both LRV connectors for datasets that exhibit clear periodic patterns. This approach can effectively capture and preserve the periodic nature of the data. Conversely, a simpler method that adds a LIP connector is more appropriate for datasets that do not display pronounced periodic traits, without the introduction of random vibrations. This alternative is better suited to datasets where the preservation of periodicity is not a primary concern, allowing for a more straightforward analysis of the underlying data trends.

Connector length concerns

The length of the connector also speaks. If the connector is too short, there will be a large variation left between adjacent points, which will produce high amplitude at the corresponding position of the sub-sequences. If the series is too long, it will obviously increase the overhead of series decomposition. Accordingly, this experiment focuses on the length of the connector. The following experiments are only conducted for the sample of MRTS dataset.

Set s = 10, 30, 50, 80 and 100 to concatenate the sample dataset by the two proposed concatenating method in Subsection 3.3, then decompose the series. The images of their several sub-sequences are shown in Figs. 4 and 5.

Figure 4
figure 4

Images of the sample of MRTS dataset concatenated by LIP connector with different length s, as well as their several sub-sequences after decomposition. The five figures horizontally represent the cases with s = 10, 30, 50, 80 and 100, respectively.

Figure 5
figure 5

Images of the sample of MRTS dataset concatenated by LRV connector with different length s, as well as their several sub-sequences after decomposition. The five figures horizontally represent the cases with s = 10, 30, 50, 80 and 100, respectively.

In Fig. 4, considering the region of the original data only, there is no obvious difference among \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\) in each group of images. There is no obvious difference among \(\textrm{IMF}_3\) in each group, but when \(s=10\), a large amplitude appears near the junctions where t = 336, 984 and 1340 compared with other groups. It can be seen from \(\textrm{IMF}_4\) in each group that along with the increase of the connection series length, the amplitude difference of the four junctions from other areas (except the second half of the third segment of series, whose regularity is decomposed into \(\textrm{IMF}_4\) instead of \(\textrm{IMF}_5\)) is gradually decreasing, and when \(s = 80\), this difference has been basically eliminated.

Figure 5 also represents some of the features in Fig. 4. In \(\textrm{IMF}_4\), along with the increase of the connection series length, the amplitude difference of the four junctions from other areas is also gradually decreasing, and when \(s = 50\), this difference has been basically eliminated.

The correlation of the length of connector, the number of sub-sequences and the model selection in ‘Temporal Attention Integration’ in Table 3.

Table 3 Influence of Connector Length.

Normalization effects

The experiments show that it is necessary to normalize the short series. There are two types of normalization commonly employed, min-max scaler and Z-normalization, which involves scaling based on the mean and standard deviation. The former, min-max scaler, is employed our paper given its prevalence in existing research.

There are two ways of normalization shown below, global and separate normalizations.

Global normalization Each sequence segment is concatenated along the time dimension and normalized as a whole. For example, there are two sequences:

$$\begin{aligned} \{a_n\}=\{0,1,2,3,4\} \\ \{b_n\}=\{6,7,8,9,10\} \end{aligned}$$

After the global min-max scaler, the two sequences become:

$$\begin{aligned} \{a_n^\prime \}=\{0,0.1,0.2,0.3,0.4\} \\ \{b_n^\prime \}=\{0.6,0.7,0.8,0.9,1\} \end{aligned}$$

However, in multiple sequences, there may possibly be large diversity in the range of values covered by the individual sequence segments (e.g., the price of one stock is tens to hundreds of dollars, while the price of another is only a few dollars). If they are normalized globally, this diversity will not be eliminated, which will still increase the difficulty and reduce the efficiency of neural network training, thus affecting the accuracy of prediction. To eliminate this diversity, there is the following method, named separate normalization.

Separate normalization Each sequence segment is normalized separately and then concatenated along the time dimension for further research. For the sequences \(\{a_n\}\) and \(\{b_n\}\) above, after the separate min-max scaler, they will be as follows:

$$\begin{aligned} \{a_n^\prime \}=\{0,0.25,0.5,0.75,1\} \\ \{b_n^\prime \}=\{0,0.25,0.5,0.75,1\} \end{aligned}$$

The approach described effectively reduces the variability within each sequence segment. However, to ensure precise denormalization prior to output, it is essential to document the length of each segment as they are concatenated directly. Additionally, it is crucial to identify the individual maximum and minimum values of each segment to facilitate the subsequent denormalization process. In this paper, when the series are joined with connectors, it is imperative to record not only the maximum and minimum values but also the starting and ending positions of each segment to ensure accurate reconstruction and analysis.

For the multi-sequence datasets MRTS and SES, the two ways of normalization, ‘global’ and ‘separate’, are tested before EMD. Firstly, before decomposition, the sequences after the two types of normalization are input to LSTM, respectively, to train and predict. The results are collected in Table 4. The best results are highlighted in bold.

In Table 4, all the results are better with separately normalized MRTS than that of global normalization, while the same for SES data, except on MAPE. The reason why the MAPE of separately normalized SES is higher than the globally normalized one is that the prediction error is larger for some items with smaller values.

Table 4 Normalization (without Decomposition) Effects.

Furthermore, for MRTS, after the sequences’ two types of normalization, the partially decomposed sequences are shown in Fig. 6. The specific selected sub-sequences are labeled on the left of the image. The comparison shows that since the lower observations in the original data account for a large part, the peaks in the obtained short and medium period sub-sequences are obviously biased to the higher part of the original data after its global normalization and decomposition. However, for the sequence normalized separately, the peak distribution after resolving is more even.

Figure 6
figure 6

The images of the MRTS dataset after global normalization (left) and separate normalization (right), and their partial sub-sequences after CEEMDAN.

The two sets of subsequences obtained above are input to LSTM, respectively, and the \(R^2\) index comparison between them is shown in Table 5. After the decomposition of the global-normalized sequences, serious overfitting appeared on two short-term regularity sequences, \(\textrm{IMF}_1\) and \(\textrm{IMF}_2\), which makes the prediction of them significantly deviate from the actual results. On the contrary, the network fits much better for the sequences normalized separately.

Table 5 Predicting Effects of Different Normalization.

Related works

Time series prediction

There exist systematic researches on time series mining. In these studies, the experiment datasets vary in both the size and the length of each series. However, it is seldom ever emphasized to handle series with diverse sizes in a uniform manner, thereby highlighting the significance of series concatenation.

Deep neural network has been widely used in time series prediction13,14,15,16) due to its complex nonlinear characteristics. Ren et al. proposed an anomaly detection algorithm based on spectral residual and Convolution Neural Network (CNN) in13, proving its universality and effectiveness. Chen et al. proposed in14 a Time-Aware Multi-Scale Recurrent Neural Networks (TAMS-RNNs), which can adaptively capture the multi-scale information of each time series at each time step. Cirstea et al. proposed a Distinct Filter Generation Network (DFGN) in15 to capture different temporal dynamics of different entities, and Dynamic Adjacency Matrix Generation Network (DAMGN) to generate dynamic graphs. Jin et al. proposed the Domain Adaptation Forecaster (DAF)16, which applies ___domain adaptation techniques via attention sharing to solve the data scarcity issue.

The research of time series prediction also explore the feature capture15,17 and model optimization18,19,20. Ding et al. introduced extreme loss in17 to detect possible extreme events. Crabbe et al. proposed dynamic masks in21 to select the feature parsimoniously and legibly from the large number of inputs. Zaffran et al. applied Adaptive Conformal Inference (ACI) to general time series18, proposing an adaptive method, AgACI, which reduces parameter dependencies by online expert aggregation. Hasson et al. discussed the stack generalization in ensemble learning and applied it to time series prediction in19. Woo et al. proposed a time-index model in20 to automatically learn a function form from the time series.

Series decomposition

Studies are explored to decompose periodic regularities and general trends from nonlinear and non-stationary series, such as STL (Seasonal and Trend decomposition using Loess)22, discrete wavelet transform23, EMD8, VMD (Variational Mode Decomposition)24, SSA (Singular Spectrum Analysis)25 and STR (Seasonal-Trend Decomposition based on Regression)26. There are two outstanding improvements to EMD: EEMD (Ensemble Empirical mode Decomposition)11 and CEEMDAN (Complete Ensemble Empirical Mode Decomposition with Adaptive Noise)27.

Although the first model dates back to 1990’s in10, Long and Short-Term Memory was popular in time series decomposition and prediction recently. Wang et al. proposed a multilevel wavelet decomposition network28 to build frequency-aware deep learning models for time series analysis, and proposed multi-frequency long short-term memory (mLSTM) for time series prediction. Tran et al. employed a seasonal-adjustment method that decomposes each time series into seasonal, trend and irregular components, and built prediction models for each component individually29. Wen et al. proposed a seasonal trend decomposition method6, RobustSTL, which extract the trend component by the least absolute deviations (LAD) loss with sparse regularization, and the seasonality component by the non-local seasonal filtering. On this basis, they proposed generalized ADMM (alternating direction method of multipliers) to speed up the computation30. Yang et al. proposed a model hybridized by EMD, stacked auto-encoders and extreme learning machines31. Dudeck et al. proposed a seasonal-trend-dispersion decomposition (STD) to extract the trend, seasonal component and component related to the dispersion of the time series5.

In the aforementioned studies, the series is decomposed usually one at a time. But it is hard to train the model simultaneously as (1) the number of decomposed series varies, (2) multiple series exhibit distinct periodic regularities or even lack obvious periodic patterns altogether, (3) the decomposition components may possess dissimilar periodic characteristics across different series. There is hardly any research exploring concatenation of multiple series followed by mode decomposition.

Conclusion

This paper investigates the pre-processing methods for multi-series based on series decomposition, aiming to get sub-sequences that align better with the periodic characteristics and overall trend of the original series. Employing series decomposition on multi-series data, the paper examines the issue of trend deviation caused by high difference of sub-sequence amplitude in direct concatenation of multi-series. To address this problem, it proposes a pre-processing method to incorporating connector between each pair of concatenating series. These connectors include linear and random vibrating types. The decomposed sub-sequences are then leveraged for training and predictive modeling. A comparative analysis of the decomposition and prediction outcomes between multi-series processed through direct concatenation and those employing the two different connector methods reveals that these methods effectively mitigate the issues related to direct concatenation. Furthermore, it is observed that the combination of linear and random vibrating connectors is well-suited for datasets with periodic features, whereas a simple linear connector is more fitting for datasets that do not exhibit clear periodic patterns.

This methodology significantly improves the accuracy of predictions. Moreover, the paper meticulously evaluates the model using representative datasets from a variety of fields, demonstrating the model’s applicability to a broad spectrum of common multi-series datasets of varying scales. The experimental results analysis suggests that while the method of inserting additional connectors mitigates the discontinuity at the junctions, it is not entirely comprehensive or optimal. There is no theoretical substantiation for the existence of an ideal connector. Although it may be feasible to develop a connector that better corresponds with the periodic characteristics identified in the training set, there is a lack of systematic criteria for evaluating the discovery of such characteristics. These challenges highlight the need for further research in this area.