Introduction

With the advancements in space geodetic technology and improvements in accuracy, many high-precision and high-spatiotemporal-resolution global navigation satellite system (GNSS) position time series have accumulated, which provide valuable support for studying the dynamics of different geophysical phenomena, such as tectonic plate movements1, crustal deformation2,3,4, postglacial rebound5 and sea-level changes6,7.

To extract geophysical signals (i.e., long-term trends and periodic signals) from noisy time series, many methods have been adopted, which can be divided into three categories: (1) frequency spectrum methods, including wavelet analysis (WA)8,9, least-squares wavelet analysis (LSWA)10,11, empirical mode decomposition12, local mean decomposition13, etc.; (2) empirical orthogonal function (EOF) methods, including principal component analysis (PCA)14, singular spectrum analysis (SSA)15,16,17, multichannel SSA18, and independent component analysis (ICA)19,20; and (3) machine learning methods such as support vector machines21. Different analysis methods have corresponding advantages for processing homogeneous geophysical time series; notably, all the methods mentioned above require that the time series be complete without missing data.

However, random or continuous data gaps normally exist in position time series due to receiver failure, outlier removal and other unknown reasons22. Therefore, making full use of the available information in existing time series to fill the missing data is a more reasonable and practical approach than discarding the missing data23,24. To our knowledge, the missing data can be divided into three categories: (1) neighbor point interpolation methods, including linear interpolation14, cubic spline interpolation25, cubic polynomial interpolation26, etc. (2) The improved frequency spectrum and EOF methods, which are used to fill the missing time series data. Qiu et al.27 proposed iteration empirical mode decomposition to fill the missing data in GNSS position time series and obtained significant interpolation effects. Xu28 proposed an iterative empirical orthogonal function to recover gappy GPS coordinate time series, which outperformed ordinary least square estimation interpolation in the wavelet analysis of GPS time series, especially for wide gaps. Wang et al.29 developed improved multichannel singular spectrum analysis to process GRACE monthly field solutions. (3) Machine learning methods for filling missing data, such as missForest23,30, long short-term memory (LSTM)31,32,33,34, and regularized expectation maximization (RegEM)35.

Compared with the abovementioned methods, especially similar algorithms such as missForest and LSTM, some limitations exist when missing data are filled. For example, the missForest method is time-consuming and biased36, and there is the flat-value problem of LSTM37. The Gaussian process (GP) is a new machine learning method with rigorous theoretical and statistical learning foundations38, which scientifically obtains prediction accuracy and probability at the same time39,40,41. In addition, the GP model has already been established for various spatial and temporal problems42. Moreover, a GP is a stochastic process in which any finite subcollection of random variables has a multivariate Gaussian distribution43. Conventional GP focuses only on the temporal relationship of time series to fill in missing data44,45. The spatial correlation between different stations for individual components is significant. However, the correlation between the position series of different coordinate components per station is not significant46. Considering the advantages of the GP and the impacts of adjacent station time series, an improved interpolation approach called the improved GP was developed in this study. The simulation and real data analysis experiments were carried out to test the performance of the improved GP for filling the missing data together with the conventional GP and cubic spline methods. Note that PCA was used in the real GNSS position time series analysis to evaluate the interpolation efficiency.

The remaining sections of the paper are organized as follows: Section "Methodology" presents the details of the improved GP. Section "Simulation experimental analysis" presents the simulation experiments used to test the interpolation performance of the improved GP. Section "Real GNSS position time series analysis" examines the effectiveness of the improved GP for interpolating the missing data in real GNSS position time series. A discussion and conclusions are presented in Sections "Discussion" and "Conclusions", respectively.

Methodology

Conventional Gaussian process

The GP model provides a method for modeling probability distributions over functions, which can be used in a Bayesian analytical framework. In particular, a GP is a stochastic process that can be fully specified by its mean function and covariance function. The mean function and covariance function are expressed as follows:

$$m\left({\varvec{x}}\right)=E\left[f\left({\varvec{x}}\right)\right]$$
(1)
$$k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)=E\left[(f\left({\varvec{x}}\right)-m\left({\varvec{x}}\right))(f({{\varvec{x}}}{\prime})-m\left({{\varvec{x}}}^{\boldsymbol{^{\prime}}}\right))\right],$$
(2)

where \({\varvec{x}}\) and \({{\varvec{x}}}{\prime}\) represent two different input samples. \(m({\varvec{x}})\) is the mean function, and \(k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)\) denotes the covariance function. Furthermore, the GP can be denoted as follows:

$$f\left({\varvec{x}}\right)\sim GP\left(m\left({\varvec{x}}\right), k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)\right)$$
(3)

It is usually considered a simple zero-mean GP.

The covariance function is a key factor in determining the notion of similarity. There are many choices for covariance functions, which may be reasonable. Formally, functions that generate a nonnegative definite covariance matrix for any set of data must be specified. From the perspective of modeling data, the covariance function is specified to obtain similar predictions with nearby inputs. The commonly used covariance functions include the squared exponential (SE) covariance function, rational quadratic (RQ) covariance function, neural network (NN) covariance function and other functions47,48.

The general forms of the SE covariance function, RQ covariance function and NN covariance function are expressed as follows:

$${k}_{SE}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}\text{exp}\left(-\frac{1}{2{l}^{2}}{\left({\varvec{x}}-{{\varvec{x}}}{\prime}\right)}^{2}\right)$$
(4)
$${k}_{RQ}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}{\left(1+\frac{{\left({\varvec{x}}-{{\varvec{x}}}{\prime}\right)}^{2}}{2\alpha {l}^{2}}\right)}^{-\alpha }$$
(5)
$${k}_{NN}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}\text{arcsin}\left(\frac{{\varvec{x}}\cdot {{\varvec{x}}}{\prime}}{l\sqrt{({l}^{2}+{{\varvec{x}}}^{2}({l}^{2}+{{{\varvec{x}}}{\prime}}^{2}))}}\right),$$
(6)

where \({\sigma }_{f}^{2}\) and \(l\) are hyperparameters of covariance functions.

The given training samples are \(D=\left\{\left({{\varvec{x}}}_{i},{y}_{i}\right)|i=1,\cdots ,n\right\}=\{X,{\varvec{y}}\}\), where \({{\varvec{x}}}_{i}\) represents the input variables with dimension d and X is the input matrix of all the input variables. \({y}_{i}\) denotes the output variable corresponding to each input variable, and \({\varvec{y}}\) is the output vector consisting of all the output variables. In the GP regression model,

$$y=f\left({\varvec{x}}\right)+\varepsilon ,$$
(7)

where x denotes the input vector and f is the function value. y represents the observation with additive noise, which is different from the function \(f({\varvec{x}})\). \(\varepsilon\) is Gaussian noise with an independent and identical distribution and obeys the distribution \(\varepsilon \sim N(0,{\sigma }_{n}^{2})\). \({\sigma }_{n}^{2}\) expresses the variance of the distribution. The given covariance function is used to calculate the likelihood function:

$$p\left(y|X,\theta \right)=N\left(y|0,K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right),$$
(8)

where \(\theta\) represents the hyperparameters of the model, \(K(X,X)\) is the covariance matrix, and \({I}_{n}\) denotes the identity matrix. Joint distribution of the observed value \(y\) and predicted \({y}^{*}\) is as follows:

$$\left[\begin{array}{c}y\\ {y}^{*}\end{array}\right]=\left(0,\left[\begin{array}{cc}K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}& K(X,{X}^{*})\\ K({X}^{*},X)& K({X}^{*},{X}^{*})\end{array}\right]\right),$$
(9)

where \(X\) and \({X}^{*}\) represents training inputs and test inputs, respectively. The key predictive equations can be obtained, which is based on Bayesian theorem.

$${y}^{*}|X,y,{X}^{*}\sim N(\mu \left({y}^{*}\right),\text{cov}({y}^{*}))$$
(10)
$$\mu \left({y}^{*}\right)=E\left[{y}^{*}|X,y,{X}^{*}\right]=K({X}^{*},X){\left[K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right]}^{-1}y$$
(11)
$$\text{cov}\left({y}^{*}\right)=K\left({X}^{*},{X}^{*}\right)-K\left({X}^{*},X\right){\left[K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right]}^{-1}K\left(X,{X}^{*}\right)$$
(12)

where \(\mu \left({y}^{*}\right)\) is the predicted mean and \(\text{cov}\left({y}^{*}\right)\) denotes the predicted covariance.

Furthermore, the optimal values of the hyperparameters are generally determined by maximizing the log marginal likelihood. One of the particularly appealing properties of the GP model is that principled and practical approaches exist for learning the hyperparameters of the mean, covariance and likelihood functions. Good adaptation of such hyperparameters can be essential for obtaining high-quality predictions49. The negative log marginal likelihood function \(L({\varvec{\theta}})\) and its partial derivatives of hyperparameters are represented as follows:

$$L\left({\varvec{\theta}}\right)=\frac{1}{2}{{\varvec{y}}}^{\text{T}}{{\varvec{C}}}^{-1}{\varvec{y}}+\frac{1}{2}\text{log}\left|{\varvec{C}}\right|+\frac{n}{2}\text{log}2\uppi$$
(13)
$$\frac{\partial L({\varvec{\theta}})}{\partial {\theta }_{i}}=\frac{1}{2}\text{tr}\left({\varvec{\upalpha}}{\boldsymbol{\alpha }}^{\text{T}}-{{\varvec{C}}}^{-1}\right)\frac{\partial {\varvec{C}}}{\partial {\theta }_{i}},$$
(14)

where \({\varvec{C}}=K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\), \(\boldsymbol{\alpha }={{\varvec{C}}}^{-1}{\varvec{y}}\).

Improved Gaussian process

Let X be an \(n\times p\) GNSS position time series matrix that requires interpolation, that is,

$${\varvec{X}}=\left({{\varvec{X}}}_{1},{{\varvec{X}}}_{2},\cdots ,{{\varvec{X}}}_{p}\right)=\left[\begin{array}{ccc}{x}_{11}& {x}_{12}& \begin{array}{cc}\cdots & {x}_{1p}\end{array}\\ {x}_{21}& {x}_{22}& \begin{array}{cc}\cdots & {x}_{2p}\end{array}\\ \begin{array}{c}\vdots \\ {x}_{n1}\end{array}& \begin{array}{c}\vdots \\ {x}_{n2}\end{array}& \begin{array}{cc}\begin{array}{c}\vdots \\ \cdots \end{array}& \begin{array}{c}\vdots \\ {x}_{np}\end{array}\end{array}\end{array}\right],$$
(15)

where n is the total number of observation epochs of the time series and p denotes the number of GNSS stations. Missing data inevitably exist in real position time series; therefore, the position time series of every station \({X}_{s}(S\in \left\{\text{1,2},\cdots ,p\right\}\) is assumed to contain missing data. \({X}_{s}\) can be divided into two parts: available values and missing values. The specific interpolation steps of the improved GP are as follows:

(1) The conventional GP method is first used to interpolate the missing data of every GNSS station.

(2) The two nearest GNSS stations form an observation pair time series \({{\varvec{X}}}_{ij}=({{\varvec{X}}}_{i},{{\varvec{X}}}_{j})\). \({{\varvec{X}}}_{ij}\) can be represented in the following four parts, as shown in Fig. 1.

Figure 1
figure 1

The four parts of \({{\varvec{X}}}_{ij}\).

where \({{\varvec{X}}}_{im-gp}\) represents the missing data of \({{\varvec{X}}}_{i}\), which are filled by conventional GP; \({{\varvec{X}}}_{jm}\) represents the missing data of \({{\varvec{X}}}_{j}\); and \({{\varvec{X}}}_{ib}\) and \({{\varvec{X}}}_{jb}\) represent the available values of \({{\varvec{X}}}_{i}\) and \({{\varvec{X}}}_{j}\), respectively.

(3) The improved GP is used to interpolate the missing data of \({{\varvec{X}}}_{j}\). \(\left\{\left({{\varvec{t}}}_{jb},{{\varvec{X}}}_{i-jb}\right),{{\varvec{X}}}_{jb}\right\}\) are training samples, and \(\left\{\left({{\varvec{t}}}_{jm},{{\varvec{X}}}_{i-jm}\right)\right\}\) are the impact factors that are regarded as the test inputs, which are used to interpolate the missing data of \({{\varvec{X}}}_{j}\). \({{\varvec{t}}}_{jb}\) are the epochs of observed values of \({{\varvec{X}}}_{j}\), and \({{\varvec{t}}}_{jm}\) are the missing data epochs of \({{\varvec{X}}}_{j}\). \({{\varvec{X}}}_{i-jb}\) denotes the values of \({{\varvec{X}}}_{i}\) corresponding to the observed epochs of \({{\varvec{X}}}_{j}\). Similarly, \({{\varvec{X}}}_{i-jm}\) denotes the values of \({{\varvec{X}}}_{i}\) corresponding to the missing epochs of \({{\varvec{X}}}_{j}\).

(4) The same procedures are adopted to address the missing data of the remaining stations.

(5) Appropriate evaluation indices are used to evaluate the interpolation performances of the improved GP, conventional GP and cubic spline methods (Fig. 2).

Figure 2
figure 2

The flow chart of the improved GP.

Simulation experimental analysis

To test the performance of the improved GP, simulation experiments were carried out. Note that the key to the GP algorithm normally lies in the selection of kernel functions. Through experimental comparison, the CovNNone and CovNoise composite covariance functions were used in this study. For the simulation cases, the true values of the missing data were known. Thus, three evaluation indices, including the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R)50, were used to evaluate the performances of the improved GP and other conventional methods for filling the missing data.

$$RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({x}_{i}^{true}-{x}_{i}^{int}\right)}^{2}}$$
(16)
$$MAE=\frac{1}{n}\sum_{i=1}^{n}\left|{x}_{i}^{true}-{x}_{i}^{int}\right|$$
(17)
$$R=\frac{\sum_{i=1}^{n}({x}_{i}^{true}-{\overline{x} }^{true})({x}_{i}^{int}-{\overline{x} }^{int}) }{\sqrt{\sum_{i=1}^{n}{\left({x}_{i}^{true}-{\overline{x} }^{true}\right)}^{2}}\sqrt{\sum_{i=1}^{n}{\left({x}_{i}^{int}-{\overline{x} }^{int}\right)}^{2}}}$$
(18)

where \({x}_{i}^{true}\) and \({x}_{i}^{int}\) are the true and interpolated values of missing data at the i epoch, respectively. R is the correlation coefficient between the filled missing data and the corresponding true values. The smaller the RMSEs and MAEs are and the higher the correlation coefficients are, the better the interpolation ability.

The time series of the GNSS stations used in this study were downloaded from the National Earthquake Data Center (https://www.eqdsc.com/). Two GNSS stations, SCPZ and SCYY, located in Sichuan Province, were used to generate missing data in simulation experiments with complete time series from January 2016 to December 2017; the correlation coefficients between the SCPZ and SCYY were 0.9923, 0.9974 and 0.7742 for the North (N), East (E) and Up (U) components, respectively. The GNSS position time series of SCYY and SCPZ are presented in Fig. 3.

Figure 3
figure 3

Position time series of stations SCYY (top) and SCPZ (bottom) for the N, E and U components.

Considering the impact of adjacent station SCYY, the time series of the SCYY station were used during the procedure of filling in missing data and were compared with conventional GP and cubic splines. Here, we used the 15% missing percentage as an example to show the processing details of the improved GP, conventional GP and cubic spline methods. Then, the incomplete time series was generated by randomly removing 15% of the total epochs missing from the SCPZ station. Figure 4 presents the differences between the filled values and true values of the three methods in a simulation experiment with a 15% missing-data percentage. The RMSEs of the improved GP were 1.08 mm, 1.34 mm and 4.33 mm for the N, E and U components, respectively, which were smaller than the 1.17 mm, 1.52 mm and 5.06 mm RMSEs of the conventional GP for the three components, similar to those of the MAEs (Table 1). In addition, the correlation coefficients between the filled values and true values of the improved GP were 0.9961, 0.9981 and 0.8189 for the three components, which were all higher than those of the conventional GP and cubic spline.

Figure 4
figure 4

Differences between the interpolated values and true values of the improved GP, conventional GP and cubic spline methods in a simulation experiment with 15% missing data.

Table 1 RMSEs, MAEs and correlation coefficients of the three interpolation methods for the N, E and U components in a simulation experiment with 15% missing data.

To obtain relatively reliable statistical results for the 15% missing-data percentage, 100 simulation experiments were carried out. The RMSEs, MAEs and correlation coefficients of 100 simulations of the improved GP, conventional GP and the cubic spline are presented in Fig. 5. Obviously, almost all the RMSEs and MAEs were smaller than those of the conventional GP and cubic spline; however, with higher correlation coefficients, the improved GP performed better at filling the missing data in the GNSS time series than the conventional GP and commonly used cubic spline methods. The mean RMSEs, MAEs and correlation coefficients of the improved GP for 100 simulations were 1.16 mm, 1.36 mm and 4.01 mm; 0.86 mm, 1.00 mm and 3.03 mm; and 0.9947, 0.9978 and 0.8325 for the N, E and U components, respectively, which all indicate that the improved GP can fill in the missing data more accurately and reliably than the other two methods. The specific statistical results are presented in Table 2.

Figure 5
figure 5

RMSEs, MAEs and correlation coefficients of 100 simulations of the improved GP, conventional GP and cubic spline for the 15% missing data.

Table 2 Mean RMSEs, MAEs and correlation coefficients of 100 simulation experiments for the three interpolation methods for the N, E and U components (15% missing-data percentage).

To further analyze the interpolation ability of the three methods at different missing percentages, we randomly generated incomplete time series of the SCPZ station by removing the missing percentage from 5 to 30% with 5% increments. The mean RMSEs, MAEs and correlation coefficients of 100 simulations at different missing percentages are shown in Fig. 6. Figure 6 clearly shows that the RMSEs and MAEs of the improved GP are the smallest, followed by those of the conventional GP, and that of the cubic spline is the largest; however, the opposite is true for the correlation coefficients. In addition, the mean RMSEs and MAEs of the vertical component are significantly larger than those of the two horizontal components. Moreover, as the percentage of missing data increases, the mean RMSEs, MAEs and correlation coefficients slightly increase for the three components; in particular, the cubic spline method is easily affected by increasing the percentage of missing data.

Figure 6
figure 6

Mean RMSEs, MAEs and correlation coefficients of 100 simulations at different missing percentages.

The relative improvements of the improved GP with respect to the conventional GP and cubic spline are calculated as follows:

$${\text{IMP}}_{\text{RMSE}}=\frac{{\text{RMSE}}_{\text{QT}}-{\text{RMSE}}_{\text{IGP}}}{{\text{RMSE}}_{\text{QT}}}\times 100\%$$
(19)
$${\text{IMP}}_{\text{MAE}}=\frac{{\text{MAE}}_{\text{QT}}-{\text{MAE}}_{\text{IGP}}}{{\text{MAE}}_{\text{QT}}}\times 100\%$$
(20)
$${\text{IMP}}_{\text{R}}=\frac{{\text{R}}_{\text{IGP}}-{\text{R}}_{\text{QT}}}{{\text{R}}_{\text{QT}}}\times 100\%$$
(21)

where \({\text{RMSE}}_{\text{IGP}}\), \({\text{MAE}}_{\text{IGP}}\) and \({\text{R}}_{\text{IGP}}\) represent the RMSE, MAE and correlation coefficient (R) of the improved GP, respectively, and \({\text{RMSE}}_{\text{QT}}\), \({\text{MAE}}_{\text{QT}}\) and \({\text{R}}_{\text{QT}}\) denote the RMSE, MAE and correlation coefficient (R) of two other methods (i.e., conventional GP and cubic spline), respectively.

As shown in Fig. 7, the relative improvements in the RMSE and MAE for the improved GP are particularly significant with respect to those of the conventional GP and cubic spline methods regardless of the percentage of missing data, especially for the two horizontal components. The relative improvements for the U component are greater than those for the N and E components. For the RMSE, the mean relative improvements of the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% for the N, E and U components, respectively. Among the three interpolation methods, the cubic spline performs worse than the improved and conventional GP methods, especially for high percentages of missing data. Overall, it is reasonable to conclude that the improved GP outperforms conventional GP and commonly used cubic spline methods in filling the missing data existing in GNSS position time series.

Figure 7
figure 7

Relative improvements in the RMSE, MAE and correlation coefficient for the improved GP with respect to the conventional GP (red line) and the cubic spline (blue line).

Real GNSS position time series analysis

To further evaluate the performance of the improved GP for filling the missing data existing in real position time series, the time series of eight GNSS stations from 2021 to 2022 were analyzed; the locations of the eight GNSS stations are shown in Fig. 8. In addition, the percentages of missing data for eight stations are presented in Fig. 9. Figure 9 shows that the percentage of missing data varies from station to station, especially for the SCSN station, where the percentage of missing data is up to 51.5%, and the percentages of missing data for the other stations are all less than 20%. The average percentage of missing data for eight stations is 18.6%. In addition to missing data due to human activities or the inevitable factors of the environment, some outlier epochs also exist in the GNSS position time series; in this study, the interquartile range (IQR) criterion51,52 was adopted to eliminate outliers.

Figure 8
figure 8

Location of the eight GNSS stations.

Figure 9
figure 9

Missing percentages of the eight stations.

Considering that the true values of missing data are unknown in the real GNSS position time series, it is impossible to evaluate the interpolation performance by using the evaluation indicators of the RMSE, MAE and correlation coefficient in the simulation experiment analysis. An important property of PCA53 is the number of principal directions, and PCs explain the maximum variance in the data. PCA is used to extract the common mode error (CME), which assumes that the first few PCs with larger variance represent the CME of the observation network. To extract the CME accurately, the filled time series of the observation network should maintain the direction of maximum variance as much as possible. Accordingly, the variance of the filled time series is projected in all directions to test the performance of various methods. The equation for calculating the variance in each principal direction is as follows54.

$${v}_{j}={{\varvec{w}}}_{j}^{\text{T}}{\varvec{S}}{{\varvec{w}}}_{j}$$
(22)

where \({{\varvec{w}}}_{j}\) is the j-th principal direction. \({\varvec{S}}\) denotes the covariance matrix, which is calculated as follows.

$${\varvec{S}}=\widehat{{\varvec{X}}}{\widehat{{\varvec{X}}}}^{\text{T}}/(nrows-1)$$
(23)

where \(\widehat{{\varvec{X}}}\) is the interpolated observation matrix and \(nrows\) represents the number of rows in the matrix \(\widehat{{\varvec{X}}}\).

To compare the interpolation efficiency of the improved GP, conventional GP and cubic spline methods, we performed the PCA approach on the complete time series after the missing data of the eight stations were filled via the three interpolation methods. The variance percentages of the filled time series obtained by the first three PCs derived from different interpolation methods16,54 were computed for the N, E and U components, which are presented in Table 3. Table 3 shows that the first three PCs of the improved GP can preserve 98.3%, 99.8% and 77.0% of the total variance in the three components, which are higher than the 97.9%, 99.7% and 76.2% of those of the conventional GP and the 96.6%, 99.6% and 72.3% of those of the cubic spline, respectively, which indicates that the improved GP performs better than the other two methods for filling the missing data in real GNSS position time series. In other words, its interpolation ability is the most prominent among the three interpolation methods.

Table 3 Variance proportions of the first three PCs for the three interpolation methods in the N, E and U directions (%).

Discussion

Accurately filling in missing data is valuable before analyzing GNSS position time series via frequency spectrum analysis and empirical mode decomposition methods, which require that the time series be complete. To solve this difficult problem, many researchers from different backgrounds have developed a series of improved algorithms based on wavelet analysis, empirical mode decomposition, local mean decomposition, empirical mode decomposition, singular spectrum analysis, principal component analysis, etc. In previous studies, most interpolation approaches filled in missing data on the basis of temporal relationships without spatial relationships, such as the impact of adjacent station time series. An effective interpolation method not only helps to obtain a consistent and complete time series but also lays the foundation for noise model identification and parameter estimation. Through a comprehensive investigation, an improved GP was developed to fill in the missing data existing in GNSS position time series, with comparisons of the conventional GP and commonly used cubic spline methods.

In addition, the noise present in the GNSS position time series contains not only white noise but also color noise, such as flicker noise, which has been confirmed by previous publications55. The larger RMSE and MAE of the U component than those of the N and E components may be related to the strong noise level in the U component. For the real GNSS position time series analysis, the proportion of the variance of the first three PCs in the N and E components is greater than that in the U component, mainly because there are significant long-term trends for the N and E components, which account for nearly more than 90% of the total variance; however, the U component is dominated by the periodic signals, which account for a relatively smaller percentage of the total variance. In addition, the combined covariance function of CovNNone and CovNoise is used to construct the improved and conventional GP models. All the results showed that considering the temporal and spatial correlation factors in filling the missing data improved the interpolation accuracy. The noise characteristics of GNSS position time series may be biased by several impact factors56. Weakening the flicker noise in time series may be conducive to missing data interpolation57. The improved GP combined with denoising approaches may improve the interpolation performance of missing data in GNSS position time series, which will be further investigated in the future.

Conclusions

An improved GP method for filling missing data existing in GNSS position time series was developed by considering the impacts of adjacent stations, which were treated with equal weights of time factors. After 100 repeated simulation experiments, the results revealed that all the RMSEs and MAEs were smaller than those of the conventional GP and cubic spline and that the correlation coefficients were higher, indicating that the improved GP can fill in the missing data more accurately regardless of the percentage of missing data. Specifically, the mean relative improvements in the RMSE for the improved GP with respect to the conventional GP were 21.2%, 21.3% and 8.3% for the N, E and U components, respectively. To further demonstrate the advantages of the improved GP over the conventional GP and cubic spline, incomplete time series from eight GNSS stations were adopted and filled by the three methods. The first three PCs of the improved GP could capture 98.3%, 99.8% and 77.0% of the total variances for the N, E and U components, respectively, which were greater than those of the conventional GP and cubic spline. Overall, the findings from both real-time series analysis and simulation experiments highlight the necessity of considering adjacent station time series when filling in missing GNSS position time series data and validate the advantages of the improved GP over the conventional GP and cubic spline methods.