An improved Gaussian process for filling the missing data in GNSS position time series considering the influence of adjacent stations

Qiu, Xiaomeng; Wang, Fengwei; Zhang, Qiuxi; Tao, Guoqiang; Zhou, Shijian

doi:10.1038/s41598-024-70421-7

Download PDF

Article
Open access
Published: 20 August 2024

An improved Gaussian process for filling the missing data in GNSS position time series considering the influence of adjacent stations

Xiaomeng Qiu¹,
Fengwei Wang²,
Qiuxi Zhang¹,
Guoqiang Tao¹ &
…
Shijian Zhou³

Scientific Reports volume 14, Article number: 19268 (2024) Cite this article

1459 Accesses
1 Citations
Metrics details

Subjects

Abstract

Due to various unavoidable reasons or gross error elimination, missing data inevitably exist in global navigation satellite system (GNSS) position time series, which may result in many analysis methods not being applicable. Typically, interpolating the missing data is a crucial preprocessing step before analyzing the time series. The conventional methods for filling missing data do not consider the influence of adjacent stations. In this work, an improved Gaussian process (GP) approach is developed to fill the missing data of GNSS time series, in which the time series of adjacent stations are applied to construct impact factors, together with a comparison of the conventional GP and the commonly used cubic spline methods. For the simulation experiments, the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R) are adopted to evaluate the performance of the improved GP. The results show that the filled missing data of the improved GP are closer to the true values than those of the conventional GP and cubic spline methods, regardless of the missing percentages ranging from 5 to 30%, with an interval of 5%. Specifically, the mean relative RMSE and MAE improvements for the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% and 12.7%, 16.2% and 11.01% for the North (N), East (E) and Up (U) components, respectively. In the real experiment, eight GNSS stations are analyzed using improved GP, together with conventional GP and a cubic spline. The results indicate that the first three principal components (PCs) of the improved GP can perverse 98.3%, 99.8% and 77.0% of the total variance for the N, E and U components, respectively. This value is obviously higher than those of the conventional GP and cubic spline. Therefore, we can conclude that the improved GP can better fill in the missing data in GNSS position time series than the conventional GP and cubic spline because of the impacts of adjacent stations.

Employing a two stage process with resampling, decision rules, and fine acquisition to optimize the acquisition stage of GNSS receiver performance

Article Open access 28 September 2024

A sub-μs accuracy GPS alternative using electrical transmission grids as precision timing networks

Article Open access 15 April 2024

Imaging evolution of Cascadia slow-slip event using high-rate GPS

Article Open access 03 May 2022

Introduction

With the advancements in space geodetic technology and improvements in accuracy, many high-precision and high-spatiotemporal-resolution global navigation satellite system (GNSS) position time series have accumulated, which provide valuable support for studying the dynamics of different geophysical phenomena, such as tectonic plate movements¹, crustal deformation^2,3,4, postglacial rebound⁵ and sea-level changes^6,7.

To extract geophysical signals (i.e., long-term trends and periodic signals) from noisy time series, many methods have been adopted, which can be divided into three categories: (1) frequency spectrum methods, including wavelet analysis (WA)^8,9, least-squares wavelet analysis (LSWA)^10,11, empirical mode decomposition¹², local mean decomposition¹³, etc.; (2) empirical orthogonal function (EOF) methods, including principal component analysis (PCA)¹⁴, singular spectrum analysis (SSA)^15,16,17, multichannel SSA¹⁸, and independent component analysis (ICA)^19,20; and (3) machine learning methods such as support vector machines²¹. Different analysis methods have corresponding advantages for processing homogeneous geophysical time series; notably, all the methods mentioned above require that the time series be complete without missing data.

However, random or continuous data gaps normally exist in position time series due to receiver failure, outlier removal and other unknown reasons²². Therefore, making full use of the available information in existing time series to fill the missing data is a more reasonable and practical approach than discarding the missing data^23,24. To our knowledge, the missing data can be divided into three categories: (1) neighbor point interpolation methods, including linear interpolation¹⁴, cubic spline interpolation²⁵, cubic polynomial interpolation²⁶, etc. (2) The improved frequency spectrum and EOF methods, which are used to fill the missing time series data. Qiu et al.²⁷ proposed iteration empirical mode decomposition to fill the missing data in GNSS position time series and obtained significant interpolation effects. Xu²⁸ proposed an iterative empirical orthogonal function to recover gappy GPS coordinate time series, which outperformed ordinary least square estimation interpolation in the wavelet analysis of GPS time series, especially for wide gaps. Wang et al.²⁹ developed improved multichannel singular spectrum analysis to process GRACE monthly field solutions. (3) Machine learning methods for filling missing data, such as missForest^23,30, long short-term memory (LSTM)^31,32,33,34, and regularized expectation maximization (RegEM)³⁵.

Compared with the abovementioned methods, especially similar algorithms such as missForest and LSTM, some limitations exist when missing data are filled. For example, the missForest method is time-consuming and biased³⁶, and there is the flat-value problem of LSTM³⁷. The Gaussian process (GP) is a new machine learning method with rigorous theoretical and statistical learning foundations³⁸, which scientifically obtains prediction accuracy and probability at the same time^39,40,41. In addition, the GP model has already been established for various spatial and temporal problems⁴². Moreover, a GP is a stochastic process in which any finite subcollection of random variables has a multivariate Gaussian distribution⁴³. Conventional GP focuses only on the temporal relationship of time series to fill in missing data^44,45. The spatial correlation between different stations for individual components is significant. However, the correlation between the position series of different coordinate components per station is not significant⁴⁶. Considering the advantages of the GP and the impacts of adjacent station time series, an improved interpolation approach called the improved GP was developed in this study. The simulation and real data analysis experiments were carried out to test the performance of the improved GP for filling the missing data together with the conventional GP and cubic spline methods. Note that PCA was used in the real GNSS position time series analysis to evaluate the interpolation efficiency.

The remaining sections of the paper are organized as follows: Section "Methodology" presents the details of the improved GP. Section "Simulation experimental analysis" presents the simulation experiments used to test the interpolation performance of the improved GP. Section "Real GNSS position time series analysis" examines the effectiveness of the improved GP for interpolating the missing data in real GNSS position time series. A discussion and conclusions are presented in Sections "Discussion" and "Conclusions", respectively.

Methodology

Conventional Gaussian process

The GP model provides a method for modeling probability distributions over functions, which can be used in a Bayesian analytical framework. In particular, a GP is a stochastic process that can be fully specified by its mean function and covariance function. The mean function and covariance function are expressed as follows:

$$m\left({\varvec{x}}\right)=E\left[f\left({\varvec{x}}\right)\right]$$

(1)

$$k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)=E\left[(f\left({\varvec{x}}\right)-m\left({\varvec{x}}\right))(f({{\varvec{x}}}{\prime})-m\left({{\varvec{x}}}^{\boldsymbol{^{\prime}}}\right))\right],$$

(2)

where ${\varvec{x}}$ and ${{\varvec{x}}}{\prime}$ represent two different input samples. $m({\varvec{x}})$ is the mean function, and $k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)$ denotes the covariance function. Furthermore, the GP can be denoted as follows:

$$f\left({\varvec{x}}\right)\sim GP\left(m\left({\varvec{x}}\right), k\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)\right)$$

(3)

It is usually considered a simple zero-mean GP.

The covariance function is a key factor in determining the notion of similarity. There are many choices for covariance functions, which may be reasonable. Formally, functions that generate a nonnegative definite covariance matrix for any set of data must be specified. From the perspective of modeling data, the covariance function is specified to obtain similar predictions with nearby inputs. The commonly used covariance functions include the squared exponential (SE) covariance function, rational quadratic (RQ) covariance function, neural network (NN) covariance function and other functions^47,48.

The general forms of the SE covariance function, RQ covariance function and NN covariance function are expressed as follows:

$${k}_{SE}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}\text{exp}\left(-\frac{1}{2{l}^{2}}{\left({\varvec{x}}-{{\varvec{x}}}{\prime}\right)}^{2}\right)$$

(4)

$${k}_{RQ}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}{\left(1+\frac{{\left({\varvec{x}}-{{\varvec{x}}}{\prime}\right)}^{2}}{2\alpha {l}^{2}}\right)}^{-\alpha }$$

(5)

$${k}_{NN}\left({\varvec{x}},{{\varvec{x}}}{\prime}\right)={\sigma }_{f}^{2}\text{arcsin}\left(\frac{{\varvec{x}}\cdot {{\varvec{x}}}{\prime}}{l\sqrt{({l}^{2}+{{\varvec{x}}}^{2}({l}^{2}+{{{\varvec{x}}}{\prime}}^{2}))}}\right),$$

(6)

where ${\sigma }_{f}^{2}$ and $l$ are hyperparameters of covariance functions.

The given training samples are $D=\left\{\left({{\varvec{x}}}_{i},{y}_{i}\right)|i=1,\cdots ,n\right\}=\{X,{\varvec{y}}\}$, where ${{\varvec{x}}}_{i}$ represents the input variables with dimension d and X is the input matrix of all the input variables. ${y}_{i}$ denotes the output variable corresponding to each input variable, and ${\varvec{y}}$ is the output vector consisting of all the output variables. In the GP regression model,

$$y=f\left({\varvec{x}}\right)+\varepsilon ,$$

(7)

where x denotes the input vector and f is the function value. y represents the observation with additive noise, which is different from the function $f({\varvec{x}})$. $\varepsilon$ is Gaussian noise with an independent and identical distribution and obeys the distribution $\varepsilon \sim N(0,{\sigma }_{n}^{2})$. ${\sigma }_{n}^{2}$ expresses the variance of the distribution. The given covariance function is used to calculate the likelihood function:

$$p\left(y|X,\theta \right)=N\left(y|0,K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right),$$

(8)

where $\theta$ represents the hyperparameters of the model, $K(X,X)$ is the covariance matrix, and ${I}_{n}$ denotes the identity matrix. Joint distribution of the observed value $y$ and predicted ${y}^{*}$ is as follows:

$$\left[\begin{array}{c}y\\ {y}^{*}\end{array}\right]=\left(0,\left[\begin{array}{cc}K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}& K(X,{X}^{*})\\ K({X}^{*},X)& K({X}^{*},{X}^{*})\end{array}\right]\right),$$

(9)

where $X$ and ${X}^{*}$ represents training inputs and test inputs, respectively. The key predictive equations can be obtained, which is based on Bayesian theorem.

$${y}^{*}|X,y,{X}^{*}\sim N(\mu \left({y}^{*}\right),\text{cov}({y}^{*}))$$

(10)

$$\mu \left({y}^{*}\right)=E\left[{y}^{*}|X,y,{X}^{*}\right]=K({X}^{*},X){\left[K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right]}^{-1}y$$

(11)

$$\text{cov}\left({y}^{*}\right)=K\left({X}^{*},{X}^{*}\right)-K\left({X}^{*},X\right){\left[K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}\right]}^{-1}K\left(X,{X}^{*}\right)$$

(12)

where $\mu \left({y}^{*}\right)$ is the predicted mean and $\text{cov}\left({y}^{*}\right)$ denotes the predicted covariance.

Furthermore, the optimal values of the hyperparameters are generally determined by maximizing the log marginal likelihood. One of the particularly appealing properties of the GP model is that principled and practical approaches exist for learning the hyperparameters of the mean, covariance and likelihood functions. Good adaptation of such hyperparameters can be essential for obtaining high-quality predictions⁴⁹. The negative log marginal likelihood function $L({\varvec{\theta}})$ and its partial derivatives of hyperparameters are represented as follows:

$$L\left({\varvec{\theta}}\right)=\frac{1}{2}{{\varvec{y}}}^{\text{T}}{{\varvec{C}}}^{-1}{\varvec{y}}+\frac{1}{2}\text{log}\left|{\varvec{C}}\right|+\frac{n}{2}\text{log}2\uppi$$

(13)

$$\frac{\partial L({\varvec{\theta}})}{\partial {\theta }_{i}}=\frac{1}{2}\text{tr}\left({\varvec{\upalpha}}{\boldsymbol{\alpha }}^{\text{T}}-{{\varvec{C}}}^{-1}\right)\frac{\partial {\varvec{C}}}{\partial {\theta }_{i}},$$

(14)

where ${\varvec{C}}=K\left(X,X\right)+{\sigma }_{n}^{2}{I}_{n}$, $\boldsymbol{\alpha }={{\varvec{C}}}^{-1}{\varvec{y}}$.

Improved Gaussian process

Let X be an $n\times p$ GNSS position time series matrix that requires interpolation, that is,

$${\varvec{X}}=\left({{\varvec{X}}}_{1},{{\varvec{X}}}_{2},\cdots ,{{\varvec{X}}}_{p}\right)=\left[\begin{array}{ccc}{x}_{11}& {x}_{12}& \begin{array}{cc}\cdots & {x}_{1p}\end{array}\\ {x}_{21}& {x}_{22}& \begin{array}{cc}\cdots & {x}_{2p}\end{array}\\ \begin{array}{c}\vdots \\ {x}_{n1}\end{array}& \begin{array}{c}\vdots \\ {x}_{n2}\end{array}& \begin{array}{cc}\begin{array}{c}\vdots \\ \cdots \end{array}& \begin{array}{c}\vdots \\ {x}_{np}\end{array}\end{array}\end{array}\right],$$

(15)

where n is the total number of observation epochs of the time series and p denotes the number of GNSS stations. Missing data inevitably exist in real position time series; therefore, the position time series of every station ${X}_{s}(S\in \left\{\text{1,2},\cdots ,p\right\}$ is assumed to contain missing data. ${X}_{s}$ can be divided into two parts: available values and missing values. The specific interpolation steps of the improved GP are as follows:

(1) The conventional GP method is first used to interpolate the missing data of every GNSS station.

(2) The two nearest GNSS stations form an observation pair time series ${{\varvec{X}}}_{ij}=({{\varvec{X}}}_{i},{{\varvec{X}}}_{j})$. ${{\varvec{X}}}_{ij}$ can be represented in the following four parts, as shown in Fig. 1.

where ${{\varvec{X}}}_{im-gp}$ represents the missing data of ${{\varvec{X}}}_{i}$, which are filled by conventional GP; ${{\varvec{X}}}_{jm}$ represents the missing data of ${{\varvec{X}}}_{j}$; and ${{\varvec{X}}}_{ib}$ and ${{\varvec{X}}}_{jb}$ represent the available values of ${{\varvec{X}}}_{i}$ and ${{\varvec{X}}}_{j}$, respectively.

(3) The improved GP is used to interpolate the missing data of ${{\varvec{X}}}_{j}$. $\left\{\left({{\varvec{t}}}_{jb},{{\varvec{X}}}_{i-jb}\right),{{\varvec{X}}}_{jb}\right\}$ are training samples, and $\left\{\left({{\varvec{t}}}_{jm},{{\varvec{X}}}_{i-jm}\right)\right\}$ are the impact factors that are regarded as the test inputs, which are used to interpolate the missing data of ${{\varvec{X}}}_{j}$. ${{\varvec{t}}}_{jb}$ are the epochs of observed values of ${{\varvec{X}}}_{j}$, and ${{\varvec{t}}}_{jm}$ are the missing data epochs of ${{\varvec{X}}}_{j}$. ${{\varvec{X}}}_{i-jb}$ denotes the values of ${{\varvec{X}}}_{i}$ corresponding to the observed epochs of ${{\varvec{X}}}_{j}$. Similarly, ${{\varvec{X}}}_{i-jm}$ denotes the values of ${{\varvec{X}}}_{i}$ corresponding to the missing epochs of ${{\varvec{X}}}_{j}$.

(4) The same procedures are adopted to address the missing data of the remaining stations.

(5) Appropriate evaluation indices are used to evaluate the interpolation performances of the improved GP, conventional GP and cubic spline methods (Fig. 2).

Simulation experimental analysis

To test the performance of the improved GP, simulation experiments were carried out. Note that the key to the GP algorithm normally lies in the selection of kernel functions. Through experimental comparison, the CovNNone and CovNoise composite covariance functions were used in this study. For the simulation cases, the true values of the missing data were known. Thus, three evaluation indices, including the root mean square error (RMSE), mean absolute error (MAE) and correlation coefficient (R)⁵⁰, were used to evaluate the performances of the improved GP and other conventional methods for filling the missing data.

$$RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({x}_{i}^{true}-{x}_{i}^{int}\right)}^{2}}$$

(16)

$$MAE=\frac{1}{n}\sum_{i=1}^{n}\left|{x}_{i}^{true}-{x}_{i}^{int}\right|$$

(17)

$$R=\frac{\sum_{i=1}^{n}({x}_{i}^{true}-{\overline{x} }^{true})({x}_{i}^{int}-{\overline{x} }^{int}) }{\sqrt{\sum_{i=1}^{n}{\left({x}_{i}^{true}-{\overline{x} }^{true}\right)}^{2}}\sqrt{\sum_{i=1}^{n}{\left({x}_{i}^{int}-{\overline{x} }^{int}\right)}^{2}}}$$

(18)

where ${x}_{i}^{true}$ and ${x}_{i}^{int}$ are the true and interpolated values of missing data at the i epoch, respectively. R is the correlation coefficient between the filled missing data and the corresponding true values. The smaller the RMSEs and MAEs are and the higher the correlation coefficients are, the better the interpolation ability.

The time series of the GNSS stations used in this study were downloaded from the National Earthquake Data Center (https://www.eqdsc.com/). Two GNSS stations, SCPZ and SCYY, located in Sichuan Province, were used to generate missing data in simulation experiments with complete time series from January 2016 to December 2017; the correlation coefficients between the SCPZ and SCYY were 0.9923, 0.9974 and 0.7742 for the North (N), East (E) and Up (U) components, respectively. The GNSS position time series of SCYY and SCPZ are presented in Fig. 3.

Considering the impact of adjacent station SCYY, the time series of the SCYY station were used during the procedure of filling in missing data and were compared with conventional GP and cubic splines. Here, we used the 15% missing percentage as an example to show the processing details of the improved GP, conventional GP and cubic spline methods. Then, the incomplete time series was generated by randomly removing 15% of the total epochs missing from the SCPZ station. Figure 4 presents the differences between the filled values and true values of the three methods in a simulation experiment with a 15% missing-data percentage. The RMSEs of the improved GP were 1.08 mm, 1.34 mm and 4.33 mm for the N, E and U components, respectively, which were smaller than the 1.17 mm, 1.52 mm and 5.06 mm RMSEs of the conventional GP for the three components, similar to those of the MAEs (Table 1). In addition, the correlation coefficients between the filled values and true values of the improved GP were 0.9961, 0.9981 and 0.8189 for the three components, which were all higher than those of the conventional GP and cubic spline.

Table 1 RMSEs, MAEs and correlation coefficients of the three interpolation methods for the N, E and U components in a simulation experiment with 15% missing data.

Full size table

To obtain relatively reliable statistical results for the 15% missing-data percentage, 100 simulation experiments were carried out. The RMSEs, MAEs and correlation coefficients of 100 simulations of the improved GP, conventional GP and the cubic spline are presented in Fig. 5. Obviously, almost all the RMSEs and MAEs were smaller than those of the conventional GP and cubic spline; however, with higher correlation coefficients, the improved GP performed better at filling the missing data in the GNSS time series than the conventional GP and commonly used cubic spline methods. The mean RMSEs, MAEs and correlation coefficients of the improved GP for 100 simulations were 1.16 mm, 1.36 mm and 4.01 mm; 0.86 mm, 1.00 mm and 3.03 mm; and 0.9947, 0.9978 and 0.8325 for the N, E and U components, respectively, which all indicate that the improved GP can fill in the missing data more accurately and reliably than the other two methods. The specific statistical results are presented in Table 2.

Table 2 Mean RMSEs, MAEs and correlation coefficients of 100 simulation experiments for the three interpolation methods for the N, E and U components (15% missing-data percentage).

Full size table

To further analyze the interpolation ability of the three methods at different missing percentages, we randomly generated incomplete time series of the SCPZ station by removing the missing percentage from 5 to 30% with 5% increments. The mean RMSEs, MAEs and correlation coefficients of 100 simulations at different missing percentages are shown in Fig. 6. Figure 6 clearly shows that the RMSEs and MAEs of the improved GP are the smallest, followed by those of the conventional GP, and that of the cubic spline is the largest; however, the opposite is true for the correlation coefficients. In addition, the mean RMSEs and MAEs of the vertical component are significantly larger than those of the two horizontal components. Moreover, as the percentage of missing data increases, the mean RMSEs, MAEs and correlation coefficients slightly increase for the three components; in particular, the cubic spline method is easily affected by increasing the percentage of missing data.

The relative improvements of the improved GP with respect to the conventional GP and cubic spline are calculated as follows:

$${\text{IMP}}_{\text{RMSE}}=\frac{{\text{RMSE}}_{\text{QT}}-{\text{RMSE}}_{\text{IGP}}}{{\text{RMSE}}_{\text{QT}}}\times 100\%$$

(19)

$${\text{IMP}}_{\text{MAE}}=\frac{{\text{MAE}}_{\text{QT}}-{\text{MAE}}_{\text{IGP}}}{{\text{MAE}}_{\text{QT}}}\times 100\%$$

(20)

$${\text{IMP}}_{\text{R}}=\frac{{\text{R}}_{\text{IGP}}-{\text{R}}_{\text{QT}}}{{\text{R}}_{\text{QT}}}\times 100\%$$

(21)

where ${\text{RMSE}}_{\text{IGP}}$, ${\text{MAE}}_{\text{IGP}}$ and ${\text{R}}_{\text{IGP}}$ represent the RMSE, MAE and correlation coefficient (R) of the improved GP, respectively, and ${\text{RMSE}}_{\text{QT}}$, ${\text{MAE}}_{\text{QT}}$ and ${\text{R}}_{\text{QT}}$ denote the RMSE, MAE and correlation coefficient (R) of two other methods (i.e., conventional GP and cubic spline), respectively.

As shown in Fig. 7, the relative improvements in the RMSE and MAE for the improved GP are particularly significant with respect to those of the conventional GP and cubic spline methods regardless of the percentage of missing data, especially for the two horizontal components. The relative improvements for the U component are greater than those for the N and E components. For the RMSE, the mean relative improvements of the improved GP with respect to the conventional GP are 21.2%, 21.3% and 8.3% for the N, E and U components, respectively. Among the three interpolation methods, the cubic spline performs worse than the improved and conventional GP methods, especially for high percentages of missing data. Overall, it is reasonable to conclude that the improved GP outperforms conventional GP and commonly used cubic spline methods in filling the missing data existing in GNSS position time series.

Real GNSS position time series analysis

To further evaluate the performance of the improved GP for filling the missing data existing in real position time series, the time series of eight GNSS stations from 2021 to 2022 were analyzed; the locations of the eight GNSS stations are shown in Fig. 8. In addition, the percentages of missing data for eight stations are presented in Fig. 9. Figure 9 shows that the percentage of missing data varies from station to station, especially for the SCSN station, where the percentage of missing data is up to 51.5%, and the percentages of missing data for the other stations are all less than 20%. The average percentage of missing data for eight stations is 18.6%. In addition to missing data due to human activities or the inevitable factors of the environment, some outlier epochs also exist in the GNSS position time series; in this study, the interquartile range (IQR) criterion^51,52 was adopted to eliminate outliers.

Considering that the true values of missing data are unknown in the real GNSS position time series, it is impossible to evaluate the interpolation performance by using the evaluation indicators of the RMSE, MAE and correlation coefficient in the simulation experiment analysis. An important property of PCA⁵³ is the number of principal directions, and PCs explain the maximum variance in the data. PCA is used to extract the common mode error (CME), which assumes that the first few PCs with larger variance represent the CME of the observation network. To extract the CME accurately, the filled time series of the observation network should maintain the direction of maximum variance as much as possible. Accordingly, the variance of the filled time series is projected in all directions to test the performance of various methods. The equation for calculating the variance in each principal direction is as follows⁵⁴.

$${v}_{j}={{\varvec{w}}}_{j}^{\text{T}}{\varvec{S}}{{\varvec{w}}}_{j}$$

(22)

where ${{\varvec{w}}}_{j}$ is the j-th principal direction. ${\varvec{S}}$ denotes the covariance matrix, which is calculated as follows.

$${\varvec{S}}=\widehat{{\varvec{X}}}{\widehat{{\varvec{X}}}}^{\text{T}}/(nrows-1)$$

(23)

where $\widehat{{\varvec{X}}}$ is the interpolated observation matrix and $nrows$ represents the number of rows in the matrix $\widehat{{\varvec{X}}}$.

To compare the interpolation efficiency of the improved GP, conventional GP and cubic spline methods, we performed the PCA approach on the complete time series after the missing data of the eight stations were filled via the three interpolation methods. The variance percentages of the filled time series obtained by the first three PCs derived from different interpolation methods^16,54 were computed for the N, E and U components, which are presented in Table 3. Table 3 shows that the first three PCs of the improved GP can preserve 98.3%, 99.8% and 77.0% of the total variance in the three components, which are higher than the 97.9%, 99.7% and 76.2% of those of the conventional GP and the 96.6%, 99.6% and 72.3% of those of the cubic spline, respectively, which indicates that the improved GP performs better than the other two methods for filling the missing data in real GNSS position time series. In other words, its interpolation ability is the most prominent among the three interpolation methods.

Table 3 Variance proportions of the first three PCs for the three interpolation methods in the N, E and U directions (%).

Full size table

Discussion

Accurately filling in missing data is valuable before analyzing GNSS position time series via frequency spectrum analysis and empirical mode decomposition methods, which require that the time series be complete. To solve this difficult problem, many researchers from different backgrounds have developed a series of improved algorithms based on wavelet analysis, empirical mode decomposition, local mean decomposition, empirical mode decomposition, singular spectrum analysis, principal component analysis, etc. In previous studies, most interpolation approaches filled in missing data on the basis of temporal relationships without spatial relationships, such as the impact of adjacent station time series. An effective interpolation method not only helps to obtain a consistent and complete time series but also lays the foundation for noise model identification and parameter estimation. Through a comprehensive investigation, an improved GP was developed to fill in the missing data existing in GNSS position time series, with comparisons of the conventional GP and commonly used cubic spline methods.

In addition, the noise present in the GNSS position time series contains not only white noise but also color noise, such as flicker noise, which has been confirmed by previous publications⁵⁵. The larger RMSE and MAE of the U component than those of the N and E components may be related to the strong noise level in the U component. For the real GNSS position time series analysis, the proportion of the variance of the first three PCs in the N and E components is greater than that in the U component, mainly because there are significant long-term trends for the N and E components, which account for nearly more than 90% of the total variance; however, the U component is dominated by the periodic signals, which account for a relatively smaller percentage of the total variance. In addition, the combined covariance function of CovNNone and CovNoise is used to construct the improved and conventional GP models. All the results showed that considering the temporal and spatial correlation factors in filling the missing data improved the interpolation accuracy. The noise characteristics of GNSS position time series may be biased by several impact factors⁵⁶. Weakening the flicker noise in time series may be conducive to missing data interpolation⁵⁷. The improved GP combined with denoising approaches may improve the interpolation performance of missing data in GNSS position time series, which will be further investigated in the future.

Conclusions

An improved GP method for filling missing data existing in GNSS position time series was developed by considering the impacts of adjacent stations, which were treated with equal weights of time factors. After 100 repeated simulation experiments, the results revealed that all the RMSEs and MAEs were smaller than those of the conventional GP and cubic spline and that the correlation coefficients were higher, indicating that the improved GP can fill in the missing data more accurately regardless of the percentage of missing data. Specifically, the mean relative improvements in the RMSE for the improved GP with respect to the conventional GP were 21.2%, 21.3% and 8.3% for the N, E and U components, respectively. To further demonstrate the advantages of the improved GP over the conventional GP and cubic spline, incomplete time series from eight GNSS stations were adopted and filled by the three methods. The first three PCs of the improved GP could capture 98.3%, 99.8% and 77.0% of the total variances for the N, E and U components, respectively, which were greater than those of the conventional GP and cubic spline. Overall, the findings from both real-time series analysis and simulation experiments highlight the necessity of considering adjacent station time series when filling in missing GNSS position time series data and validate the advantages of the improved GP over the conventional GP and cubic spline methods.

Data availability

The datasets analyzed in this study are downloaded on this website: https://www.eqdsc.com/. The website is currently under maintenance and may not be accessible.

References

Turgut, U. et al. Monitoring the tectonic plate movements in Turkey based on the national continuous GNSS network. Arab. J. Geosci. 6, 3573–3580 (2013).
Article Google Scholar
Gu, G. & Wang, W. Advantages of GNSS in monitoring crustal deformation for detection of precursors to strong earthquakes. Positioning 4, 11–19 (2013).
Article Google Scholar
Richter, A. et al. Crustal deformation across the Southern Patagonian Icefield observed by GNSS. Earth Planet. Sci. Lett. 452, 206–215 (2016).
Article ADS CAS Google Scholar
Fang, J., He, M., Luan, W. & Jiao, J. Crustal vertical deformation of Amazon Basin derived from GPS and GRACE/GFO data over past two decades. Geodesy Geodyn. 12, 441–450 (2021).
Article Google Scholar
Pan, M. & Lars, E. S. Estimating present-day postglacial rebound and horizontal movements in Fennoscandia By Repeated GPS Campaigns in 1993 and 1997. Geophys. Res. Lett. 26, 771–774 (1999).
Article ADS Google Scholar
Montillet, J. P., Melbourne, T. I. & Szeliga, W. M. GPS vertical land motion corrections to sea-level rise estimates in the Pacific North-west. J. Geophys. Res.-Oceans 123, 1196–1212 (2018).
Article ADS Google Scholar
Alothman, A. O., Bos, M., Fernandes, R., Radwan, M. A. & Rashwan, M. Annual sea level variations in the Red Sea observed using GNSS. Geophys. J. Int. 221, 826–834 (2020).
Article ADS Google Scholar
Ji, K., Shen, Y. & Wang, F. Signal extraction from GNSS position time series using weighted wavelet analysis. Remote Sens. 12, 992 (2020).
Article ADS Google Scholar
Kaczmarek, A. & Kontny, B. Identification of the noise model in the time series of GNSS stations coordinates using wavelet analysis. Remote Sens. 10, 1611 (2018).
Article ADS Google Scholar
Ghaderpour, E., Ince, E. S. & Pagiatakis, S. Least-squares cross-wavelet analysis and its applications in geophysical time series. J. Geodesy 92, 1223–1236 (2018).
Article ADS Google Scholar
Ghaderpour, E. & Pagiatakis, S. LSWAVE: A MATLAB software for the least-squares wavelet and cross-wavelet analyses. GPS Solut. https://doi.org/10.1007/s10291-019-0841-3 (2019).
Article Google Scholar
Li, W. & Guo, J. Extraction of periodic signals in Global Navigation Satellite System (GNSS) vertical coordinate time series using the adaptive ensemble empirical mode decomposition method. Nonlinear Process. Geophys. 31, 99–113 (2024).
Article ADS CAS Google Scholar
Montillet, J. P., Tregoning, P., Mcclusky, S. & Yu, K. Extracting white noise statistics in GPS coordinate time series. IEEE Geosci. Remote Sens. Lett. 10, 563–567 (2013).
Article ADS Google Scholar
He, X. et al. Accuracy enhancement of GPS time series using principal component analysis and block spatial filtering. Adv. Space Res. 55, 1316–1327 (2015).
Article ADS Google Scholar
Li, C., Yang, P., Zhang, T. & Guo, J. Periodic signal extraction of GNSS height time series based on adaptive singular spectrum analysis. Geodesy Geodyn. 15, 50–60 (2024).
Article Google Scholar
Zhou, M., Guo, J., Shen, Y., Kong, Q. & Yuan, J. Extraction of common mode errors of GNSS coordinate time series based on multi-channel singular spectrum analysis. Chin. J. Geophys. 61, 4383–4395 (2018).
Google Scholar
Wang, F., Shen, Y., Li, W. & Chen, Q. Singular spectrum analysis for heterogeneous time series by taking its formal errors into account. Acta Geodynamica et Geomaterialia 15, 395–403 (2018).
Article Google Scholar
Wang, F., Shen, Y., Chen, T., Chen, Q. & Li, W. Improved multichannel singular spectrum analysis for post-processing GRACE monthly gravity field models. Geophys. J. Int. 223, 825–839 (2020).
Article ADS Google Scholar
Tan, W., Dong, D. & Chen, J. Application of independent component analysis to GPS position time series in Yunnan Province, southwest of China. Adv. Space Res. 69, 4111–4122 (2022).
Article ADS Google Scholar
Zhou, W., Ding, K., Liu, P., Lan, G. & Ming, Z. Spatiotemporal filtering for continuous GPS coordinate time series in mainland China by using independent component analysis. Remote Sens. 14, 2904 (2022).
Article ADS Google Scholar
Wang, H. & Liu, G. Automatic signal detection based on support vector machine. Acta Seismologica Sinica 20, 88–97 (2007).
Article ADS Google Scholar
Zhan, W., Huang, L., Liu, Z. & Meng, X. Effect of data defect on analyzing GNSS time series. J. Geodesy Geodyn. 33, 48–53 (2013).
Google Scholar
Zhang, S. et al. Imputation of GPS coordinate time series using MissForest. Remote Sens. 13, 2312 (2021).
Article ADS Google Scholar
Donders, A. R. T., Van Der Heijden, G. J., Stijnen, T. & Monns, K. G. Review: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59, 1087–1091 (2006).
Article PubMed Google Scholar
Dyer, S. A. & Dyer, J. S. Cubic-spline interpolation: Part 1. IEEE Instrum. Meas. Magaz. 4, 44–46 (2001).
Article Google Scholar
Mohammad, A. G., Marc, C., Rock, S. & Tsehaie, W. GPS interactive time series analysis software. GPS-Solutions 17, 595–603 (2013).
Article Google Scholar
Qiu, X., Wang, F., Zhou, Y. & Zhou, S. Iteration empirical mode decomposition method for filling the missing data of GNSS position time series. Acta Geodynamica et Geomaterialia 19, 271–279 (2022).
Article Google Scholar
Xu, C. Reconstruction of gappy GPS coordinate time series using empirical orthogonal functions. J. Geophys. Res. Solid Earth 121, 9020–9033 (2016).
Article Google Scholar
Wang, F., Shen, Y., Chen, Q. & Wang, W. Bridging the gap between grace and grace follow-on monthly gravity field solutions using improved multichannel singular spectrum analysis. J. Hydrol. 598, 126319 (2021).
Article Google Scholar
Thamelo, E. et al. A survey on missing data in machine learning. J. Big Data https://doi.org/10.1186/s40537-021-00516-9 (2021).
Article Google Scholar
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Computat. 9, 1735–1780 (1997).
Article CAS Google Scholar
Yin, L. et al. Reconstruction of Gappy GPS coordinate time series based on long short-term memory network. J. Geomat. Sci. Technol. 35, 331–336 (2018).
Google Scholar
Hosseini, S. et al. Assessment of the ground vibration during blasting in mining projects using different computational approaches. Sci. Rep. 13, 18582 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Khatti, J. & Grover, K. S. Assessment of fine-Grained soil compaction parameters using advanced soft computing techniques. Arab. J. Geosci. 16, 1–31 (2023).
Article Google Scholar
Schneider, T. E. Analysis of incomplete climate data: estimation of mean values and covariance matrices and imputation of missing values. J. Clim. 14, 853–871 (2001).
Article ADS Google Scholar
Jin, H., Jung, S. & Won, S. MissForest with feature selection using binary particle swarm optimization improves the imputation accuracy of continuous data. Genes Genom. 44, 651–658 (2022).
Article Google Scholar
Zhou, X., Li, W., Yang, Y. & Li, W. The TLSTM interpolation method and its application for long-term missing GNSS time series. J. Geomat. 48, 13–19 (2023).
Google Scholar
Rasmussen, C. E. & Williams, C. K. I. Gaussian Processes for Machine Learning (MIT Press, 2006).
Google Scholar
Williams C.K.I.; Rasmussen C.E. Gaussian processes for regression. Adv. Neural Inf. Process. Syst. 8 (1995).
Chen, Z., Li, D., Liu, J. & Gao, K. Application of Gaussian processes and transfer learning to prediction and analysis of polymer properties. Computat. Mater. Sci. 216, 111859 (2023).
Article CAS Google Scholar
Tsang, W. K. & Benoit, D. F. Gaussian processes for daily demand prediction in tourism planning. J. Forecast. 39, 551–568 (2019).
Article MathSciNet Google Scholar
Mackay, D. J. Introduction to Gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 168, 133–166 (1998).
Google Scholar
Do C.B & Lee H. Gaussian processes. Stanford University, Stanford, CA, Accessed Dec. 2007, 2 (2017).
Xu, K. et al. Reconstruction of geodetic time series with missing data and time-varying seasonal signals using Gaussian process for machine learning. GPS Solut. 28, 79 (2024).
Article Google Scholar
Khatti, J. & Grover, K. S. Prediction of UCS of fine-grained soil based on machine learning part 2: Comparison between hybrid relevance vector machine and Gaussian process regression. Multiscale Multidiscip. Model. Exp. Des. 7, 1–41 (2023).
Google Scholar
Amiri-Simkooei, A. R. Noise in multivariate GPS position time series. J. Geodesy 83, 175–187 (2009).
Article ADS Google Scholar
Khatti, J. & Grover, K. S. Computation of permeability of soil using artificial intelligence approaches. Int. J. Eng. Adv. Technol. 11, 257–266 (2021).
Article Google Scholar
Khatti, J. & Grover, K. S. Determination of the optimum performance AI model and methodology to predict the compaction parameters of soils. ICTACT J. Soft Comput. 12, 2640–2650 (2022).
Google Scholar
Rasmussen, C. E. & Nickisch, H. Gaussian processes for machine learning (GPML) toolbox. J. Mach. Learn. Res. 11, 3011–3015 (2010).
MathSciNet Google Scholar
Khatti, J. & Grover, K. S. Assessment of the uniaxial compressive strength of intact rocks: An extended comparison between machine and advanced machine learning models. Multiscale Multidiscip. Model. Exp. Des. https://doi.org/10.1007/s41939-024-00408-4 (2024).
Bao, Z., Chang, G., Zhang, L., Chen, G. & Zhang, S. Filling missing values of multi-station GNSS coordinate time series based on matrix completion. Measurement 183, 109862 (2021).
Article Google Scholar
Langbein, J. & Bock, Y. High-rate real-time GPS network at Parkfield: Utility for detecting fault slip and seismic displacements. Geophys. Res. Lett. https://doi.org/10.1029/2003GL019408 (2004).
Article Google Scholar
Li, W. & Shen, Y. The consideration of formal errors in spatiotemporal filtering using principal component analysis for regional GNSS position time series. Remote Sens. 10, 534 (2018).
Article ADS Google Scholar
Tripathi, S. & Govindaraju, R. S. Engaging uncertainty in hydrologic data sets using principal component analysis: BaNPCA algorithm. Water Resour. Res. https://doi.org/10.1029/2007WR006692 (2008).
Article Google Scholar
Williams, S. D. et al. Error analysis of continuous GPS position time series. J. Geophys. Res. Atmos. https://doi.org/10.1029/2003JB002741 (2004).
Article Google Scholar
Chen, G., Zhao, Q., Wei, N. & Liu, J. Impacts on noise analyses of GNSS position time series caused by seasonal signal, weight matrix, offset, and Helmert transformation parameters. Remote Sens. 10, 1584 (2018).
Article ADS Google Scholar
Yang, B., Yang, Z., Tian, Z. & Liang, P. Weakening the flicker noise in GPS vertical coordinate time series using hybrid approaches. Remote Sens. 15, 1716 (2023).
Article ADS Google Scholar

Download references

Acknowledgements

This work is funded by the Science and Technology Project of the Education Department of Jiangxi Province (GJJ2203708) and the Natural Science Foundation of China (42374017). We acknowledge the data products of the National Earthquake Data Center for providing the GNSS position time series (https://www.eqdsc.com/).

Author information

Authors and Affiliations

Gandong University, Fuzhou, People’s Republic of China
Xiaomeng Qiu, Qiuxi Zhang & Guoqiang Tao
State Key Laboratory of Marine Geology, Tongji University, Shanghai, People’s Republic of China
Fengwei Wang
School of Software, Nanchang Hangkong University, Nanchang, People’s Republic of China
Shijian Zhou

Authors

Xiaomeng Qiu
View author publications
Search author on:PubMed Google Scholar
Fengwei Wang
View author publications
Search author on:PubMed Google Scholar
Qiuxi Zhang
View author publications
Search author on:PubMed Google Scholar
Guoqiang Tao
View author publications
Search author on:PubMed Google Scholar
Shijian Zhou
View author publications
Search author on:PubMed Google Scholar

Contributions

F.W. proposed the key idea, supervised the research and reviewed the manuscript; X.Q. carried out the experiments and data analysis and wrote the manuscript; Q.Z., G.T. and S.Z. revised the manuscript. All authors have read and agreed to publish version of the manuscript.

Corresponding author

Correspondence to Fengwei Wang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Qiu, X., Wang, F., Zhang, Q. et al. An improved Gaussian process for filling the missing data in GNSS position time series considering the influence of adjacent stations. Sci Rep 14, 19268 (2024). https://doi.org/10.1038/s41598-024-70421-7

Download citation

Received: 24 January 2024
Accepted: 16 August 2024
Published: 20 August 2024
DOI: https://doi.org/10.1038/s41598-024-70421-7