Abstract
Cross-city transfer learning aims to apply the knowledge and model from data-rich cities to data-poor cities to solve the cold start problem. Existing methods directly transfer the model constructed from developed cities to underdeveloped cities without considering the similarity between them, which leads to a potential transfer mismatch problem, and in turn, decreases the performance of transfer results. Meanwhile, existing transfer learning methods cannot effectively extract the time series features of the data, resulting in the inability to achieve adaptive positive migration across cities. To solve this problem, we propose a similarity-based cross-city transfer learning method named TransCSM, which embeds the urban similarity into an adaptation transfer learning framework to achieve desired data transfer. Specifically, we first constructed an urban similarity model, which utilizes the urban POI (Point Of Interest) data to group the cities with similar characteristics into the same cluster. Then, we build a feature extractor network, that uses convolution neural network (CNN) and Gated Recurrent Unit (GRU) to extract more representative features of time series data. Afterwards, we build an adaptation transfer learning framework to achieve data transfer within the same city cluster, which ensures the reliability of cross-city data transferring results. Finally, we evaluate our proposed method in many public POI datasets from Baidu Map API, and enormous results have demonstrated that our proposed method can achieve superior performance against state-of-the-art methods.
Similar content being viewed by others
Introduction
Constructing more aesthetically pleasing and user-friendly smart cities has become a significant aspect of contemporary urban development1. In recent years, deep learning has led to significant progress in building smart cities, greatly improving the ability to capture nonlinear correlations in data2,3. With the widespread use of sensors, large-scale urban clusters generate a large amount of valuable data4. The deployment and maintenance of a large number of sensors in small and medium-sized cities entail high costs, which limits their capacity to collect sufficient and usable data over prolonged periods of time, as demonstrated in Fig. 1. As a result, these cities may lack a large amount of data to build smart cities5,6, which is also called cold start problem.
Comparison of POI data quantity in different cities. Beijing, a developed city, has significant differences in terms of the quantity of POI data from less developed cities such as Hefei and Taiyuan. Large amounts of historical data may not yet be collected in some underdeveloped cities due to uneven urban development.
Transfer learning provides an effective method to solve the problem of data defects, which applies good models trained in data-rich developed cities to data-poor less developed cities7,8. Such learning framework enables the smart construction of underdeveloped cities to make leapfrog progress and get rid of the dilemma of insufficient data in underdeveloped cities. In recent years, many transfer learning methods have been proposed to solve data scarcity problem. For example, Wang et al.9 proposed a regional transfer learning method named RegionTrans for cross-city depth spatiotemporal prediction, which used regional similarity regularization to transfer knowledge between cities. Guo et al.7 proposed a collaborative filtering-based dual knowledge transfer framework named CityTransfer, which transfer chain store knowledge from semantic-related fields for recommending chain store layouts in new cities. Huang et al.10 proposed a graph neural networks-based traffic prediction method called TEEPEE, which used traffic data from developed cities to predict traffic data from less developed cities. Wei et al.11 proposed a flexible multimodal transfer learning method named FLORAL, which learned semantically related dictionaries for multiple modalities from a source ___domain and would be transferred to a target ___domain. Traditional transfer learning approaches, such as ___domain adaptation, have been widely explored. However, recent advancements in meta-learning and zero-shot learning provide new perspectives on enhancing cross-city adaptation. Meta-learning, often referred to as “learning to learn,” enables rapid adaptation to new cities with minimal labeled data, while zero-shot learning leverages shared feature representations to infer knowledge about unseen cities. These methodologies have been successfully applied in various domains, such as state estimation in lithium-ion batteries using Unsupervised Domain Adaptation (UDA)12 and hypergraph-based semi-supervised feature selection13. However, the above studies directly conducted transfer learning between two cities and did not quantitatively analyze whether each pair of cities is suitable for data transfer. Meanwhile, they also failed to extract desired time series features for the precise transfer of cross-city data, which inevitably results in inaccurate data transfer.
To address these challenges, we propose a similarity-based cross-city transfer learning method named TransCSM, which embeds urban similarity into an adaptation transfer learning framework to achieve desired data transfer. The TransCSM includes two parts: the City Similarity Module and Adaptation Transfer Module. In the city similarity module, we first obtained a 19-dimensional POI data distribution map for each city, where each dimension represents one property of this city, such as shopping, and it is segmented into multiple smaller regions with several POIs to better characterize the features of the city in this dimension. To figure out the optimal segmentation size, the thermal map is introduced to measure the similarity of different cities, where the optimal segmentation size is selected from several specific sizes of grids. According to the above operations, we obtain more representative distribution features for each city, and accordingly we employ the k-means algorithm to group these cities with high-level similarity into the same cluster. In the adaptation transfer module, ConvGRU and Domain Adaptation are jointly introduced to achieve data transfer across cities. The ConvGRU model is used to extract the deep time series features of both source and target cities within the same city cluster, which utilizes a combination of one-dimensional convolution neural networks (1DCNN) and GRU structures to enhance the effectiveness of the feature extraction. In the Domain Adaptation part, we design a ___domain adversarial neural network to achieve cross-city adaptive transfer, where the result of data source differentiation - from target city or source city - is used as feedback to motivate the network to continue mining the same feature distribution at a deeper level between the source city and the target city, until the data source cannot be distinguished. By integrating the city similarity module and adaptation transfer module into a unified framework, the performance of the cross-city data transfer is significantly improved.
The contributions of this paper are as follows:
-
We proposed a novel similarity based cross-city adaptation transfer learning method, which integrates the city similarity, ConvGRU and ___domain adaptation into a unified framework to improve the performance of cross-city data transfer.
-
To the best of our knowledge, it is the first time to embed city similarity into adaption transfer learning model, which significantly reduces negative transfer influence.
-
We conduct enormous experiments on various real-world POI data sets from Baidu Map API, and the results have verified the effectiveness of our proposed method.
The rest of this paper is organized as follows. Section II reviews some related works about City similarity framework and measurements and urban transfer learning. In Section III, we provide a detailed introduction to our method. In Section IV, we show the comparative experimental results of our method and the existing methods. In Section V, we summarize conclusions and future work.
Related work
In this section, we briefly introduced two related studies about our proposed method including city similarity measurement and urban transfer learning.
City similarity framework and measurements
Some researchers have proposed measurement methods for data similarity and used them to judge similarity and recommend. For example, Wang et al.14 extracted a new class of similarity measures from the work on fuzzy power sets, and discussed the properties of the concept of approximate equality corresponding to these similar measures. Yan et al.15 discussed the spatial similarity and popularity of urban TV programs and applied them to advertising and video recommendations. Santini et al.16 proposed fuzzy logic based similarity measurement method, called FFC, which had several functions that matched the results of human experiments, and used the FFC model to model similarity evaluation based on fuzzy judgments of attributes.
In order to measure the similarity of cities, existing methods can be roughly grouped into two categories: attributes and dimensions of cities and city representations method. For the attributes and dimensions of cities, Mulligan et al.17 researched the constantly changing population size distribution of these large cities and classified them into different clusters based on the similarity of population growth rates, ranking them from fastest to slowest growing. Crucitti et al.18 proposed that spatial analysis can visualize and characterize the urban structure, which demonstrates that planned and self-organized cities obviously belong to two different universal categories. For the city representations method, Preotiuc-Pietro et al.19 proposed Three vector based city representations methods, which utilized venue data from 17 cities in the United States, totaling more than 10 million venues, and had drawn the conclusion that different assumptions in urban cognition may lead to induced urban description and classification. McKenzie et al.20 used shared micro-mobility patterns to characterize the city, which can evaluate the city similarity by sharing the patterns of vehicle movement trajectories, including trams and bicycles.
The aforementioned methods only focus on data similarity and urban similarity models while ignore its contribution to urban data transfer, which fails to effectively improve the overall performance of transfer learning.
Urban transfer learning
Research on Urban transfer learning models has significantly progressed and been applied in many practical applications. Some researchers focused on how to achieve effective transfer through data similarity evaluation or data weighting, etc. For example, Fawaz et al.21 proposed to judge the similarity of time series based on DTW (Dynamic Time Warping)22, and the two most similar sets of data were selected for training and fine-tuned to achieve transfer. However, DTW was processed only in the time dimension, which leaded to discard information such as inter dimensional interactions. Jung et al.23 used Pearson correlation coefficient to evaluate the correlation of the data between cities, and then used DNN to extract data features. However, when there had non-linear correlation in the data, the Pearson correlation coefficient will fail. Zeng et al.24 used the Tradaboost method to weight the source and target city data, but the results rely heavily on the different configuration of weight settings, and inappropriate initialization can lead to very poor results from Tradaboost. Moreover, Jung23 and Zeng24 didn’t make effective use of the time series features of the data.
Transfer learning has emerged as a pivotal technology for harnessing data from source domains to enhance the performance of target tasks25. Some researchers focused on how to learn the common features between domains through models to achieve effective transfer. For example, Gautam26 trained the LSTM model using Covid-19 data from countries with sufficient data and apply them to countries with insufficient data. Although effective ___domain adaptation was not performed, the overall results suggest that model-based transfer learning across geographic regions is feasible. Xin et al.27 used an auto-encoder with CNN to extract the data features, and jointly used the source and target ___domain data to train the model so that model was trained with sufficient data and can also learn target ___domain features. However, the inadequate target data will lead to a weakening of feature learning by the model. Hu et al.28 used a deep auto-encoder model where each city’s data share the hidden layer of the model while have their own out layer during the training phase, thus the sharing of model parameters is achieved. Wan et al.29 proposed a cost sensitive transfer learning framework to address the issue of heterogeneous interpretation features between two datasets, which can represent cross city data using unified features in the original interpretation feature space. Bappee et al.30 tried to achieve city self-adaptation by reducing the KL-Divergence between two cities’ features. Traditional ___domain adaptation techniques, such as UDA, seek to minimize feature distribution shifts between cities. However, recent research in meta-learning offers a promising alternative by enabling quick adaptation with limited target data. For instance, unsupervised ___domain adaptation has been applied to lithium-ion battery state estimation across different battery types12. This approach aligns well with cross-city adaptation, where data distributions differ but share underlying structures. Similarly, hypergraph Laplacian-based semi-supervised feature selection13 demonstrates the effectiveness of capturing complex relationships in sparse data, which is crucial for urban computing applications. The idea of deep ___domain adaption is used in31,32,33,34. However, the above methods also do not effectively utilize the time series features of the data when performing feature extraction and need to be further improved.
Proposed method
Overall architecture
In this section, we describe the architecture of our proposed TransCSM, as shown in Fig. 2. TransCSM is comprised of two main components: the city similarity module and the adaptation transfer learning module. The first component consists of the POI Data Preprocessing block and Cluster&Classify block. The POI Data Preprocessing block is used to process the different kinds of POI data of each city, whose output is fed to the next block as their input. The cluster and classify block divides similar cities into the same subset according to the urban characteristics, and the classification results will serve as a reference for the selection of source cities and target cities for transfer learning in the next step. Finally, the adaptation transfer learning module, i.e., the DA-ConvGRU model, utilizes convolutional neural network and ___domain adaptation to extract the similar distribution of source and target cities, to predict the missing POI data of target cities.
City similarity model
Urban POI, the fundamental spatial big data for urban research. It contains ___location and attribute information of various urban facilities, including 19 dimensions of POI data, specifically catering, entertainment, infrastructure, shopping, banking and finance, medical care, life services, sports and fitness, automobiles, construction properties, companies and enterprises, education, hotels and guesthouses, institutional groups, place names and addresses, cultural venues, tourism services, transportation facilities, commercial residences. We need to mine the characteristics of a city based on these data. Therefore, we propose a city similarity model which includes two parts: POI Data Preprocessing and Cluster&Classify.
POI data preprocessing
Urban POI data serve as a fundamental spatial representation for urban research, encapsulating diverse urban functional characteristics. Our model leverages a 19-dimensional POI feature representation to measure city similarity. This choice is motivated by urban planning theories, where POI distributions reflect functional zoning, socio-economic activities, and urban morphology. However, to further validate its effectiveness, we compare our approach with two widely used city similarity measurement methods: graph-based representations and hierarchical clustering.
Graph-based methods capture complex spatial dependencies, but they require significant computational resources and graph structure tuning, making them less scalable for large urban datasets. Our 19-dimensional POI representation is computationally efficient and provides a direct interpretable representation of urban functions, making it easier to integrate into downstream tasks such as transfer learning. Hierarchical clustering provides an unsupervised approach to grouping cities, but it lacks a structured feature representation for downstream tasks. Our 19-dimensional POI representation enables direct vector-based similarity computations (e.g., cosine similarity), facilitating seamless integration into machine learning models.
To evaluate these approaches, we conduct experiments on a dataset of 60 cities and compare their similarity scores with real-world urban attributes (e.g., GDP, demographic distribution, land use). The results show that: Our 19-dimensional POI feature representation achieves a cosine similarity correlation of 0.82 with urban socio-economic indicators, outperforming hierarchical clustering (0.76) and graph-based methods (0.79). Graph-based methods require significantly more computation time due to graph construction and embedding training. Hierarchical clustering lacks a structured feature representation, making it less effective for predictive modeling. These findings demonstrate that our 19-dimensional POI representation effectively captures city similarity with high interpretability and computational efficiency, making it a practical choice for large-scale urban studies.
As shown in Fig. 3, we first extract the center point of each city according to Baidu Map. Then the fuzzy rectangle (60km * 60km) of each city can be uniquely determined. Next, grid each city rectangle using a 2km * 2km rectangle as the basic unit. And after POI is captured, we obtained 19-dimensional (corresponding to every 19 types of POI data) 30 * 30 city POI data maps of every city.
To obtain detailed information on various local data in the city, we need an extraction grid to extract POI information on the city POI data map. To determine the optimal size of the extraction grid, we will use six types of grids to extract POI information, and accordingly select the most suitable one based on the thermal map. The specific process is shown in Fig. 4. Fine-grained grids: (3 \(\times\) 3, 2 \(\times\) 2) \(\rightarrow\) Higher spatial resolution, capturing more local variations. Medium grids: (6 \(\times\) 6, 5 \(\times\) 5) \(\rightarrow\) Balanced trade-off between granularity and density. Coarse grids: (15 \(\times\) 15, 10 \(\times\) 10) \(\rightarrow\) Lower resolution, capturing only macro-scale city structures. As shown in Fig. 4, taking the extraction grids of 10 * 10 and 15 * 15 as examples, we will randomly select a portion of cities and perform feature extraction on their city POI data maps. The outputs are 19-dimensional feature rectangles of the cities, with each feature unit containing a certain number of POI interest points.
We draw POI thermal maps of 19 dimensions based on the number of interest points (different feature extraction grids correspond to different series of thermal maps). By comparing the thermal maps drawn based on different feature extraction grids, we can determine the optimal size of the feature extraction grid. We use the cosine similarity to measure the effectiveness of judging similarity in thermal maps.
Cluster and classify
After preprocessing the POI data, we obtain the data feature matrix for every city, which is the input in this block to classify cities using cluster algorithms. Firstly, we use the Elbow-SSE method (Eq. (1)) to determine how many types of cities should be classified (i.e., determine the value of k in the K-means algorithm). Notably, considering the wide variety of POI types, we need to determine the optimal value of k based on different POI types.
In Eq. (1), C is the clusters of high-similarity cities, and \(C_i\) is the \(i^{th}\) cluster. We consider a set of sample points denoted by p in the cluster \(C_i\). The centroid of cluster \(C_i\) is denoted by \(m_i\), and the clustering error of all samples is denoted by SSE, which represents the quality of clustering performance. The value of k increases beyond the true number of clusters, the degree of aggregation for each cluster is significantly affected, leading to a decrease in SSE. Therefore, the decrease in SSE will decrease sharply and then tend to flatten out as the value of k continues to increase. So we can get the optimal value of k.
After determining k, we will cluster cities. The detailed calculation process is illustrated in Algorithm 1.
Adaptation transfer learning module
The POI of the source and target cities have different distributions, which can affect the adaptability and transfer effectiveness between cities. In fact, the most effective knowledge in the transfer process is the similar features of the two cities, which guarantees the effectiveness of the positive transfer. Therefore, we aim to extract the same distribution of features to the source and target cities. Based on this, we propose the DA-ConvGRU model, containing the Feature Extractor Network and Adaptation Transfer Learning Framework. We define the input as \({{\{(x^i_S,y^i_S)\}}^{N_T}_{i=1}}\) and \({{\{(x^i_T)\}}^{N_T}_{i=1}}\) ,and the output as \({y^i_T}\) and \(d_i\in \{0,1\}\). \(Y_T\) represents a certain type of missing POI data that the target city wants to predict, such as education, and \(X_t\) represents the existing POI data of the target city. \(Y_S\) is the POI data of the source city. \(N_S\) and \(N_T\) represent the total amount of POI data of the source and target city respectively, and different categories of POIs may have different N. When d = 0, the feature data is from the source ___domain, otherwise the target ___domain.
Feature extractor network
As shown in Fig. 2, we use a two-layer 1DCNN model combined with a two-layer GRU to achieve the extraction of time series data features. The first 1DCNN tries to extract low-dimensional features from data, and the second layer tries to further extract its main features from the output. After that, we use the max pooling layer to retain the main features of the data while reducing the dimension. Notably, dimension reduction can effectively reduce the computation and complexity, especially for processing long-series data, which will be more effective. Then, the time series features are further extracted by the GRU model using the advantage of its long-term memory. We use 1DCNN to do the first step of GRU input to reduce the dimension because for each step of GRU input, the dimension of feature is fixed, which is very favorable for 1DCNN, and the translation invariance of it will not destroy the time features in feature extraction before passing the result into GRU, which will make GRU more efficient. Therefore, the Feature Extraction model effectively extracts and utilizes the time series features of data.
During the forward pass, the \(H_f\) maps the data into a new feature space f, i.e., \(f=H_{f}\left( x; \theta _{f}\right)\), which is a part of the input of the next part: Adaptation Transfer Learning Framework.
Adaptation transfer learning framework
The framework of Adaptive Transfer Learning consists of two blocks: Predictor\(H_{y}\) and Domain Classifier \(H_{d}\). Their parameters are denoted as \(\theta _y\) and \(\theta _d\) correspondingly. After obtaining the feature space f from the previous step, the Predictor and Domain Classifier map f to the predicted y and ___domain label \(d_i\) through their mapping functions \(H_{y}\left( f ; \theta _{y}\right)\) and \(H_{d}\left( f ; \theta _{d}\right)\).
In Predictor, we denote the loss of the Predictor as \({\mathscr {L}}_{y}\). Our primary goal of training is to predict the y-value more accurately, i.e., to make the \({\mathscr {L}}_{y}\) much smaller.
Afterwards, we need to use Domain Classifier to achieve adaptability and effective transfer between cities, by mining the same feature distribution of source and target city. The original function of the Domain Classifier is to distinguish the source of data (source city or target city) by maximizing the differences between features of different domains. But we hope that the feature distributions of the mapped two cities are as similar as possible to obtain the optimal predicted results. To achieve this goal, we try to use the ___domain classifier loss \({\mathscr {L}}_{d}\) as an auxiliary loss to estimate the degree of dissimilarity between domains. Thus, we combine \({\mathscr {L}}_{y}\) and \({\mathscr {L}}_{d}\) to form the final loss \({\mathscr {L}}\) shown in Eq. (2).
where \({\mathscr {L}}_{d}\) consists of classification losses in the source and target ___domain. To reduce the degree of influence of \({\mathscr {L}}_{d}\) on the optimization of model parameters relative to \({\mathscr {L}}_{y}\), we define \(\alpha \in (0, 1)\). To minimize the loss \({\mathscr {L}}\), the \({\mathscr {L}}_{y}\) should be smaller and the \({\mathscr {L}}_{d}\) should be larger. Then the optimal value of the model parameters \({\hat{\theta }}_{f}, {\hat{\theta }}_{y}, {\hat{\theta }}_{y}\) can be expressed as shown in Eq. (3) and (4).
The expressions of \({\mathscr {L}}_{y}^{i}\) and \({\mathscr {L}}_{d}^{i}\) are shown in Eq. (5) and (6):
where \({\mathscr {L}}_{y}^{i}\) is the pth power of the absolute difference of the real value and the predicted one, when p is 1 corresponding to Mean Absolute Error and when p is 2 corresponding to Mean Squared Error. \({\mathscr {L}}_{d}^{i}\) is the ___domain labels’ binary cross-loss quotient.
Besides, the Gradient Reversal Layer (GRL) is used to invert the gradient direction automatically in the process of backpropagation and achieve constant transformation in the forward propagation process. which can be represented as Eq. (7).
where the parameter \(\lambda\) is not fixed, but changes dynamically. Its variation expression is shown in Eq. (8).
where p represents the relative process of iterative, i.e., the ratio of the current iteration number to the total. Small values of \(\lambda\) at early stages, allowing the feature extractor to focus on prediction accuracy. Gradually increasing \(\lambda\) over iterations, strengthening ___domain adaptation. This schedule prevents the gradient vanishing problem and stabilizes adversarial training. In the process of backpropagation, the gradient of the ___domain classifier’s classification loss is inverted before backpropagating to the feature extractor’s parameters automatically, thus achieving a GAN-like adversarial loss. We used ReLU as the activation function. The parameters are updated using stochastic gradient descent (SGD) in Eqs. (9)–(11) :
To improve stability, we dynamically adjust \(\mu\).
where \(\mu _0\) is the initial learning rate, \(\gamma\) is the decay factor, and \(t\) is the current epoch.
where \(\eta\) is a small factor controlling learning rate increase.
where \(\mu _{\text {min}}\) ensures the learning rate does not vanish. Integrating the adaptive \(\mu (t)\):
where \(\mu (t)\) follows:
The proposed ConvGRU + Domain Adaptation framework is designed to balance predictive performance and computational efficiency. Below, we discuss the computational characteristics of each component. ConvGRU is chosen over ConvLSTM due to its reduced parameter complexity. Specifically, ConvLSTM maintains separate memory and hidden states, leading to higher computational overhead. ConvGRU, by contrast, removes the memory cell mechanism, resulting in faster training and inference while preserving strong spatiotemporal modeling capabilities. The Adaptation Transfer Learning module, introduces a ___domain classifier with gradient reversal. This additional component incurs moderate computational overhead but significantly enhances cross-city generalization, as demonstrated in our experiments. We mitigate the overhead by sharing feature extraction layers between the label predictor and ___domain classifier, ensuring efficient resource utilization.
Experiment
In this section, we conduct a comprehensive evaluation of our proposed approach on real-world datasets. Specifically, we compare the effectiveness of our approach with several state-of-the-art methods for the task of cold-start express delivery outlets recommendation. Furthermore, we assess the effectiveness of the City Similarity module and analyze its contribution to the overall performance of our model. Additionally, we investigate the sensitivity of our proposed approach to hyper-parameters, thereby providing insights into the optimal parameter settings.
Experimental settings
Datasets and configurations
The datasets we use for the experiment consists of two parts: the real-word POI datasets for different cities and the real-word datasets from several express delivery enterprises. In the real-word POI datasets for different cities, we use the publicly available POI datasets of Baidu Maps API, which selects 60 cities of different levels based on the 2020 City Commercial Charm Ranking, and we collect POI data from 19 dimensions of these cities, covering various aspects of urban life, such as catering, shopping, finance, etc., totaling approximately 3 million POI data. These datasets are used by City Similarity module to group these cities with high-level similarity into the same cluster. In the real-word datasets from several express delivery enterprises, these datasets from express delivery enterprises, which owns a lot of express delivery outlets in multiple cities of China. In our experiments, we obtain 5 datasets that contain over 600 express delivery outlets from 5 different cities in China for 2019, 2020, 2021 and 2022, including 3 first-tier cities (i.e., Beijing, Shanghai and Guangzhou) and 2 new first-tier cities (i.e., Changsha and Foshan), respectively. The TransCSM model and its baselines are evaluated using POI data, comprising name and ___location, collected for each city. The performance of these models is assessed based on these datasets. For express delivery outlets prediction, we use express delivery outlets POI data from the source and target cities for the past 2 years (2019 and 2020) and from the source city for the next 2 years (2021 and 2022) to predict express delivery outlets POI data from the target city for the next 2 years (2021 and 2022), including the candidate ___location of each courier outlet in the target city. Eq. (19) is used to represent the error between the predicted and true values.
All experiments are conducted on 4 TITAN RTX GPUs. All baselines and the proposed model are performed on PyTorch-1.1.0 and CUDA-10.0. When making recommendations for express delivery outlets, our model is trained using the SGD optimizer with an initial learning rate of 0.05. The proposed models are trained for 50 epochs with a batch size of 32 and a ___domain discriminant loss factor (\(\alpha\)) is 0.5 in Eq. (2). Additionally, a time window size of 12 is used to ensure effective forecasting accuracy. The training time of our TransCSM model in this experiment is a total of 36 hours.
Evaluation metrics
Our model predicts the candidate locations for each express delivery outlet in the target city. Therefore, we calculated the Euclidean distance between the predicted express delivery outlet coordinates and the nearest real express delivery outlet coordinates, where \(Y_{i}\) and \({\hat{Y}}_{i}\) are ground truth and predicted express delivery outlet coordinates, i.e., Eq. (19) :
Mean Absolute Error (MAE) directly reflects the average magnitude of prediction errors, making it more interpretable in real-world applications. Since MAE treats all errors equally, it is less sensitive to outliers, making it a reliable metric when large deviations are not the primary concern. MAE is particularly useful in scenarios where consistent, stable performance across all samples is desired, such as urban demand forecasting or logistics network optimization, where extreme prediction errors are less critical. Root Mean Square Error (RMSE) penalizes larger errors more heavily due to squaring, making it sensitive to extreme deviations. While RMSE provides a more risk-aware evaluation, it can be influenced disproportionately by a few large errors, potentially distorting overall performance assessment. Given the characteristics of our prediction task, we adopt MAE as the primary metric, as it provides a stable and interpretable measure of model accuracy without being overly influenced by outliers. However, we also report RMSE to capture the impact of large prediction errors, ensuring our model performs well not only on average cases but also in extreme situations. In our experiments, we adopt MAE and RMSE as the evaluation metrics, where m is defined as the number of samples, as shown in Eq. (20,21)
Baseline models
We conduct a comparative analysis of the proposed model against two distinct categories of approaches, namely non-transfer learning and transfer learning methods. Within the non-transfer learning category, we utilize T-GCN, ASTGCN, LSTM, and XGBoost as our baseline models. On the other hand, within the transfer learning category, we select DANN, RegionTrans, and MetaST as our benchmark models.
-
T-GCN35: A temporal graph convolutional network model combining graph convolutional network (GCN) with gated recurrent unit (GRU).
-
ASTGCN36: A Spatial-Temporal Graph Convolution Network model based on a self-attention, which improves its ability to learn from graph-structured data.
-
LSTM37: A special variant of Recurrent Neural Network with a “gate” structure, which determines whether data is updated or discarded through the logical control of gate units, effectively improving prediction accuracy.
-
XGBoost38: A gradient-boosting algorithm that uses multiple trees to make predictions together, and finally adds up the scores of each tree as the predicted value.
-
DANN39 : A representative approach for ___domain adaptation through adversarial learning, which entails a feature extractor, label predictor, and ___domain classifier.
-
RegionTrans9: A transferable model for effectively matching source cities with corresponding regions in target cities.
-
MetaST40: A spatial-temporal network architecture that employs a meta learning paradigm to facilitate adaptation to the target city and enable knowledge transfer.
-
TL_DCRNN41: A transfer model based on the Diffusion Convolutional Recurrent Neural Networks model, which transfers knowledge between the source city and the target city. The resulting training model can be used to predict traffic on unknown networks, etc.
-
CNN-LSTM42: A hybrid deep learning model that combines Convolutional Neural Networks (CNNs) for spatial feature extraction with Long Short-Term Memory (LSTM) networks for temporal modeling.
-
Transformer-based Model43: A deep learning model leveraging self-attention mechanisms to capture long-range dependencies in sequential data.
Experimental results
As illustrated in Table 1, we employed a clustering approach to group 60 cities using various basic classification units, namely 15 * 15, 10 * 10, 6 * 6, and 3 * 3. The K values used for clustering are set to 3.
Through the heat map of urban features, we selected 10 cities from 60 cities as sample displays, and the classification and feature extraction results of these 10 cities are visualized as shown in Figure 5. The basic classification units of 3 * 3, 6 * 6, 10 * 10, and 15 * 15. We compute the cosine similarity scores for different grid sizes and analyze their stability across multiple city pairs. The cosine similarity between city POI heatmaps is calculated as follows:
where \(H_A^i\) and \(H_B^i\) represent the POI density of the \(i\)-th POI category in cities \(A\) and \(B\), respectively. To determine the optimal grid size, we evaluate cosine similarity scores across different grid resolutions. The results are summarized in the Table 3 below:
The utilization of a basic classification unit of 15 * 15 has been found to pose significant limitations in effectively discerning urban features, this inadequacy in discrimination can be attributed to the unit’s relatively large size. A 15 * 15 grid will result in unclear features, resulting in fewer categories in urban clustering, and thus affecting the differentiation of urban similarity. The 3 * 3 basic classification unit leading to the urban characteristics are too independent and scattered. The 10 * 10 and 6 * 6 basic classification units are relatively moderate, but the 6 * 6 manifested urban features are still relatively complex. Grids of 3 * 3 and 6 * 6 can lead to features that are too independent and scattered, making it difficult to extract complete city features, resulting in an increase in the number of categories in city clustering. This also affects the effectiveness of city clustering, thereby affecting the overall efficiency of transfer learning. From this analysis, we select 10 \(\times\) 10 as the optimal grid size because it provides stable similarity measurements (low variance) while preserving urban functional details.
Here we focus on the 10 * 10 classification results demonstrated in Fig. 5d–f. The number of cities in class 0, class 1, and class 2 are 25, 11, and 24 (see in Table 2), respectively. We present experimental results for three cities: Chenzhou, Beijing, and Kunming, representing third-tier, first-tier, and second-tier cities, respectively. The darker the color in the heat map of urban features, the more concentrated the distribution of POI data in this area. Both Chenzhou and Kunming’s POI data are concentrated in the central part of the distribution. So our analysis reveals that both Chenzhou and Kunming are centrally developed cities, albeit with different degrees of centralization. Specifically, Chenzhou exhibits a weaker degree of centralization compared to Kunming. The observed results are consistent with the actual development of the cities. In contrast, Beijing, one of the most developed cities in China, exhibits high levels of development in both the central and surrounding areas, as demonstrated by the heat map.
We found that the clustering results do not exhibit a correlation with the tier classification of cities. Instead, the results are primarily determined by factors such as POI distribution rather than city tier. City tier classification is primarily based on economic indicators, population size, and administrative importance, whereas our clustering methodology focuses on spatial and functional similarities derived from POI distributions. Therefore, two cities with similar urban functionalities (e.g., commercTCAial and residential area distributions) may be grouped together, even if they belong to different tiers.
Model performance comparison
This experiment aims to explore the effectiveness of transfer learning in solving six transfer tasks, namely Beijing to Guangzhou(2021), Beijing to Guangzhou(2022), Changsha to Foshan(2021), Changsha to Foshan(2022), Wuhan to Xian(2021), and Wuhan to Xian(2022). According to our city clustering results (Table 2), both Beijing and Guangzhou belong to class 1, Changsha and Foshan belong to class 0 ,Wuhan and Xian belong to class 2. Therefore, we choose to conduct transfer learning between cities of the same class to demonstrate the effectiveness of our proposed transfer learning framework based on city similarity. The performance of the prediction model is evaluated using MAE as the major metric and RMSE as an auxiliary metric. The comparison results with various baselines are presented in Table 4, highlighting the best results in bold. It is observed that the proposed TransCSM outperforms all the compared baselines by achieving the lowest MAE and RMSE. However, the results show that both MAE and RMSE are slightly inferior to the best non-transfer model XGBoost.
Compared with the non-transfer models, TransCSM obtains additional information from source city data in training and reduce the impact of defective prior knowledge in the target city’s data. We also compare DANN with other methods and observe that DANN slightly outperforms other methods in MAE and RMSE. We conclude that ___domain adaptation is effective for data transfer between two cities. In addition, DANN and our model also adopt the strategy of Domain Adaptation and indicates that adaptation transfer in our framework is effective to learn the ___domain-invariant feature representation for data transfer.
In terms of the transfer models, Non-Transfer models are overall inferior to the transfer model, except for XGBoost. XGBoost can learn the time characteristics of a single express delivery outlet from a single node, thereby obtaining valuable predictions. However, long intervals between data may hinder XGBoost’s ability to accurately predict future express delivery outlets in target cities. Therefore, it is evident that TransCSM exhibits superior performance in comparison to XGBoost when it comes to learning the features of source cities and effectively leveraging the time information of target cities. The experimental results of CNN-LSTM and Transformer-based models are worse than those of TransCSM (w/o DA). Compared with CNN-LSTM and Transformer-based models, ConvGRU offers the following advantages. Efficient Spatial-Temporal Encoding: The convolutional structure extracts local spatial correlations, while GRU captures temporal dependencies with fewer parameters than LSTM. Reduced Computational Complexity: Unlike LSTM, which requires additional forget gates, GRU is computationally more efficient, making it suitable for large-scale urban datasets. Scalability and Stability: Transformer-based models perform well in capturing long-range dependencies, but require quadratic complexity in sequence length.
TransCSM outperforms all three baselines, namely, MetaST, RegionTrans and DANN. Compared with MetaST, RegionTrans and DANN, we conduct prediction tasks for express delivery outlets within the same city cluster and use CNN and GRU to extract more representative features of time series data, which improves the reliability of cross-city data transfer and the predictive ability of the model. Furthermore, our TransCSM method shows superior performance over the non-transfer model with regards to prediction stability. The prediction of express delivery outlets in a target city necessitates a significant amount of historical data for the extraction of long-term data dependencies. Nonetheless, acquiring such historical data for the target city is a daunting task due to its scarcity. Embedding the urban similarity into an adaptation transfer learning framework to achieve desired data transfer, which cannot be utilized in other baseline.
TCA projects the source and target ___domain data into a shared latent space, minimizing ___domain discrepancy while preserving task-related information. The transferability score is computed as:
where: \(\textbf{Z}_{\mathscr {S}}, \textbf{Z}_{\mathscr {T}}\) are the feature representations after TCA transformation. Higher TCA scores indicate better ___domain alignment and less negative transfer.
Based on the city clustering results in Table 2, we know that Beijing and Changsha belong to class 1 and class 0 respectively, while both Shanghai and Changsha belong to class 0. Beijing to Changsha suffers from the strongest negative transfer effect in Table 5, suggesting significant feature distribution mismatch. The results demonstrate that transfer learning between cities of the same type can effectively reduce negative transfer.
Ablation study
Three components comprise TransCSM, where City Similarity(CS) enables transfer between similar cities, ConvGRU extracts representative features of time series data, and Domain Adaptation(DA) contains a ___domain adversarial neural network to achieve cross-city adaptive transfer. Consequently, this investigation introduces two novel TransCSM variations.
-
TransCSM(w/o CS): The transfer task conduct within the dissimilar city cluster.
-
TransCSM(w/o ConvGRU): The adaptation transfer module removes ConvGRU module.
-
TransCSM(w/o DA): The adaptation transfer module removes ___domain adaptation module.
The results of TransCSM(w/o ConvGRU) and TransCSM(w/o DA) are shown in Table 4. Without DA or ConvGRU blocks, the performance of the model falls short of TransCSM. Nevertheless, TransCSM (w/o ConvGRU) and TransCSM (w/o DA) still perform better than non-transfer models in all four transfer tasks. The Improvement of transfer efficiency attributes to the availability data from the source city as learning samples for predicting express delivery outlets in the target city. The performance of TransCSM is superior to TransCSM (w/o ConvGRU), indicating that the time series information extracted by ConvGRU module improves the predictive ability of express delivery outlets. The performance of TransCSM is superior to TransCSM (w/o DA), indicating that adversarial neural network can help achieve adaptive transfer of cross-city.
Based on the city clustering results in Table 2, We designed TransCSM (w/o CS) for transfer between different types of cities, including the following cases: class 1 to class 0, class 0 to class 1, class 0 to class 2, class 1 to class 2, class 2 to class 0 and class 2 to class 1. Table 6 shows the results of TransCSM (w/o CS) and TransCSM in 2021, and Table 7 shows the predicted results of TransCSM (w/o CS) and TransCSM in 2022. The performance of TransCSM is superior to TransCSM (w/o CS), indicating that Data transfer within the different city cluster will decrease the performance of transfer. We employ ___domain adaptation techniques to align feature distributions between the source and target cities, enabling better generalization in cross-city knowledge transfer. Specifically, by integrating a ___domain classifier with gradient reversal, our model learns ___domain-invariant features, mitigating the negative effects of distribution shifts. Experimental results further confirm that ___domain adaptation significantly improves model transferability compared to direct transfer without adaptation, highlighting the necessity of incorporating ___domain adaptation mechanisms for effective urban knowledge transfer.
Transferability of source data quantity
Determining the optimal quantity of data from the source city required for effective transfer learning in the target city becomes crucial. Identifying this appropriate data volume could potentially reduce the model’s computational complexity and enhance its overall efficiency. In this study, we examine the impact of the source-target data quantity size ratio n on transfer performance.
As depicted in Fig. 6, it can be observed that both MAE and RMSE exhibit a decline prior to n = 10, with the rate of decrease diminishing subsequently. This suggests that when the “richness” of the source data reaches saturation, an additional influx of source data does not contribute to enhancing transferability. Furthermore, augmenting n necessitates a larger volume of training data. Nonetheless, as demonstrated in Fig. 6, the escalation of n does not correspond to a uniform linear enhancement in performance, thereby giving rise to superfluous computations.
Limitation and future works
Our current evaluation is based on the Baidu POI dataset, which, while comprehensive for Chinese cities, may limit the generalizability of our findings. Future work will incorporate additional datasets, including: Amap POI (China) for cross-validation within the same geographic region. OpenStreetMap (OSM) POI (Global) to assess model performance across different continents. Yelp POI (USA) to examine economic disparities in urban distributions. A broader dataset will enable a more comprehensive evaluation of the model’s adaptability across different geographic and economic environments.
Our experiments focus solely on express delivery prediction, a task that is closely related to urban infrastructure and can be effectively analyzed using POI data. However, this single-task setting may limit the generalizability of our approach in real-world applications. Future work should explore additional urban computing tasks, such as traffic flow prediction, business district identification, and emergency service optimization, to further validate the adaptability of our method. While POI similarity is a key factor in cross-city knowledge transfer, our findings suggest that urban development models (e.g., centralized vs. decentralized city structures) may also impact the transferability of knowledge between cities. Future research should incorporate explicit urban morphology indicators, such as road network density, land use patterns, and population distribution, to better quantify these effects.
Conclusion
In this article, we first propose a similarity-based cross-city transfer learning method named TransCSM, which embeds urban similarity into an adaptation transfer learning framework to achieve desired data transfer. TransCSM constructed an urban similarity model, which utilizes the urban POI data to group the cities with similar characteristics into the same cluster. Afterwards, a feature extractor network(ConvGRU) is proposed to use CNN and GRU to extract more representative features of time series data. Then, we build an adaptation transfer learning framework to achieve data transfer within the same city cluster, which ensures the reliability of cross-city data transferring results.
In assessing the efficacy of TransCSM, we undertake a city similarity clustering analysis utilizing the urban Points of Interest (POI) dataset, followed by site selection recommendation tasks employing real-world express delivery outlet datasets. A comparison with contemporary state-of-the-art benchmarks demonstrates the enhanced performance of TransCSM over extant methodologies. Moreover, we scrutinize the performance of distinct TransCSM variants to discern the influence of individual modules on the model’s transferability. Concomitantly, we explore the source-target data quantity size ratio in the transfer process with the objective of diminishing the computational workload.
In forthcoming research endeavors, we aim to investigate alternative transfer learning methodologies by examining aspects such as city similarity and feature extraction, with the ultimate objective of developing a comprehensive solution for transfer recommendation predicated on city similarity. We seek to foster collaboration with organizations possessing a substantial number of express delivery outlets, thereby acquiring sufficient data to substantiate the efficacy of our proposed framework.
Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.
References
Zheng, Y., Capra, L., Wolfson, O. & Yang, H. Urban computing: concepts, methodologies, and applications. ACM Trans. Intel. Syst. Technol. 38 (2014).
Dai, S. & Meng, F. Addressing modern and practical challenges in machine learning: A survey of online federated and transfer learning. Appl. Intell. 53, 11045–11072 (2023).
Liu, Y. et al. A deep learning-based acute coronary syndrome-related disease classification method: A cohort study for network interpretability and transfer learning. Appl. Intell. 53, 25562–25580 (2023).
Zanella, A., Bui, N., Castellani, A., Vangelista, L. & Zorzi, M. Internet of things for smart cities. IEEE Internet of Things J. 22–32 (2014).
Wang, P., Lai, J., Huang, Z., Tan, Q. & Lin, T. Estimating traffic flow in large road networks based on multi-source traffic data. IEEE Trans. Intel. Transp. Syst. 1–12 (2020).
Panda, D. K. & Ray, S. Approaches and algorithms to mitigate cold start problems in recommender systems: a systematic literature review. J. Intell. Inf. Syst. 59, 341–366 (2022).
Guo, B., Li, J., Zheng, V. W., Wang, Z. & Yu, Z. Citytransfer: Transferring inter- and intra-city knowledge for chain store site recommendation based on multi-source urban data. Proc. ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (2018).
Huang, J., Wen, J., Chen, L. & Liu, H.-L. Transfer learning based evolutionary algorithm framework for multi-objective optimization problems. Appl. Intell. 53, 18085–18104 (2023).
Wang, L., Geng, X., Ma, X., Liu, F. & Yang, Q. Cross-city transfer learning for deep spatio-temporal prediction. Preprint at arXiv:1802.00386 (2018).
Huang, Y., Song, X., Zhang, S. & Yu, J. J. Transfer learning in traffic prediction with graph neural networks. In: 2021 IEEE International Intelligent Transportation Systems Conference, 3732–3737 (2021).
Wei, Y., Zheng, Y. & Yang, Q. Transfer knowledge between cities. In Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 1905–1914 (2016).
Badfar, M., Chinnam, R. B. & Yildirim, M. Enhancing lithium-ion battery state-of-charge estimation across battery types via unsupervised ___domain adaptation. PHM Soc. Eur. Conf. 8, 8–8 (2024).
Sheikhpour, R., Berahmand, K., Mohammadi, M. & Khosravi, H. Sparse feature selection using hypergraph laplacian-based semi-supervised discriminant analysis. Pattern Recogn. 157, 110882 (2025).
Wang, L., Guo, B. & Yang, Q. Smart city development with urban transfer learning. Computer 32–41 (2018).
Yan, H., Liu, J., Li, Y., Jin, D. & Chen, S. Spatial popularity and similarity of watching videos in a large city. In: 2016 IEEE Global Communications Conference, 1–6 (2016).
Santini, S. & Jain, R. Similarity measures. IEEE Transactions on pattern analysis and machine Intelligence 871–883 (1999).
Mulligan, G. F. & Crampton, J. P. Population growth in the world’s largest cities. Cities (2005).
Crucitti, P., Latora, V. & Porta, S. Centrality measures in spatial networks of urban streets. Phys. Rev. E (2006).
Preoţiuc-Pietro, D., Cranshaw, J. & Yano, T. Exploring venue-based city-to-city similarity measures. In: Proc. 2nd ACM SIGKDD International Workshop on Urban Computing, 16 (2013).
McKenzie, G. Shared micro-mobility patterns as measures of city similarity: Position paper. In: Proc. 1st ACM SIGSPATIAL International Workshop on Computing with Multifaceted Movement Data, 1–4 (2019).
Fawaz, H. I., Forestier, G., Weber, J., Idoumghar, L. & Muller, P. A. Transfer learning for time series classification. 2018 IEEE international conference on big data (Big Data) (2018).
Li, H. L., Liang, Y. & Wang, S. C. Review on dynamic time warping in time series data mining. Control and Decision 1345–1353 (2018).
Jung, S. M., Park, S., Jung, S. W. & Hwang, E. Monthly electric load forecasting using transfer learning for smart cities. Sustainability (2020).
Zeng, P., Sheng, C. & Jin, M. A learning framework based on weighted knowledge transfer for holiday load forecasting. J. Modern Power Syst. Clean Energy (2018).
Cao, L. & Song, Y. Robust transfer learning for high-dimensional quantile regression model with linear constraints. Appl. Intell. 1–12 (2024).
Gautam, Y. Transfer learning for covid-19 cases and deaths forecast using lstm network. ISA Transactions (2021).
Xin, B. & Peng, W. Prediction for chaotic time series-based ae-cnn and transfer learning. Complexity 2680480 (2020).
Hu, Q., Zhang, R. & Zhou, Y. Transfer learning for short-term wind speed prediction with deep neural networks. Renew. Energy 83–95 (2016).
Wan, J. & Zhu, S. Cross-city crash severity analysis with cost-sensitive transfer learning algorithm. Expert Syst. Appl. (2022).
Bappee, F. K., Soares, A., Petry, L. M. & Matwin, S. Examining the impact of cross-___domain learning on crime prediction. J. Big Data (2021).
Yang, T., Yu, X., Ma, N., Zhao, Y. & Li, H. A novel ___domain adaptive deep recurrent network for multivariate time series prediction. Eng. Appl. Artif. Intelli. 104498 (2021).
Costa, P., Akcay, A., Zhang, Y. & Kaymak, U. Remaining useful lifetime prediction via deep ___domain adaptation. Reliab. Eng. Syst. Saf. (2020).
Wen, B. C., Xiao, M. Q., Wang, X. Q. & Chen, X. Data-driven remaining useful life prediction based on ___domain adaptation (PeerJ, Computer science, 2021).
Wilson, G., Doppa, J. R. & Cook, D. Multi-source deep ___domain adaptation with weak supervision for time-series sensor data (2020).
Zhao, L., Song, Y., Zhang, C., Liu, Y. & Li, H. T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Trans. Intell. Transp. Syst. 1–11 (2019).
Guo, S., Lin, Y., Feng, N., Song, C. & Wan, H. Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In: Proc. AAAI conference on artificial intelligence, 922–929 (2019).
Chung, J., Gulcehre, C., Cho, K. & Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. Preprint at arXiv:1412.3555 (2014).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. Proc. 22nd acm sigkdd international conference on knowledge discovery and data mining 785–794 (2016).
Ganin, Y. & Lempitsky, V. Unsupervised ___domain adaptation by backpropagation (2015).
Yao, H., Liu, Y., Wei, Y., Tang, X. & Li, Z. Learning from multiple cities: A meta-learning approach for spatial-temporal prediction. The World Wide Web Conference 2181–2191 (2019).
Mallick, T., Balaprakash, P., Rask, E. & Macfarlane, J. Transfer learning with graph neural networks for short-term highway traffic forecasting. In 2020 25th International Conference on Pattern Recognition (ICPR) (2021).
Shi, X. et al. Convolutional lstm network: A machine learning approach for precipitation nowcasting. arXiv: Computer Vision and Pattern Recognition,arXiv: Computer Vision and Pattern Recognition (2015).
Vaswani, A. et al. Attention is all you need. Advances in neural information processing systems30 (2017).
Acknowledgements
This paper is supported by Science and Technology Research and Development Program of China National Railway Group Corporation Limited (N2023S027) and Guizhou Provincial Basic Research Program (Natural Science) Qiankehe Foundation -ZK[2022] General 244.
Author information
Authors and Affiliations
Contributions
HW: Methodology, Writing, and Experimental Design. EC, CQ, GZ, ZZ, ZW, GL, PL: Programming, Experimental Implementation. XC: Investigation, Supervision, Validation. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, H., Che, X., Chang, E. et al. Similarity based city data transfer framework in urban digitization. Sci Rep 15, 10776 (2025). https://doi.org/10.1038/s41598-025-94987-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-94987-y