CNSW 1.0: Prefectural Reconstruction of China’s Surface Water Resources Using Machine Learning Methods

Wang, Qichen; Zhao, Fubo; Wang, Xi; Shi, Wenbo; Shan, Yinuo; Li, Qiang; Liu, Dengfeng; Wu, Yiping

doi:10.1038/s41597-025-05389-8

Download PDF

Data Descriptor
Open access
Published: 19 June 2025

CNSW 1.0: Prefectural Reconstruction of China’s Surface Water Resources Using Machine Learning Methods

Qichen Wang ORCID: orcid.org/0009-0006-3356-7806¹,
Fubo Zhao¹,
Xi Wang¹,
Wenbo Shi¹,
Yinuo Shan¹,
Qiang Li²,
Dengfeng Liu³ &
…
Yiping Wu¹

Scientific Data volume 12, Article number: 1032 (2025) Cite this article

1003 Accesses
2 Altmetric
Metrics details

Subjects

Abstract

A comprehensive and long-term dataset of prefectural surface water resources is crucial for effective water resources management in China. However, there has been a significant gap in the availability of such datasets, with no existing datasets providing comprehensive long-term coverage. To address this gap, we have developed CNSW 1.0, the first long-term (2000–2020) dataset of prefectural surface water resources in China. Utilizing surface water resources data from official water resources bulletins, we employed 14 advanced machine learning models to reconstruct the CNSW 1.0 dataset. The resulting dataset exhibits high accuracy, with an R² of 0.98 for total surface water resources and acceptable level of bias across China. CNSW 1.0 not only outperforms existing datasets like CNRD v1.0, GRUN, and ISIMIP in terms of simulation accuracy and spatial distribution but also fills a critical gap in water resources data for China. This dataset is expected to be an invaluable tool for developing more informed water resources management strategies at the administrative level in China, particularly in the context of climate change.

The increasing water stress projected for China could shift the agriculture and manufacturing industry geographically

Article Open access 23 July 2024

Gainers and losers of surface and terrestrial water resources in China during 1989–2016

Article Open access 10 July 2020

A high-resolution multi-scale industrial water use dataset in China

Article Open access 05 December 2024

Background & Summary

In March 2023, the United Nations (UN) convened the second global Water Conference, the first such gathering in nearly 50 years, with the aim of accelerating progress on water-related Sustainable Development Goals (SDGs)¹. At this meeting, the UN Water Community emphasized the importance of prioritizing water in the context of climate change^2,3, as water is the primary medium through which humans experience the effects of climate change. Surface water, a critical source for human societies^4,5, constitutes a significant portion of the world’s fresh water resources^6,7. In some countries, particularly China, surface water resources management policies are typically formulated based on prefecture-level administrative units, including prefectures, leagues, and regions^8,9. However, significant gaps remain in the research on surface water resources in China, primarily due to the absence of a comprehensive dataset at the prefectural level. The lack of complete and usable surface water resources dataset has hindered the development of effective water resources management strategies in China.

Currently, several surface runoff datasets are available at global and national levels, including CNRD v1.0, GRUN, ISIMIP2a, and ISIMIP3a^10,11,12,13. These datasets are primarily generated through simulations that incorporate key geographical elements, such as meteorology and vegetation, at the grid scale, rather than directly observed surface water data^14,15,16,17. This approach introduces significant limitations, as the absence of observed surface water data as a benchmark makes it difficult to comprehensively assess the accuracy of these simulations. Additionally, substantial discrepancies in the simulated surface water quantities and spatial distributions among these datasets, further complicating their reliability¹⁸. Moreover, the relatively coarse spatial resolutions of these datasets (0.5° or 0.25°) restrict their ability to accurately calculate surface water resources at the prefectural level, these datasets are only lower than that of certain smaller prefectures, introducing significant uncertainty when comparing these datasets to measured data. As a result, these datasets may not fully capture surface water resources at the prefecture scale and require validation with observed data to be considered reliable for water resources management.

Although surface water resources data for most prefectures in China are available through national and provincial water resources bulletins, no comprehensive dataset encompassing all prefectures has been compiled. This gap largely stems from the substantial and intricate effort required to create a long-term, nationwide dataset. Additionally, administrative changes and redivisions in many prefectures since the early 2000s, combined with incomplete provincial water resources bulletins, have led to significant data gaps¹⁹. These challenges further complicate the development of a thorough surface water resources dataset. Consequently, there is an urgent need to develop a dataset based on measured data to comprehensively document China’s surface water resources.

To address the existing gap in surface water resources datasets, we compiled and integrated publicly available surface water data from national and provincial water resources bulletins at the prefecture level. Using 14 machine learning models, we predicted and filled in missing data (21.3%), resulting in the creation of a high-precision, long-term dataset spanning from 2000 to 2020, named the “China National Surface Water Resources Dataset (CNSW 1.0).” This dataset includes 16 separate datasets covering China’s surface water resources over 21 years, across 341 lower-level administrative units in mainland China, excluding Hong Kong and Macau. CNSW 1.0 serves as a benchmark for validating the accuracy of surface runoff products and enables exploration of the temporal and spatial dynamics of surface water resources in China. It supports the development and implementation of water resources management policies at the administrative level. Additionally, it aligns with the goals set by the World Water Congress and supports the achievement of SDGs by 2030, promoting a harmonious balance between human activities and nature and enhancing proactive risk management.

Method

CNSW 1.0 reconstruction framework

The reconstruction of CNSW 1.0 involves two main steps, as illustrated in Fig. 1. First, we collected surface water resources data from provincial water resources bulletins in China spanning from 2000 to 2020. This initial dataset contained 21.3% missing values (Supplementary Table 1) and serves as the foundation for CNSW 1.0, acting as the dependent variable for predicting missing data using 14 machine learning models. Second, we selected 21 key variables (listed in Supplementary Table 2) related to meteorology, soil type, vegetation, and topography as predictors. Meteorological factors serve as the direct drivers of surface water resources, while soil (contents of Slit, Sand, and Clay), topographic (DEM and Slope), and vegetation (NDVI, LAI, and GPP) factors represent surface characteristics that directly influence the partitioning of precipitation²⁰. Therefore, these factors are crucial for predicting the formation processes of surface water resources. Variable preprocessing methods for machine learning models are in Supplementary Note 2. The dataset was then divided into training and prediction subsets, with a 9:1 split for model training and testing. After training the models, we predicted the missing surface water data and combined these predictions with the measured dataset to form CNSW 1.0. The final dataset includes 16 components: 14 derived from machine learning predictions, one based on the median of predictions, and one based on the mean of predictions. To address the issue of administrative boundary adjustments, this study distinguished two scenarios and adopted different approaches accordingly. First, when entire prefectures were merged into new administrative units (e.g., Hancheng City in Shaanxi Province and Laiwu City in Shandong Province), data were aggregated directly by summation. Second, for cases involving the division of one prefectures into multiple cities (e.g., the 2003 administrative adjustment of Ningxia Autonomous Region and the division and merger of Chaohu City in Anhui Province), machine learning models were employed to reconstruct and predict data, ensuring data consistency and accuracy.

Machine learning models and performance evaluation

Machine learning regression models are widely used for predicting continuous variables^{21,22,23,24,25} and have been extensively applied in hydrology for tasks such as streamflow prediction^{26,27,28,29,30,31}, precipitation forecasting^{32,33,34,35,36,37}, water quality assessment^38,39,40, groundwater level change prediction^{41,42,43,44,45}, and evapotranspiration estimation^{23,46,47,48,49,50}. Key models include linear regressions (e.g., generalized linear models, elastic net regression)^30,34,42,45, decision tree regression²⁷, support vector machine regression^{26,27,31,38,45,46,47,50}, ensemble learning models (random forest, gradient boosting regression)^{27,32,34,38,45,47,50}, neural network models^29,43,45,50 and generalized additive models^51,52,53). These algorithms used historical data to identify patterns, enabling both the reconstruction of missing historical data and predictions of future data^{54,55,56,57,58,59,60,61}. In this study, 14 machine learning models (Table 1) were utilized to predict missing surface water resources data in China for the period of 2000–2020. Detailed explanations of the principles and parameter tuning methods for these machine learning models are provided in Supplementary Note 1. Model performance was evaluated using metrics such as root mean square error (RMSE), coefficient of determination (R²), percent bias (PBIAS), and normalized error (NE), with detailed calculation formulas provided in Eqs. 1–4.

$$RMSE=\sqrt{\frac{1}{{\rm{n}}}{\sum }_{i=1}^{n}{(\overline{{y}_{i}}-{y}_{i})}^{2}}$$

(1)

$${R}^{2}=\frac{{({\sum }_{i=1}^{n}({y}_{i}-\overline{{y}_{i}})({y}_{i}-\overline{y}))}^{2}}{{\sum }_{i=1}^{n}{({y}_{i}-\overline{{y}_{i}})}^{2}{({y}_{i}-\overline{y})}^{2}}$$

(2)

$$PBIAS=\frac{{\sum }_{i=1}^{n}|{y}_{i}-\overline{{y}_{i}}|}{{\sum }_{i=1}^{n}{y}_{i}}\times 100 \% $$

(3)

$$NE=\frac{|{y}_{i}-\overline{{y}_{i}}|}{\frac{1}{{\rm{n}}}{\sum }_{i=1}^{n}\overline{{y}_{i}}}$$

(4)

Where $\overline{{y}_{i}}$ is the Predicted value, y_i is the observed value, $\overline{y}$ is the mean of the observed values, and n is the sample size.

Table 1 Machine learning models used in this study.

Full size table

Quality evaluation and comparison of reconstructed CNSW 1.0

In addition to evaluating the performance of the machine learning models, it is crucial to assess the simulation accuracy of CNSW 1.0 across China and within specific administrative regions. The validation and comparison framework for CNSW 1.0 is illustrated in Supplementary Figure 2. We collected annual total surface water volume data for China and its provincial units from 2000 to 2020, and computed the total surface water volume for seven administrative regions: North China (NC), East China (EC), Central China (CC), South China (SC), Northeast China (NE), Northwest China (NW), and Southwest China (SW) (Supplementary Figure 1). This dataset, covering the period from 2003 to 2020(several provinces did not publicly disclose their annual surface water resources data from 2000–2002), was used to create the Chinese Surface Water Resources Simulation Accuracy Validation Dataset. Validation of CNSW 1.0 was performed using this dataset as a benchmark, with the coefficient of determination (R²) and percentage bais (PBIAS) employed as the metric. The R² values and percentage bais (PBIAS) for CNSW 1.0 were calculated for both national and regional scales, with the formula for R² and PBIAS provided in Eqs. (2, 3).

Furthermore, the CNSW 1.0 constructed in this study was compared with other 4 runoff datasets in terms of simulation accuracy, time series, and spatial distribution. The runoff datasets used for comparison include CNRD v1.0, GRUN, ISIMIP2a, and ISIMIP3a. ISIMIP2a includes averages of 18 model simulations derived from 6 models with 3 types of driving data, while ISIMIP3a includes averages of 14 runoff datasets. As the time scale of ISIMIP2a only extends to 2010, the comparison of the multi-year average spatial distribution characteristics of surface water resources used data from 2000 to 2010, covering 11 years for each dataset.

Data Records

The CNSW 1.0 encompasses surface water resources data for 341 prefectural-level administrative units in mainland China over a period from 2000 to 2020, comprising a total of 16 surface water datasets. The data are stored in both CSV and shapefile (SHP) formats, unit is mm. The naming convention for annual spatial distribution shapefiles follows the structure “CNSW_1.0_YYY_ZZZZ.shp,” where “YYY” represents the model’s name, and “ZZZZ” denotes the year. The naming convention for multi-year average spatial distribution shapefiles is “CNSW_1.0_mean_YYY.shp,” and for the spatial evolution trend shapefiles, it is “CNSW_1.0_trend_YYY.shp.” A total of 368 shapefiles are provided, which approximately take up 7.2GB of disk space, and it can be accessed through figshare⁶². Additional information, such as region, province, and prefecture area, is also stored in the files.

Technical Validation

Validation of machine learning models

Among these models, the RF model demonstrates the lowest RMSE at 53.87 mm, outperforming the BR, GBM, and SVR models, which have RMSE values of 54.23 mm, 77.59 mm, and 91.88 mm, respectively (Table 2). RF also achieves the highest R² of 0.98, followed closely by BR, GBM, and SVR with R² values of 0.98, 0.97, and 0.95. BR shows the lowest PBIAS at 6.79%, indicating the best performance in terms of bias, while RF, SVR, and GBM follow with PBIAS values of 6.85%, 11.27%, and 11.45%, respectively. Conversely, the PR and BLR models exhibit poorer performance, with PBIAS values of 26.56% and 24.39%, respectively. The DTR model performs the worst, with an RMSE of 178.08 mm, R² of 0.82, and PBIAS of 28.6% (Supplementary Figure 3). Overall, RF and BR models demonstrate the best performance in the training dataset, with GBM and SVR also performing well, while PR, BLR, and DTR show inferior results.

Table 2 Machine learning models used in this study.

Full size table

In evaluating the performance of the test datasets, SVR exhibits the best results, with an RMSE of 93.07 mm, an R² of 0.95, and a PBIAS of 14.87%. The RF model also performs commendably, with an RMSE of 98.77 mm, an R² of 0.94, and a PBIAS of 15.13%. The BR and GBM models also demonstrate strong performance, with BR achieving an RMSE of 102.82 mm, an R² of 0.93, and a PBIAS of 15.72%, while GBM shows an RMSE of 99.27 mm, an R² of 0.94, and a PBIAS of 15.74%. Density scatter plots (Supplementary Figure 4) illustrate that the predictions of SVR, RF, GBM, and BR closely align with the y = x line, indicating superior performance. Conversely, the DTR model continues to underperform, with an RMSE of 167.50 mm, an R² of 0.82, and a PBIAS of 29.08%. PR and BLR also show suboptimal performance.

In the training dataset (Fig. 2), the RF and BR models significantly outperform the other models. In the test dataset, SVR, RF, BR, and GBM models all derive superior results compared to the other models, with SVR slightly surpassing RF, BR, and GBM. Overall, SVR achieves the best performance in the test dataset, while RF, BR, and GBM also perform well. A comprehensive analysis of both training and test datasets highlights that BR, GBM, RF, and SVR exhibit the strongest overall performance, whereas BLR, PR, and DTR show relatively poor performance.

Time series

CNSW 1.0 derived 16 datasets at the prefecture-level. In addition to the 14 datasets derived from various machine learning models, this study also calculates the average and median of these model predictions, resulting in two additional datasets of prefecture-level surface water resources for China. We analyzed the temporal trends in surface water runoff depth (Fig. 3a), the multi-year average spatial distribution (Fig. 3b), and the spatial evolution trends (Fig. 3c) of surface water resources in China from 2000 to 2020 across all 16 datasets.

Temporal analysis (Fig. 3a) reveals that, except for the dataset simulated by the MLP model, surface water resources in China exhibited an upward trend from 2000 to 2020, though this trend is not statistically significant. Notably, years such as 2002, 2006, 2010, 2012, 2015, 2016, 2019, and 2020 experienced relatively high surface water resources, while years like 2004, 2007, 2009, and 2011 were characterized by lower levels. According to the China Water resources Bulletin, the total surface water resources in China increased at an average annual rate of 1.77 mm from 2000 to 2020. Among the simulations, GAM, GPR, KNN, RF, and SVR closely matched the observed values, whereas the MLP model exhibited a trend contrary to the actual surface water resources trend, indicating suboptimal performance.

Spatial distribution

Figure 3b illustrates the multi-year average spatial distribution of surface water resources in China as simulated by each model. While all models consistently identify the abundant water regions in SE, their results diverge in Northwest, Northeast, and Southwest. The areas with the least surface water resources are found in NC and NW, particularly within endorheic basins and non-monsoon regions. In contrast, SE displays the highest surface water availability which were largely attributed to its elevated precipitation levels.

The analysis reveals that regions experiencing a decline in surface water resources are concentrated in four key areas: the NW inland river basins, NC and EC, the Southwestern river basins, and the Southeastern coastal region (Fig. 3c). Conversely, the most substantial increases in surface water resources are observed in the border areas of Jiangxi, Anhui, Zhejiang, and Jiangsu provinces. Additionally, Qinghai Province, Sichuan Province, Guizhou Province, and the three northeastern provinces exhibit notable increases in surface water resources. The Ili Kazakh Autonomous Prefecture in Xinjiang also shows significant growth in surface water availability.

Quality evaluation and comparison

This study assesses the deviation of each dataset in CNSW 1.0 from the total surface water resources reported by the China Water Resources Bulletin using the formula for annual totals. A PBIAS value closer to zero indicates a smaller discrepancy between the dataset and the Bulletin’s reported (Tables 3, 4). Among the datasets, simulations by BLR, BR, ENR, GLM, LR, and SVR closely matched the observed totals in the early years. However, in recent years, these simulations generally show lower values than those observed. Specifically, BLR’s simulated totals were accurate up to 2012 but fell below observed values post-2013. BR’s simulations were higher than observed totals before 2007 but were lower in most subsequent years, except for 2015. ENR showed similar discrepancies. GLM results were higher than observed values before 2006 (except 2003) and slightly lower thereafter (except 2008 and 2012). LR exhibited discrepancies comparable to GLM. SVR data were slightly higher than observed values until 2015 (except 2003) but fell below observed totals from 2016 onward.

Table 3 PBIAS of CNSW 1.0 (1).

Full size table

Table 4 PBIAS of CNSW 1.0 (2).

Full size table

Across all datasets, simulations of total surface water resources in China by DTR, PR, RF, GAM, GBM, GPR, KNN, Median, and Average generally yielded lower values than the observed totals. Specifically, DTR consistently underestimated the total surface water resources, with an average discrepancy of approximately 9%. Simulated totals from PR and RF were also slightly lower than observed values. GAM’s simulations were below observed totals in all years except 2000 and 2002. GBM simulations were lower in all years except 2000–2003. GPR’s simulations were lower than observed values in all years except 2000. The discrepancies for KNN and Median were similar to those of GPR. The Average dataset’s simulations closely aligned with observed values before 2006 but showed a slight decrease afterward. Notably, only the MLP consistently overestimated total surface water resources across almost all years, with a significant overestimation of 38% in 2000. Additionally, while observed surface water resources increased from 2014 to 2015, MLP simulations indicated a decrease, and in 2018, MLP showed an increasing trend contrary to the observed decline.

In addition to assessing discrepancies in simulated total surface water resources across China, we also evaluated the accuracy of CNSW 1.0 at the provincial and regional scales using metrics of R² and PBIAS (Fig. 4 and Supplementary Figures 5, 6). Nationally, eight datasets—GAM, GBM, GPR, KNN, PR, RF, and the average and median values—demonstrated exceptional simulation accuracy (R² > 0.95), with RF achieving the highest accuracy (R² = 0.98). However, PBIAS analysis revealed varying stability. BR showed the lowest absolute national bias (−1.65%), indicating excellent stability, whereas RF, despite high accuracy, exhibited a negative bias (−8.14%). In contrast, MLP had both poor accuracy (R² = 0.55) and severe overestimation (PBIAS = 11.55%), highlighting significant deficiencies. At the regional scale, substantial spatial variability was observed. CC showed near-perfect accuracy (R² = 0.99) and negligible bias (PBIAS = 0.09%) due to complete data coverage. SC also displayed high accuracy (R² > 0.95) and minimal bias (PBIAS < 0.55%). EC had generally excellent accuracy (R² > 0.95), with RF (R² > 0.95, PBIAS = 0.79%) and SVR (PBIAS = 0.21%) performing optimally. WS exhibited accuracy discrepancies, with GAM, PR, and RF showing good performance (R² = 0.90–0.95), yet significant underestimations occurred (RF: PBIAS = −22.07%). WN faced greater challenges due to sparse data; RF achieved highest accuracy (R² = 0.88) and relatively lower bias (PBIAS = 5.51%), whereas MLP and GAM showed extreme deviations (PBIAS: 64.56%, 50.79%). In NC, all models had moderate accuracy (R² < 0.90), with severe MLP overestimation (PBIAS = 206.42%) contrasting with BLR’s best regional accuracy (R² = 0.880). NC showed RF as the top-performing model (R² = 0.837, PBIAS = 3.73%), whereas ENR and MLP obviously underperformed (R² < 0.40). Several provinces, including Beijing, Chongqing, and Shanghai, showed perfect accuracy (R² = 1.00, PBIAS = 0%). RF consistently demonstrated robust accuracy and bias control across diverse provinces: Ningxia (R² = 0.97, PBIAS = −0.64%), Anhui (R² = 0.91, PBIAS = −0.43%), and Yunnan (R² = 0.95). Conversely, severe datasets deficiencies were evident in Heilongjiang and Shanxi provinces, with MLP showing obvious overestimations (Heilongjiang: PBIAS = 367.27%, Shanxi: PBIAS = 145.80%) and negligible accuracy (R² ≈ 0).

Considering both R² and PBIAS, the RF model emerged as the most optimal choice across national, regional, and provincial scales. While the BR model exhibited the lowest absolute bias at the national level, its performance declined significantly in regional applications, particularly in East and Southwest China. In contrast, RF effectively balanced high predictive accuracy with controlled bias, consistently maintaining R² values above 0.80 and achieving notably low biases even in regions with complex hydrological conditions. This comprehensive evaluation highlights RF’s robustness and adaptability, making it the preferred model for reliable surface water resource simulations across diverse hydrological contexts.

Intermodel comparison

This study compares the CNSW 1.0 dataset with four other datasets—CNRD v1.0, GRUN, ISIMIP2a, and ISIMIP3a—across four dimensions: simulation accuracy, total surface water resources discrepancies in China, prefecture-level discrepancies, and spatial distribution from 2000 to 2010. The end years for the datasets are 2018 for CNRD v1.0, 2014 for GRUN, 2020 for ISIMIP2a, and 2019 for ISIMIP3a. For consistency, spatial distribution analysis is restricted to the 2000–2010 period, while other comparisons utilize the full available time span. ISIMIP2a and ISIMIP3a averages are based on 18 and 14 qualifying models, respectively.

The comparative analysis of CNSW 1.0 datasets (RF, Average, Median) against alternative runoff datasets (CNRD v1.0, GRUN, ISIMIP2a, and ISIMIP3a) reveals significant differences in both R² and PBIAS at national, regional, and provincial scales. At the national scale (Fig. 5a,b), ISIMIP2a shows the highest simulation accuracy (R² = 0.98), equal to CNSW 1.0’s RF model (R² = 0.98). However, ISIMIP2a demonstrates substantial bias (PBIAS = −10%) compared to CNSW 1.0 datasets, which maintain biases within 10%, highlighting their stability. Conversely, GRUN significantly underestimates water resources nationally (PBIAS = −31.7%), while CNRD v1.0 and ISIMIP3a display substantial overestimations exceeding 10%. Regionally, CNSW 1.0’s RF consistently exhibits superior overall performance. In CC, RF achieves near-perfect accuracy (R² ≈ 1.0, PBIAS ≈ 0%), in stark contrast to the substantial biases observed in CNRD v1.0 (44.37%) and GRUN (−56.1%). Similarly, in EC, RF maintains high precision (R² > 0.95, PBIAS < 1%), whereas CNRD v1.0 and GRUN exhibit considerable deviations. In NC, RF outperforms alternative datasets with superior accuracy (R² = 0.84, PBIAS = 3.73%), significantly reducing the overestimation seen in CNRD v1.0, ISIMIP2a, and ISIMIP3a. For EN, RF achieves comparable accuracy (R² ~0.88, PBIAS = 0.66%) to ISIMIP datasets (PBIAS = −0.24% and 13.64% for ISIMIP2a and ISIMIP3a), while GRUN markedly underestimates regional water resources (−57.52%). In Northwest China, comparative datasets systematically overestimate surface water resources (CNRD v1.0: 59.23%, GRUN: 47.44%, ISIMIP3a: 31.08%), whereas ISIMIP2a underestimates (−18.58%). In contrast, RF maintains robust performance with controlled bias (R² = 0.88, PBIAS = 5.51%). In humid South China, RF demonstrates exceptional accuracy (R² > 0.95, PBIAS ≈ −0.51%), distinctly outperforming comparative datasets such as GRUN and CNRD v1.0. The WS region presents challenges for all models, with RF exhibiting a systematic underestimation, yet remaining comparable to GRUN and performing better than ISIMIP2a. These findings underscore RF’s overall superiority in maintaining high accuracy and minimizing bias across diverse hydrological conditions.

At the provincial scale (Fig. 6a,b), CNSW 1.0 consistently outperforms comparative datasets, with RF and median estimates maintaining biases within 1% in Guangdong, Guangxi, Hainan, Guizhou, and Jiangxi. In contrast, alternative datasets exhibit extreme deviations: CNRD v1.0 overestimates water resources in Hebei (276.13%) and Ningxia (365.32%), while GRUN underestimates in Heilongjiang (−51.77%) and Jilin (−62.36%). ISIMIP2a shows severe overestimations in North China (Hebei: 248.90%, Shanxi: 182.93%) and underestimations in the Southwest (Sichuan: −28.61%, Xinjiang: −59.46%). ISIMIP3a amplifies these biases, particularly in Hebei (422.99%) and Shanxi (346.62%), though with improved performance in Guangdong (−7.82%). Despite localized challenges, such as RF biases in Tibet (−51.97%), CNSW 1.0 exhibits superior stability across diverse hydrological conditions.

Overall, CNSW 1.0’s datasets effectively integrate observed data with interpolative methods, achieving robust accuracy (R² is generally above 0.90) and superior bias control at all scales. While comparative datasets occasionally match CNSW 1.0’s accuracy in specific regions, they often exhibit varied regional biases due to their reliance on purely predictive methods. This analysis underscores CNSW 1.0’s distinct advantage in accurately simulating China’s prefectural surface water resources across diverse hydrological conditions.

Figure 7 depicts the discrepancies in prefecture-level simulations between CNSW 1.0 and the four runoff datasets. CNSW 1.0 operates at a spatial resolution of 0.25°, whereas the other datasets use a coarser resolution of 0.5°. GRUN’s simulations consistently yield lower values than CNSW 1.0, exhibiting the lowest R² value. In contrast, CNRD v1.0 forecasts higher runoff values compared to CNSW 1.0. ISIMIP2a tends to overestimate runoff in prefectures with less than 400 mm of surface water resources and underestimate in those with more than 400 mm. ISIMIP3a shows a similar pattern but with a threshold of 800 mm, overestimating in prefectures with less than 800 mm and underestimating in those with more. The density scatter plots illustrating discrepancies for all models in ISIMIP2a and ISIMIP3a are provided in Supplementary Figures 9, 10.

Figure 8 illustrates the time series trends of measured versus simulated total surface water resources in China. CNRD v1.0 consistently overestimates values, with significant discrepancies in 2007 and 2013. GRUN tends to underestimate, with notable inconsistencies in 2001, 2007, and 2014. ISIMIP3a also shows overestimation, though less pronounced than CNRD v1.0, with discrepancies in 2007 and 2014. ISIMIP2a aligns most closely with the measured data, except for 2002, although its time series extends only to 2010, shorter than CNSW 1.0 and the other datasets. Overall, CNSW 1.0 demonstrates superior performance in terms of simulation accuracy, time series alignment, and total volume compared to the four other runoff datasets.

Among these, CNRD v1.0 most closely aligns with CNSW 1.0, particularly in North China (Fig. 9). GRUN displays notably different spatial distribution patterns, with generally lower surface water levels across all regions. ISIMIP2a accurately reflects the surface water distribution in southern China but shows significant deviations from CNSW 1.0 in northern regions. ISIMIP3a fails to capture the spatial variability of surface water resources in southern China and exhibits poor performance in the north. The spatial distribution characteristics for all years in ISIMIP2a and ISIMIP3a are further detailed in Supplementary Figures 11, 12.

Usage Note

In this study, we developed a framework for reconstructing prefectural-scale surface water resources. By analyzing surface water data from 341 prefectures in China for the period 2000 to 2020 and integrating predictions from 14 machine learning regression models to address missing data, we created China’s first long-term, prefecture-level surface water resources dataset, named CNSW 1.0. This dataset, covering 2000 to 2020, was rigorously validated using multiple methods. Systematic precision validation and comparative evaluations demonstrate that CNSW 1.0 exceeds current mainstream natural runoff gridded datasets in simulation accuracy, temporal consistency, and spatial distribution, effectively filling the gap in China’s prefectural-level surface water data.

The reconstructed CNSW 1.0 dataset can facilitate further research in the following areas:

1)
Water use planning and management: This dataset can assist in the rational planning of agricultural, industrial, and domestic water use, thereby enhancing water use efficiency.
2)
Disaster prevention and mitigation: The dataset supports monitoring and analysis of surface water changes, enabling the prediction and management of natural disasters such as floods and droughts, thus aiding in the formulation of measures to minimize losses.
3)
Policy formulation: The surface water data provides a scientific basis for government and relevant agencies to design and implement effective water resources policies and regulations.
4)
Benchmark for hydrological models: CNSW 1.0 serves as benchmark data for hydrological model calibration, providing a valuable reference for improving model accuracy and reliability.

All data are stored in CSV and SHP formats, allowing users to read and process them using platforms such as R, Python, MATLAB, and ArcGIS. Users can select the most appropriate data for their research needs. Among all datasets, we highly recommend CNSW 1.0 (RF). For studies focusing on specific administrative regions, users can take the average of the top five models with the highest simulation accuracy for that region to obtain a more precise regional surface water resources dataset.

Code availability

The R codes for generating and processing the CNSW 1.0 can be accessed through figshare⁶². Climate variable data were obtained from remote sensing products of CRU (https://crudata.uea.ac.uk/cru/data/hrg/). Vegetation data were derived from the Moderate Resolution Imaging Spectroradiometer (MODIS) surface reflectance and vegetation index products, with a resolution of 0.05 degrees (https://modis.gsfc.nasa.gov). The soil data were collected from the Harmonized World Soil Database (HWSD) v. 1.2 (https://www.fao.org/soils-portal/data-hub/soil-maps-and-databases/harmonized-world-soil-database-v12/en/) and topography data were collected from the Global Multiresolution topography Elevation Data 2010 (GMTED2010) released by the U.S. Geological Survey and the National Geospatial-Intelligence Agency (https://www.usgs.gov/coastal-changes-and-impacts/gmted2010).

References

https://www.worldwatercongress.cn/. The XVIII World Water Congress, 2023.
Sjöstrand, K. Water for sustainable development. Nature Water 1, 568–572, https://doi.org/10.1038/s44221-023-00108-2 (2023).
Article Google Scholar
Rahman, M. et al. As the UN meets, make water central to climate action. Nature 615, 582–585, https://doi.org/10.1038/d41586-023-00793-9 (2023).
Article ADS CAS PubMed Google Scholar
Nguyen, A. D. et al. Attentional ensemble model for accurate discharge and water level prediction with training data enhancement. Engineering Applications of Artificial Intelligence 126, https://doi.org/10.1016/j.engappai.2023.107073 (2023).
Richards, L. A. et al. A systematic approach to understand hydrogeochemical dynamics in large river systems: Development and application to the River Ganges (Ganga) in India. Water Research 211, https://doi.org/10.1016/j.watres.2022.118054 (2022).
Vörösmarty, C. J. et al. Global threats to human water security and river biodiversity. Nature 467, 555–561, https://doi.org/10.1038/nature09440 (2010).
Article ADS CAS PubMed Google Scholar
Mekonnen, M. M. & Hoekstra, A. Y. Four billion people facing severe water scarcity. Science Advances 2, e1500323 https://doi.org/10.1126/sciadv.1500323
Zhou, F. et al. Deceleration of China’s human water use and its key drivers. Proceedings of the National Academy of Sciences of the United States of America 117, 7702–7711, https://doi.org/10.1073/pnas.1909902117 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. City level water withdrawal and scarcity accounts of China. Scientific Data 11, 449 (2024).
Article CAS PubMed PubMed Central Google Scholar
Ghiggi, G., Humphrey, V., Seneviratne, S. I. & Gudmundsson, L. GRUN: an observation-based global gridded runoff dataset from 1902 to 2014. Earth System Science Data 11, 1655–1674, https://doi.org/10.5194/essd-11-1655-2019 (2019).
Article ADS Google Scholar
Gou, J. et al. Sensitivity Analysis-Based Automatic Parameter Calibration of the VIC Model for Streamflow Simulations Over China. Water Resources Research 56, https://doi.org/10.1029/2019WR025968 (2020).
Gou, J. et al. CNRD v1.0: A High-Quality Natural Runoff Dataset for Hydrological and Climate Studies in China. Bulletin of the American Meteorological Society 102, E929–E947, https://doi.org/10.1175/BAMS-D-20-0094.1 (2021).
Article Google Scholar
Miao, C. et al. High-quality reconstruction of China’s natural streamflow. Science Bulletin 67, 547–556, https://doi.org/10.1016/j.scib.2021.09.022 (2022).
Article ADS PubMed Google Scholar
Ayzel, G. Runoff for Russia (RFR v1.0): The Large-Sample Dataset of Simulated Runoff and Its Characteristics. Data 8, https://doi.org/10.3390/data8020031 (2023).
Li, Y. et al. Multi-model analysis of historical runoff changes in the Lancang-Mekong River Basin-Characteristics and uncertainties. Journal of Hydrology 619, https://doi.org/10.1016/j.jhydrol.2023.129297 (2023).
Wang, C. et al. Historical and projected future runoff over the Mekong River basin. Earth System Dynamics 15, 75–90, https://doi.org/10.5194/esd-15-75-2024 (2024).
Article ADS CAS Google Scholar
Wang, J. et al. Parameter regionalization of the FLEX-Global hydrological model. Science China-Earth Sciences 64, 571–588, https://doi.org/10.1007/s11430-020-9706-3 (2021).
Article ADS CAS Google Scholar
Pan, Z. et al. Comparing multi-source runoff data in different watersheds across China. Geographical Research 43, 1004–1017 (2024).
Google Scholar
Zhou, F. et al. Deceleration of China’s human water use and its key drivers. Proceedings of the National Academy of Sciences 117, 7702–7711, https://doi.org/10.1073/pnas.1909902117 (2020).
Article ADS CAS Google Scholar
Yan, J., Jia, S., Lv, A. & Zhu, W. Water Resources Assessment of China’s Transboundary River Basins Using a Machine Learning Approach. Water Resources Research 55, 632–655, https://doi.org/10.1029/2018wr023044 (2019).
Article ADS Google Scholar
Karim, F., Armin, M. A., Ahmedt-Aristizabal, D., Tychsen-Smith, L. & Petersson, L. A Review of Hydrodynamic and Machine Learning Approaches for Flood Inundation Modeling. Water 15, https://doi.org/10.3390/w15030566 (2023).
Kratzert, F. et al. Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences 23, 5089–5110, https://doi.org/10.5194/hess-23-5089-2019 (2019).
Article ADS Google Scholar
Pan, S. et al. Evaluation of global terrestrial evapotranspiration using state-of-the-art approaches in remote sensing, machine learning and land surface modeling. Hydrology and Earth System Sciences 24, 1485–1509, https://doi.org/10.5194/hess-24-1485-2020 (2020).
Article ADS Google Scholar
Sit, M. et al. A comprehensive review of deep learning applications in hydrology and water resources. Water Science and Technology 82, 2635–2670, https://doi.org/10.2166/wst.2020.369 (2020).
Article PubMed Google Scholar
Zounemat-Kermani, M., Batelaan, O., Fadaee, M. & Hinkelmann, R. Ensemble machine learning paradigms in hydrology: A review. Journal of Hydrology 598, https://doi.org/10.1016/j.jhydrol.2021.126266 (2021).
Adnan, R. M. et al. Least square support vector machine and multivariate adaptive regression splines for streamflow prediction in mountainous basin using hydro-meteorological data as inputs. Journal of Hydrology 586, https://doi.org/10.1016/j.jhydrol.2019.124371 (2020).
Cheng, M., Fang, F., Kinouchi, T., Navon, I. M. & Pain, C. C. Long lead-time daily and monthly streamflow forecasting using machine learning methods. Journal of Hydrology 590, https://doi.org/10.1016/j.jhydrol.2020.125376 (2020).
Cho, K. & Kim, Y. Improving streamflow prediction in the WRF-Hydro model with LSTM networks. Journal of Hydrology 605 https://doi.org/10.1016/j.jhydrol.2021.127297 (2022).
Kumar, V., Kedam, N., Sharma, K. V., Mehta, D. J. & Caloiero, T. Advanced Machine Learning Techniques to Improve Hydrological Prediction: A Comparative Analysis of Streamflow Prediction Models. Water 15, https://doi.org/10.3390/w15142572 (2023).
Niu, W.-j. & Feng, Z.-k. Evaluating the performances of several artificial intelligence methods in forecasting daily streamflow time series for sustainable water resources management. Sustainable Cities and Society 64, https://doi.org/10.1016/j.scs.2020.102562 (2021).
Shamshirband, S. et al. Predicting Standardized Streamflow index for hydrological drought using machine learning models. Engineering Applications of Computational Fluid Mechanics 14, 339–350, https://doi.org/10.1080/19942060.2020.1715844 (2020).
Article Google Scholar
Elbeltagi, A. et al. Prediction of meteorological drought and standardized precipitation index based on the random forest (RF), random tree (RT), and Gaussian process regression (GPR) models. Environmental Science and Pollution Research 30, 43183–43202, https://doi.org/10.1007/s11356-023-25221-3 (2023).
Article PubMed Google Scholar
Fan, J. et al. Comparison of Support Vector Machine and Extreme Gradient Boosting for predicting daily global solar radiation using temperature and precipitation in humid subtropical climates: A case study in China. Energy Conversion and Management 164, 102–111, https://doi.org/10.1016/j.enconman.2018.02.087 (2018).
Article ADS Google Scholar
Jose, D. M., Vincent, A. M. & Dwarakish, G. S. Improving multiple model ensemble predictions of daily precipitation and temperature through machine learning techniques. Scientific Reports 12, https://doi.org/10.1038/s41598-022-08786-w (2022).
Madakumbura, G. D., Thackeray, C. W., Norris, J., Goldenson, N. & Hall, A. Anthropogenic influence on extreme precipitation over global land areas seen in multiple observational datasets. Nature Communications 12, https://doi.org/10.1038/s41467-021-24262-x (2021).
Pirone, D., Cimorelli, L., Del Giudice, G. & Pianese, D. Short-term rainfall forecasting using cumulative precipitation fields from station data: a probabilistic machine learning approach. Journal of Hydrology 617, https://doi.org/10.1016/j.jhydrol.2022.128949 (2023).
Sachindra, D. A., Ahmed, K., Rashid, M. M., Shahid, S. & Perera, B. J. C. Statistical downscaling of precipitation using machine learning techniques. Atmospheric Research 212, 240–258, https://doi.org/10.1016/j.atmosres.2018.05.022 (2018).
Article ADS Google Scholar
Alizadeh, M. J. et al. Effect of river flow on the quality of estuarine and coastal waters using machine learning models. Engineering Applications of Computational Fluid Mechanics 12, 810–823, https://doi.org/10.1080/19942060.2018.1528480 (2018).
Article Google Scholar
Chen, K. et al. Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data. Water Research 171, https://doi.org/10.1016/j.watres.2019.115454 (2020).
Nasir, N. et al. Water quality classification using machine learning algorithms. Journal of Water Process Engineering 48, https://doi.org/10.1016/j.jwpe.2022.102920 (2022).
Adnan, R. M. et al. Modelling groundwater level fluctuations by ELM merged advanced metaheuristic algorithms using hydroclimatic data. Geocarto International 38, https://doi.org/10.1080/10106049.2022.2158951 (2023).
Haggerty, R., Sun, J., Yu, H. & Li, Y. Application of machine learning in groundwater quality modeling-A comprehensive review. Water Research 233, https://doi.org/10.1016/j.watres.2023.119745 (2023).
Panahi, M., Sadhasivam, N., Pourghasemi, H. R., Rezaie, F. & Lee, S. Spatial prediction of groundwater potential mapping based on convolutional neural network (CNN) and support vector regression (SVR). Journal of Hydrology 588, https://doi.org/10.1016/j.jhydrol.2020.125033 (2020).
Podgorski, J. & Berg, M. Global threat of arsenic in groundwater. Science 368, 845-+, https://doi.org/10.1126/science.aba1510 (2020).
Article ADS CAS PubMed Google Scholar
Quoc Bao, P. et al. Groundwater level prediction using machine learning algorithms in a drought-prone area. Neural Computing & Applications 34, 10751–10773, https://doi.org/10.1007/s00521-022-07009-7 (2022).
Article Google Scholar
Granata, F. Evapotranspiration evaluation models based on machine learning algorithms-A comparative study. Agricultural Water Management 217, 303–315, https://doi.org/10.1016/j.agwat.2019.03.015 (2019).
Article Google Scholar
Huang, G. et al. Evaluation of CatBoost method for prediction of reference evapotranspiration in humid regions. Journal of Hydrology 574, 1029–1041, https://doi.org/10.1016/j.jhydrol.2019.04.085 (2019).
Article ADS Google Scholar
Jiang, C. & Ryu, Y. Multi-scale evaluation of global gross primary productivity and evapotranspiration products derived from Breathing Earth System Simulator (BESS). Remote Sensing of Environment 186, 528–547, https://doi.org/10.1016/j.rse.2016.08.030 (2016).
Article ADS Google Scholar
Mostafa, R. R., Kisi, O., Adnan, R. M., Sadeghifar, T. & Kuriqi, A. Modeling Potential Evapotranspiration by Improved Machine Learning Methods Using Limited Climatic Data. Water 15, https://doi.org/10.3390/w15030486 (2023).
Xu, T. et al. Evaluating Diffferent Machine Learning Methods for Upscaling Evapotranspiration from Flux Towers to the Regional Scale. Journal of Geophysical Research-Atmospheres 123, 8674–8690, https://doi.org/10.1029/2018JD028447 (2018).
Article ADS Google Scholar
Goetz, J. N., Brenning, A., Petschko, H. & Leopold, P. Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Computers & Geosciences 81, 1–11, https://doi.org/10.1016/j.cageo.2015.04.007 (2015).
Article ADS Google Scholar
Singh, D. et al. Machine-learning- and deep-learning-based streamflow prediction in a hilly catchment for future scenarios using CMIP6 GCM data. Hydrology and Earth System Sciences 27, 1047–1075, https://doi.org/10.5194/hess-27-1047-2023 (2023).
Article ADS CAS Google Scholar
Xiao, Q. et al. Separating emission and meteorological contributions to long-term PM2.5 trends over eastern China during 2000-2018. Atmospheric Chemistry and Physics 21, 9475–9496, https://doi.org/10.5194/acp-21-9475-2021 (2021).
Article ADS CAS Google Scholar
Bian, L., Qin, X., Zhang, C., Guo, P. & Wu, H. Application, interpretability and prediction of machine learning method combined with LSTM and LightGBM-a case study for runoff simulation in an arid area. Journal of Hydrology 625, https://doi.org/10.1016/j.jhydrol.2023.130091 (2023).
Chen, C. et al. CRML: A Convolution Regression Model With Machine Learning for Hydrology Forecasting. IEEE Access 7, 133839–133849, https://doi.org/10.1109/ACCESS.2019.2941234 (2019).
Article Google Scholar
Du, S. et al. Control of climate and physiography on runoff response behavior through use of catchment classification and machine learning. Science of the Total Environment 899, https://doi.org/10.1016/j.scitotenv.2023.166422 (2023).
Gaertner, B. Geospatial patterns in runoff projections using random forest based forecasting of time-series data for the mid-Atlantic region of the United States. Science of the Total Environment 912, https://doi.org/10.1016/j.scitotenv.2023.169211 (2024).
Jung, J., Han, H., Kim, K. & Kim, H. S. Machine Learning-Based Small Hydropower Potential Prediction under Climate Change. Energies 14, https://doi.org/10.3390/en14123643 (2021).
Liu, S., Lei, P. & Koyamada, K. LSTM Based Hybrid Method for Basin Water Level Prediction by Using Precipitation Data. Journal of Advanced Simulation in Science and Engineering 8, 40–52, https://doi.org/10.15748/jasse.8.40 (2020).
Article Google Scholar
Sarzaeim, P., Bozorg-Haddad, O., Bozorgi, A. & Loaiciga, H. A. Runoff Projection under Climate Change Conditions with Data-Mining Methods. Journal of Irrigation and Drainage Engineering 143, https://doi.org/10.1061/(ASCE)IR.1943-4774.0001205 (2017).
Wi, S. & Steinschneider, S. Assessing the Physical Realism of Deep Learning Hydrologic Model Projections Under Climate Change. Water Resources Research 58, https://doi.org/10.1029/2022WR032123 (2022).
Wang, Q. et al. 1CNSW 1.0: Prefectural Reconstruction of China’s Surface Water Resources Using Machine Learning Methods. figshare https://doi.org/10.6084/m9.figshare.26952454 (2025).

Download references

Acknowledgements

This study was funded by the National Key Research and Development Program of China (2022YFF1302200), the National Science Foundation of China (42101029 and 31961143011), the China Postdoctoral Science Foundation (2020M683451 and 2022T150513), the Shaanxi Major Theoretical and Practical Program (20ST-106), the Innovation Team of Shaanxi Province (2021TD-52).

Author information

Authors and Affiliations

Institute of Global Environmental Change, Xi’an Jiaotong University, Xi’an, Shaanxi Province, 710049, China
Qichen Wang, Fubo Zhao, Xi Wang, Wenbo Shi, Yinuo Shan & Yiping Wu
College of Forestry, Northwest A&F University, Yangling, 712100, China
Qiang Li
State Key Laboratory of Eco-Hydraulics in Northwest Arid Region of China, School of Water Resources and Hydropower, Xi’an University of Technology, Xi’an, 710048, China
Dengfeng Liu

Authors

Qichen Wang
View author publications
Search author on:PubMed Google Scholar
Fubo Zhao
View author publications
Search author on:PubMed Google Scholar
Xi Wang
View author publications
Search author on:PubMed Google Scholar
Wenbo Shi
View author publications
Search author on:PubMed Google Scholar
Yinuo Shan
View author publications
Search author on:PubMed Google Scholar
Qiang Li
View author publications
Search author on:PubMed Google Scholar
Dengfeng Liu
View author publications
Search author on:PubMed Google Scholar
Yiping Wu
View author publications
Search author on:PubMed Google Scholar

Contributions

Fubo Zhao conceived the idea and designed the research. Qichen Wang carried out data analysis and prepared the original draft. Fubo Zhao, Xi Wang, Wenbo Shi, Yinuo Shan and Yiping Wu reviewed the manuscript.

Corresponding authors

Correspondence to Fubo Zhao or Yiping Wu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supporting information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, Q., Zhao, F., Wang, X. et al. CNSW 1.0: Prefectural Reconstruction of China’s Surface Water Resources Using Machine Learning Methods. Sci Data 12, 1032 (2025). https://doi.org/10.1038/s41597-025-05389-8

Download citation

Received: 12 September 2024
Accepted: 11 June 2025
Published: 19 June 2025
DOI: https://doi.org/10.1038/s41597-025-05389-8