The 500-meter long-term winter wheat grain protein content dataset for China from multi-source data

Xu, Xiaobin; Zhou, Lili; Taylor, James; Casa, Raffaele; Fan, Chengzhi; Song, Xiaoyu; Yang, Guijun; Huang, Wenjiang; Li, Zhenhai

doi:10.1038/s41597-024-03866-0

Download PDF

Data Descriptor
Open access
Published: 19 September 2024

The 500-meter long-term winter wheat grain protein content dataset for China from multi-source data

Xiaobin Xu ORCID: orcid.org/0000-0002-5210-2338¹,
Lili Zhou¹,
James Taylor²,
Raffaele Casa³,
Chengzhi Fan¹,
Xiaoyu Song⁴,
Guijun Yang⁴,
Wenjiang Huang⁵ &
…
Zhenhai Li¹

Scientific Data volume 11, Article number: 1025 (2024) Cite this article

2308 Accesses
2 Citations
Metrics details

Subjects

Abstract

In China, the exigency for precise wheat grain protein content (GPC) data rises with growing food consumption demands and global market competition. However, due to the lack of extensive, prolonged high-resolution benchmark data, previous GPC studies have primarily focused on experimental fields, small geographic units, and limited temporal scopes. Additionally, the diverse geographical terrain in China exacerbates the challenges of large-scale GPC estimation. To address this challenge and the data gap, the first 500-meter spatial resolution, long-term winter wheat dataset covering major planting regions in China (CNWheatGPC-500) was created by integrating multi-source data from ERA5 and MODIS. The results demonstrate that the GPC estimation model based on hierarchical linear model significantly outperformed other conventional models. The validation dataset exhibited an R² of 0.45 and an RMSE of 0.96%. In cross-validation, the RMSE values ranged from 0.90% in Gansu to 1.32% in Anhui. For leave-one-year-out cross-validation, the RMSE values ranged from 0.77% to 1.11%. CNWheatGPC-500 offers valuable insights for enhancing wheat production, quality control, and agricultural decision-making.

A dataset of winter wheat aboveground biomass in China during 2007–2015 based on data assimilation

Article Open access 11 May 2022

Automatic mapping of winter wheat planting structure and phenological phases using time-series sentinel data

Article Open access 02 August 2024

Efficient agronomic practices narrow yield gaps and alleviate climate change impacts on winter wheat production in China

Article Open access 16 April 2025

Background & Summary

Wheat (Triticum aestivum L.) is one of the world’s primary staple crops, being a crucial source of dietary calories and protein¹. In the global context, China holds a significant position as both the largest producer and consumer of wheat, thereby exerting substantial influence over worldwide wheat dynamics². Notably, within China, winter wheat takes on a pivotal role, contributing to approximately 85% of the nation’s total grain output during the summer harvest³. This prominence is attributed to its exceptional yield, elevated protein content, and adaptability to local environmental conditions and dietary preferences⁴. Therefore, under the conditions of frequent climate change and escalating geopolitical conflicts, the timely acquisition of comprehensive production performance information at a broad regional scale for winter wheat is instrumental for an improved understanding of food supply, which is pivotal for achieving food security and sustainable development goals. Although previous research^5,6,7 has primarily focused on wheat yield estimation, recent studies^8,9,10 have increasingly emphasized the importance of grain quality estimation. Grain quality is as vital as grain yield because it directly influences the demand strength of the consumer markets, international market competitiveness, and the profitability of production systems¹¹. The grain protein content (GPC) of wheat is a crucial trait that characterizes grain quality, significantly affecting the viscoelastic properties and milling characteristics of flour¹⁰. Furthermore, it direct determines the nutritional and economic value of wheat¹². Thus, an accurate understanding of wheat GPC holds paramount economic and practical significance.

Spatial monitoring of GPC has evolved over time. It initially began with labor-intensive methods like manual sampling. These samples were georeferenced and interpolated using geostatistics to create spatial GPC layers¹³. With advancements in remote sensing (RS) and earth observation technologies, RS-based GPC monitoring has provided more potential. Beyond conventional GPC mapping, it offers the capability for early estimation. Sensors, spanning from handheld devices¹⁴ to proximal ground-based systems^15,16, aerial platforms^17,18,19, and satellite platforms^20,21,22, have progressively emerged as increasingly favored data acquisition tools due to their economic efficiency and non-destructive advantages. These sensors are integrated with ground-based GPC observations, enabling the establishment of a predictive system for generating spatial GPC layers with predetermined levels of accuracy²³. RS sensors utilize received spectral reflectance signals, particularly in the red and near-infrared wavelengths, to rapidly capture crop growth characteristics and health information^24,25,26. Nonetheless, there is a common need to enhance the information of interest while disregarding responses that are less relevant and considered as noise. For instance, GPC is closely related to crop nitrogen, chlorophyll, and water content^27,28, making information sensitive to these variables worthy of additional consideration. This highlights the significance of vegetation indices (VIs). The widely applied Normalized Difference VI (NDVI) in agriculture provides comprehensive and actionable vegetation information^29,30. However, it can encounter saturation effects in areas with high vegetation cover³¹. Three-band VIs, especially the Enhanced VI (EVI), have been highlighted due to their ability to mitigate saturation³¹. This is crucial for GPC monitoring as the critical phase for grain protein synthesis often occurs in the later stages of crop growth³², coinciding with high biomass. Previous efforts^33,34,35 have shown that EVI has the strongest correlation with GPC among numerous VIs. Relying exclusively on RS-based GPC monitoring may yield favorable results in small-scale experimental fields or individual growing seasons^9,10. However, these methods often fall short when extended to large-scale and multi-year applications^36,37. Different environments exert different influences on the response of RS factors to GPC. These environmental factors usually include variables directly related to GPC^8,35,38,39, such as evapotranspiration (ET), temperature (Tem), precipitation (Pre), and solar radiation (SR). They influence the protein nutritional value of wheat grains by exerting effects on photosynthesis, growth rates, nutrient utilization efficiency, and the translocation of carbon and nitrogen from source to sink^40,41. In light of this prior knowledge, researchers have chosen to employ multisource data fusion for GPC estimation, which has led to the development of various methods.

There are three primary categories of methods used for predicting GPC. The first category is empirical methods, including empirical regression based on simple mathematical models and complex machine learning algorithms. Empirical regression methods typically requiring fewer data points, offering transparent models that are easy to comprehend⁴². However, these methods often lack precision and exhibit relatively weak generalization capabilities¹⁸. Research on machine learning algorithms for GPC estimation has been increasing^9,10,43, and such methods typically demand a substantial amount of data for accurate model training. By adjusting factors such as algorithm complexity and regularization, model generalization can be enhanced⁴⁴. Nevertheless, most machine learning algorithms are commonly perceived as “black boxes” due to their limited interpretability¹². The second category is physically-based process models. These models aim to simulate and predict crop growth and development by considering various factors, such as soil properties, meteorological conditions, and agricultural management practices⁴⁵. Data assimilation provides a method for integrating RS observations into process models. Leaf area index or leaf nitrogen content are often used as state variables in data assimilation systems to correct crop growth model behavior, thereby improving model accuracy⁴⁶. The application of GPC simulation using these methods has yielded favorable results at field and small-scale levels^34,47,48. However, the high demand for dense data input and substantial localized parameter calibration makes employing these models for large-scale yield estimations more challenging. The third category involves semi-mechanistic models. These models combine fundamental mechanistic equations or rules, explaining the core mechanism of an observed phenomena, with empirical parameters to account for complexities or uncertainties that the model may not fully describe^49,50. Given these advantages, semi-mechanistic models have garnered increased attention, particularly in estimating agricultural surface parameters^51,52,53. Among them, Li, et al.⁵⁴ has elucidated the effectiveness of the hierarchical linear model (HLM) for GPC estimation that integrates climate-RS spatial nesting theories in interannual application scenarios, and Xu, et al.³⁵ has reiterated it within cross-regional application scenarios. However, most research still focuses on limited regions and years, and the scalability of the results requires further validation. Additionally, there are only a limited number of studies regarding the monitoring of national wheat GPC in China and its correlation with climate^8,38. Due to data limitations, the minimum geographic units analyzed in current studies are at the county-level, rather than undergoing more detailed pixel-level assessments.

The synergy of advancements in RS technology, the application of sensor technology, agricultural information management systems, agricultural digitization, and open data policies has propelled the generation of vast agricultural resource products^55,56. These products encompass datasets related to crop planting distribution^57,58,59, crop phenology^60,61,62, and yield^63,64,65, among others. The creation of these datasets holds significant value, as it contributes to the realization of sustainable agricultural development, food security, grain production, and rural economic growth. Nevertheless, so far, no large-scale and long-term datasets related to wheat GPC have been made publicly available. This data gap in the field signifies a need for further research and data acquisition to fill it.

In view of this situation, this study leveraged multi-source climate and RS data spanning from 2008 to 2019 to develop a nationwide GPC estimation model based on HLM for China. This model was employed to generate a high-resolution, long-term GPC dataset with a spatial resolution of 500 m (CNWheatGPC-500). Specifically, first, a multi-source meteorological dataset was integrated, encompassing evapotranspiration, temperature, precipitation, and solar radiation, based on pixel-by-pixel phenological attributes. This integration effectively alleviated the phenological variations across different regions. Secondly, taking into account different levels of spatial data structures and integrating meteorological datasets with maximum EVI, an HLM-based GPC estimation model was constructed. Finally, the performance of the GPC estimation model was rigorously assessed through extensive multi-model comparisons and cross-validation. The model validated for its performance was applied to the entire wheat cultivation areas of China to generate the CNWheatGPC-500.

Methods

Study area

China’s winter wheat cultivation is a result of its conducive natural environment, rich cultivation history, and alignment with dietary preferences. With its vast territory covering approximately 9.6 million km², China displays a wide range of climate and soil conditions, which, in turn, leads to the varying adaptability of winter wheat across different regions. Each region boasts suitable planting areas. The focus of this study includes 13 provinces and municipalities directly under the Central Government (Fig. 1(a)), encompassing nearly all major winter wheat cultivation regions in China. These regions have been further categorized into five primary agricultural subregions based on climate characteristics, soil conditions, and topography. The vector data utilized for these delineations is derived from the Agricultural Zoning Map of China, depicted in Fig. 1(b). Subregion A covers the northwestern region of Gansu and Xinjiang, while Subregion B includes the Loess Plateau, notably the Guanzhong Plain. Subregion C corresponds to the North China Plain, and Subregion D is the southwestern regions that includes the Sichuan Basin. Subregion E includes the plain areas in the middle and lower reaches of the Yangtze Valley.

Data acquisition and preprocessing

In this study, a comprehensive dataset was employed, encompassing geospatial data for winter wheat, satellite imagery, historical meteorological data, and field-based GPC data. These diverse data sources were selected to facilitate analysis of winter wheat characteristics and enhance the precision of GPC estimation models. The geospatial data of winter wheat provides key information about wheat planting patterns and phenology, serving as the base layer for this study and establishing the temporal framework for meteorological data synthesis. Satellite imagery data refines the spatial scale by offering high-resolution imagery of the research area. This is particularly valuable for closely monitoring the growth and nutritional status of winter wheat. The utilization of historical meteorological data plays a pivotal role in driving grain protein synthesis and nutrient transfer within the wheat. The pronounced variability in meteorological conditions among different regions exerts a direct influence on the diversity in GPC. Consequently, the inclusion of meteorological data can facilitate the development of more robust GPC estimation models, particularly in the presence of spatial heterogeneity. Finally, field-based GPC data serves to calibrate and validate GPC estimation models based on RS and meteorological data. Together, these datasets form a robust foundation for creating the CNWheatGPC-500 dataset, as elaborated in the subsequent sections and Table 1.

Table 1 Input dataset details for CNWheatGPC-500 generation.

Full size table

Geospatial data for winter wheat

The winter wheat phenological data and spatial distribution data in this study were obtained from the winter wheat phenological dataset in China from 2000 to 2019 produced by Luo, et al.⁶⁰. In this dataset, the wheat layer was extracted from a 1 km national land cover dataset (NLCD) Detailed information about the NLCD dataset can be found at http://www.resdc.cn/Default.aspx. Notably, the dataset comprises essential winter wheat phenological periods, namely, green-up and emergence (GE), heading (HE), and maturity (MA). Of particular significance is that it provides spatiotemporal coverage that aligns with the specific needs of this study.

Satellite imagery data

The Moderate Resolution Imaging Spectroradiometer (MODIS) MCD43A4 dataset was accessed through the Google Earth Engine (GEE) platform. This dataset includes data from 2000 onward and features a spatial resolution of 500 m along with a retrieval period of 16 days. The utilization of RS features from this dataset is widespread in the study of surface crop parameter inversion^66,67. Among these features, EVI stands out as an improvement of NDVI, specifically designed to minimize the impact of atmospheric scattering effects⁶⁸. This refinement significantly bolsters the detectable vegetation signals, providing a more precise estimation of vegetation growth and health status, even in the face of varying conditions across different temporal and spatial contexts⁶⁹. Hence, considering the accessibility of high-resolution, long-term time series data, this study proceeded to compute the EVI using MODIS data for estimating wheat GPC. The calculation equation is as follows:

$${\rm{EVI}}=2.5{\rm{\times }}\frac{\left({\rm{NIR}}-{\rm{R}}\right)}{\left({\rm{NIR}}+6{\rm{\times }}{\rm{R}}-7.5{\rm{\times }}{\rm{B}}+1\right)},$$

(1)

where NIR, R, and B represent the reflectance values of MODIS products in the near-infrared, red, and blue bands, respectively. It is worth noting that the EVI employed in this study corresponds to the maximum synthesis achieved during the winter wheat growth season. This specific choice is grounded in the fact that the maximum EVI value typically conveys vital information about the physiological characteristics and growth status of wheat during the heading stage^51,60. This period is crucial as it directly influences the processes of wheat photosynthesis and nutrient transfer, which are closely tied to protein synthesis^70,71.

Historical meteorological data

The ET, Tem, Pre, and SR data were chosen to reconstruct the required historical meteorological input dataset for this study. ET represents the combined processes of soil water evaporation and plant transpiration, comprehensively reflecting water demand, water stress, and the management of agricultural water⁷². ET plays a pivotal role in influencing crop growth and protein synthesis and has been widely employed in crop yield estimation⁷³. However, its incorporation into the estimation of GPC at the national scale represents an unexplored approach. Furthermore, there is a well-established recognition that Tem, Pre, and SR are critical in facilitating nutrient absorption, translocation, and protein synthesis within wheat plants^39,74,75. Previous studies^35,76 harnessed these factors to estimate wheat GPC, yielding favorable results. The ET data for this study was obtained through the MODIS MOD16A2 product via the GEE platform. Tem, Pre, and SR data were sourced from European Centre for Medium-Range Weather Forecasts (ECMWF), ensuring consistent data sources across these data. Detailed data specifications can be found in Table 1.

Considering the challenges of regional and interannual phenological variations across extensive spatial and temporal scales, this study integrated phenological grid data with their inherent temporal structures to reconstruct meteorological datasets on a pixel-by-pixel basis, as depicted in Fig. 2. First, wheat phenological data was paired with meteorological grids. This started by determining the day of year (DOY) corresponding to MA for each pixel within the meteorological grid. Subsequently, cumulative values for the respective meteorological variables were computed at 30-day intervals, over a total period of 90 days before MA, resulting in the generation of three effective cumulative values for each meteorological variable. These 30-day time intervals were denoted as T1, T2, and T3, substituting the conventional empirical selection of specific months. The 90-day timeframe generally encompasses the critical phenological stages of wheat growth (e.g. HE and MA). This observation period should capture the influence of the most critical phenological stages on wheat grain protein synthesis, which is essential for accurate monitoring and analysis of GPC^35,41.

Field-based GPC data

Ground survey points were densely distributed in the study area, as illustrated in Fig. 1(a). A total of 2648 GPC data points for winter wheat were collected in the field from 2008 to 2019, which followed a normal distribution (Fig. 3(a)). The data size across different provinces is illustrated in Fig. 3(b), with Anhui, Hebei, Henan, Jiangsu and Henan exhibiting the most extensive datasets. These provinces are situated in the primary cultivation regions for winter wheat in China, accounting for 60% of the nation’s total wheat acreage⁷⁷. It should be noted that data for the year 2015 were not available, while the data for the subsequent years exhibited consistent and comprehensive coverage, as illustrated in Fig. 3(c). The on-site measurement of GPC strictly adheres to standard measurement methods. Specifically, first, during the annual maturity period, sample points were randomly selected and precisely located within each sample area using GPS. Second, winter wheat was manually harvested from each 1 m² sample area employing the five-point sampling method. Each sample area was positioned at a minimum distance of 2 m from the field’s edge to mitigate edge effects. Thirdly, the samples were transported to the laboratory, where the plants were separated, impurities were removed, and the grains were dried to attain a standardized moisture content of 14%. Finally, the wheat grains were weighed. The protein content was then measured using an Infratec TM 1241 near-infrared grain analyzer (FOSS A/S, Hillerød, Denmark).

Workflow

Figure 4 illustrates the workflow of this study. The first step is data acquisition and preprocessing, which involves the fusion of multisource meteorological data with phenological information at the pixel level, the extraction of RS data, and on-site GPC measurements. These data are used to create the calibration and validation dataset. The subsequent step encompasses model development and accuracy evaluation indicator, in which the GPC estimation model is constructed using this dataset and the HLM approach. Multiple evaluations, including model comparisons and cross-validation, are employed to assess the GPC estimation models. The final step, labeled dataset and validation, leverages the GPC estimation model to create the CNWheatGPC-500 dataset, which covers the study area from 2008 to 2019.

Development of hierarchical linear model (HLM)

The generation of large-scale and long-term GPC datasets typically faces challenges of spatial heterogeneity and interannual phenological differences. The conventional strategy is to comprehensively utilize multi-source environmental data such as RS and meteorology to address this challenge. However, RS data is typically acquired at relatively high resolutions, offering detailed information about wheat growth for each small-scale geographic unit (pixel). Meteorological data, on the other hand, is usually collected at larger geographic units, covering extensive geographic regions. These geographic regions with distinct climates influence the relationship between GPC and RS data. Consequently, the complex interactions among these diverse data sources result in significant data nesting. HLM is a powerful tool employed for the explanation and modeling of nested data⁷⁸, and it has found widespread application in studies related to vegetation growth, crop yield and GPC estimation^{35,51,52,53,54,79,80}. HLM enables the partitioning of variability in nested data into two components: one arising from the individual level (i.e., how RS data responds to GPC), and the other stemming from the group level (i.e., how meteorological data influences the relationship between RS data and GPC). The general forms of the two-level relationships are presented as:

$$\,\mathrm{Layer}\,1:GP{C}_{{\rm{ij}}}={\beta }_{{\rm{oj}}}+{\beta }_{1{\rm{j}}}\times {\rm{EVI}}+{{\rm{r}}}_{{\rm{ij}}}$$

(2)

where the GPC_ij is the grain protein content of an individual i within the population j, β_0j and β_1j represents intercept and slope, respectively. r_ij represents the random error. In this layer, the first linear structure of EVI response to wheat GPC is formed. The selected meteorological data has an impact on the relationship between wheat GPC and EVI, resulting in variations in slope and intercept:

$${\rm{Layer}}\,2:{\beta }_{{\rm{mj}}}={\gamma }_{{\rm{m0}}}+{\sum }_{1}^{n}({\gamma }_{{\rm{mn1}}}\times {{\rm{ET}}}_{{\rm{n}}})+{\sum }_{1}^{n}({\gamma }_{{\rm{mn2}}}\times {{\rm{Tem}}}_{{\rm{n}}})+{\sum }_{1}^{n}({\gamma }_{{\rm{mn3}}}\times {{\rm{Pre}}}_{{\rm{n}}})+{\sum }_{1}^{n}({\gamma }_{{\rm{mn4}}}\times {{\rm{SR}}}_{{\rm{n}}})+{{\rm{\mu }}}_{{\rm{mj}}},$$

(3)

where β_mj represents the β₀ and β₁ from the Level 1 model respectively, γ_m0 is the intercept. γ_mn1 to γ_mn4 represent coefficient of each factor. The n values are 1, 2, and 3, representing the meteorological data synthesized for the n-th time interval (T1, T2 and T3). And μ_mj is the random effect of the Level-2, used to consider the correlation and variability between individuals within group. The fixed and random effects parameters in HLM are estimated using maximum likelihood estimation⁸¹. The construction of the model and the estimation of parameters in this study were carried out utilizing the Jamovi statistical software (The jamovi project)⁸², adhering to rigorous scientific methodology. Furthermore, it should be noted that this study employs centralization to set the mean of the independent variables to zero. This approach can reduce parameter collinearity and uniform scaling, thereby enhancing the quality and interpretability of the model.

To further characterize the level of data nesting in the GPC estimation model, this study uses the Intraclass Correlation Coefficient (ICC). ICC serves as a crucial statistical metric for assessing the correlation among individual-level data within group-level data, the variability between different groups, and the effectiveness of hierarchical data⁵³. Its value ranges from 0 to 1, and a higher ICC value indicates a better fit to the characteristics of nested data, making it more suitable for HLM. The calculation is as follows:

$${\rm{ICC}}={{\rm{\sigma }}}_{{\rm{\mu }}0}^{2}/\left({{\rm{\sigma }}}_{{\rm{\mu }}0}^{2}+{{\rm{\sigma }}}^{2}\right),$$

(4)

where σ² is the within-group variance, σ_μ0² is the between-group variance.

Comparative regression algorithms

In addition to HLM, several other machine learning and statistical algorithms have been considered for estimating GPC in wheat. The algorithms tested in the present work include Random Forest (RF), Support Vector Machine (SVM), and Multiple Linear Regression (MLR). Each of these methods possesses its unique characteristics, offering valuable insights into agricultural parameter estimation in previous study^73,83,84.

RF is an ensemble learning algorithm widely used in agricultural modeling and RS applications. And it is particularly attractive for GPC estimation due to its capability to handle high-dimensional datasets and nonlinear relationships. The RF algorithm employs multiple decision trees created from bootstrapped samples⁸⁵. Each tree splits nodes using random feature subsets. The outcomes from each tree are aggregated using either majority voting or averaging of the predicted values. Moreover, RF provides opportunities for performance optimization through hyperparameter tuning. In this study, grid search and cross-validation techniques are utilized to identify the optimal hyperparameter combination, resulting in a precisely calibrated RF model for GPC estimation.

SVM is another powerful algorithm employed in the ___domain of agricultural modeling and RS, making it an appealing choice for these data. SVM operates by determining the ideal hyperplane for effectively distinguishing between various data point categories⁸⁶. This hyperplane is identified by maximizing the margin between the nearest data points of different categories. The effectiveness of SVM lies in its ability to capture complex relationships in the data, which is especially valuable for the GPC estimation. Similarly, grid search and cross-validation techniques are used to fine-tune the SVM model.

MLR is a statistical method used to model the relationship between a dependent variable and multiple independent variables by fitting a linear equation⁸⁷. Its primary purpose is to predict the dependent variable’s values based on the values of the predictor variables. MLR provides quantifiable coefficients for each predictor variable, representing the strength and direction of their relationships with GPC. Simplicity and transparency of MLR offer a useful reference against more complex algorithms, contributing to a comprehensive evaluation of their respective performance.

Experimental setup

a)
Evaluating the hierarchy and variability in multi-source environmental data employed for GPC estimations across different interannual and regional groupings, using ICC computations. This approach aims to enhance our comprehension of the advantages associated with the HLM.
b)
Model comparison: The experimental setup for this study entailed the random selection of model calibration and validation datasets, maintaining a 60% to 40% ratio respectively. The calibration dataset served as the basis for constructing the HLM model, which was subsequently validated using the separate validation dataset. In parallel, the exact same subsets of data were used to calibrate and validate the comparison modelling approaches (RF, SVM and MLR).
c)
Model robustness: Cross-validation was executed using two distinct methods: leave-one-year-out and leave-one-region-out. These methods ensured that model validation occurred independently of the data used for model training.

The CNWheatGPC-500 dataset was generated by retraining the model with the best performing model using the entire available dataset, and individual models were crafted for each province to accommodate the substantial data available from each region. Exploration of the spatiotemporal patterns was carried out using the generated CNWheatGPC-500 dataset. This experimental design allowed for rigorous model assessment, validation, and the investigation of GPC variations at both regional and temporal scales, ensuring the reliability of the results.

Statistical analysis methods

The R-squared (R²), Root Mean Square Error (RMSE), and normalized RMSE (nRMSE) serve as essential metrics for the comparison and assessment of the GPC estimation models. These metrics are widely used in quantifying the models’ accuracy and effectiveness^35,88. The calculation formula is as follows:

$${{\rm{R}}}^{2}=1-\frac{{\rm{SSE}}}{{\rm{SST}}},$$

(5)

$${\rm{RMSE}}=\sqrt{\frac{\mathop{\sum }\limits_{i=1}^{n}{({y}_{i}-{\hat{y}}_{i})}^{2}}{n}},$$

(6)

$$nRMSE=\frac{RMSE}{{\bar{y}}_{i}},$$

(7)

Where SSE is the Sum of Squared Errors, representing the fitting error of the regression model. SST is the Total Sum of Squares, representing the overall variability of the data. n is the number of sample points. ${y}_{i}$ is the actual observation value of the i-th sample point. $\hat{{y}_{i}}$ is the predicted value of the model for the i-th sample point. $\bar{{y}_{i}}$ is the average of the actual observation value.

Data Records

The first 500-meter spatial resolution, long-term winter wheat dataset covering major planting regions in China (CNWheatGPC-500) was created by integrating multi-source data from ERA5 and MODIS. Distributed under the Creative Commons Attribution 4.0 International license, the CNWheatGPC-500 dataset not only advances our understanding of winter wheat GPC in China but also facilitates research and analysis in an open and collaborative manner. The dataset is designated as YearCNWheatGPC-500, where “Year” represents the years spanning from 2008 to 2019. It encompasses a comprehensive 12-year dataset presented in TIF format. The CNWheatGPC-500 product generated in this study is aavailable at https://doi.org/10.5281/zenodo.10066544⁸⁹. Kindly contact the authors for further inquiries and more detailed information.

Technical Validation

Inter-group variability in multisource data

Significant variations across years were observed by calculating Intraclass Correlation Coefficient (ICC) for various provinces (Fig. 5(a)). Gansu Province displayed relatively lower interannual differences (ICC = 0.20), while Shanxi Province exhibited higher interannual variations (ICC = 0.79). This suggests that a significant proportion of the total variability, ranging from 20% to 79%, can be attributed to inter-group variance. Furthermore, the analysis extended to provincial differences for various years (Fig. 5(b)). For instance, a relatively low ICC value was found in 2018, indicating minimal differences among provinces and a higher degree of consistency within that year. Conversely, 2010 exhibited a notably high ICC value, implying substantial differences among provinces during that particular year. Despite these discrepancies, the overall ICC values remained notably high, highlighting the multi-layered nature of spatial data, where a level of consistency exists within groups, but disparities between groups are pronounced. Incorporating diverse agricultural subregions into the analysis has generated additional insights (Fig. 5(c–g)). While scatter plots suggest that the relationship between GPC and the EVI may not be overt, it does exhibit a notable degree of correlation within different groups.

Assessment and selection of GPC estimation models

Evaluating the performance of GPC estimation models is essential for validating the reliability of CNWheatGPC-500. Figure 6 presents the performance and accuracy of various models on the calibration dataset. The RF model (Fig. 6(a)) exhibits a favorable performance with an R² of 0.59, RMSE of 0.89%, and nRMSE of 6.39%. However, a slight bias is visible as it tends to underestimate high GPC values and overestimate low GPC values. In Fig. 6(b), the performance of the SVM model is shown, presenting acceptable results (R² = 0.43, RMSE = 1.04%, nRMSE = 7.5%), though worse than RF in terms of accuracy. The MLR model results, shown in Fig. 6(c), exhibited the lowest accuracy with an R² of 0.11, RMSE of 1.29%, and nRMSE of 9.32%. Lastly, an R² value of 0.57, an RMSE of 0.89%, and an nRMSE of 6.43% are observed for the HLM in Fig. 6(d). While HLM slightly underperformed compared to RF in terms of R², the HLM displayed a more balanced estimation performance, particularly aligning with actual values in the lower and higher GPC ranges.

The performance of the GPC estimation models on the validation dataset can be observed in Fig. 7. The R², RMSE, and nRMSE (Fig. 7(a)) for the GPC estimated by RF compared to the measured GPC were 0.39, 0.99%, and 7.12%, respectively. Although it still delivered reasonable estimations, the performance of RF was comparatively diminished in the validation dataset. Figure 7(b) shows the SVM model results (R² = 0.29, RMSE = 1.09%, nRMSE = 7.86%) indicating a drop in performance as well. In Fig. 7(c), the MLR model maintained its R² at 0.11, but experienced an increase in both RMSE (1.19%) and nRMSE (8.61%), and continued to be the worst performing model. Notably, Fig. 7(d) reveals the robust performance of the HLM in the validation dataset with an R² of 0.45, RMSE of 0.96%, and nRMSE of 6.90%. Among all models, HLM stands out with the highest validation accuracy.

Cross-validation of HLM across multiple years and regions

Further cross-validation was performed to evaluate the performance of HLM. Figure 8(a) illustrates the results of a leave-one-region-out cross-validation conducted across provinces in China, where each province’s results represent its independent validation, having not participated in the model training. While R² values varied across provinces, ranging from 0.07 in Hebei to 0.41 in Xinjiang, the overall precision remained acceptable, illustrating the influence of regional differences. The RMSE plays a more crucial role than R² in assessing the validation accuracy of HLM. The RMSE values, ranging from 0.90% in Gansu to 1.32% in Anhui, displayed a relatively consistent pattern across provinces. Figure 8(b) displays the accuracy of leave-one-year-out cross-validation. The R² values exhibit variance across years due to sample size discrepancies, ranging from 0.19 to 0.62. Conversely, the RMSE values exhibit a relatively balanced distribution, ranging from 0.77% to 1.11%. Each year’s results consistently demonstrated a satisfactory performance. In summary, the cross-validation results demonstrated the resilience of HLM. Even when faced with data gaps for a specific year, the HLM models constructed by individual provinces consistently generated satisfactory GPC estimations for that particular year, emphasizing the reliability of the GPC dataset.

Code availability

The scripts for the comparative algorithms used in this study can be obtained at https://zenodo.org/records/10571132⁹⁰. The software implementing the algorithms for dataset generation and analysis is thoroughly listed in the methodology.

References

Cai, Y. et al. Integrating satellite and climate data to predict wheat yield in Australia using machine learning approaches. Agricultural and Forest Meteorology 274, 144–159, https://doi.org/10.1016/j.agrformet.2019.03.010 (2019).
Article ADS Google Scholar
Zhai, Y. et al. Impact-oriented water footprint assessment of wheat production in China. Science of The Total Environment 689, 90–98, https://doi.org/10.1016/j.scitotenv.2019.06.262 (2019).
Article ADS CAS PubMed Google Scholar
Cao, J. et al. Identifying the Contributions of Multi-Source Data for Winter Wheat Yield Prediction in China. Remote Sensing 12, https://doi.org/10.3390/rs12050750 (2020).
Huang, J. et al. Improving winter wheat yield estimation by assimilation of the leaf area index from Landsat TM and MODIS data into the WOFOST model. Agricultural and Forest Meteorology 204, 106–121, https://doi.org/10.1016/j.agrformet.2015.02.001 (2015).
Article ADS Google Scholar
Silvestro, P. et al. Estimating Wheat Yield in China at the Field and District Scale from the Assimilation of Satellite Data into the Aquacrop and Simple Algorithm for Yield (SAFY) Models. Remote Sensing 9, https://doi.org/10.3390/rs9050509 (2017).
Lobell, D.B. et al. Sight for Sorghums: Comparisons of Satellite- and Ground-Based Sorghum Yield Estimates in Mali. Remote Sensing 12, https://doi.org/10.3390/rs12010100 (2019).
Mavromatis, T. Spatial resolution effects on crop yield forecasts: An application to rainfed wheat yield in north Greece with CERES-Wheat. Agricultural Systems 143, 38–48, https://doi.org/10.1016/j.agsy.2015.12.002 (2016).
Article Google Scholar
Zhou, W., Liu, Y., Ata-Ul-Karim, S.T., Ge, Q. Spatial difference of climate change effects on wheat protein concentration in China. Environmental Research Letters 16, https://doi.org/10.1088/1748-9326/ac3401 (2021).
Liu, S., Hu, Z., Han, J., Li, Y., Zhou, T. Predicting grain yield and protein content of winter wheat at different growth stages by hyperspectral data integrated with growth monitor index. Computers and Electronics in Agriculture 200, https://doi.org/10.1016/j.compag.2022.107235 (2022).
Longmire, A. R., Poblete, T., Hunt, J. R., Chen, D. & Zarco-Tejada, P. J. Assessment of crop traits retrieved from airborne hyperspectral and thermal remote sensing imagery to predict wheat grain protein content. ISPRS Journal of Photogrammetry and Remote Sensing 193, 284–298, https://doi.org/10.1016/j.isprsjprs.2022.09.015 (2022).
Article ADS Google Scholar
Khanal, S., Kc, K., Fulton, J.P., Shearer, S., Ozkan, E. Remote Sensing in Agriculture—Accomplishments, Limitations, and Opportunities. Remote Sensing 12, https://doi.org/10.3390/rs12223783 (2020).
Ma, J., Zheng, B. & He, Y. Applications of a Hyperspectral Imaging System Used to Estimate Wheat Grain Protein: A Review. Frontiers in Plant Science 13, 837200, https://doi.org/10.3389/fpls.2022.837200 (2022).
Article PubMed PubMed Central Google Scholar
Basso, B. et al. Landscape Position and Precipitation Effects on Spatial Variability of Wheat Yield and Grain Protein in Southern Italy. Journal of Agronomy and Crop Science 195, 301–312, https://doi.org/10.1111/j.1439-037X.2008.00351.x (2009).
Article Google Scholar
McLellan, E. L. et al. The Nitrogen Balancing Act: Tracking the Environmental Performance of Food Production. BioScience, 68, 194–203, https://doi.org/10.1093/biosci/bix164%JbioScience (2018).
Diacono, M., Rubino, P. & Montemurro, F. Precision nitrogen management of wheat. A review. Agronomy for Sustainable Development 33, 219–241, https://doi.org/10.1007/s13593-012-0111-z (2013).
Article Google Scholar
Øvergaard, S. I., Isaksson, T. & Korsaeth, A. Prediction of Wheat Yield and Protein Using Remote Sensors on Plots—Part I: Assessing near Infrared Model Robustness for Year and Site Variations. Journal of Near Infrared Spectroscopy 21, 117–131, https://doi.org/10.1255/jnirs.1042 (2013).
Article ADS CAS Google Scholar
Rodrigues, F.A. Jr. et al. Multi-Temporal and Spectral Analysis of High-Resolution Hyperspectral Airborne Imagery for Precision Agriculture: Assessment of Wheat Grain Yield and Grain Protein Content. Remote Sens (Basel), 10, 930, https://doi.org/10.3390/rs10060930 (2018).
Hama, A., Tanaka, K., Mochizuki, A., Tsuruoka, Y., Kondoh, A. Estimating the Protein Concentration in Rice Grain Using UAV Imagery Together with Agroclimatic Data. Agronomy 10, https://doi.org/10.3390/agronomy10030431 (2020).
Barmeier, G., Hofer, K. & Schmidhalter, U. Mid-season prediction of grain yield and protein content of spring barley cultivars using high-throughput spectral sensing. European Journal of Agronomy 90, 108–116, https://doi.org/10.1016/j.eja.2017.07.005 (2017).
Article Google Scholar
Azzari, G. et al. Satellite mapping of tillage practices in the North Central US region from 2005 to 2016. Remote Sensing of Environment 221, 417–429, https://doi.org/10.1016/j.rse.2018.11.010 (2019).
Article ADS Google Scholar
Franch, B. et al. Improving the timeliness of winter wheat production forecast in the United States of America, Ukraine and China using MODIS data and NCAR Growing Degree Day information. Remote Sensing of Environment 161, 131–148, https://doi.org/10.1016/j.rse.2015.02.014 (2015).
Article ADS Google Scholar
Zhou, W. et al. Integrating climate and satellite remote sensing data for predicting county-level wheat yield in China using machine learning methods. International Journal of Applied Earth Observation and Geoinformation 111, https://doi.org/10.1016/j.jag.2022.102861 (2022).
Bastos, L. M., Froes de Borja Reis, A., Sharda, A., Wright, Y., Ciampitti, I. A. Current Status and Future Opportunities for Grain Protein Prediction Using On- and Off-Combine Sensors: A Synthesis-Analysis of the Literature. Remote Sensing 13, https://doi.org/10.3390/rs13245027 (2021).
Tian, J. et al. Simultaneous estimation of fractional cover of photosynthetic and non-photosynthetic vegetation using visible-near infrared satellite imagery. Remote Sensing of Environment 290, 113549, https://doi.org/10.1016/j.rse.2023.113549 (2023).
Article Google Scholar
Elvidge, C. D. & Chen, Z. Comparison of broad-band and narrow-band red and near-infrared vegetation indices. Remote Sensing of Environment 54, 38–48, https://doi.org/10.1016/0034-4257(95)00132-K (1995).
Article ADS Google Scholar
Miura, T., Huete, A. & Yoshioka, H. An empirical investigation of cross-sensor relationships of NDVI and red/near-infrared reflectance using EO-1 Hyperion data. Remote Sensing of Environment 100, 223–236, https://doi.org/10.1016/j.rse.2005.10.010 (2006).
Article ADS Google Scholar
Wang, Z. et al. Predicting grain yield and protein content using canopy reflectance in maize grown under different water and nitrogen levels. Field Crops Research 260, 107988, https://doi.org/10.1016/j.fcr.2020.107988 (2021).
Article Google Scholar
Savaşlı, E., Karaduman, Y., Önder, O. & Ateş, Ö. Prediction of grain protein content and gluten quality of bread wheat in the early vegetation period by optical sensors. Journal of Cereal Science 102, 103354, https://doi.org/10.1016/j.jcs.2021.103354 (2021).
Article CAS Google Scholar
Johnson, D. M. An assessment of pre- and within-season remotely sensed variables for forecasting corn and soybean yields in the United States. Remote Sensing of Environment 141, 116–128, https://doi.org/10.1016/j.rse.2013.10.027 (2014).
Article ADS Google Scholar
Wolanin, A. et al. Estimating crop primary productivity with Sentinel-2 and Landsat 8 using machine learning methods trained with radiative transfer simulations. Remote Sensing of Environment 225, 441–457, https://doi.org/10.1016/j.rse.2019.03.002 (2019).
Article ADS Google Scholar
Wang, W. et al. Estimating leaf nitrogen concentration with three-band vegetation indices in rice and wheat. Field Crops Research 129, 90–98, https://doi.org/10.1016/j.fcr.2012.01.014 (2012).
Article Google Scholar
Wu, W. et al. Booting stage is the key timing for split nitrogen application in improving grain yield and quality of wheat – A global meta-analysis. Field Crops Research 287, 108665, https://doi.org/10.1016/j.fcr.2022.108665 (2022).
Article Google Scholar
Tan, C. et al. Predicting grain protein content of field-grown winter wheat with satellite images and partial least square algorithm. PLoS One 15, e0228500, https://doi.org/10.1371/journal.pone.0228500 (2020).
Article CAS PubMed PubMed Central Google Scholar
Chen, P. Estimation of Winter Wheat Grain Protein Content Based on Multisource Data Assimilation. Remote Sensing 12, https://doi.org/10.3390/rs12193201 (2020).
Xu, X. et al. Prediction of Wheat Grain Protein by Coupling Multisource Remote Sensing Imagery and ECMWF Data. Remote Sensing 12, https://doi.org/10.3390/rs12081349 (2020).
Zhao, H., Song, X., Yang, G., Li, Z., Zhang, D., Feng, H. Monitoring of Nitrogen and Grain Protein Content in Winter Wheat Based on Sentinel-2A Data. Remote Sensing 11, https://doi.org/10.3390/rs11141724 (2019).
Song, Y. et al. Improving the Prediction of Grain Protein Content in Winter Wheat at the County Level with Multisource Data: A Case Study in Jiangsu Province of China. Agronomy 13, https://doi.org/10.3390/agronomy13102577 (2023).
Zhao, Y. et al. Spatial heterogeneity of county-level grain protein content in winter wheat in the Huang-Huai-Hai region of China. European Journal of Agronomy 134, https://doi.org/10.1016/j.eja.2022.126466 (2022).
Mußhoff, O. & Vollmer, E. Average Protein Content and Its Variability in Winter Wheat: A Forecast Model based on Weather Parameters. Earth Interactions 22, 1–24, https://doi.org/10.1175/ei-d-18-0011.1 (2018).
Article Google Scholar
Raghuram, N., Sharma, N. 4.17 - Improving Crop Nitrogen Use Efficiency★. In Comprehensive Biotechnology (Third Edition), Moo-Young, M.R., Nandula, Sharma, N., Eds.; Pergamon: Oxford, pp. 211–220 (2019).
Li, Z. et al. Remote sensing of quality traits in cereal and arable production systems: A review. The Crop Journal 12, 45–57, https://doi.org/10.1016/j.cj.2023.10.005 (2023).
Wang, L., Tian, Y., Yao, X., Zhu, Y. & Cao, W. Predicting grain yield and protein content in wheat by fusing multi-sensor and multi-temporal remote-sensing images. Field Crops Research 164, 178–188, https://doi.org/10.1016/j.fcr.2014.05.001 (2014).
Article Google Scholar
Li, W. et al. Monitoring rice grain protein accumulation dynamics based on UAV multispectral data. Field Crops Research 294, 108858, https://doi.org/10.1016/j.fcr.2023.108858 (2023).
Article Google Scholar
Xiao, C. et al. Short and mid-term sea surface temperature prediction using time-series satellite data and LSTM-AdaBoost combination approach. Remote Sensing of Environment 233, 111358, https://doi.org/10.1016/j.rse.2019.111358 (2019).
Article Google Scholar
El-Sharkawy, M. A. Overview: Early history of crop growth and photosynthesis modeling. Biosystems 103, 205–211, https://doi.org/10.1016/j.biosystems.2010.08.004 (2011).
Article CAS PubMed Google Scholar
Jin, X. et al. A review of data assimilation of remote sensing and crop models. European Journal of Agronomy 92, 141–152, https://doi.org/10.1016/j.eja.2017.11.002 (2018).
Article Google Scholar
Li, Z. et al. Estimating wheat yield and quality by coupling the DSSAT-CERES model and proximal remote sensing. European Journal of Agronomy 71, 53–62, https://doi.org/10.1016/j.eja.2015.08.006 (2015).
Article Google Scholar
Li, Z. et al. Assimilation of Two Variables Derived from Hyperspectral Data into the DSSAT-CERES Model for Grain Yield and Quality Estimation. Remote Sensing 7, 12400–12418, https://doi.org/10.3390/rs70912400 (2015).
Article ADS Google Scholar
Balaguer-Romano, R. et al. A semi-mechanistic model for predicting daily variations in species-level live fuel moisture content. Agricultural and Forest Meteorology 323, 109022, https://doi.org/10.1016/j.agrformet.2022.109022 (2022).
Article Google Scholar
Dong, J., Lu, H., Wang, Y., Ye, T. & Yuan, W. Estimating winter wheat yield based on a light use efficiency model and wheat variety data. ISPRS Journal of Photogrammetry and Remote Sensing 160, 18–32, https://doi.org/10.1016/j.isprsjprs.2019.12.005 (2020).
Article ADS Google Scholar
Zhao, Y. et al. ChinaWheatYield30m: a 30 m annual winter wheat yield dataset from 2016 to 2021 in China. Earth System Science Data 15, 4047–4063, https://doi.org/10.5194/essd-15-4047-2023 (2023).
Article ADS Google Scholar
Li, Z. et al. Comparison and transferability of thermal, temporal and phenological-based in-season predictions of above-ground biomass in wheat crops from proximal crop reflectance data. Remote Sensing of Environment 273, https://doi.org/10.1016/j.rse.2022.112967 (2022).
Wang, C., Jiang, Q.o., Engel, B., Mercado, J.A.V., Zhang, Z. Analysis on net primary productivity change of forests and its multi–level driving mechanism – A case study in Changbai Mountains in Northeast China. Technological Forecasting and Social Change 153, https://doi.org/10.1016/j.techfore.2020.119939 (2020).
Li, Z. et al. A hierarchical interannual wheat yield and grain protein prediction model using spectral vegetative indices and meteorological data. Field Crops Research 248, https://doi.org/10.1016/j.fcr.2019.107711 (2020).
Hussain, S. et al. Global Trends and Future Directions in Agricultural Remote Sensing for Wheat Scab Detection: Insights from a Bibliometric Analysis. Remote Sensing 15, https://doi.org/10.3390/rs15133431 (2023).
Atzberger, C. Advances in Remote Sensing of Agriculture: Context Description, Existing Operational Monitoring Systems and Major Information Needs. Remote Sensing 5, 949–981, https://doi.org/10.3390/rs5020949 (2013).
Article ADS Google Scholar
Zhang, T., Cheng, C. & Wu, X. Mapping the spatial heterogeneity of global land use and land cover from 2020 to 2100 at a 1 km resolution. Scientific Data 10, 748, https://doi.org/10.1038/s41597-023-02637-7 (2023).
Article PubMed PubMed Central Google Scholar
Dong, J. et al. Early-season mapping of winter wheat in China based on Landsat and Sentinel images. Earth System Science Data 12, 3081–3095, https://doi.org/10.5194/essd-12-3081-2020 (2020).
Article ADS Google Scholar
Peng, Q. et al. A twenty-year dataset of high-resolution maize distribution in China. Scientific Data 10, 658, https://doi.org/10.1038/s41597-023-02573-6 (2023).
Article PubMed PubMed Central Google Scholar
Luo, Y., Zhang, Z., Chen, Y., Li, Z. & Tao, F. ChinaCropPhen1km: a high-resolution crop phenological dataset for three staple crops in China during 2000–2015 based on leaf area index (LAI) products. Earth System Science Data 12, 197–214, https://doi.org/10.5194/essd-12-197-2020 (2020).
Article ADS Google Scholar
Tran, K. H. et al. HP-LSP: A reference of land surface phenology from fused Harmonized Landsat and Sentinel-2 with PhenoCam data. Scientific Data 10, 691, https://doi.org/10.1038/s41597-023-02605-1 (2023).
Article PubMed PubMed Central Google Scholar
Niu, Q. et al. A 30 m annual maize phenology dataset from 1985 to 2020 in China. Earth System Science Data 14, 2851–2864, https://doi.org/10.5194/essd-14-2851-2022 (2022).
Article ADS Google Scholar
Luo, Y. et al. GlobalWheatYield4km: a global wheat yield dataset at 4-km resolution during 1982-2020 based on deep learning approach. Earth System Science Data 1–21, https://doi.org/10.5194/essd-2022-423 (2022).
Qin, X., Wu, B., Zeng, H., Zhang, M., Tian, F. GGCP10: A Global Gridded Crop Production Dataset at 10km Resolution from 2010 to 2020. Earth System Science Data https://doi.org/10.5194/essd-2023-346 (2023).
Cernay, C., Pelzer, E. & Makowski, D. A global experimental dataset for assessing grain legume production. Scientific Data 3, 160084, https://doi.org/10.1038/sdata.2016.84 (2016).
Article CAS PubMed PubMed Central Google Scholar
Campos-Taberner, M. et al. Global Estimation of Biophysical Variables from Google Earth Engine Platform. Remote Sensing 10, 1167 (2018).
Article ADS Google Scholar
Zhang, C. & Diao, C. A Phenology-guided Bayesian-CNN (PB-CNN) framework for soybean yield estimation and uncertainty analysis. ISPRS Journal of Photogrammetry and Remote Sensing 205, 50–73, https://doi.org/10.1016/j.isprsjprs.2023.09.025 (2023).
Article ADS Google Scholar
Huete, A. et al. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sensing of Environment 83, 195–213, https://doi.org/10.1016/S0034-4257(02)00096-2 (2002).
Article ADS Google Scholar
Zhen, Z., Chen, S., Yin, T. & Gastellu-Etchegorry, J.-P. Globally quantitative analysis of the impact of atmosphere and spectral response function on 2-band enhanced vegetation index (EVI2) over Sentinel-2 and Landsat-8. ISPRS Journal of Photogrammetry and Remote Sensing 205, 206–226, https://doi.org/10.1016/j.isprsjprs.2023.09.024 (2023).
Article ADS Google Scholar
Shewry, P. R. et al. Storage product synthesis and accumulation in developing grains of wheat. Journal of Cereal Science 50, 106–112, https://doi.org/10.1016/j.jcs.2009.03.009 (2009).
Article CAS Google Scholar
Shewry, P. R. et al. An integrated study of grain development of wheat (cv. Hereward). Journal of Cereal Science 56, 21–30, https://doi.org/10.1016/j.jcs.2011.11.007 (2012).
Article Google Scholar
da Silva, E. H. F. M., Hoogenboom, G., Boote, K. J., Gonçalves, A. O. & Marin, F. R. Predicting soybean evapotranspiration and crop water productivity for a tropical environment using the CSM-CROPGRO-Soybean model. Agricultural and Forest Meteorology 323, 109075, https://doi.org/10.1016/j.agrformet.2022.109075 (2022).
Article Google Scholar
Cheng, M. et al. Combining multi-indicators with machine-learning algorithms for maize yield early prediction at the county-level in China. Agricultural and Forest Meteorology 323, https://doi.org/10.1016/j.agrformet.2022.109057 (2022).
Uhlen, A. K. et al. Effects of Cultivar and Temperature During Grain Filling on Wheat Protein Content, Composition, and Dough Mixing Properties. Cereal Chemistry 75, 460–465 (2007).
Article Google Scholar
Dalla Marta, A., Grifoni, D., Mancini, M., Zipoli, G. & Orlandini, S. The influence of climate on durum wheat quality in Tuscany, Central Italy. International Journal of Biometeorology 55, 87–96, https://doi.org/10.1007/s00484-010-0310-8 (2011).
Article ADS PubMed Google Scholar
Lee, B.-H., Kenkel, P. & Brorsen, B. W. Pre-harvest forecasting of county wheat yield and wheat quality using weather information. Agricultural and Forest Meteorology 168, 26–35, https://doi.org/10.1016/j.agrformet.2012.08.010 (2013).
Article ADS Google Scholar
Wen, P. et al. Adaptability of wheat to future climate change: Effects of sowing date and sowing rate on wheat yield in three wheat production regions in the North China Plain. Sci Total Environ 901, 165906, https://doi.org/10.1016/j.scitotenv.2023.165906 (2023).
Article CAS PubMed Google Scholar
Muller, K. E., Stewart, P. W. 4. Generalizations of the Multivariate Linear Model, Linear Model Theory: Univariate, Multivariate, and Mixed Models. (2012).
Zhao, Y. et al. Should phenological information be applied to predict agronomic traits across growth stages of winter wheat? The Crop Journal 10, 1346–1352, https://doi.org/10.1016/j.cj.2022.08.003 (2022).
Article Google Scholar
Zhu, B. et al. A Regional Maize Yield Hierarchical Linear Model Combining Landsat 8 Vegetative Indices and Meteorological Data: Case Study in Jilin Province. Remote Sensing 13, https://doi.org/10.3390/rs13030356 (2021).
Huang, S. Linear regression analysis. In International Encyclopedia of Education (Fourth Edition), Tierney, R. J., Rizvi, F., Ercikan, K. H., Sijia, Eds., Elsevier: Oxford, 2023, pp. 548–557.
The jamovi project. 2.3), j.V. Retrieved from https://www.jamovi.org (2023).
Feng, P. et al. Dynamic wheat yield forecasts are improved by a hybrid approach using a biophysical model and machine learning technique. Agricultural and Forest Meteorology 285–286, https://doi.org/10.1016/j.agrformet.2020.107922 (2020).
Gos, M., Krzyszczak, J., Baranowski, P., Murat, M., Malinowska, I. Combined TBATS and SVM model of minimum and maximum air temperatures applied to wheat yield prediction at different locations in Europe. Agricultural and Forest Meteorology 281, https://doi.org/10.1016/j.agrformet.2019.107827 (2020).
Ho, T. K. Random decision forests. In Proceedings of the Proceedings of 3rd international conference on document analysis and recognition, pp. 278–282 (1995).
Cortes, C. & Vapnik, V. Support-vector networks. Machine Learning 20, 273–297, https://doi.org/10.1007/BF00994018 (1995).
Article Google Scholar
Lindley, D. Introduction to the practice of statistics, by David S. Moore and George P. McCabe. Pp. 825 (with appendices and CD-ROM).£ 27.95. 1999. ISBN 0 7167 3502 4 (WH Freeman). The Mathematical Gazette 83, 374–375 (1999).
Google Scholar
Kamir, E., Waldner, F. & Hochman, Z. Estimating wheat yields in Australia using climate records, satellite image time series and machine learning methods. ISPRS Journal of Photogrammetry and Remote Sensing 160, 124–135, https://doi.org/10.1016/j.isprsjprs.2019.11.008 (2020).
Article ADS Google Scholar
Xu, X. et al. The first 500-meter, long-term winter wheat grain protein content dataset for China from multi-source data. https://doi.org/10.5281/zenodo.10066544 (2023).
Xu, X. Comparative algorithms for generating the CNWheatGPC-500 dataset. https://doi.org/10.5281/zenodo.10571131 (2024).

Download references

Acknowledgements

This study was supported by the National Natural Science Foundation of China (42271396), the Key Research and Development project of Shandong Province (LJNY202103), the European Space Agency (ESA), Ministry of Science and Technology of China (MOST) Dragon (95250) and Fengyun Application Pioneering Project(FY-APP-2022.0306).

Author information

Authors and Affiliations

College of Geodesy and Geomatics, Shandong University of Science and Technology, Qingdao, 266590, PR China
Xiaobin Xu, Lili Zhou, Chengzhi Fan & Zhenhai Li
UMRITAP, Montpellier SupAgro, Irstea, Univ. Montpellier, Montpellier, 34000, France
James Taylor
DAFNE, Università della Tuscia, Via San Camillo de Lellis, 01100, Viterbo, Italy
Raffaele Casa
Key Laboratory of Quantitative Remote Sensing in Ministry of Agriculture and Rural Affairs, Information Technology Research Center, Beijing Academy of Agriculture and Forestry Sciences, Beijing, 100097, China
Xiaoyu Song & Guijun Yang
State Key Laboratory of Remote Sensing Science, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing, 100094, China
Wenjiang Huang

Authors

Xiaobin Xu
View author publications
Search author on:PubMed Google Scholar
Lili Zhou
View author publications
Search author on:PubMed Google Scholar
James Taylor
View author publications
Search author on:PubMed Google Scholar
Raffaele Casa
View author publications
Search author on:PubMed Google Scholar
Chengzhi Fan
View author publications
Search author on:PubMed Google Scholar
Xiaoyu Song
View author publications
Search author on:PubMed Google Scholar
Guijun Yang
View author publications
Search author on:PubMed Google Scholar
Wenjiang Huang
View author publications
Search author on:PubMed Google Scholar
Zhenhai Li
View author publications
Search author on:PubMed Google Scholar

Contributions

Xiaobin Xu: Formal analysis, Conceptualization, Writing – original draft, Visualization, Validation. Lili Zhou: Visualization, Writing – original draft. James Taylor: Writing – revision. Raffaele Casa: Writing – revision. Chengzhi Fan: Validation. Xiaoyu Song: Sample collection. Guijun Yang: Sample collection. Wenjiang Huang: Sample collection. Zhenhai Li: Conceptualization, Writing – revision, Writing – revision.

Corresponding author

Correspondence to Zhenhai Li.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, X., Zhou, L., Taylor, J. et al. The 500-meter long-term winter wheat grain protein content dataset for China from multi-source data. Sci Data 11, 1025 (2024). https://doi.org/10.1038/s41597-024-03866-0

Download citation

Received: 15 February 2024
Accepted: 11 September 2024
Published: 19 September 2024
DOI: https://doi.org/10.1038/s41597-024-03866-0