Abstract
This study aims to develop predictive models for rice yield by applying multivariate techniques. It utilizes stepwise multiple regression, discriminant function analysis and logistic regression techniques to forecast crop yield in specific districts of Haryana. The time series data on rice crop have been divided into two and three classes based on crop yield. The yearly time series data of rice yield from 1980–81 to 2020–21 have been taken from various issues of Statistical Abstracts of Haryana. The study also utilized fortnightly meteorological data sourced from the Agrometeorology Department of CCS HAU, India. For comparing various predictive models' performance, evaluation of measures like Root Mean Square Error, Predicted Error Sum of Squares, Mean Absolute Deviation and Mean Absolute Percentage Error have been used. Results of the study indicated that discriminant function analysis emerged as the most effective to predict the rice yield accurately as compared to logistic regression. Importantly, the research highlighted that the optimum time for forecasting the rice yield is 1 month prior to the crops harvesting, offering valuable insight for agricultural planning and decision-making. This approach demonstrates the fusion of weather data and advanced statistical techniques, showcasing the potential for more precise and informed agricultural practices.
Similar content being viewed by others
Introduction
Rice is the most significant and widely planted crop in India and in terms of global rice output, India comes at second place. The north-eastern, southern, and south-eastern regions of India account for 92% of the country's rice output. Approximately, 44 million hectares of land in India is used for rice farming1. The country's rice production is influenced by monsoon patterns, irrigation facilities, government policies, and market demand. States like Punjab, West Bengal, Uttar Pradesh, Andhra Pradesh, and Telangana are significant contributors to India's rice production. Haryana is a major agricultural state in India, known for its wheat and rice production. The state has made significant strides in agricultural technology, irrigation facilities, and crop management practices. Haryana's rice production is supported by the availability of water resources from rivers like the Yamuna and the infrastructure for rice cultivation2.
Weather variables can have differing effects on crops at various stages of development. Therefore, the influence of weather on crop yield not only depends on the intensity of weather variables but also on their distribution pattern throughout the crop season. This highlights the need to divide the entire crop season into smaller intervals and analysis crop-weather relationships within these intervals. However, this approach increases the number of variables in the model, leading to a large number of parameters that need evaluation from the available data. Due to limited data availability, it may be challenging to precisely estimate these parameters. Thus, a technique that uses a manageable number of parameters while considering the entire weather distribution can be a viable solution to address this issue.
India possesses one of the world's most effective systems for gathering, organizing, and summarizing data on crop production. This system relies on official inputs regarding area and yield obtained from states, which in turn collect data from districts and further sources. Area data is derived from comprehensive assessments conducted by revenue agencies, while yield data comes from crop cutting experiments. The Directorate of Economics and Statistics, Ministry of Agriculture in New Delhi issues initial forecasts (advance estimates) for major cereal and commercial crops. However, the final estimates are typically provided several months after the actual crop harvest. As a result, one drawback of the Department of Agriculture's yield estimates is the delay and potential quality issues with the statistics. Therefore, there is significant room for enhancing the conventional system.
Climate change has become a significant concern, prompting researchers to delve into its effects on crop growth and yield. They are also focused on identifying appropriate management strategies to maintain crop productivity in the face of projected climate changes. To achieve a quantitative understanding of how crops are response to climate shifts, researchers are developing statistical models that consider the crop's time-series behaviour along with various climatic factors. Summer crops are particularly vulnerable to low temperatures during reproductive stages, and their differing responses to temperature decrease can significantly impact crop yields. The challenge lies in integrating this relevant information into the forecasting process and subsequently into decision-making processes and for an operational yield model to gain widespread adoption, it's crucial to have access to data well before the crop harvest, and the data collection process should be cost-effective. Weather parameters are readily available during the maximum vegetative stage of the crop and can be highly effective for developing accurate yield forecasting models.
Therefore, it becomes essential to predict the rice yield in advance. The interplay between climatic factors and the crop's growth stages significantly influences both the total yield and the ability to forecast it before harvesting. Accurate predictions are invaluable, not just for farmers but also for trade, industry, and policymakers. Projections of rice yield play a crucial role for taking decisions related to pricing, storage, marketing strategies, and distribution channels, emphasizing their substantial impacts across the agricultural sector and beyond. In past years, many researchers have been tried different techniques to develop rice yield forecasting models such as models (weather indices)3, artificial neural network4, principal component analysis5 and time series6.
Other than these techniques, forecasting of rice yield using different statistical technique such as ordinal logistic regression and discriminant function analysis have been discussed in past by several researchers including Goyal7 studied several statistical techniques for estimating wheat yield such as multiple linear regression, discriminant analysis, and principal component analysis etc. Sharma et al.8 studied the estimation of wheat yield based on environmental factors using ordinal logistic regression. Kumari and Kumar9 used ordinal logistic regression for forecasting the yield. Kumari et al.10 explored the crop production output in Kanpur district of Uttar Pradesh by utilizing logistic regression to predict future yield. Kumar et al.11 developed pre-harvest forecast models of yield using advance statistical technique based on meteorological parameters. Bayesian discriminant and discriminant function analysis (scores) both the technique have been contrasted by Kumari et al.12. Goyal and Verma13 used various statistical techniques for pre-harvest crop yield estimation. Priya et al.14 utilized discriminant function analysis to predict the yield of Coimbatore in Tamil Nadu by considering by various weather parameters14. Discriminant function analysis was performed to forecast the sugarcane yield in Coimbatore district, Tamil Nadu by using monthly weather data and the yield was categorized into two and three groups. The scores derived from this analysis, along with the trend, was incorporated as regressors in developing the yield forecast models. Comparisons between the forecasting models based on two groups and three groups indicated that the models utilizing three groups were deemed more effective. Given the information provided, the current research aims to create predictive models for rice yield before harvesting by operating weather data and advanced statistical methods.
Materials and methods
Area and crop covered
The research was conducted in Karnal district, Haryana, India, located at coordinates approximately 29.68570 N latitude and 76.99050 E longitude, situated within the eastern plain zone of Haryana. With an average annual rainfall of around 766 mm and a soil composition primarily consisting of deep alluvial, medium to medium-heavy textured soils that are easy to plough, the area offers favourable conditions for agriculture. The combination of favourable climate, soil, and extensive irrigation facilities makes rice and wheat cultivation a natural choice for the region. Rice is typically cultivated during the kharif season, while wheat is grown during the Rabi season, taking advantage of the optimal growing conditions during these periods.
Data description
The various issues of Statistical Abstract of Haryana have been used to extract the time series data from 2017–18 to 2020–21 on rice crop yield for the Karnal district of Haryana. The study gathered daily meteorological information from the Meteorology Department at C.C.S Haryana Agricultural University in Hisar, Haryana, as well as from CSSRI in Karnal.
Computation of weather parameters
The weather parameters including maximum temperature (°C), minimum temperature (°C), average relative humidity (%), sunlight hours (h) and cumulative rainfall (mm) are the significant weather parameters which influenced the crop growth, different physiological stages and the rate of phenological development. The fortnight climate information was mentioned below:
where TMAXi = \(i\)th day maximum temperature; TMINj = \(j\)th day minimum temperature; ARHk = \(k\)th day relative humidity; SSH\(_{l}\) = \(l\)th day sun shine hours; ARFm = \(m\)th day rainfall; (I, j, k,\(l^{ }\), m, represent daily weather data).
These data were organized into different fortnight periods based on SMW (Standard Meteorological Weeks) relevant to the growth stages of the rice crop.
Methodology used for the study
The Fig. 1 illustrates the stages involved in preparing the model.
The forecasting models were fitted using data from 1980–81 to 2017–18 and the next 3 years i.e., from 2018–19 to 2020–21 were used to validate the developed models. Initially, a linear regression analysis was conducted between rice yield (as the response variable) and year (as the explanatory variable), using yield data from 1980–81 to 2017–18. The equation derived from this analysis was then used to calculate the residuals. The residuals were then used to classify the crop yield into two separate categories or groups. The first group was based on two types of residuals: assigning a zero to negative residuals and a one to positive residuals. In the second group, three categories of residuals were organized in ascending order. Subsequently, the crop yield was divided into three classifications: adverse (group 0), normal (group 1), and congenial (group 3). Different statistical methodologies, such as logistic regression and discriminant function analysis, were applied to these two and three groups to determine probabilities and scores for the yield forecasting models. Furthermore, a stepwise linear regression method was utilized to develop these forecast models using the derived probabilities/scores and years as regressors. The effectiveness and accuracy of the fitted models were then assessed by comparing various key measures including the mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute deviation (MAD), and predicted error sum of squares (PRESS). These measures were critical in validating the reliability and performance of the fitted models.
Forecast models
The techniques of discriminant function analysis and logistic regression were applied to rice yield forecasting. Regression models based on logistic probabilities and discriminant function analysis based on scores along with trend (year) as regressors were fitted for quantitative rice yield forecasting Kumari and Kumar9.
Two group method
Discriminant scores
where \(\alpha_{0}\) represented the intercept of the equation; T denoted the the time period, measured in years; \(\beta_{i}{\prime} s\) were the coefficients in the regression equation; Z referred to the discriminant score; is error ~ N (0, σ2): ε symbolized the error term, assumed to be normally distributed with a mean of 0 and a variance of \(\sigma^{2}\).
Ordinal logistic regression
The explanatory variable X and the response variable Y had the following linear relationship:
where \(\alpha\) represented the intercept of the equation; \(\beta\) was the regression coefficient, denoted the degree of change in Y for a unit change in x; \(\varepsilon\) ~ N (0, \(\sigma\)2).
The model was given by
where β0 was the intercept of the equation; T represented the time period, typically expressed in years. P1 denoted the probability of the response variable Y being equal to 1. βi’s were the regression coefficients, indicating the influence of each explanatory variable on the response variable. ε represented the error term.
Three groups
Discriminant function analysis (scores)
where z1 and z2 represent discriminant scores and the other terms retained their previously defined meanings.
Ordinal logistic regression
where P1 and P2 are probabilities of Y = 1 and Y = 2 and the other terms retained their previously defined meanings.
Comparative performance measures
The performance of a model in predicting the rice yield, or any other variable for that matter, was typically evaluated using various metrics that assessed its accuracy, reliability, and generalizability. The standard metrics often employed for assessing model performance include mean absolute percentage error (MAPE), root mean square error (RMSE), mean absolute deviation (MAD), and predicted error sum of squares (PRESS). These metrics offered valuable insights into the accuracy and reliability of a model's predictions, allowing for a comprehensive evaluation of its performance Kumar et al.11.
-
(a) Predicted error sum of square (PRESS)
The PRESS statistic was defined as
where \(Y_{i}\) was the value of dependent variable of ith observation such that rice yield of ith year and \(\hat{Y}_{i}\) was the forecast of \(Y_{i}\) computed from fitted model without taking the ith data point. It is generally regarded as how well for a model will perform in predicting the rice yield.
-
(b) Root mean square error of forecasts
$${\text{RMSE }} = \sqrt {\frac{1}{{\varvec{n}}}\mathop \sum \limits_{{{\varvec{i}} = 1}}^{{\varvec{n}}} \left( {{\varvec{Y}}_{{\varvec{i}}} - \hat{\user2{Y}}_{{\user2{i }}} } \right)\user2{ }^{2} }$$
where \(Y_{i }\) and \(\hat{Y}_{i }\) were the observed and forecasted values of the rice yield respectively and n was the number of years for which forecasting was done.
-
(c) Mean absolute percentage error of forecast
The formula of MAPE is given as: .
where \(Y_{i }\) and \(\hat{Y}_{i }\) were the observed and forecasted value of the rice yield respectively and n was the number of years for which forecasting was done.
-
(d) Mean absolute deviation
The formula of MAD is given as:
where \({\text{Y}}_{i}\) and \({\hat{\text{Y}}}_{i}\) were the observed and forecasted value of the rice yield respectively and n was the number of years for which forecasting was done.
Results and discussion
It sounds like an extensive analysis was conducted to develop the forecasting models using various statistical procedures. Each of these methods had its own strengths and considerations viz., multiple linear regression, stepwise regression, ordinal logistic regression and discriminant function analysis. By employing these diverse statistical procedures, the analysis attempted to capture different aspects of the relationship between weather patterns and rice crop yield. Each method might emphasize certain aspects or patterns within the data, offering unique insights into how weather variables impact the yield.
Forecast models
The stepwise linear regression technique was utilized to fit regression models, incorporating probabilities derived from ordinal logistic regression (OLR), discriminant function analysis15, and the year as independent variables, with yield as the dependent variable. The resulting rice yield forecast models for different fortnights were detailed in Table 1.
Selected forecast models
The rice yield was divided into two groups: negative residuals representing low yield (0), and positive residuals representing high yield (1). The model used Probability (P) and score (Z) as predictors, and the outcomes, along with adjusted R2 values, were presented in Table 2. When considering three yield groups, 40 years of yield data were sorted into low (0), medium (1), and high (2) categories based on residuals. The same methodology was applied, but with the addition of two probabilities (P1 and P2) and scores (Z1 and Z2) within each group.
In Table 2, the selected models using stepwise linear regression by taking discriminant scores (Z), logistic probabilities(P) and time (T) as regressors were given. Also, it can be seen that the value of coefficient of determination is 67.7% in case of discriminant scores and 64.7% in case of logistic regression models for forcasting the rice yield.
The above Table 3, showed developed models using stepwise linear regression by taking discriminant scores (Z1 and Z2), logistic probabilities (P1andP2) and year (T) as regressors.
In Table 4, the values of performance measures like PRESS, RMSE, MAPE and MAD for three groups based on discriminant scores were found lowest as compared to other models. In Fig. 2, the values of observed and predicted yield using graphical representation were given to provide a clear visual comparison of the selected models. Also, it can be seen clearly from the below figure that the yield was sudden decreased in the year 1996 and 1999. This yield gap may occur when irrigated rice under tropical climates and sudden increase in 2007–2008 may due to favourable conditions as compared to other year16. It is also found that the average absolute relative deviation (%) for the training set was 28.08 (%) approximately and in case of selected model i.e. discriminant function analysis based on scores is 3.37 (%).
Discussion
In the course of presenting the results of the present study, the significant variations were observed due to the effect of different weather parameters on the rice yield. The study emphasized the importance of timely and reliable crop forecasts for an agrarian economy, which are essential for planning, policy formulation, and implementation related to crop procurement, price structure, distribution, and import–export decisions. The study included fitting of stepwise linear regression models with yield as the response variable and weather parameters and year as regressors, as well as studying ordinal logistic regression and discriminant function analysis for forecasting the crop yield based on weather variables. The present study was designed to develop models for estimating rice yield using ordinal logistic regression and discriminant function analysis. The objectives included creating a yield forecast model for rice crop using ordinal logistic regression with weather data and comparing the performance of ordinal logistic regression and discriminant function analysis. Johnson et al.17 conducted a study on the relationship between weather conditions and outbreaks of potato late-blight in the semiarid region of south-central Washington. They used linear discriminant analysis and logistic regression analysis to forecast late-blight outbreaks based on weather variables. Their findings revealed that the logistic regression model outperformed the discriminant function analysis in predicting late-blight outbreaks. Hassan et al.18 said that the discriminant analysis and binary logistic regression enabled more accurate prediction of autism spectrum disorder than principal component analysis19. Multivariate analysis was utilized to identify the key distinguishing factors between COVID-19 patients and healthy individuals, as well as between severe and moderate cases. By employing discriminant analysis and binary logistic regression models, we achieved a classification accuracy ranging from 71 to 100%. The differentiation of severe cases from moderate ones was primarily associated with reduced levels of natural killer cells and activated class-switched memory B cells, higher neutrophil frequency, and decreased HLA-DR expression on monocytes in severe COVID-19 patients. Garde et al.20 studied the different statistical models based on weather parameters and also developed statistical model using data from 1990 to 2012, and validation was carried out using the remaining data from 2013 to 2016. The adjusted R2 values ranged from 73.00 to 93.30% across different models. The best forecast model was selected based on high adjusted R2 values, forecast error, and RMSE. In Navsari district, the discriminant function analysis technique (Model-5) was found to be superior to logistic regression analysis (Model-12) for pre-harvest forecasting of rice crop yield, based on the obtained results. The present study involved using stepwise linear regression, ordinal logistic regression, and discriminant function analysis based on scores for forecasting the rice yield. The detrended yield is divided into two and three categories, with years and weather variables considered as regressors. The study compares the accuracy of the fitted models using PRESS, RMSE, MAPE and MAD. In the case of two categories, discriminant function analysis (scores) yielded optimal results, while in the case of three categories, discriminant function analysis (scores) also performed better compared to the ordinal logistic regression method. The best time for forecasting rice yield is found to be 1 month before harvesting i.e. (21st Fortnight).
Conclusion
In this study, we explored the prediction of rice yield using different statistical methods such as regression analysis, ordinal logistic regression, and discriminant function analysis and highlighted their unique roles and importance in forecasting the rice yield. Each approach played a crucial part in understanding, modeling, and categorizing yield data. Several multivariate models were fitted using fortnightly weather data by categorizing the yield data into two and three groups. The performance of these selected models were compared using various performance measures such as PRESS, MAPE, RMSE, and MAD to assess their effectiveness. According the findings of present study, the discriminant function analysis based on scores provided the optimum results as compared to other methods. Hence, it is concluded that the discriminant function analysis(scores) in case of three groups performed best as compared to other multivariate techniques. The results of this study may be useful for policy planners and other stakeholders in making informed judgments about how to set up domestic and international trade, distribution, storage, and procurement, as well as how to manage adequate inventories ahead of time. In the future study, the system based on crop growth simulation models may be used to forecast the crop yield based on meteorological parameters.
Data availability
The datasets used and/or analyzed during the current study available from the corresponding author on reasonable request.
References
Anonymus. https://en.wikipedia.org/wiki/Rice_production_in_India.
Pathak, H., Tripathi, R., Jambhulkar, N., Bisen, J. & Panda, B. Eco-regional-based rice farming for enhancing productivity, profitability and sustainability (2020).
Rajavel, M. et al. Development of rice yield forecast in mid-season using weather indices based agrometeorological model in Chhattisgarh. Vayu Mandal 44, 38–45 (2018).
Niedbała, G. & Kozlowski, J. Application of artificial neural networks for multi-criteria yield prediction of winter wheat. J. Agric. Sci. Technol. 21, 51–61 (2019).
Yadav, R., Sisodia, B. & Sunil, K. Application of principal component analysis in developing statistical models to forecast crop yield using weather variables. Mausam 65, 357–360 (2014).
Devi, M., Kumar, J., Malik, D. & Mishra, P. Forecasting of wheat production in Haryana using hybrid time series model. J. Agric. Food Res. 5, 100175 (2021).
Goyal, M. Use of different multivariate techniques for pre-harvest wheat yield estimation in Hisar (Haryana). Int. J. Comput. Math. 12, 6–11 (2016).
Sharma, A. et al. Prediction of wheat yield using ordinal logistic regression based on weather parameters. Environ. Ecol. 1880–1885 (2022).
Kumari, V. & Kumar, A. Forecasting of wheat (Triticum aestivum) yield using ordinal logistic regression. Indian J. Agric. Sci. 84, 691–694 (2014).
Kumari, V., Agrawal, R. & Kumar, A. Use of ordinal logistic regression in crop yield forecasting. Mausam 67, 913–918 (2016).
Kumar, J., Devi, M., Verma, D., Malik, D. & Sharma, A. Pre-harvest forecast of rice yield based on meteorological parameters using discriminant function analysis. J. Agric. Food Res. 5, 100194 (2021).
Kumari, V., Aditya, K., Chandra, H. & Kumar, A. Bayesian discriminant function analysis based forecasting of crop yield in Kanpur district of Uttar Pradesh (2019).
Goyal, M. & Verma, U. Spectral-weather–crop yield forecasting: Discriminant function analysis. J. Appl. Probab. 10, 1–14 (2015).
Priya, S. R. K., Balambiga, R. K., Mishra, P. & Das, S. S. Sugarcane yield forecast using weather based discriminant analysis. Smart Agric. Technol. https://doi.org/10.1016/j.atech.2022.100076 (2023).
Farshadfar, E., Romena, H. & Safari, H. Evaluation of variability and genetic parameters in agro-physiological traits of wheat under rain-fed condition. Int. J. Agric. Crop Sci. 5, 1015 (2013).
Duwayri, M., Tran, D. & Nguyen, V. Reflections on yield gaps in rice production. Int. Rice Comm. Newsl. 48, 13–26 (1999).
Johnson, D. A., Alldredge, J. R. & Vakoch, D. L. Potato late blight forecasting models for the semiarid environment of south-central Washington. Phytopathology 86, 480–484 (1996).
Hassan, W. M., Al-Dbass, A., Al-Ayadhi, L., Bhat, R. S. & El-Ansary, A. Discriminant analysis and binary logistic regression enable more accurate prediction of autism spectrum disorder than principal component analysis. Sci. Rep. 12, 3764 (2022).
Bean, J. et al. Multivariate indicators of disease severity in COVID-19. Sci. Rep. 13, 5145 (2023).
Garde, Y., Banakara, K. & Pandya, H. Different statistical models based on weather parameters in Navsari district of Gujarat. MAUSAM 74, 795–806 (2023).
Acknowledgements
The authors of this article would like to express their sincere gratitude to the members of the research team who contributed to the successful completion of this study.
Author information
Authors and Affiliations
Contributions
A.J. conceived and designed the project; A.J., J.K., M.R., P.K., M.G., N.P., P.F., and M.R. wrote and revised the MS. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Sharma, A., Kumar, J., Redhu, M. et al. Estimation of rice yield using multivariate analysis techniques based on meteorological parameters. Sci Rep 14, 12626 (2024). https://doi.org/10.1038/s41598-024-63596-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-63596-6