Introduction

Traffic accidents are one of the most common events in the world and also one of the leading causes of unnatural deaths. According to data on the age distribution of all-cause mortality collected in 2019, road traffic accidents are a crucial cause of death for children and young people aged 5–29 years, and the 12th leading cause of death when considering all age groups1. The occurrence of traffic accidents not only poses a threat to human health but also directly leads to property losses. According to the People’s Republic of China statistical communiqué on national economic and social development in 2023 (National Bureau of Statistics of China, http://www.stats.gov.cn/), the death toll from road traffic per 10,000 vehicles was 1.38 people. Furthermore, according to the Chengdu Statistical Yearbook, direct economic losses caused by traffic accidents in Chengdu reached 4.11 million yuan in 2022 (Chengdu Municipal Statistics Bureau data, https://cdstats.chengdu.gov.cn/). In recent years, the frequency of traffic accidents in China has improved compared to the past. However, China is still the world’s most populous country, ranking second in the number of traffic accidents, just behind India. In addition, the casualties, psychological disorders, and property losses caused by traffic accidents remain significant issues for the victims themselves, their families, and society as a whole. Identifying the factors that influence traffic accidents and presenting alerts based on these factors can be an effective approach to reduce the occurrence of accidents. To further clarify the factors that contribute to traffic accidents, this article focuses on air pollution and climate conditions(APCC), conducting an in-depth analysis to find the details of the impacts of APCC on road traffic accidents.

Against the backdrop of increasingly prominent global environmental issues, China has proposed the carbon peak and carbon neutrality goals and outlines the timeline and roadmap for these goals in the 14th Five-Year Plan, committing to realizing sustainable development and enhancing the construction of ecological civilization. China has provided a strong impetus for the international community in addressing climate change and fully implementing the Paris Agreement. Furthermore, China has injected new momentum into the global green recovery post-pandemic and the shared effort to build a global community of life on earth. The successful introduction and implementation of green and low-carbon transportation policies have led to the widespread adoption of shared bicycles and electric vehicles, significantly contributing to the achievement of carbon neutrality in China’s transportation sector. A report jointly released by Aurora Mobile (NASDAQ: JG) and Xi’an Jiaotong University (https://www.moonfox.cn/insight/report/944) shows that, based on data from January 2009 to April 2021, approximately 8 million shared electric bicycles have been deployed nationwide, which has led to a reduction of about 1.636 million tons of carbon emissions per year. Furthermore, 31% of shared electric bike trips have replaced high-carbon travel modes, mainly private cars and motorcycles.

In recent years, Chinese authorities have repeatedly emphasized the importance of protecting and improving people’s livelihoods. Starting with addressing the prominent issues that the public is most concerned about, efforts have been made to ensure the well-being of the people. According to the China Statistical Yearbook 2023 (National Bureau of Statistics of China, https://www.stats.gov.cn/sj/ndsj/2023/indexch.htm), a total of 256,409 traffic accidents occurred in China in 2022, resulting in direct property losses of 1.24 billion yuan. Among them, the number of traffic accidents involving motor vehicles was 215,627, representing approximately 84. 1% of the total. According to data for 2019 on the age distribution of all-cause mortality, road traffic accidents are the leading cause of death for children and young people aged 5–29 years and the 12th leading cause of death when considering all age groups1. In terms of motorcycles alone, one of the most common types of transportation, the risk of death in a traffic accident for motorcyclists is 34 times higher than for those driving other types of motor vehicle per mile traveled. Also, they are 8 times more likely to be injured than others2. Relevant authorities attach great importance to the casualties and economic losses caused by traffic accidents, continuously improve the system of traffic legal system, and strengthen the management of traffic hazards at their sources. Therefore, a more in-depth study is needed on the causes of road traffic accidents.

The factors that lead to road traffic accidents are complex, and the occurrence is mainly affected by three aspects: driver characteristics, driving skills, and environmental factors3. The World Health Organization(WHO) believes that air pollution has a significant impact on human health. Although there has been no strong evidence to show that air pollution is necessarily associated with specific diseases, it has already posed a severe threat to human well-being4. According to research conducted by Global Burden of Disease (GBD), in 2015, China ranked among the top 10 countries with the highest death rates due to air pollution. Additionally, the study indicated that exposure to air pollution contributed to 1.1 million premature deaths in China, with 1478.9 disability-adjusted life-years (DALYs) per 100,000 people5. The total DALYs of all individuals, reflecting the degree of disease burden, can be used to measure the gap between the current health status and the ideal health status (no one suffers from disease or disability) of a population. Meanwhile, people’s ability to perceive risk and their coping behavior would be affected when living under adverse air conditions for an extended period6.

In recent years, due to the boom in mobile internet technology, the food delivery industry in China has experienced rapid development. According to the China Internet Network Information Center (CNNIC), as of December 2023, the number of online takeaway food delivery users in China reached 545 million, an increase of 23.38 million compared to December 2022, accounting for 49.9% of the total Internet users. The growth of the takeaway industry not only has brought convenience to the public but also has generated economic benefits for society. Most delivery workers use motorcycles or electric bikes for delivery, and the widespread adoption of food delivery services has led to an increase in the number of electric bikes and motorcycles. This has aroused widespread public concern and debate over road transport safety. Malka et al. applied occupational safety and ambidexterity theory, the study explored the factors contributing to the reduction of work-related road accidents. The researchers found that when customer service is prioritized, an increase in road safety priority is not associated with a decrease in road accidents, and vice versa. Furthermore, a dual-priority situation on both road safety and customer service leads to a higher road accident rate compared to a sole focus on road safety7. Under the ’on-time delivery’ requirement of food delivery companies, common traffic violations such as running red lights, speeding, and driving against traffic have become increasingly prominent among delivery workers, resulting in significant traffic accidents.

According to the China Urban Transportation Report for 2023 (https://jiaotong.baidu.com), as a megacity, Chengdu’s commuting congestion index reached 1.708 in Q3 of 2023, ranking seventh among cities of similar size. The Chengdu weekend congestion index was ranked fifth nationwide. In recent years, with increasing commuting distances, many commuters have also chosen the ‘subway + shared bicycle’ combination for their daily commute. Furthermore, as reported by Zhicheng Data Platform in 2023 (https://www.datayicai.com), based on Ele.me data, one of China’s leading food delivery companies, regarding orders from June 2022 to May 2023, Chengdu ranked sixth nationwide in terms of 24-h retail food delivery market activity. As a new first-tier city, Chengdu’s nighttime retail vitality is comparable to first-tier cities. Due to the aforementioned reasons, this article chooses Chengdu as the research subject to explore the occurrence of road traffic accidents in-depth and analyze potential causes of road traffic accidents from environmental causes. This article primarily utilizes data on air pollution (including pollutants such as ozone, CO, etc.) and weather conditions (including temperature, precipitation, and other indices). The number of emergency calls due to road traffic accidents in Chengdu from 2022 to 2023 is applied as a standard to assess the occurrence of traffic accidents in the city during this period. The aim is to identify the potential correlation between road traffic accidents and APCC in Chengdu.

With increasing public concern about environmental pollution, the potential impact of air quality and weather conditions on road traffic safety has also attracted more and more attention from researchers and policymakers. Existing studies have shown that traffic density is positively correlated with air pollutants such as \(\mathrm {NO_2}\)8. Sager adopted an instrumental variable (IV) approach to estimate the causal effect of air pollution on accidents. Based on data within 153 NUTS regions in the UK between 2009 and 2014, the study demonstrated that exogenous \(PM_{2.5}\) spikes induced by temperature inversions increased traffic collision risks by 0.3–0.6% per \(\upmu \text {g/m}^3\), establishing air pollution as a modifiable risk factor for traffic accidents9. Moreover, it is found that air pollution has significant short-term negative effects on health, especially on the respiratory and cardiovascular systems10. These health problems may indirectly affect the driver’s cognitive ability and reaction speed, thereby increasing the risk of traffic accidents. Logically thinking, as the number of vehicles on the road increases, the likelihood of accidents and collisions also increases. Zou et al. explored the correlation of indices related to fatal traffic accidents in California and Arizona using a negative binomial model and a log-change model. Their study found that increases in temperature and precipitation were associated with a significant increase in the number of traffic accidents11. Furthermore, other studies have shown a correlation between air quality and residents’ emotional responses12. Based on social media postings, Shan et al. investigated the relationship between \(PM_{2.5}\) and people’s emotions. The results indicate that the emotional intensity of netizens is significantly positively affected by levels of \(PM_{2.5}\)13. Similarly, Baryshnikova et al. examined the relationship between air pollution and motor vehicle collisions in New York City and concluded that air pollution not only harms health, but also leads to costly social impacts unrelated to health14.

Based on a city-level daily panel data of over 470,000 traffic accident cases in China between 2014 and 2018, Wang et al. found that in contrast to warm days, the occurrence of traffic accident crimes increased by 11.9% on days when the maximum temperature exceeded 100\(^{\circ }F\)15. This indicates that high temperature has a significant negative impact on road safety in China, with variations between different populations and regions. Additionally, weather indices such as temperature, humidity, and precipitation can affect drivers’ emotions, perceptions, and driving conditions16, which in turn are associated with traffic accidents.

This paper contributes to the analysis of potential factors that influence traffic accidents in the following aspects. First, three datasets are collected from Chengdu, a major city in southwestern China, where the local government has implemented numerous measures to control air pollution, reflecting a high level of attention to this issue. Second, through regression analysis, linear and nonlinear models are established to explore the relationship and extent of the impact of various APCC indices on the number of traffic accidents, while also examining the correlation between accidents and weekends. Third, a procedure combines stepwise regression and multicollinearity tests is conducted to select the variables for model construction. After that, correlated variables with less multicollinearity are reserved for study.

Several statistical methods are adopted in the procedure. The Pearson correlation test is applied initially to capture the correlations between variables, and the test result indicates a strong correlation between them. When multicollinearity problems arise, the standard errors of the coefficients can be inflated, resulting in the statistical insignificance of variables that would otherwise be significant17. In order to prevent regression model from multicollinearity problems, the variance inflation factor (VIF) is utilized to quantify the degree of multicollinearity between independent variables. The variables are then preliminarily selected by a stepwise regression based on the Akaike Information Criterion18. In multiple regression analysis, it is important not to rely solely on collinearity diagnostic measures (such as VIF, condition number, etc.), but to integrate other factors for a comprehensive evaluation19. By removing some overlapping information, six independent variables are finally reserved and a linear regression model is constructed based on them.

However, certain nonlinear interrelationships between variables remain unexplored. In the procedure, nonlinear analysis including the eXtreme Gradient Boosting decision tree (XGBoost)20 and the Jackknife Mutual Information (JMI) estimator is considered to capture this complex nonlinear relationship. The JMI method is proposed by Zeng et al.21. Similarly to how the Pearson correlation coefficient measures linear relationships22, Mutual Information (MI) quantifies nonlinear dependencies between variables23. By ranking variables based on their marginal importance using MI, the most suitable variables can be selected to construct nonlinear models. Unlike traditional MI estimators, the JMI method by Zeng et al. is data-driven and does not rely on tuning parameters like bandwidth selection, making it more robust for quantifying nonlinear dependencies between variables. Similarly, Wang et al. utilized XGBoost to unravel the intricate nonlinear relationships between five categories of independent variables and public transport usage among older adults in China24. Based on the result of the nonlinear analysis, generalized additive models are constructed to further elucidate the nonlinear interrelationships among variables.

The rest of this article is framed as follows. Firstly, in Materials and Methods, three datasets utilized in this article are introduced, which contain aspects of road traffic accident records and APCC indices, respectively. In Results, descriptive statistics is made. As an early-stage preparation, a generalized introduction is illustrated on characteristics of road safety and APCC indices. To build a linear regression model, several methods, including stepwise regression, are adopted for variable selection. Based on the AIC criterion, the variables are initially screened. Later, multicollinearity tests are conducted to compare and retain variables with less significant multicollinearity. Standard data is then used to establish a multivariate linear regression model, preliminarily revealing the correlation between traffic accidents in Chengdu and APCC. To further explore the potential impact of APCC on traffic accidents, an in-depth discussion of the nonlinear dependencies between variables is conducted, using the XGBoost model and a jackknife version of mutual information21. The results indicated that nonlinear models achieve better fit performance than linear ones. Finally, in Section , the contributions and limitations of this study are discussed, along with future directions. Conclusions are given in Section .

Materials and methods

Data source

This article comprehensively consider three datasets to explore the association between traffic accidents occurrences and APCC. The first dataset comprises data from network hospital medical record data for emergency calls made between January 1, 2022 and November 30, 2023, provided by the Chengdu First Aid Command Center. In instances where multiple injuries arise from the same accident, duplicated records are removed based on the “event number”. After the data process, variables such as date, traffic accident numbers, and whether the date is on weekends are reserved.

The climate-related conditions indices in the second dataset are sourced from NASA POWER (https://power.larc.nasa.gov/). Chengdu is the target ___location, with coordinates of 104.0266 longitude and 30.7066 latitude. The observation period corresponds to that of the first dataset. All indices, except for earth skin temperature(TS), are measured at 2 meters above the ground, including temperature (T2M), relative humidity(\(RH_{2M}\)), precipitation corrected (PRECTOTCORR, henceforth abbreviated as PORC).

The air pollution indices in the third dataset are downloaded from the Environment Big Data (https:/www.ebd120.com/), which are collected from the China General Environmental Monitoring Station, and the real-time air quality release platforms in provinces and cities. Data contain daily average values of six common air pollutants, including particulate matter with diameter less than or equal to 2.5\(\upmu \text {m}\) (\(PM_{2.5}\)), particulate matter with diameter less than or equal to 10\(\upmu \text {m}\)(\(PM_{10}\)), sulphur dioxide (\(SO_2\)), nitrogen dioxide (\(NO_2\)), carbon monoxide (CO) and ozone (\(O_3\)). Data for these pollutants were monitored at 12 fixed stations across the city (Junping Street, Dashi West Road, Jinquan Lianghe, Shahepu, Sanwayao, Shilidian, Longquanyi District Government, Checheng East 7th Road, Qingbaijiang Technician Branch, and Science City, Jinbo Road, Huayang District).

The three datasets are constructed based on comprehensive city-wide data from Chengdu, integrating observations collected across multiple geographically distributed sites. The collection of RTA data is based on the emergency calls reported to the 120 First Aid Command Center. Consequently, recording factors such as traffic volume or road construction conditions at the actual accident locations poses significant challenges. challenges. In addition, this article aims to investigate the relationship between road traffic accident counts and overall daily pollution levels and weather conditions. As the data involved is collected and analyzed on a daily basis, time-of-day variations such as hourly fluctuations in traffic patterns or pollution exposure are not incorporated into the analysis.

Based on the above datasets, the ambulance dispatch time from the emergency center is adopted as the statistical standard for the time point of traffic accidents. APCC indices are applied as explanatory variables to facilitate further analysis.

Statistical analysis

Section 3, as illustrated in Fig. 1, explores the potential correlation between traffic accidents and APCC indices, from both linear and nonlinear perspectives. Through the Pearson correlation coefficient test and the variance inflation factor (VIF) method, this article performs a preliminary screening of variables with strong correlations and multicollinearity. Subsequently, on the basis of Akaike’s information criterion (AIC), the model with the lowest multicollinearity is retained. VIF is again applied for a multicollinearity test and then a multivariate linear regression model is constructed. Although linear models can provide an initial assessment of the linear relationships between the response variable(traffic accident numbers) and the explanatory variables, they are limited in capturing the potential nonlinear dependencies. For instance, the impact of certain pollutants on accident numbers may plateau or fluctuate under specific conditions. Similarly, climate indices such as temperature and precipitation conditions may exert varying degrees of influence depending on their ranges, leading to nonlinear interactions. Consequently, this article integrates linear and nonlinear models, aiming to explore the complex relationship between variables more thoroughly, to enhance the explanatory and predictive power of the model.

This article prioritizes additive models (GAM and XGBoost with additive constraints) over other black-box approaches to ensure alignment with public health needs. GAM models can capture complex nonlinear relationships between variables without defining specific functions. The results can be interpreted by checking the shape of the smoothing functions. XGBoost is subsequently adopted as a complementary machine learning method due to the additive modeling structure through gradient boosting, which aligns with GAM in interpreting. Additionally, XGBoost is able to visualize feature contributions via SHAP (SHapley Additive exPlanations) values. These two approaches ensure robustness by comparing the statistically nonlinear inference of GAM with the high predictive accuracy of XGBoost in high-dimensional cases.

Fig. 1
figure 1

Steps of analysis in section3.

Given the response variable y, where f(x) and g(z) represent functions with potential nonlinear dependencies, a general regression problem involving multiple variables can be expressed as

$$\begin{aligned} y=f(x)+g(z)+\epsilon . \end{aligned}$$
(1)

Based on the Taylor series, f(x) and g(z) can be expanded around \(x=x_0\) and \(z=z_0\) respectively, which is

$$\begin{aligned} y&=f(x)+g(z)+\epsilon \nonumber \\&=f(x_0)+f'(x_0)(x-x_0)+\frac{f''(x_0)}{2}(x-x_0)^2+g(z_0)+g'(z_0)(z-z_0)+\frac{g''(z_0)}{2}(z-z_0)^2+\epsilon \nonumber \\&=f(x_0)+g(z_0)-x_0f'(x_0)-z_0g'(z_0)+f'(x_0)x+g'(z_0)z+\frac{1}{2}[(x-x_0)^2f'(x_0)+(z-z_0)^2g'(z_0)]+x^*+z^*\epsilon \nonumber \\&=\beta _0+\gamma _0+(1-x_0)f'(x_0)x+(1-z_0)g'(z_0)z+\frac{1}{2}f'(x_0)x^2+\frac{1}{2}g'(z_0)z^2+\epsilon , \end{aligned}$$
(2)

where \(x^*=\frac{f''(x_0)}{2}(x-x_0)^2\), and \(z^*=\frac{g''(z_0)}{2}(z-z_0)^2.\) In statistical regression, the first- and second-order coefficients of x and z can be regarded as the coefficients obtained from regression. Specifically,

$$\begin{aligned} y=\beta _0+\gamma _0+\beta _1x+\gamma _1z+\beta _2x^2+\gamma _2z^2+\epsilon . \end{aligned}$$
(3)

According to Eq.(3), when only the linear relationships in the nonlinear dependencies f(x) and g(z) are considered, the analyzed model becomes

$$\begin{aligned} \left\{ \begin{aligned} y&=\beta _0+\gamma _0+\beta _1x+\gamma _1z+\epsilon ^*,\\ \epsilon ^*&=\beta _2x^2+\gamma _2z^2+\epsilon .\\ \end{aligned} \right. \end{aligned}$$
(4)

This simplification highlights the limitation of purely linear models when underlying nonlinear dependencies exist. In such cases, important information about higher-order effects, represented by the quadratic terms in \(\epsilon ^*\), is omitted. This loss of information reduces the model’s explanatory power and may result in inaccurate or incomplete interpretations of the relationships between variables. Therefore, for complex issues such as the correlation between traffic accidents and APCC, it is essential to conduct tests for potential nonlinear dependencies and incorporate them into the modeling process.

Although the Pearson correlation coefficient method is a powerful tool for measuring the linear correlation between variables, it is ineffective in diagnosing nonlinear dependencies. Mutual Information (MI), a fundamental concept in information theory, has been widely used to quantify dependencies between variables. However, existing estimation methods often exhibit unstable statistical performance due to their reliance on a set of tuning parameters. To address this issue, Zeng et al. proposed a Jackknife approach to the estimation of the mutual information estimation method (JMI)21. Under conditions where the bandwidths are set equally, JMI achieves a unique asymptotic maximum. Furthermore, Zeng et al. demonstrated the stability of JMI in performing nonlinear dependency tests. In Results, this article employs the JMI method to investigate potential nonlinear dependencies among variables.

Results

Study population

Table 1 presents the descriptive statistics of road traffic accidents (RTA) and APCC indices from January 1, 2022, to November 30, 2023. During this period, approximately 80 days of air pollution data are missing. In the subsequent analysis, we employ the Multiple Imputation method to address missing values in the selected environmental pollution index. Upon reestimating the model, no significant difference in the t test of coefficients is found compared to the model using Listwise-deletion (detele the rows that contain missing data). Consequently, approximately 80 days of missing data is ultimately removed. Details are provided in subsequent sections.

Based on data within the scope of the study, the average daily number of vehicle traffic accidents in Chengdu is 102.97. During this period, the minimum number of such accidents in a single day is 11, and the maximum is 192, indicating significant variability. However, on most days, the number of accidents fluctuates between 89 and 122. During the 23-month period, which is nearly two years, Chengdu’s daily average temperature, earth skin temperature, humidity, and precipitation are \({17.83}^{\circ }\)C, \({17.97}^{\circ }\)C, 74.19% and 3.12 mm, coincide with the characteristics of its subtropical monsoon climate. The average daily concentrations of air pollutants \(PM_{2.5}\), \(PM_{10}\), \(SO_2\), \(NO_2\), CO, \(O_3\) are 38.52\(\upmu \text {g/m}^3\), 58.96\(\upmu \text {g/m}^3\), 3.83\(\upmu \text {g/m}^3\), 28.75\(\upmu \text {g/m}^3\), 0.63\(\upmu \text {g/m}^3\) and 90.66\(\upmu \text {g/m}^3\), corresponding to an average air quality level of “good”.

To further investigate the impact of air pollution degree and weekends on the number of traffic accidents, we introduce two binary variables based on the third dataset: “Air Pollution Degree” (polldegree) and “Is Weekend” (IsWeekend). This enhancement allows us to explore the potential effects of general air quality and rest days on accident occurrences, thereby enriching the regression analysis model.

Table 1 Descriptive statistics of RTA and APCC indices in Chengdu (2022.1.1-2023.11.30).

Variable selection

Figure 2 presents a heatmap of correlation coefficients among variables, where deeper colors indicate stronger correlations between pairs of variables. The heatmap reveals significant correlations among certain APCC indices, indicating the presence of multicollinearity. To ensure reliable estimates in subsequent modeling analyzes, accurately reflect real-world conditions, and obtain precise significance test results, it is necessary to process these variables further. In particular, \(PM_{2.5}\) and \(PM_{10}\), as well as TS and T2M, exhibit strong correlations. The existence of multicollinearity between these pairs will be demonstrated in further detail in subsequent analyses.

Fig. 2
figure 2

Plot of correlation coefficients of variables.

To further select variables with low multicollinearity, the Variance Inflation Factor (VIF) is first employed for evaluation. For the i-th independent variable \(X_i\) in the regression model, the VIF can be calculated by

$$VIF_i=1/(1-R_i^2),$$

where \(R_i^2\) represents the unadjusted coefficient of determination for regressing \(X_i\) on the remaining independent variables. \(VIF>10\) indicates significant multicollinearity among variables. As shown in Table 2, \(PM_{2.5},\) \(PM_{10},\) T2M (temperature) and TS (earth skin temperature) exhibit high correlations, consistent with the observations in Fig.2.

To mitigate the impact of multicollinearity, a stepwise regression based on the Akaike Information Criterion (AIC) is applied for preliminary variable selection. Based on AIC, the trade-off between model fit and complexity can be quantified, which is

$$AIC=2p-2ln(L),$$

p is the number of parameters in this model and L represents the maximum likelihood. A lower AIC indicates a better balance of accuracy and simplicity. In a forward stepwise regression procedure, it always starts with a baseline model. Then the predictors that reduce AIC most significantly will be added to the model, repeating this step until no remaining variables could improve AIC. The results indicate that excluding the variable of pollution degree (polldegree), \(PM_{10}\), ozone (\(O_3\)) and \(SO_2\) yields the lowest AIC value for the model.

Subsequent reassessment of multicollinearity among variables reveals persistent strong correlations between certain variables, as shown in Table 3. This article further refines the variables by removing those with overlapping information. Between TS (earth skin temperature) and T2M (temperature), TS (hereafter referred to as “temperature”) is retained. And \(PM_{10}\) is excluded. This adjustment further reduces the multicollinearity in the model. Subsequently, the correlation coefficients among the retained variables \(NO_2\), CO, TS, \(RH_2M\), PORC and IsWeekend are examined, resulting in the correlation matrix shown as

$$\begin{aligned} \begin{pmatrix} & SO_2 & NO_2 & CO & TS & RH_{2M} & PORC & IsWeekend\\ SO_2 & 1.00 & 0.45 & 0.39 & 0.05 & -0.07 & -0.18 & <-0.01\\ NO_2 & 0.45 & 1.00 & 0.66 & -0.36 & -0.11 & -0.40 & -0.09\\ CO & 0.39 & 0.66 & 1.00 & -0.35 & 0.21 & -0.27 & -0.02\\ TS & 0.05 & -0.36 & -0.35 & 1.00 & -0.26 & 0.26 & -0.01\\ RH_{2M} & -0.07 & -0.11 & 0.21 & -0.26 & 1.00 & 0.27 & -0.03\\ PORC & -0.18 & -0.40 & -0.27 & 0.26 & 0.27 & 1.00 & -0.06\\ IsWeekend & -0.01 & -0.09 & -0.02 & -0.01 & -0.03 & -0.06 & 1.00 \end{pmatrix} \end{aligned}$$

The three air pollution indices -\(\mathrm {SO_2}\), \(\mathrm {NO_2}\) and \(\textrm{CO}\) - exhibit relatively high correlation coefficients, with \(\mathrm {NO_2}\) correlated with the other two pollutants at 0.45 and 0.66, respectively. This suggests a strong interrelationship among these pollutants. To mitigate the impact of multicollinearity on parameter estimation, \(\mathrm {NO_2}\) is retained as the representative air pollutant index25.

To assess multicollinearity among the remaining indices, the kappa value (condition number) is utilized as a diagnostic measure. It is derived from the eigenvalues of the design matrix, which is

$$k=\lambda _{max}/\lambda _{min},$$

\(\lambda _{max}\) and \(\lambda _{min}\) represents the largest and the smallest eigenvalue of \(X'X\). A kappa value less than 100 suggests minimal multicollinearity. The analysis shows that \(k \approx 5.12\), which is well below the threshold of 100. Therefore, it can be concluded that multicollinearity in the current model is negligible and does not significantly impact the model fit.

Table 2 Results of the VIF test for variables.
Table 3 VIF results with some indices excluded.

Results of multivariate linear regression

After standardizing the independent variables, excluding the binary variable IsWeekend, this article constructs a model with the number of traffic accidents as a dependent variable. Multivariate linear regression is a statistical method for modeling the linear relationship between a continuous dependent variable and multiple independent variables. A multivariate linear model can be expressed as

$$\varvec{Y}=\varvec{x} \varvec{\alpha }+\varvec{\epsilon },$$

where \(\varvec{Y}=(y_{ij})_{n\times 1}\), which is the dependent variable. \(\varvec{x}=(x_{ij})_{n\times p}\), \(j=1,2,\dots ,p\), are the independent variables. p denotes the number of independent variables and n represents the total sample size. \(\varvec{\alpha }=(\alpha _{ij})_{p\times 1}\) is the coefficient vector and \(\varvec{\epsilon }=(\epsilon _{ij})_{n\times 1}\) is the error term. In this article, the dependent variable is the number of RTAs that occur every day, and the independent variables are TS (earth skin temperature), \(\textrm{NO}_2\) concentration,\(\mathrm {RH_{2M}}\) (relative humidity), PORC (precipitation) and IsWeekend (whether it is on weekends). And coefficient estimates for the linear model are derived by ordinary least squares (OLS) estimation, whose objective function is

$$min_{\alpha }\sum _{i=1}^{n}(y_i-x_i^T\alpha )^2.$$

The results presented in Table 4 indicate that all variables, except \(\textrm{RH}_{2 \textrm{M}}\) (relative humidity), are significantly associated with the occurrence of traffic accidents. Increases in TS, \(\textrm{NO}_2\) concentration,\(\mathrm {RH_{2M}}\), and PORC have varying degrees of impact on the frequency of traffic accidents. Notably, fewer accidents occur on weekends compared to weekdays.

Table 4 Regression result of remaining index variables and RTAs.

We conduct further validation of results the model. A Breusch-Pagan test (BP=36.24, p value<0.05) confirms heteroscedasticity in the model, with disturbance terms exhibiting non-constant variances. The potential reasons could be undiscovered nonlinear relationships between variables, indicating possible model misspecification. To further modify the model, the Ramsey RESET test is also utilized by introducing squared and cubic terms, aiming to verify whether the existing linear model adequately captured linear relationships between variables while ensuring no omitted nonlinear terms26. The test result (RESET=15.58, p-vlaue<0.05) indicates that incorporating nonlinear terms into the model is statistically justified.

Additionally, the degrees of freedom in the F test and the t test are sufficiently large, the significance tests for model and parameters in multiple linear regression approximately follow a normal distribution and exhibit robustness to violations of error term assumptions. A cointegration test is also performed on the model. Based on the results of the Augmented Dickey-Fuller (ADF) test (p = 0.01 < 0.05), the null hypothesis (i.e., no unit root exists) is rejected, indicating a long-term equilibrium relationship between the independent variables and the dependent variable.

To address potential heteroscedasticity in regression residuals, this article utilizes the Weighted Least Square Estimation to build a model. Unlike ordinary least squares (OLS) estimation, weighted least squares (WLS) assigns specific weights to address heteroscedasticity in the error term. The objective function can be expressed as

$$min_{\alpha }\sum _{i=1}^{n}\omega _i(y_i-x_i^T\alpha )^2,$$

\(\omega _i\) represents the weights of the i-th observation. In this article, we let \(\omega _i=1/{|\sigma _i|},\) \(\sigma _i\) denotes the i-th residual. The result shows that the coefficient estimates for variables \(NO_2\), TS and PORC are 3.28, 15.45, and 3.37, with the corresponding p values of \(3.24\times 10^{-14}\), \(2\times 10^{-16}\), and \(3.05\times 10^{-9}\). All coefficient estimates demonstrate statistically significant results. Furthermore, we report heteroscedasticity-consistent standard errors using the HC3 estimator27. This approach provides valid hypothesis tests (t statistics and p values) without requiring prior diagnostic tests for heteroscedasticity. The results are shown in Table 5.

Table 5 Regression Results with Heteroscedasticity-Consistent Standard Errors.

The results of the multivariate regression analysis are presented in a nomogram (Fig.3). To facilitate a more intuitive estimate, the data used to construct the nomogram have not been standardized.

Fig. 3
figure 3

Nomogram for Multiple Regression.

Missing \(NO_2\) concentrations (11.3% of records) are addressed through Multiple Imputation. Multiple imputation is an approach to handling missing data that addresses the uncertainty in missing values by generating multiple plausible imputations through the distribution of observed variables28. The results of the linear model based on the processed data have been compared with the results in Table 5. The coefficient estimates for variables \(NO_2\), TS, and PORC are 3.79, 15.40, and 3.42, respectively, with corresponding p values of 0.0004, \(<2\times 10^{-16},\) and 0.0016 in the significance tests. Since these results show slight differences compared to those of the original model, this article ultimately retains the findings derived from the Listwise-deletion approach.

The nomogram provides a more intuitive representation of the relationships between various APCC indices and the daily number of traffic accidents. Indices such as \(NO_2\), TS, \(RH_{2M}\), PORC and IsWeekend all have varying degrees of positive impact on the number of traffic accidents. Among these, TS, representing temperature, has the most significant effect. As the temperature increases, the frequency of traffic accidents may also increase.

Results of Xgboost and JMI

Based on the aforementioned analysis, this article takes a further step to build nonlinear models. To investigate the nonlinear relationships between the APCC indices and the number of traffic accidents, the integrated dataset is randomly divided into training and testing sets. Using ordinary least squares estimation (OLSE), Lasso regression (Least Absolute Shrinkage and Selection Operator), and XGBoost, this article conducts 100 repeated experiments. The hold-out method is utilized to randomly divide the data into training and test sets by 7: 3. The mean squared error (MSE) is used as an evaluation metric to compare the performance of the three models. The results are presented in Table 6.

Lasso regression is a regularized method that combines ordinary least squares (OLS) with an L1-penalty term to prevent overfitting and perform automatic feature selection. The objective function of Lasso can be described as

$$min_\alpha \{\frac{1}{2n}\sum _{i=1}^{n}(y_i-x_i^T\alpha )^2+\lambda \sum _{j=1}^{p}|\alpha _j|\},$$

where \(\lambda\) is the regularization parameter that controls the strength of penalty and \(\lambda \ge 0\). \(\sum _{j=1}^{p}|\alpha _j|\) is the L1-norm penalty. When selecting \(\lambda\) that minimizes the prediction error the most, we perform a cross-validation using “cv.glmnet” in R language. This optimal value of \(\lambda\) is then used to extract the corresponding regression coefficients and make predictions on a test dataset.

XGBoost is an efficient machine learning algorithm based on gradient-boosted decision trees. This algorithm introduces regularization into the objective function, which is

$$L=\sum _{i}l(y_i,\hat{y}_i)+\sum _{k}\Omega (f_k),$$

where l is the loss function. \(\Omega\) is the regularization term for the k-th tree, and \(\Omega (f_k)=\gamma T+1/2\lambda ||\omega ||^2,\) T represents the number of leaves in the tree, \(\omega\) denotes the weights of leaves. \(\gamma\) and \(\lambda\) are hyperparameters that control complexity. In the XGBoost model, the parameters are configured to optimize the performance of regression tasks. The objective function is set to the squared error loss function as it is effective in penalizing larger prediction errors. Other parameters are maintained at their default settings, as provided by the XGBoost package in R. Before constructing a model, the training data is formatted into a specialized structure by “xgb.DMatrix” which enhances the efficiency of the XGBoost algorithm.

Both Lasso regression and XGBoost perform variable selection. Lasso through L1-penalty and XGBoost through feature importance rankings derived from gradient boosting. To evaluate the performances of each model, we compare the mean squared error (MSE) of models trained on the dataset before data selection (retaining 12 predictors post-selection) against a baseline ordinary least squares (OLS) model using the same dataset. The results show that XGBoost achieves the lowest MSE, significantly outperforming Lasso and OLS, thus validating its superior predictive accuracy and robustness.

Table 6 Average MSE of 100 experiments.

The average Mean Squared Error (MSE) from the 100 models fitted using XGBoost is significantly lower than that of the other two linear models. This suggests that there may be nonlinear relationships between the number of RTAs and various indices. Consequently, the nonlinear XGBoost method is better suited to capture these relationships, resulting in improved model fitting.

Afterwards, SHAP values (SHapley Additive exPlanations) are introduced, which are based on the Shapley values from game theory. In such games, when a group achieves a certain outcome, the Shapley value fairly distributes the contribution of each participant, satisfying four properties: efficiency, symmetry, dummy, and additivity29. “Efficiency” means that the sum of all feature contributions is equal to the difference between the prediction and the average prediction. In other words, this prediction of a model can be interpreted as the sum of the contributions of each feature. “Symmetry” means that if two features contribute equally to the model, they have identical SHAP values. Analogously, if a feature has zero marginal contribution across all possible models, its Shapley value is zero. This is what “Dummy” means. “Additivity” means that the combined prediction of multiple models is equal to the sum of every prediction of each model30. Shapley demonstrated that when these four properties are met, the solution is both fair and unique29. Considering the computational challenges of determining Shapley values, Lundberg et al. introduced the SHAP method to estimate them, integrating this approach into machine learning to enhance the interpretability of the model31. In this study, the incorporation of SHAP values provides the possibility of quantifying the impact of each feature on the occurrence of RTAs, identifying the most influential factors in the model. The results based on XGBoost in this section are presented in Fig. 4.

Fig. 4
figure 4

SHAP value of each feature.

According to the ranking result of average SHAP values, an increase in the temperature-related indices (TS and T2M) is more likely to lead to traffic accidents. Air pollution indices also affect road traffic safety to varying degrees. Specifically, an increase in \(\textrm{NO}_2\) concentration tends to increase the frequency of RTAs, while an increase in \(PM_{2.5}\) concentration appears to reduce the number of accidents. This suggests that different air pollution indices exhibit varying impacts on road traffic safety, which requires further discussion. Furthermore, the IsWeekend variable (indicating whether it is a weekend) shows a negative correlation with the number of traffic accidents, with the accident frequency significantly lower on weekends compared to weekdays.

Table 6 presents the ranking of the feature importance derived from the XGBoost model. In this context, “Gain” represents the relative contribution of each feature to the model. A higher Gain value indicates a greater impact on the model’s predictive capability. “Cover” denotes the average coverage of a feature across all the trees where it appears, reflecting the proportion of observations associated with that feature. “Frequency” refers to the percentage of times a feature has been utilized in the trees. A higher frequency suggests a more frequent usage of that feature in the model. As shown in Table 7, TS (earth skin temperature) has the most significant influence on RTAs, while IsWeekend (whether it is on weekends) has the least impact. The XGBoost regression results indicate potential nonlinear relationships between the independent variables and the response variable.

Table 7 Ranking result of feature importance under XGBoost.

According to Eq.(2) to Eq.(4), when a modeling framework assumes exclusively linear relationships between variables, while omitting potential nonlinear dependencies, the error term in linear model would inherently incorporate these omitted nonlinear components. Hence, introducing bias in parameter estimates and affecting the validity of statistical inferences. Then the residual of the linear model would not be independent from covariates.

To further assess this nonlinearity, this article employs the Jackknife Mutual Information (JMI) estimator and set the number of permutations to 50. Let \(S= \{(\varvec{x}_i,\varvec{y}_i),i=1,\dots ,n\}\) denote a random sample from \((\varvec{X},\varvec{Y}).\) For each iteration \(k=1,2,\dots ,N\), generate a permuted dataset \(S_{(k)}=\{(\varvec{x}_i.\varvec{y}_{\sigma _k(i)}),i=1,\dots ,n\},\) where \(\sigma _k(\cdot )\) is a random permutation of \(1,2,\dots ,n\). Then the JMI estimator for each time can be calculated based on \(S_{(k)}\). Under the null hypothesis (\(\varvec{X} \perp \varvec{Y}\)), the distribution of \(\widehat{JMI}\) can be approximated by the empirical distribution \(T=\{JMI_{(1)},JMI_{2},\dots ,JMI_{(N)}\}\)21. In contrast with tests based on asymptotic distributions that require data with a large sample size, the permutation technique proposed by Zeng et al. can give a precise distribution for small samples. In R, the “JMI” package provides functions to perform JMI tests.

This article conducts the JMI test of residuals of the linear regression aforementioned and five APCC indices. In terms of the linear regression results presented in Table 4, applying the JMI method, the nonlinear dependencies between the residuals of the linear model and the selected variables are assessed. The findings are summarized in Table 8. Notably, the p value for the variables \(\textrm{NO}_2\) and TS is less than 0.05, indicating the dependency between them and the residuals of the linear model. Then further indicate the presence of nonlinear dependencies that the linear model fails to adequately capture.

Table 8 Results of nonlinear dependencies based on JMI.

Results of GAM

Based on the JMI result in Table 8, three generalized additive models have been developed to further explore the nonlinear relationships between variables. GAM is an extension of generalized linear models. By replacing linear terms with smooth functions, GAM allows for nonlinear relationships between predictors and the response variable. In addition, one of the key features of GAMs is the ability to visualize smooth terms. These smooth functions can be plotted to show how each predictor influences the response variable, independent of the other variables in the model. The plots provide insight into the shape and strength of the relationships between variables, such as whether the effect is increasing, decreasing, or nonmonotonic. A GAM is defined as

$$y_i=\alpha _0+\sum _{j=1}^{p}f(x_{ij})+\epsilon _i,$$

\(f(x_{ij})\)is a smooth nonlinear function.

In this article, GAM1 assumes that the relationships between \(\textrm{NO}_2\) and TS with the response variable (count) are nonlinear. GAM2 applies a smoothing function only to \(\textrm{NO}_2\), allowing for a nonlinear relationship between \(\textrm{NO}_2\) and the count of road traffic accidents. GAM3 allows only the temperature (TS) to have a nonlinear relationship with the response variable (count).

Based on the standardized dataset, a ten-fold cross-validation has been performed for iterations of 200 times, to evaluate three candidate GAM models through MSE (Table 9). The dataset is partitioned into 10 subsets, each fold iteratively designated as the test set (30% of the data) and the remaining 70% as the training set. For each iteration, 9 subsets train the model, while the 10th tests performance. This cycle repeats 10 times, ensuring each subset serves as the test set once. Metrics such as MSE are averaged across folds to reduce overfitting bias. According to the results in Table 9, it is evident that GAM2 has the highest MSE, corresponding to the scenario where it is assumed only \(\textrm{NO}_2\) has a nonlinear relationship with the response variable. Compared to the MSE of GAM3, it is initially inferred that \(\textrm{RH}_{2 \textrm{M}}\) has a stronger nonlinear dependence on the response variable. Overall, the fully nonlinear model GAM1 exhibits better fitting performance.

Table 9 MSE results of the GAMs.

Figure 5 presents the visualization of the smooth terms (e.g. \(\textrm{NO}_2\) and TS) under this model, allowing for a further discussion of the nonlinear relationship between temperature and the number of traffic accidents. The horizontal axis represents the average temperature, and for the smooth term (the blue curve in Fig.5), the vertical axis quantifies the effect on the smooth term s(TS) for each \(1^{\circ }{\textrm{C}}\) increase or decrease in earth skin temperature (TS). The variation in s(TS) can be roughly divided into three stages, separated by dashed lines in the figure.

The first segment of the curve shows a linear increase, and each unit increase in temperature results in an average increase of 46.94 in s(TS). When the actual temperature is below the average temperature, the further it deviates from the average, the stronger this nonlinear dependence becomes. The second segment of the image exhibits a nonlinear trend, initially increasing and then decreasing. In the second stage, the effect of s(TS) on the count of the response variable shifts from negative to positive. Nevertheless, as the average temperature increases, s(TS) slightly decreases and its impact on the response variable gradually diminishes. In this stage, each unit increase in earth skin temperature (TS) corresponds to an increase of 8.17 in s(TS). In the final stage, the segment shows a steady linear increase. Each increase in TS unit leads to an average increase of 15.82 in s(TS).

Variations in the GAM-fitted curve across these three stages indicate that when the actual temperature significantly deviates from the average temperature in Chengdu, either higher or lower, a temperature change is more strongly associated with a varying frequency of traffic accidents. The ultimate GAM can be expressed as

$$count=s(NO_2)+s(TS)+\mathrm {RH_{2M}}+PORC+IsWeekend.$$
Fig. 5
figure 5

TS in GAM1.

The significance tests for coefficient estimates show p values of 0.8543, \(2.05\times 10^{-5}\) and 0.0061 for linear variables \(\mathrm {RH_{2M}}\), \(\textrm{PORC}\), and \(\textrm{IsWeekend}\), 0.0046 and \(<2\times 10^{-16}\) for nonlinear terms \(s(NO_2)\) and s(TS), indicating statistically significant results for all predictors except variable \(\mathrm {RH_{2M}}\). To further verify the significance of air pollution as a potential contributor to road traffic accidents, we conduct a likelihood ratio test (LRT) that compares the fully linear model against the model incorporating nonlinear terms. The results demonstrate statistically superior explanatory power in nonlinear terms (\(\chi ^2=54257\), p value\(<2\times 10^{-16}\)), thereby indicating the necessity of modeling nonlinear dependencies when analyzing air pollution effects.

Discussion

This study demonstrates the following points: First, there is a strong correlation between the occurrence of traffic accidents in Chengdu and the climate conditions, as well as environmental pollution degrees. Second, variables such as \(\textrm{NO}_2\), TS (earth skin temperature), \(\mathrm {RH_{2M}}\) (relative humidity at 2 meters) and \(\textrm{PORC}\) (precipitation) have a positive impact on the occurrence of traffic accidents. The results of the linear regression indicate that the effect of \(\mathrm {RH_{2M}}\) on the response variable is not significant, while \(\textrm{IsWeekend}\) (whether it is on the weekend) has a significant negative impact on the number of traffic accidents. Third, the study detects potential nonlinear relationships between APCC indices and the number of traffic accidents. Taking TS as an example, when the actual temperature is higher or lower than the average temperature in Chengdu within the study range, the direction of its impact on traffic accidents varies.

The findings in this article are consistent with the existing literature. In multiple linear regression analysis, earth skin temperature (\(\textrm{TS}\)) is found to have the most significant impact on traffic accidents, consistent with the variable importance ranking from the nonlinear analysis in Results using the XGBoost model. Furthermore, the dependency between TS and the response variable count is examined in greater depth, revealing that when Chengdu’s temperature is at low or high local levels, it more significantly affects the number of traffic accidents. This finding is in agreement with the studies by Wang et al. and Lee et al.15,32.

In addition, precipitation (\(\textrm{PORC}\)) has a significant impact on road traffic safety. As precipitation increases, the number of traffic accidents tends to increase, corroborating previous research. Chung et al. found that precipitation leads to a temporary reduction in maximum traffic flow and vehicle speed on roads33. Eisenberg analyzed daily data and discovered a positive correlation between precipitation and fatal accidents16.

Moreover, the results of the linear regression indicate that the relative humidity (\(\mathrm {RH_{2M}}\)) has a positive but statistically insignificant effect on traffic accidents. Traffic is safer on weekends, possibly due to the reduced traffic volume during the study range.

The traffic accident dataset used in this study includes only incidents where emergency calls were made, thus not all accidents in Chengdu were covered. And datasets are all structured on a daily basis, and the influences of time-of-day variations are ignored. Furthermore, among air pollution indices, \(\textrm{NO}_2\) is significantly positively correlated with the frequency of traffic accidents. The specific reasons for this phenomenon are not yet clear. One possible reason is that \(\textrm{NO}_2\) is a major pollutant in vehicle exhaust emissions34. Furthermore, direct emissions of \(\textrm{NO}_2\) from vehicles significantly affect \(\textrm{NO}_2\) near-road concentrations35. Consequently, a more in-depth analysis is required to understand the reasons behind the increased frequency of traffic accidents due to environmental pollution.

Conclusion

Generally, this study, based on data from a southwestern city in mainland China, examines the relationship between road traffic safety and APCC. It provides evidence that earth skin temperature (\(\textrm{TS}\)), corrected precipitation (\(\textrm{PORC}\)), relative humidity (\(\mathrm {RH_{2M}}\)), \(\textrm{NO}_2\), and whether it is a weekend (IsWeekend) influence the frequency of traffic accidents in Chengdu to varying degrees. Future research could explore the impact of APCC on variables such as the severity of casualties in traffic accidents using larger datasets.