Introduction

The increasing air pollution in developing nations has been linked to increased adverse health effects on humans1. The rising air pollution levels don’t remain limited to the source region but are also observed to follow long-range transportation due to meteorology2,3. To address this issue, it is essential to conduct high-resolution spatial and temporal monitoring of key air pollutants. For this purpose, ground-based monitoring using high-quality reference-grade instruments (Federal Reference Method (FRM)/Federal Equivalent Method (FEM)) is the most reliable approach. These instruments provide high-quality pollutant data and are recognized as standard instruments by major international air pollution regulatory bodies4,5,6,7.

A major constraint to the widespread use of FRM/FEM instruments is their high initial setup and maintenance costs. As a result, there is a sparse network of such stations in developing countries compared to developed nations. In the past decades, low-cost sensors (LCS) for air quality monitoring have emerged and are now used globally8,9. These sensors play a crucial role in air quality monitoring by providing real-time, high-resolution temporal and spatial data due to their easy and affordable deployment and low maintenance costs. LCS is capable of monitoring gases, Particulate Matter (PM), VOCs, and airborne microorganisms10,11. Gas sensors typically utilize electrochemical and metal oxide technologies, while PM sensors generally operate on light scattering principles. The size-specific signals produced by these sensors are converted into mass concentrations using algorithms12,13,14,15,16.

Despite their various advantages, the use of LCS has limitations. A significant drawback is the quality of data generated compared to FRM/FEM instruments. These discrepancies in PM measurements between LCS and FRM/FEM arise from differences in their operating principles17,18,19. Additionally, LCS is sensitive to meteorological factors, with many studies highlighting the significant impact of Relative Humidity (RH)20,21. Therefore, to enhance the accuracy of air quality data from these sensors, calibration with FRM/FEM instruments is necessary, typically achieved through collocation methods, which include both laboratory and field setups. The laboratory method is often considered to be more reliable, as it allows sensors to be tested under ambient conditions similar to those they will encounter in real-world monitoring22. For calibration, researchers have used various models, ranging from simple linear regression to more complex Machine Learning (ML) models, with ML models generally yielding the best performance in most studies23,24.

The performance of different calibration models varies by ___location, and to date, few studies have explored these models to correct LCS data in India25,26. This study is the first to conduct long-term, in-field calibration of two widely used LCS: PA, which has the largest global network, and ATMOS, which has the largest network in India, specifically in NW-IGP. In this study, we assessed the performance of these sensors in measuring PM2.5 against FEM Beta Attenuation Monitor (BAM) and examined the influence of ambient meteorology on LCS performance.

Five ML models; Multiple Linear Regression (MLR), Decision Tree (DT), Random Forest (RF) Regression Model, Support Vector Machine (SVM), XGBoost (XGB), and an empirical RH correction methodology were employed to calibrate raw measurements from these sensors. These LCS were collocated with the BAM for 10 months, covering all seasons. The study identifies the best-performing ML model, capable of correcting raw sensor measurements in spatial networks across similar urban areas throughout the Indo-Gangetic Plains (IGP) and beyond. The outstanding performance of the model enhances spatial and temporal air quality monitoring, aiding in the identification of local air pollution hotspots, accurate exposure assessment, health risk evaluation, and informed source-oriented policy making. A graphical abstract of the study is presented in Fig. 1.

Fig. 1
figure 1

Graphical representation of the study.

Results and discussion

Raw ATMOS and PA measurements

The raw PM2.5 measurements from the ATMOS sensor ranged from 0.02 to 328.35 µg/m3, with a mean value of 51.38 µg/m3. In comparison, the raw PM2.5 values from the PA sensor ranged from 1.27 to 537.68 µg/m3, with a mean value of 91.59 µg/m3. The BAM PM2.5 measurements ranged from 0.91 to 213.63 µg/m3, with a mean value of 45.29 µg/m3. A time series comparison of PM2.5 measurements from all these instruments is presented in Fig. 2. Both PA and ATMOS measurements tended to overestimate PM2.5 levels compared to BAM, particularly at higher concentrations, with PA showing a greater degree of overestimation. The root mean square error (RMSE) for raw ATMOS and PA measurements were found to be 77.67 µg/m3 and 34.6 µg/m3, respectively. The mean absolute error (MAE) was 24.19 µg/m3 for ATMOS and 54.52 µg/m3 for PA.

Fig. 2: Comparison of Raw LCS PM2.5 measurements with BAM.
figure 2

a Comparison as time series. Red, blue, and green lines indicate the raw PM2.5 measurements by ATMOS, Purple Air, and BAM monitors, respectively. b The regression plots of both ATMOS and Purple air, with BAM, along with R-square and RMSE values.

The coefficient of determination (COD) between ATMOS and BAM was 0.40, while it was 0.43 between PA and BAM. Scatter plots and linear fit for these parameters are shown in Fig. 2. In global studies involving long-term field collocations with BAM, higher COD values (ranging from 0.8 to 0.9) have been recorded for PA sensors. However, these results were obtained in locations with very low PM2.5 concentrations27,28. All statistics of raw measurements, including minimum and maximum value and standard deviation, are summarized in Table 1.

Table 1 Statistics of raw PM2.5 measurements by PA, ATMOS, and BAM in Chandigarh, India

Correlation plots of ATMOS PM2.5, PA PM2.5, and their respective RH and Temperature (T) with BAM PM2.5 are presented in Fig. 3. Long-term collocation studies in the IGP region of India using these LCS are limited. A study found lower COD ranging from 0.55 to 0.74 for PA sensors during long-term collocation with BAM in IGP cities, where PM2.5 levels were higher29. Another study reported a COD of 0.32 during a two-month collocation with BAM in Chandigarh30. Research indicates that the composition, and size of PM, along with ambient meteorological conditions, significantly impact LCS performance31. This variability may explain why these sensor’s performance differs from city to city, within the same region. The lower performance of these sensors in this study could also be attributed to discrepancies between the ambient conditions where the sensors were calibrated post-manufacturing and those present in the study area.

Fig. 3: Pearson correlation plot of ATMOS, PA, and their respective RH and T with BAM PM2.5 (labeled as PM2.5_CQM).
figure 3

The values inside the boxes indicate the p-value, which is significant.

Meteorological factors, particularly RH, have been identified as a major contributor to the measurement uncertainty in LCS32. This bias in raw measurements for both LCS is evident in Fig. 2 and is further compared in Table 2. The data in Table 2 shows that the detection of particulates increases with rising RH, for both the ATMOS and PA sensors. However, this effect is negligible in BAM measurements as it has a heater at the inlet, which regulates the RH by maintaining a setpoint33,34.

Table 2 Comparison of mean and median PM2.5 recorded by LCS and BAM in different RH bins

Performance of calibration models

The performance of each ML model was assessed using RMSE, MAE, and Mean Absolute Percentage Error (MAPE) of the corrected PM2.5 values from the LCS (testing split data). The DT model demonstrates the best performance for both sensors among all the models. Using DT for ATMOS, RMSE decreased from 34.6 µg/m3 to 0.731 µg/m3, the MAE dropped from 24.19 µg/m3 to 0.177 µg/m3, and the MAPE improved from 57.9% to 0.41%. Using DT for PA, RMSE reduced from 77.7 µg/m3 to 0.61 µg/m3, the MAE fell from 54.52 µg/m3 to 0.135 µg/m3 and the MAPE from 125.74% to 0.37%. The DT’s ability to capture non-linear relationships and adapt to complex patterns in the data likely contributed to its superior performance, particularly in predicting PM2.5 concentration.

Additionally, decision trees are valued for their interpretability, making it easier to understand and validate the model’s decision-making process35. While previous studies have indicated that RF models often outperform DT with larger datasets, the DT model performs better for less data36. In this study, the DT model may be performing better by partitioning the feature space37, and efficiently capturing complicated patterns38, while other models may struggle to capture non-linear relationships, resulting in low performance in our dataset for our study region. This finding highlights the advantages of the DT algorithm within R and provides valuable insights for calibrating LCS.

While the performance of the DT model is exceptionally good on the testing/validation dataset, its performance was further evaluated on an unseen dataset, which was not part of the initial training and testing/validation dataset, to assess potential overfitting39. A total of 1,849 raw PM2.5 measurements for ATMOS and 1391 PM2.5 measurements for PA, were used for the analysis. The DT model achieved an R² of 0.986 for PA and an R² of 0.987 for ATMOS on the unseen dataset, indicating strong generalization and minimal overfitting that supports the robustness of the DT model. The evaluation metrics are presented in Table 3, while the linear regression plots and models of corrected LCS measurements (testing with unseen dataset) alongside the corresponding BAM measurements are illustrated in Figs. S1 to S4.

Table 3 Performance metrics of both LCS for corrected (testing) data in Chandigarh, India

The COD values of corrected PM2.5 measurements from the other models ranged from 0.42 to 0.72 for ATMOS and from 0.52 to 0.79 for PA. While the performance metrics among these models did not significantly differ, the DT model notably outperformed all others. The RH correction methodology performed better than all ML models except for DT. This methodology involved calculating k and m factors in two ways, i.e. for ‘14-day sliding windows’ and ‘constant m and k’ for whole data. The correction equation utilizing the m and k factors derived from 14-day windows improved the R2 to 0.72 for ATMOS and 0.79 for PA (Table 3 and Fig. 4). In contrast, the same equation applied with optimized but constant m and k factors, did not yield satisfactory results (Supplementary Figs. S5 and S6). In a separate study, various ML calibration models were tested on nine LCS over a nine-month collocation period in Chennai, where the SVM model outperformed others26.

Fig. 4: Regression plots and statistics of corrected LCS measurements (testing data) and corresponding BAM measurements.
figure 4

a Regression plots and statistics of ATMOS. b Regression plots and statistics of Purple Air. The best-fit line in each plot is shown in red, and the values of intercept, slope, and R-squared are given separately in each plot.

This variation in model performance across various regions may be attributed to differences in aerosol morphology, the composition of contributing sources, and meteorological conditions12,40. Detailed performance metrics for all models are provided in Table 3. Scatter plots and statistics of linear regression for corrected LCS measurements (testing data in case of ML models) alongside corresponding BAM measurements are depicted in Fig. 4. Model residuals and comparison of data corrected by models with BAM are given in Supplementary Figs. S7 to S26 for ATMOS and Supplementary Figs. S27 to S46 for PA in the supplementary file.

Effect of RH on LCS measurements

To evaluate the effect of RH on the performance of LCS, raw PM2.5 measurements of both LCS were categorized into four groups based on the corresponding ambient RH levels. These were RH ≤ 25%, 25% < RH ≤ 50%, 50% < RH ≤ 75% and RH > 75%. Further, the COD was then calculated between LCS PM2.5 measurements and corresponding BAM PM2.5 values for each group. The highest COD values for both LCS were found in the highest RH group, i.e., RH > 75 (with COD = 0.59 for ATMOS and 0.63 for PA). Conversely, the lowest COD was recorded for the ≤25 RH group. Scatter plots for these four groups are shown in Fig. 5, while linear regression statistics are presented in Table 4.

Fig. 5: Comparison of LCS measurements with the reference instrument.
figure 5

a Regression of raw LCS’s PM2.5 measurements and BAM PM2.5 measurements, in different RH groups, i.e., ‘≤25%’, ‘25% < RH  50%’, ‘50% < RH  75%’ and ‘>75%’. The scatter, best-fit lines in red color are of Purple Air and BAM. The scatter, best-fit lines in black color are of ATMOS and BAM. b Box plots of relative humidity and temperature, measured by low-cost sensors and reference instruments.

Table 4 Linear regression statistics of LCS with BAM measurements in Chandigarh, India

The RH and T readings from these sensors differed from those measured at the reference Continuous Ambient Air Quality Monitoring Station (CAAQM) station. Both LCS recorded higher T and lower RH compared to the reference station (see Figs. 5 and 6). This discrepancy arises because the temperature sensors are placed inside the sensor cabinets, which become heated, leading to inflated temperature readings than the ambient true measurements. Consequently, the RH sensors inside these cabinets also indicate lower RH than the ambient levels measured by the reference instrument. The comparison of these instruments is shown in Figs. 4 and 6. Many studies have reported a low correlation between LCS PM2.5 measurements and reference instruments at low RH levels. This can be attributed to RH effects on the size distribution of PM2.5 particles41,42. Higher RH facilitates hygroscopic growth of aerosols, improving detection by light-scattering devices due to increased particle size, which enhances light scattering and alters the refractive index43.

Fig. 6: Time series comparison of low-cost sensors-measured relative humidity and temperature with reference instrument.
figure 6

a Comparison of temperature. b Comparison of relative humidity. The red, blue, and green lines show the measurements by ATMOS, purple air, and reference instruments, respectively.

To curb this bias, some studies have integrated calibration algorithms with hygroscopic factor (k factor derived from Köhler theory) and size distribution correction factor (m factor), resulting in improved calibration outcomes32,44. In the current study, these k and m factors were also derived empirically, following the same methodology, to assess the calibration performance. The correction method utilizing these factors outperformed all the ML models except for the DT model (as shown in Table 3 and Fig. 4). The ML models in this study were trained with RH and temperature variables incorporated into their algorithms, leading to excellent calibration results, particularly with the DT model.

Materials and methods

Study area

The study area for this research is Chandigarh, a city located in northern India (30°45′N 76°47′E) and is one of the union territories of India. It has a total area of 114 km2 with a population exceeding 1 million45. Its climate is characterized as humid subtropical with varying temperatures from season to season (−1 to 45 °C). The majority of the land use pattern of the city is urban, with a few rural pockets on its outskirts. Notably, Chandigarh has the highest vehicle density in the country, contributing significantly to local air pollution. Additionally, its proximity to the states of Punjab and Haryana leads to a seasonal influx of air pollution, particularly from stubble burning in those regions46,47.

Instrumentation and in-field collocation

For the in-field collocation experiment, two LCS, PA and ATMOS, were positioned alongside the BAM (FEM instrument) at the same height (around 12 feet above the ground). Both LCS operate on the nephelometric principle, measuring the light scattered by particles and a laser as the light source. The measurements were conducted from 13th October 2020 to 28th July 2021, covering Post-Monsoon, Winter, Summer, and Monsoon seasons. Hourly averaged PM2.5 data was downloaded from the APIs of the respective sensors, while hourly FEM PM2.5 measurements for the same period were retrieved from the Central Pollution Control Board (CPCB) repository. Data was collected from the BAM (PM101M), which is part of the CAAQMS established by the CPCB. Additionally, both LCSs recorded their own meteorological parameters, including RH and T, which were incorporated into their calibration models. Ambient RH and T data were sourced from the same CAAQMS as mentioned above.

Data processing

A total number of 5222 hourly values from the ATMOS sensor were utilized after excluding data points with erroneous/unrealistic meteorological values for training and testing of model calibration against corresponding BAM measurements. For the PA sensor, 5839 data points were used for training and testing purposes and calibrated with corresponding BAM values. Data points with excessively high values (e.g., temperature readings of 3 data points exceeding 1000 °C), identified as unrealistic (outliers) during plots visualization, were removed from the dataset48. This pre-processing step was crucial to ensure the ML models were trained accurately, thereby enhancing the reliability of the results. This approach follows standard data processing practices as established in previous studies49,50.

Calibration models

Previous studies have demonstrated that models such as Multivariate Linear Regression, Decision Tree, Support Vector Machine, Random Forest, and XGboost have effectively calibrated LCS in various regions worldwide51,52. In this study, the raw data was randomly divided into 70% for calibration and 30% for testing. 70% of the dataset was allocated for model training, while the remaining 30% was used to evaluate the models against different statistical parameters. Recognizing the established relationship between PM2.5, RH, and T, we included these as independent variables and BAM PM2.5 values serving as dependent parameters12,21. A description of all the ML models and their internal characteristics is provided in the following sub-sections.

Multiple linear regression

The MLR model expands upon simple linear regression by accommodating datasets with multiple predictor variables while maintaining a single outcome53. MLR represents a broader form of simple linear regression, encompassing situations with multiple predictor variables54. The MLR model assumes that changes in the independent parameter are associated with consistent changes in the dependent variable, with a linear relationship between them55.

Decision tree

The DT is a non-linear model that recursively splits the dataset into subsets based on the most significant features. Its hierarchical composition enables the efficient handling of complex relationships56. In R, DT is a straightforward and interpretable predictive model that provides easy implementation of complex associations within datasets. We have used the inbuilt library of R to develop a DT ML model57,58,59. In the R library for DT regression, the internal structure involves recursively partitioning the dataset based on selected features to create a hierarchical tree, where nodes represent splitting criteria and terminal leaves contain regression predictions. This design enables the models to efficiently capture non-linear relationships in the data56. The key assumptions made in this model are that the data points are independent and identically distributed, the most important feature is used for splitting the data at each node, and there is a non-linear relationship between dependent and independent variables.

Random forest regression model

RF is highly useful for its high predictive accuracy and in handling complex datasets. It operates by using multiple decision trees through a process called bagging, which incorporates both feature randomness and bootstrap sampling. The internal structure of RF consists of an ensemble of decision trees, each trained on a subset of the data, using a random selection of features. Predictions from these trees are aggregated through a voting or averaging mechanism56,60,61,62. This ensemble approach improves robustness, mitigates bias, and provides a powerful tool for classification and regression tasks. RF regression is practically applied to datasets with complex relationships, non-linear patterns, and a large number of features (datasets with a substantial number of independent variables). We have utilized the inbuilt library of R studio and developed the RF model to calibrate LCS56,63,64. In R’s RF regression (e.g., ‘randomForest’ package. The final prediction is an average or weighted combination of individual tree predictions, leading to improved accuracy and robustness. The key assumptions of this model include that the data points are independent and identically distributed, there is a non-linear relationship between dependent and independent variables, and there is sufficient data availability to construct multiple trees65.

Support vector machine

SVM is an ML model used for regression tasks that identify an optimal hyperplane to minimize the error between predicted and actual values46. This hyperplane can then be used to estimate the label for unseen data29,66, and SVM may effectively capture non-linear relationships30. In the R studio, the internal structure of SVM (e1071 library) focuses on optimizing hyperplane parameters and support vectors to minimize regression errors31,67,68. The assumptions underlying SVM include that the data can be separated by a hyperplane with a maximum margin, that features are scaled when using the kernels function, that the data points are independent, and there is a non-linear relationship between dependent and independent variables67.

XGBoost

XGB is a highly powerful ML algorithm that adds the strengths of gradient boosting and tree-based models, which leads to prediction accuracy69. It operates by iteratively training an ensemble of decision trees, minimizing errors from previous iterations, and adding new trees that correct residual errors70. This algorithm performs gradient boosting with regularization techniques. In the R library for XGB (‘xgboost’ package), the internal structure involves an ensemble of decision trees, and each is sequentially added to minimize the gradient of the loss function71,72,73. The model’s strength lies in its optimization for both predictive accuracy and regularization, achieved by combining weak learners into a robust predictive model. The assumptions made in the model include that the final prediction is an additive combination of all individual trees, that the input features are relevant to the models, and that there is a non-linear relationship between dependent and independent variables74.

Empirical RH correction methodology

This methodology was adopted from a study where, to minimize bias in LCS measurements, occurring due to hygroscopic growth of particles and varying size distribution, k and m factors were introduced63. The following Eq. (1) was used for LCS calibration:

$${{\rm{PM}}}_{{\rm{2.5}}}={{\rm{PM}}}_{{\rm{2.5}}\,{\rm{LCS}}}\times \frac{{\rm{m}}}{{{1}}+\frac{{\rm{k}}}{\frac{{{100}}}{{{\rm{RH}}}_{{\rm{LCS}}}\,}-{{1}}}}$$
(1)

Here k is the hygroscopic growth parameter derived from k-Köhler theory and m is the particle size distribution correction factor. More details are provided in a study by Patel et al.32. The above equation does not apply when RHLCS = 100.

Performance evaluation metrics

To evaluate the calibration performance of these sensors, several metrics were chosen based on the literature review and recommendations from the EPA75. These metrics included the coefficient of determination, root mean square error, mean absolute error, and mean absolute percentage error. RMSE is commonly used to assess the error between two data sets; in this study, it was applied to calculate the error between raw/corrected sensor measurements and corresponding measurements from reference instruments. MAE measures the distance of the raw/corrected values from the reference instrument’s values, while MAPE calculates the predictive accuracy of the model. The analysis was done in R Studio (V-4.2.2). These metrics were applied separately to the raw PM2.5 values of LCS and the corrected PM2.5 values obtained from each model.