Introduction

Various computational models have been developed for analysis and correlation of separation processes where a solute is removed from solution via mass transfer. The models are primarily based on estimation of diffusional and convective mass transfer rates and use of thermodynamics for phase equilibria in the solution and solid phase1,2,3. It is of great importance to find out the concentration distribution of solute in adsorption process so that the distribution and separation efficiency can be correlated to the process parameters.

Usually, computational fluid dynamics (CFD) method is used for determination of concentration distribution in adsorption process by which transport phenomena equations should be solved numerically in order to find the distribution of solute concentration in the solution. By calculating the concentration values at the outlet in continuous mode or at the end of process for batch mode, one can estimate the separation efficiency and the time required for the desired separation4,5,6. Although CFD is powerful in measuring the separation efficiency, the method of computation suffers from some drawbacks such as computational expenses and therefore some methods based on Machine learning (ML) can be replaced the physical models.

The use of machine learning models has gained significant traction as powerful tools for the analysis and prediction of data in various industries7,8. The application of these methodologies enables the extraction of valuable insights and complex patterns from intricate datasets, thereby empowering researchers to generate accurate predictions and well-informed decisions. There have been significant advances in the ___domain of ML over the past decade, which have led to the emergence of various algorithms and models designed to address a wide range of complex problems9,10,11. Unlike CFD method, ML can learn the data to offer quick and accurate prediction of concentration distribution of solute entire the process12,13. For adsorption, this can be done by finding solute concentration versus spatial coordinates such as x and y14. From the practical point of view, finding solute concentration versus x and y can help optimize the process and calculate the separation efficiency for various solid adsorbents, thereby saving time and cost of measurements.

The present study employed AdaBoost regression as the ensemble method, incorporating three distinct base models: decision tree (DT), support vector regression (SVR), and Gaussian process regression (GPR) for the first time to predict solute concentration in adsorption process using mesoporous silica materials. AdaBoost regression is a widely utilized and highly efficient methodology for constructing precise regression models. This is achieved by amalgamating the forecasts of numerous weak regressors through an adaptive and iterative procedure15.

The DT regression approach is a simple yet powerful way of modeling predictive data. By recursively splitting the dataset, it creates a tree-like structure where each leaf (terminal) node contains the predicted numeric value. DT regression is known for its interpretability, making it a popular choice when understanding feature importance is crucial. However, it can be prone to overfitting noisy data, and its performance may be limited for highly complex relationships within the data16.

SVR represents a distinct adaptation of support vector machines (SVMs), for regression objectives. SVR endeavors to pinpoint an ideal hyperplane that adeptly captures the inherent data patterns, all while accounting for a predefined range of acceptable deviations, referred to as the epsilon-insensitive zone17.

GPR is a versatile technique utilized for the modeling of non-linear relationships within data. GPR employs mathematical functions to model the uncertainty of predictions based on the available training set. A kernel function quantifies the similarity and correlation between data. GPR is capable of effectively managing observations that are affected by noise and intricate data patterns. The computation of large datasets can pose challenges18.

The main contribution and innovation of the current study is developing a modeling framework based on machine learning and integration of CFD and ML for prediction of solute concentration distribution in adsorption process. Furthermore, the built models are optimized using advanced algorithm including particle swarm optimization to enhance its performance. Use of boosting strategy and integration of optimizer would enhance the accuracy of models considerably which have practical application in design and optimization of adsorption process for selective removal of species from solution such as water treatment.

Material and methods

In this section, we outline the comprehensive methodology employed for the regression analysis of a concentration dataset comprising more than 19,000 data points. The primary objective of this analysis is to model and predict the concentration (C) in a given environment based on two coordinates (x and y). To ensure the robustness of our analysis, we follow a structured workflow that encompasses data preprocessing, model selection, hyper-parameter optimization, and ensemble learning. The methodology can be summarized into the following key steps:

  1. 1.

    Data preprocessing: We commence by scrutinizing the dataset for outliers using Cook’s distance, a widely accepted method for identifying influential data points. To facilitate consistent model training, we employ min–max scaling to normalize the input features (x and y). Subsequently, we partition the dataset into a training set (75%) and a test set (25%) to enable model evaluation.

  2. 2.

    Model selection: Three distinct regression models serve as the foundation of our analysis: DT, SVR, and GPR. Our choice of these models was based on their ability to capture nonlinear relationships and handle large datasets. In addition, we use the AdaBoost ensemble method to combine the strengths of these base models to improve prediction accuracy.

  3. 3.

    Hyper-parameter optimization: To harness the full potential of each base model and the ensemble method, we employ particle swarm optimization (PSO) for hyper-parameter tuning. PSO is utilized to search for optimal hyper-parameter configurations, such as tree depth for DT, kernel selection for SVR, and alpha value for GPR.

This structured approach (visually shown in Fig. 1), from data preprocessing to ensemble learning, aims to uncover the best regression model for predicting chemical concentrations within the spatial ___domain defined by x and y coordinates. In the subsequent sections, we delve into the specifics of each step, presenting experimental results and insights derived from this comprehensive analysis. The building blocks of this modeling structure are described in the rest of this section.

Fig. 1
figure 1

Schematic representation of the overall modeling workflow, including data preprocessing, model selection, hyperparameter optimization using PSO, and ensemble learning using AdaBoost regression.

Dataset description

The dataset under examination contains over 19,000 entries, which represent a diverse collection of data points14. The data have been collected from CFD simulation of mass transfer in continuous mode considering a single adsorbent particle which is mesoporous silica structure with high surface area for adsorption. Diffusion was considered inside the porous adsorbent, and the mass transfer in the solution was assumed to be diffusion and convection due to the fluid flow. The convection term is significant in the bulk of solution because of velocity dominance, and effect of viscous forces are simulated via momentum equation14.

Mass transfer model can be expressed as diffusion and convection for the bulk of fluid and inside the porous solid which is diffusion only. The general form of mass transfer may be written as19:

$$\frac{\partial C}{{\partial t}} = \nabla \cdot \left( {D\nabla C} \right) - \nabla \cdot \left( {VC} \right) + R$$
(1)

where C is the concentration of solute (mol/m3), D is diffusivity, V is velocity, and t is time. R is chemical reaction term which is zero in this case.

The fluid flow in the bulk of solution (Navier–Stokes equations) can be expressed as19:

$$\rho \frac{dV}{{dt}} = - \nabla p + \mu \nabla^{2} V + F$$
(2)
$$\nabla \cdot V = 0$$
(3)

where V is velocity, p is pressure, \(\rho\) is density of fluid, \(\mu\) is the viscosity, and F is the body force.

The coupled equations of mass transfer and Navier–Stokes were solved by finite element method and utilization of COMSOL software to determine concentration distribution of solute at various locations in the process, i.e., x and y. Indeed, two-dimensional model was assumed for the mass transfer simulation of adsorption process. Then, the concentration (C) data was used in the next step for ML modeling. As such, C as a function of x and y are used for ML analysis.

Due to its size and diversity, the dataset can be used for research and modeling, allowing a thorough exploration of various correlations and patterns. The inputs are x and y measured in meters, while the output feature is C expressed in moles per cubic meter (moles/m3)14. The histograms of variables are shown in Fig. 2.

Fig. 2
figure 2

Histograms showing the distribution of input variables (x and y in meters) and the output variable (C in mol/m3), illustrating their statistical spread and skewness in the dataset.

Outlier detection: Cook’s distance

In the process of analyzing the dataset and building regression models, it is essential to identify potential outliers that might influence the model’s performance and reliability. Cook’s distance is a valuable metric for detecting influential data points in regression analysis. It measures the impact of each data point on the model’s predictions and parameter estimates.

For a given data point i, Cook’s distance \(D_{i}\) is defined as20,21:

$${D_{i} = \frac{{\mathop \sum \nolimits_{j = 1}^{n} \left( {Y_{j} - Y_{j\left( i \right)} } \right)^{2} }}{{p \cdot \widehat{{{\upsigma }^{2} }}}}}$$
(4)

where n stands for the total number of data points, \(Y_{j}\) denotes the observed response variable, and \(Y_{j\left( i \right)}\) denotes the predicted response variable with the i-th data point removed. Also, p represents the number of model parameters (including intercept) and \(\widehat{{{\upsigma }^{2} }}\) stands for the estimated variance of the error term.

Cook’s distance quantifies how much the model’s predictions and parameters change when the i-th data point is excluded. Larger Cook’s distances indicate that the data point significantly influences the model.

A common threshold for identifying influential points is a Cook’s distance greater than \(4/n\), which corresponds to a point having a substantial impact on the model.

In our analysis, Cook’s distance was used as a preliminary step for outlier detection. Data points with Cook’s distance values exceeding the threshold were flagged as potential outliers and considered for further investigation and preprocessing.

By identifying and addressing influential outliers, we aimed to improve the robustness and accuracy of our regression models.

Ensemble method: AdaBoost regression

AdaBoost, abbreviated from Adaptive boosting, stands as an ensemble regression method that amalgamates forecasts from numerous weak learners, typically decision trees, to craft a potent predictive model. In situations where input and output variables have complex relationships, it is particularly useful.

In the AdaBoost algorithm, the assignment of weights to each weak learner is determined by their respective accuracies in predicting the target variable. In the context of ML, weak learners that exhibit superior performance are assigned greater weights, while those that demonstrate inferior performance are assigned lesser weights. The final prediction is achieved by combining the weighted predictions contributed by each independent weak learner22.

The key equation for AdaBoost regression involves calculating the weighted sum of the weak learners’ predictions15:

$${F\left( x \right) = \mathop \sum \limits_{t = 1}^{T} {\upalpha }_{t} h_{t} \left( x \right)}$$
(5)

In this context, the ultimate prediction, denoted as F(x), is determined by a combination of factors. \({\upalpha }_{t}\) signifies the weight attributed to the weak learner \(h_{t} \left( x \right)\), while T represents the overall count of weak learners in the ensemble.

AdaBoost incrementally adjusts the weights of training samples to emphasize those misclassified in the prior iteration, thus enhancing predictive accuracy. The steps of this model are shown in Fig. 3.

Fig. 3
figure 3

Flowchart illustrating the iterative steps of the AdaBoost regression algorithm, highlighting how weak learners are weighted and combined to form the final predictive model.

Base models

The models employed in this study comprise of DT, GPR, and SVR. The aforementioned models function as the fundamental components within the ensemble learning structure known as Adaptive Boosting Regression (AdaBoost).

The DT regression model is a versatile and interpretable ML technique used in our analysis. It constructs a tree-like structure that recursively partitions the dataset based on the input features, x and y, into regions with distinct predicted outcomes. In the context of regression, each leaf node of the tree represents a predicted numeric value for the target variable, C (concentration). DT regression identifies complex, nonlinear associations between input features and the target variable. This is especially beneficial for our analysis, as it offers insights into the hierarchical significance of input features in predicting chemical concentrations, facilitating result interpretation. The model’s hyperparameters, such as tree depth and split criteria, are optimized through PSO to ensure optimal performance in our regression task23.

In the context of regression tasks, researchers have introduced SVR, which is a specialized version derived from the general-purpose support vector machine (SVM). SVR is designed to identify an optimal hyperplane that effectively accommodates the dataset while regulating deviations within an epsilon-insensitive zone. This approach seeks to strike a delicate balance between accurately fitting the training data and averting overfitting. Leveraging the kernel trick, SVR transforms the data into higher-dimensional space, thereby enhancing its capacity to capture intricate, non-linear associations within the dataset24. SVR demonstrates proficiency in managing datasets with high dimensionality and exhibits reduced sensitivity to outliers. Nevertheless, the key to achieving peak performance lies in the judicious selection of hyperparameters and kernel functions24.

GPR represents a Bayesian approach, wherein it characterizes the connection between input attributes X and the target variable Y as a probabilistic distribution encompassing various functions. This predictive distribution takes the form of25:

$${Y\sim {\mathcal{GP}}\left( {{\upmu }\left( X \right),k\left( {X,X^{\prime}} \right)} \right)}$$
(6)

Within the framework of this equation, the expression \({\upmu }\left( X \right)\) signifies the mean function, while \(k\left( {X,X^{\prime}} \right)\) serves as the kernel function, responsible for evaluating the likeness between data points X and \(X^{\prime}\). GPR exhibits the ability to capture intricate non-linear associations within the dataset and excels in managing observations tainted by noise. Nevertheless, it may encounter computational challenges when confronted with extensive datasets, primarily due to the requirement of inverting a covariance matrix. The method is described in Fig. 4.

Fig. 4
figure 4

Process flow diagram of the GPR model, including key components such as kernel function selection, mean prediction calculation, and uncertainty quantification.

Hyperparameter tuning: particle swarm optimization (PSO)

The PSO is a nature-inspired optimization method that emulates the social behavior of birds in flocks or particles traversing a multi-dimensional space. The PSO algorithm is utilized in hyperparameter tuning to detect the optimal hyperparameters that enhance the fitting accuracy of machine learning26,27.

PSO initializes a population, with each particle symbolizing a distinct configuration of hyperparameters. These particles systematically modify their locations in the hyperparameter space according to their present performance and the experiences of their peers. The PSO algorithm efficiently navigates the hyperparameter space and progressively approaches the optimal solution by tracking the most successful particles in the swarm28. One of the unique aspects of the PSO algorithm is its ability to strike a balance between exploration and exploitation. This adaptive behavior makes the PSO algorithm a valuable tool for hyperparameter tuning, where the objective is to discover hyperparameter configurations that yield the best model performance.

Results and discussion

In this section, the results of our work involving three base regression models (DT, SVR, and GPR) combined with the AdaBoost ensemble method are represented and analyzed. The project was built using Python 3.10, harnessing powerful libraries for data analysis and ML. Scikit-learn provided a robust set of ML algorithms for predictive modeling, while Matplotlib enabled data visualization to illustrate trends and results. NumPy and Pandas supported efficient numerical operations and data management, streamlining the analysis process.

We assessed the efficacy of each model utilizing two fundamental metrics: the Coefficient of Determination (R2-score) and the mean squared error (MSE). These metrics provide insights into the model’s goodness of fit and the magnitude of prediction errors, respectively. The models were assessed on training, fivefold cross-validation, and test sets, with results summarized in Table 1, which includes R2-score and MSE for each phase.

Table 1 Performance of the models for fitting the data.

The findings illustrate the effectiveness of the AdaBoost ensemble technique in enhancing the predictive precision of the foundational regression models. AdaBoost integrated with decision tree (ADA-DT) attained a remarkable R2-score of 0.96984, indicating a robust agreement between predictions and actual data. AdaBoost combined with support vector regression (ADA-SVR) performed even better with an R2 of 0.97148, showcasing excellent predictive accuracy. AdaBoost combined with Gaussian Process Regression (ADA-GPR) also exhibited commendable performance with an R2-score of 0.95963. Finally, the ADA-SVR model is selected to do our final analysis. Figure 5 compares the observed and predicted output values. Also, partial dependencies are shown in Figs. 6 and 7 which show agreement with the results reported in literature14.

Fig. 5
figure 5

Comparison of predicted versus observed concentration values (C in mol/m3) using the ADA-SVR model, demonstrating model accuracy and fit.

Fig. 6
figure 6

Partial dependence plot showing the relationship between the input feature x (m) and predicted concentration C (mol/m3), derived from the ADA-SVR model.

Fig. 7
figure 7

Partial dependence plot showing the relationship between the input feature y (m) and predicted concentration C (mol/m3), highlighting model interpretation of spatial influence.

The close alignment of R2-scores and MSE values across training, fivefold cross-validation, and test sets, as shown in Table 1, indicates the absence of significant overfitting in the AdaBoost ensemble models. For instance, the ADA-SVR model’s cross-validation R2-score of 0.96629 is only marginally lower than its test R2-score of 0.97148, with a similar trend observed in MSE values. Likewise, ADA-DT and ADA-GPR exhibit consistent performance across all sets, suggesting that the models generalize well to unseen data. This robustness, achieved through careful hyper-parameter tuning via particle swarm optimization and the ensemble approach, underscores the reliability of the predictive models for practical applications.

Figure 8 displays the ultimate prediction surface in a three-dimensional representation, while Fig. 9 exhibits the contour plot of the output. The significance of the input features is demonstrated in Fig. 10. No significant concentration change was observed in the bulk of solution which is far from the solid adsorbent, and that implies the concentration change is important near the surface of porous adsorbent. It shows the formation of solute concentration boundary layer governed by molecular diffusion (see Fig. 9). It is also notable that molecular diffusion is significant inside the porous adsorbent which causes significant concentration change inside the mesoporous silica. Same variations was reported by Sun et al.14 for hybrid machine learning modeling of solute concentration in adsorption process.

Fig. 8
figure 8

Three-dimensional surface plot of the final concentration predictions (C in mol/m3) over the spatial ___domain (x, y), generated using the ADA-SVR model.

Fig. 9
figure 9

Contour plot of predicted concentration values (C in mol/m3) over the x–y spatial plane, illustrating spatial distribution patterns and formation of boundary layer.

Fig. 10
figure 10

Relative importance of input features (x and y) in predicting concentration C, as determined by the ADA-SVR model during training.

In addition to assessing the accuracy of the ML models, we evaluated their computational efficiency, focusing on CPU time and memory usage. These metrics are essential for determining the feasibility of deploying the models in practical, resource-limited settings. For example, the overall training time for Model ADA-DT was approximately 531 s, Model ADA-GPR required 590 s, and Model ADA-SVR took 672 s. These differences highlight the trade-offs between model complexity and computational cost, which are critical considerations for scalability and real-world application. Future efforts could explore optimization techniques, such as parallel processing or model simplification, to reduce these computational demands while maintaining performance.

While this study offers valuable insights into the performance of the tested ML algorithms, several limitations should be noted to provide a balanced perspective. The dataset used was relatively modest in size, which may limit the generalizability of the findings to larger or more diverse scenarios. Furthermore, the study focused on a specific subset of algorithms, leaving out other potentially effective methods that could yield different results. Real-world factors, such as data imbalance or noisy inputs, were not fully addressed, potentially affecting the models’ robustness. To build on this work, future research could incorporate larger and more varied datasets, test a broader range of algorithms, and account for additional practical constraints. These steps would enhance the applicability and reliability of the models in diverse contexts.

Conclusion

Our study found that ensemble modeling, specifically AdaBoost, improves the predictive performance of base regression models (DT, SVR, GPR) on a large dataset with spatial coordinates (x, y) predicting concentration (C) in mol/m3.

Our findings indicate that AdaBoost, in combination with these base models, consistently produces highly accurate predictions. AdaBoost combined with DT and SVR yielded R2 scores exceeding 0.97, signifying strong alignment between predictions and actual observations. Furthermore, the combination of AdaBoost and GPR attained an impressive R2 score of 0.95963. These findings highlight the efficacy of AdaBoost as a formidable ensemble technique for enhancing the precision of spatial regression models.

Furthermore, our utilization of PSO for hyper-parameter tuning showcased the importance of optimizing model parameters to achieve optimal predictive performance. This fine-tuning process significantly contributed to the superior results obtained in our study.

The implications of our research extend to various domains, including environmental science, where accurate predictions of spatially distributed concentrations are critical for informed decision-making. By leveraging ensemble techniques like AdaBoost, researchers and practitioners can harness the power of diverse base models to improve the reliability of predictions in complex spatial datasets.

In summary, our study highlights the synergy between ensemble learning, hyper-parameter optimization, and spatial regression modeling, offering a valuable framework for enhancing predictive accuracy and advancing the application of ML in spatial data analysis. These insights pave the way for more precise predictions and informed decisions in fields reliant on spatial concentration modeling.