Figure 5 | Scientific Reports

Figure 5

From: Machine-learning-guided recognition of α and β cells from label-free infrared micrographs of living human islets of Langerhans

Figure 5

Supervised-learning results from four different models. (a) After creating the feature matrix and target vector, data undergo several preprocessing steps to enhance the performance and stability of classification. The process starts with manual cleaning, where only cells with clearly defined identities are retained in the dataset, excluding over a thousand cells, resulting in a cleaned dataset with 861 cells. Preprocessing includes encoding categorical features, handling missing values, handling outliers, and scaling the data. The dataset is then rebalanced using SMOTE (Synthetic Minority Oversampling Technique), and it is split into training and test sets. The training set, after SMOTE, comprises 970 cells and 151 features. Before training, cross-validation and hyperparameter tuning are performed to obtain a stable and high score. The model is tested on the testing data, which can be considered as new, unseen data. The original data is cleaned to improve algorithm performance. (b) Four different algorithms are tested and compared: multivariate logistic regression, boosted decision tree (XGBoost), Support Vector Machine for classification, and K-Nearest Neighbor for binary classification. Each algorithm is optimized using the most common hyperparameter range and Grid Search as the optimization algorithm. (c) Evaluation of precision, recall, F1 score, and the area under an ROC curve reveals that XGBoost is the most promising algorithm in terms of classification performance and stability. XGBoost is further optimized with Optuna, allowing for the selection of a wider hyperparameters range to improve its performance.

Back to article page