Introduction

In vitro fertilization-embryo transfer (IVF-ET) has transformed the landscape of infertility treatment, offering hope to millions of couples worldwide. Advances in culture conditions have increasingly supported extending embryo culture to Day 5 or 6 (blastocyst stage), rather than the conventional Day 2 or 3 (cleavage stage), in an effort to enhance the live birth rates1. However, this strategy also raises concerns about the reduction in available embryos2 and its further impact on the cumulative success rate of each IVF cycle3,4 due to suboptimal culture conditions in vitro. Consequently, despite the potential benefits of extended culture, its overall safety and efficacy warrant careful scrutiny5.

To enhance the safety, some broad recommendations have been offered: American Society for Reproductive Medicine (ASRM) supports blastocyst culture for good-prognosis patients2and National Collaborating Centre for and Children’s (NICE) advocates cleavage-stage transfer for patients with limited embryos6. However, these recommendations focus on singular or narrowly defined factors. There is a wide variation (0-100%) in blastocyst formation rates among patients7which heightens the risk of losing potentially viable embryos. In practice, multiple patient-specific prognostic elements must be evaluated simultaneously. There is a clear clinical need for a cycle-based model to predict blastocyst yield, which would support individualized risk-benefit assessments and enhance the understanding of patient heterogeneity, ultimately improving the precision of extended embryo culture strategies.

Existing cycle-based models have been limited to binary prediction of complete culture failure8,9,10. These tools overlook the availability of surplus embryos after transferring a viable blastocyst, an essential consideration when determining whether to pursue extended embryo culture1.Although few studies have identified several potential determinants of blastocyst yield (e.g., oocyte number, day-3 embryo quality) using traditional statistical methods11,12complex interactions among these parameters may limit the predictive accuracy of such models. Machine learning, known for its ability to capture nonlinear relationships13has not yet been applied to blastocyst yield prediction. To address this gap, we aimed to develop and internally validate machine learning models to quantitatively predict blastocyst yields.

We set out to create models that are clinically acceptable in performance, methodologically transparent, and reliable for supporting decisions on extended embryo culture. Following the TRIPOD + AI guidelines for clinical prediction model reporting14we analyzed over 9,000 IVF/intracytoplasmic sperm injection (ICSI) cycles using a structured and transparent methodology. Potential clinical predictors were incorporated to establish the initial feature set, and the dataset was randomly split into training and testing subsets. We trained three machine learning models alongside a baseline linear regression model using backward feature selection, iteratively removing the least informative features from the maximal set. Internal validation was performed on the testing set with multiple performance metrics to assess robustness in our center.

Results

Dataset characteristics

A total of 9,649 cycles were included in our dataset, of which 3,927 (40.7%) produced no usable blastocysts, 3,633 (37.7%) yielded one or two usable blastocysts, and 2,089 cycles (21.6%) resulted in three or more usable blastocysts. This dataset was randomly split into training and test sets (see Methods). Potential features and cycle-based outcome for both sets are summarized in Table 1.

Table 1 Patient and cycle characteristics.

Model evaluation

We performed model-RFE analysis to identify the optimal feature subset and compare the performance of the models (Fig. 1). The RFE results showed that all models maintained stable performance with 8 to 21 features. A sharp decline in R2 values occurred during the final phase of feature selection, when the number of features reduced to 6 or fewer. The MAE curves followed a similar trend. Table 2 provides a comparative summary of the optimal performance metrics across all models.

Fig. 1
figure 1

Performance comparison of machine learning models using recursive feature elimination (RFE). The figure illustrates the impact of RFE on model performance across four machine learning algorithms: Light Gradient Boosting Machine (LightGBM), Linear Regression (LR), Support Vector Machine (SVM), and Extreme Gradient Boosting (XGBoost). Features are systematically eliminated from 21 down to 2. The top panel presents the test R2 (coefficient of determination), where higher values indicate better model fit, while the bottom panel displays the test Mean Absolute Error (MAE), where lower values represent better prediction accuracy.

Table 2 Performance of models.

The three machine learning models had remarkably similar patterns, achieving optimal performance with approximately 8 to 11 features, reaching R2 values of 0.67–0.68 and MAE values of 0.79–0.81. In contrast, the linear regression model consistently underperformed, with an R2 of 0.59 and an MAE of 0.94. Among the three machine learning models with comparable performance, LightGBM was selected as the preferred choice due to its use of fewer features (8 versus 10–11 for SVM and XGBoost), which reduces overfitting risk and enhances simplicity for clinical application. It also offers greater interpretability than the complex kernel transformations used in SVM. Overall, LightGBM provides the best balance of accuracy, practicality and interpretability, making it well-suited for clinical decision support.

Subgroup characteristics by blastocyst yield

Given that poor prognosis patients face more urgent decision-making dilemmas during extended culture due to inherently lower blastocyst yields, we focused our subgroup analysis in this population. Figure 2 illustrates the three-class distribution of predicted versus actual blastocyst yields (0, 1–2, and ≥ 3 blastocysts) in the overall test sets and specific subgroups, with confusion matrices presented as bar plots. In the overall cohort, the predicted yields are noticeably skewed in favor of the actual yields. However, in the subgroups, this trend varies when actual yields are ≥ 3, as predictions tend to fall into the lower categories (0 and 1–2), suggesting that the model may underestimate yields in these subgroups. Fortunately, the proportion of actual yields ≥ 3 is relatively low in these subgroups, at 13.4% for advanced maternal age, 8.3% for poor embryo morphology, and 2.5% for low embryo count, compared to 21.6% in the overall test sets.

Fig. 2
figure 2

Distribution of predicted versus actual blastocyst yields in bins across the overall cohort and subgroups. The confusion matrices visualized as bar plots show the relationship between predicted and actual blastocyst yields (0, 1–2, and ≥ 3) for the overall cohort and three clinical subgroups. The plots illustrate the class imbalance and prediction patterns across different clinical scenarios, with notably skewed distributions in adverse subgroups.

Table 3 presents the comprehensive evaluation metrics of our model on three-class classification. For the overall cohort, it achieved an accuracy of 0.678 and a Kappa coefficient of 0.5. When compared to the overall cohort, the three subgroups exhibited small variation in accuracy, ranging from 0.675 to 0.71, while the Kappa coefficients showed a decrease, ranging from 0.365 to 0.472. Additionally, F1(0) scores increased, whereas the F1 scores for the 1–2 and ≥ 3 blastocyst cases declined.

Table 3 Model performance metrics across the overall cohort and subgroups.

Model interpretation

Feature importance analysis revealed the primary predictors of blastocyst yiled (Fig. 3A). The LightGBM model identified eight key features, with the number of extended culture embryos emerging as the most critical predictor (61.5%). Other predictors included Day 3 embryo-related metrics: mean cell number (10.1%), the proportion of 8-cell embryos (10.0%), the proportion of symmetry (4.4%), and mean fragmentation (2.7%), while Day 2 characteristics—the proportion of 4-cell embryos (7.1%)—also contributed substantially. Demographic and treatment-related factors, including female age (2.4%) and the number of 2PN embryos (1.7%), demonstrated relatively lower importance in predicting blastocyst development.

Fig. 3
figure 3

Feature importance and partial dependence analysis using LightGBM. (A) The bar plot reveals the relative importance of features in the LightGBM model, with values quantifying each feature’s proportional contribution to the model’s predictive performance. (B) Individual conditional expectation and partial dependence plots illustrate the nuanced effects of the top six features on blastocyst yields. Thirty gray lines track the prediction trajectories of 30 samples, illustrating how predictions dynamically shift as a specific feature varies while other features remain constant. The red line delineates the mean effect across all samples, providing a comprehensive view of each feature’s impact on model predictions.

Individual conditional expectation (ICE) and partial dependence plots (Fig. 3B) elucidated how the top six features modulated model predictions. The number of extended culture embryos, mean cell number (D3), proportion of 8-cell embryos (D3), proportion of symmetry (D3), and proportion of 4-cell embryos (D2) positively influenced blastocyst yield, while fragmentation negatively impacted it. Although these general trends were evident, substantial variability in individual predictions at specific feature values underscores that blastocyst yield results from a complex interplay of multiple factors rather than being determined by a single predictor.

Discussion

In this study, we developed machine learning models to quantitatively predict blastocyst yields in IVF cycles, demonstrating that machine learning algorithms significantly outperformed traditional linear regression. LightGBM emerged as the most effective model, achieving superior predictive performance with fewer features and offering enhanced interpretability compared to other models. We also identified the number of embryos in extended culture and early embryo morphology contributed substantially to the prediction of blastocyst yield. Our research provides a cycle-level perspective for personalized decision-making regarding embryo culture strategies, differing from previous tools that focused on binary predictions for individual embryo 15,16,17,18 or per cycle 8,9,10.

The standard evaluation across multiple models and the use of explainable approach here, address critical gaps in prior studies11,12,19and align with recent methodological recommendations14,20,21. Establishing statistical associations11,12does not imply the ability to make generalized predictions; therefore, evidence for prediction requires testing the model on data separate from that used to estimate its parameters22. While summing individual embryo probabilities, as suggested by Jiang et al., can also estimate blastocyst yield, this indirect mapping approach fails to identify key factors and struggles to fully utilize features that have been shown to have independent effects in cycle-level binary predictions, such as the proportion of high-quality embryos and female age8,9. This methodological shift not only enhanced performance but also provides biological insights, which are significant for health workers in applications23.

Regarding optimal model selection, various levels of assessment exist. First and foremost, accuracy is crucial; a method lacking accuracy is irrelevant, even if it is easy to understand24. Machine learning’s superior performance highlights the limitations of traditional linear regression in capturing complex, non-linear interactions inherent to blastocyst development25. Moreover, the close performance among LightGBM, XGBoost, and SVM suggests that we may be approaching the theoretical prediction limit with the current feature set. Further improvements of predictive accuracy may require the identification of novel predictors.

For high-performing models, we then focus on their interpretability– a critical ethical consideration in IVF practice21. For clinicians, using decision support without the underlying factors driving those decisions is difficult. Fewer biomarkers in treatment guidelines enhance clinicians’ comprehension, making the number of variables a key factor in assessing a method’s interpretability24,26. The understandability of methods is also a key aspect of explainability24. While both tree-based models, such as LightGBM, and kernel-based models, like SVM, can visualize their decision-making processes, SVMs tend to be less intuitive and harder to comprehend raising concerns about their visualizations27. LightGBM, which utilizes fewer features and is relatively easy to understand, strikes a balance between performance and interpretability, which is supported by a recent study predicting the number of oocytes retrieved28.

Our interpretability analysis provides a global view on the cohort using importance ranking and also instance-level explanations through ICE plots26. Importance ranking revealed the number of embryos for extended culture as the primary predictor of blastocyst yield. This highlights the importance of having an adequate number of embryos to optimize extended culture outcomes, a key focus of previous studies and guidelines aimed at avoiding extended culture failure6,29. However, other previously recognized predictors such as 2PN embryos11 and oocyte yield11,12did not emerge as significant factors in our model. Their strong correlation with the number of embryos for extended culture (r: 0.76–0.84) suggests they may influence blastocyst yield indirectly, with this number acting as a mediating factor.

Interestingly, partial independency plots revealed that blastocyst yields progressively increased with higher mean cell numbers in Day 3 embryos, even beyond the eight-cell stage. This finding highlights the developmental advantage of rapid-cleaving embryos during in vitro culture, consistent with prior studies30,31and warrants further investigation. Additionally, the proportion of four-cell embryos on Day 2 emerged as an independent predictor, aligning with recent research32,33. These findings suggest that current assessment protocols may undervalue early developmental kinetics34.

ICE plots provide natural direct effects by holding confounders other than exposures constant to isolate the effects caused by changes in exposure35. We observed that single-feature interventions elicit varied responses across individuals. For instance, increasing the number of embryos for extended culture to five resulted in predicted blastocyst yields ranging from 0 to 3 across cycles. Such variability underscores the importance of personalized interventions in clinical practice, moving beyond single-feature threshold approaches. Understanding the complex interplay of multiple factors is crucial for guiding future policy development and research directions.

The subgroup analysis revealed variations in our predictive model for clinically complex cases. While accuracy remained stable (68-71%), predictive consistency declined (kappa: 0.37–0.50). In the “low embryo numbers” subgroup, the model performed poorly in predicting cycles with ≥ 3 blastocysts (F1 score = 0.14) but achieved a high F1 score (0.8) in the 0-blastocyst category. This aligns with Bayes’ theory, which indicates that false alarm rates are high for events with low baseline occurrence rates (≥ 3 blastocysts account for only 2.5%, while 0 blastocysts account for 61.6%). Overall, due to the severe class imbalance in the subgroups, the model primarily demonstrates binary classification capabilities and struggles with nuanced high or low yield predictions. However, since having ≥ 3 blastocysts in subgroups is a clinically rare event, this does not undermine the model’s utility.

This study also has several limitations. The single-center retrospective design and the lack of evaluation on an independent external cohort may constrain the robustness of the model. Furthermore, the non-all-blastocyst-culture policy at our center introduced an inherent selection bias: by transferring or freezing the highest-quality embryos on Day 3, the remaining embryos available for extended culture were likely of lower developmental potential. This practice may have led to an underestimation of actual blastocyst yield compared to settings that pursue whole-cohort blastocyst culture. As a result, our dataset reflects only a subset of clinical workflows 36,37,38. Future prospective studies including all embryos produced would help develop more comprehensive and generalizable prediction models.

Despite these limitations, our study represents an important step toward personalized reproductive medicine by providing a quantitative framework for predicting blastocyst yields and identifying key biological determinants. Note that our framework is intended as a proof-of-concept, demonstrating that quantitative blastocyst yield prediction is feasible through tailored model development based on local clinical practices. By supporting individualized clinical decision-making and patient counseling, this approach may help reduce psychological stress associated with uncertainty in IVF outcomes. Moving forward, efforts to validate and refine this model across diverse clinical settings, including centers employing full-cohort blastocyst culture or using time-lapse technologies, will be essential. Additionally, integrating dynamic morphokinetic data and molecular biomarkers into predictive models could further enhance their accuracy and clinical applicability.

Methods

Study subjects

A retrospective study was conducted at the Center for Reproductive Medicine, Nanfang Hospital, Southern Medical University, China, between January 2016 and May 2022. The study was approved by the Institutional Review Board of Nanfang Hospital, as authorized by the Ethical Committee (approval number: NFEC-2024-326). Informed consent was waived due to the use of non-identifiable patient records.

The inclusion criteria were as follows: (1) Autologous IVF/ICSI cycles performed at our center without egg or sperm donation; (2) Cycles with embryos undergoing extended culture to day 5/6; (3) No restrictions on patient characteristics (e.g., female age), ovarian stimulation protocols (conventional antagonist, agonist, or other unconventional regimens), and gamete states (fresh or frozen). The exclusion criteria included: (1) Cycles recorded with missing blastulation outcomes; (2) Cycles that contained an exceptionally high number of embryos in extended culture (> 20). We initially collected a total of 9,857 blastocyst culture cycles. After applying exclusion criteria, we removed 128 cycles with incomplete outcome labels and 80 cycles with an unusually high number of embryos. Our final dataset included 9,649 cycles.

Treatment procedures

Most patients underwent personalized ovarian stimulation protocols using standard agonist or antagonist regimens, while a few received alternative stimulation approaches, such as minimal stimulation, individualized based on patient-specific characteristics including age, antral follicle count, menstrual cycle, body mass index (BMI), and anticipated ovarian response. Ovarian stimulation typically began between the 2nd and 5th days of the menstrual cycle or during the luteal phase. Patients received 75–375 IU of gonadotropin daily during the ovarian stimulation process. Recombinant FSH or urinary hMG (human menopausal gonadotropin) was administered for gonadotropin stimulation. The initial dosage was maintained for 4 days and then adjusted based on the patient’s follicular growth response and serum E2 levels. For antagonist protocols, GnRH antagonists were flexibly administered until the trigger day. Final oocyte maturation was triggered when at least one follicle reached 18 mm in diameter, using intramuscular human chorionic gonadotropin (hCG: 2000–10000 IU, Livzen, China; or 250 µg, Ovidrel, Merck-Serono, Switzerland), with or without Triptorelin (0.2 mg, Decapeptyl, Ferring, Switzerland). Oocyte retrieval occurred 34–36 h post-hCG administration under transvaginal ultrasound guidance.

Both conventional IVF and ICSI were employed as primary fertilization techniques. For a few patients with severe male factor infertility, percutaneous epididymal sperm aspiration or testicular sperm aspiration were additionally utilized to retrieve sperm for subsequent fertilization. Following insemination, oocytes were cultured in pre-equilibrated cleavage medium under mineral oil in incubators maintained at 37 °C, 6% CO2, and 5% O2 in a humidified atmosphere. On Day 3, the top embryos (with 73% being the top 1–2) was selected for fresh transfer or cryopreservation and the remaining embryos were extended cultured in blastocyst medium under the same conditions. This non-all-blastocyst-culture strategy aims to avoid having no embryos for transfer after a failed extended culture. Embryo development was monitored according to the Istanbul consensus timelines34with detailed records of cell number, fragmentation, and blastomere symmetry on Day 2 and 3. According to the Istanbul consensus, a good day-3 embryo was defined as having 8 equally-sized, mononucleated blastomeres with less than 10% fragmentation. Blastocyst formation was defined by the development of viable embryos suitable for transfer or cryopreservation. Usable blastocysts were classified as those embryos achieving a Gardner score of ≥ 3BC on Day 5 or 6. Blastocyst yield was calculated as the total number of usable blastocysts per cycle.

Feature selection

A comprehensive set of 21 features were selected from the database based on three criteria: (1) temporal relevance – only data collected before extended culture initiation were included; (2) literature-based predictors – twelve established factors such as female age, fertilization method, number of extended culture embryos8infertility cause, number of oocytes retrieved, and number of normally fertilized oocytes (2 pronuclei;11 were incorporated. Additionally, Day 3 embryo properties (e.g. proportion of 8-cell and > 8-cell embryos, mean cell number, proportion of < 10% fragmentation rate, mean fragmentation and proportion of symmetry) were included11,19; (3) expert-driven parameters – nine clinically relevant features such as patient demographics (male age, BMI, infertility type), stimulation protocols, total gonadotropin dosage, stimulation duration, and Day 2 embryo characteristics (e.g. proportion of 4-cell embryos, proportion of > 4-cell embryos and mean cell number) were selected.

Missing data were addressed through imputation methods, with mean imputation applied to continuous variables (e.g., female age, BMI) and mode imputation for categorical variables (e.g., fertilization method, stimulation protocol). All included variables demonstrated missing rates below 5%. Detailed definitions of the included features are provided in Supplementary Table S1, and Pearson correlation coefficients among numerical variables are shown in Supplementary Figure S1.

Recursive Feature Elimination (RFE), a wrapper-based feature selection method39was applied to enhance model performance and reduce dimensionality. RFE was conducted independently for each model, beginning with the full set of features. Feature importance was iteratively assessed, and the least significant features were systematically removed until only two features remained. Performance metrics for each model were recorded at each iteration across feature subsets of varying sizes. The optimal feature subset was identified by balancing model performance and feature parsimony. The highest R-squared (R2) value achieved across all feature subsets was first identified, and a tolerance threshold of 0.005 below this maximum was applied. Among all the feature subsets meeting this criterion, the one containing the fewest features was selected as optimal. This approach ensured the selection of a minimal feature set while maintaining near-optimal performance.

Model development

For model development, the dataset was randomly split into training (70%) and test (30%) sets, with the test set held out during training to ensure unbiased performance evaluation. Three machine learning models – Support Vector Machine (SVM), Light Gradient Boosting Machine (LightGBM) and Xtreme Gradient Boosting (XGBoost) – were evaluated alongside a linear regression baseline to capture linear and non-linear relationships, handle feature interactions, and mitigate overfitting.

Hyperparameter optimization was performed using grid search with 5-fold cross-validation40. The training set was divided into five equal parts, with four used for training and one for validation in each iteration. This process was repeated five times, allowing each subset to serve as validation data once. Performance metrics were averaged across all iterations, providing reliable hyperparameter evaluation and reduced overfitting risks. After each feature reduction in RFE, optimal hyperparameters were re-searched to ensure the best model performance with the updated feature set. The optimal hyperparameter configurations for each model, along with the corresponding optimal feature set, are detailed in Supplementary Table 2. Following optimization, each model was retrained on the full training set using the identified optimal hyperparameters and feature subset. Model performance was then assessed on the held-out test set to evaluate the generalization capacity.

Model evaluation

All analyses were performed using R Statistical Software (v4.4.1, RStudio 2023.9.1). Regression performance was measured using R2, mean absolute error (MAE), and root mean square error (RMSE). R2 quantifies the proportion of variance in the dependent variable explained by the model, with values ranging from 0 to 1, where higher values indicate better model performance. MAE represents the average absolute difference between predicted and actual values, with lower MAE indicating greater accuracy. RMSE, calculated as the square root of the mean squared differences between predicted and actual values, penalizes larger errors more heavily than MAE, making it more sensitive to outliers.

After consulting two experienced clinicians, it was determined that considering blastocyst yield in ranges, rather than as precise numbers, holds greater practical significance. Consequently, the predict values of the target variable were discretized into bins, resulting in a three-category system (0, 1–2, and ≥ 3 blastocysts). The three-class classification was evaluated using multiple metrics, including accuracy, Cohen’s kappa coefficient, and F1 scores. Accuracy reflects the proportion of correct predictions out of the total predictions made, with values ranging from 0 to 1, where higher values indicating better performance. However, as accuracy can be misleading in cases of imbalanced class distributions, Cohen’s kappa coefficient and class-specific F1 scores were also employed to provide a more comprehensive evaluation. Cohen’s kappa measures the level of agreement between predicted and actual categories while a numbering for the possibility of agreement occurring by chance. Kappa values range from − 1 to 1 and are interpreted as follows: >0.80 as excellent, 0.61–0.80 good, 0.41–0.60 moderate, 0.21–0.40 fair, 0-0.20 poor, and < 0 worse than random. The F1 score, ranging from 0 to 1, reflects the balance between precision and recall. It provides a harmonic mean of these two metrics, offering insight into the model’s performance across different classes. F1 scores above 0.9 are considered excellent, those above 0.8 very good, and scores exceeding 0.7 are deemed good. Class-specific F1 scores were calculated to assess performance across individual categories.