Introduction

Oral diseases are among the most common and preventable non-communicable diseases worldwide1. According to the World Health Organization’s (WHO) global oral health status report (2022), 75% of the population is affected by oral diseases, including dental decay and periodontal disease2. Tooth decay presents a significant global health concern, particularly among children and adolescents who face increased susceptibility due to limited resources and awareness regarding dental care, notably in Low- and Middle-Income Countries (LMICs)3,4. Dental decay affects 34.1% of adolescents aged 12–19 years, globally3,4. Timely and accurate diagnosis is imperative for decay prevention, as it significantly influences prognosis5. However, there is limited data on the prevalence of dental decay in LMICs. Despite WHO acknowledging dental decay as a significant health burden affecting 60–90% of children, it continues to be overlooked in these regions6,7. The prevalence data is essential for dental practitioners and policy makers to comprehend the vulnerability of the disease and implement preventive measures8,9.

Although dentists possess the ability to detect decay, socio-economic obstacles leading to decreased access to healthcare as well as the lack of resources in low-income settings render Artificial Intelligence (AI)-based models a more viable solution, empowering machines to generate results comparable to those of humans4,10,11,12. Convolutional Neural Networks (CNNs) and other Deep Learning (DL) techniques employed in computer vision have demonstrated effectiveness in detecting, segmenting, and categorizing anatomical structures or pathologies across diverse image datasets13,14. Leveraging CNNs for diagnosing dental decay from intraoral photographs presents a potentially cost-effective and accessible means to enhance oral healthcare15. Multiple studies have showcased the efficacy of DL models in detecting and categorizing decay in intraoral images16,17,18,19,20. Researchers have utilized various DL models, including YOLO, R-CNN, SSD-MobileNetV2, ResNet, and RetinaNet algorithms for the detection of dental decay18,21,22,23,24. Furthermore, transformer-based models have also been employed for similar purposes; however, their extensive computational requirements have limited their widespread exploration in dental imaging25.

These studies also lack the integration of Explainable AI (XAI) methods. XAI elucidates the decision-making process of CNNs, which may not always be transparent. Methods such as GradCAM and EigenCAM aid in comprehending the rationale behind AI decisions, thereby enhancing the validation of its results26.

Additionally, despite considerable efforts by researchers to develop algorithms for decay detection, these advancements are seldom translated into deployable solutions4,27. Hence, this study aimed to assess the effectiveness of a CNN-based model in identifying dental decay using clinical photographs, subsequently adapting it into a smartphone-based application28. This open-sourced app has the potential for implementation by users such as community health workers, to assess decay prevalence in populations28. Additionally, a direct comparison between the performance of AI and dentists is needed to determine the real-life applicability of this application.

Materials and methods

Study design and sample size

This validation study was conducted at the Aga Khan University Hospital (AKUH), Karachi, Pakistan following the STARD-AI guidelines (ERC#: 2023-9434-27025). All experimental methods were performed in accordance with the mentioned checklist and approved by the Ethical Review Committee (ERC) of AKUH prior to commencement of the study. The dataset utilized in this study was previously collected during another study, informed consent and assent was obtained from the participants (ERC#:2021-5943-16892). Around 6% of the images that formed the dataset were obtained from the adolescent population in Karachi, prospectively. Intraoral pictures of both primary and secondary dentitions were captured on mobile phones to incorporate images with varying resolutions. This was done to increase the diversity of images comprising the dataset which constituted a total of 7,465 intra-oral images. The methodology is summarized in Fig. 1.

Fig. 1
figure 1

Depiction of the workflow of training the AI model for detecting decay on intra-oral photographs as well as the comparison with junior dentists to determine its real-life applicability.

Inclusion criteria

Intra-oral images of adolescents and young adults of both genders, between 8 and 24 years of age were included.

Exclusion criteria

Intra-oral images of patients with developmental dental anomalies, tetracycline staining, cleft lip/palate and oral pathologies were excluded.

Dataset preparation

The intra-oral images were transferred and visually analyzed by two dentists with an experience of more than two years on a HP Desktop Pro (G3 Intel(R) Core(TM) i5-9400, built-in Intel(R) UHD Graphics 630 with HP P19b G4 WXGA Monitor Display (1366 × 768)).The annotators were calibrated on the annotation task prior to commencement of the study. They were tained to localize and annotate/label all carious teeth, if present, in the intraoral images. This annotation process was carried out on an open-sourced annotation tool, ‘LabelMe V5.4.1’ (Massachusetts Institute of Technology-MIT, Massachusetts, Cambridge, United States)29. The annotators utilized bounding boxes to record the height, width, and coordinates of corners of the box around the tooth with visible decay (labelled ‘D’)2. Consequently, these annotations were verified by two senior dentists with at least four years of postgraduate experience. Any conflicts were resolved with discussion. Cohen’s kappa statistics revealed inter and intra-examiner reliability of 0.80. These annotations were employed for training of the models.

The annotated dataset is comprised of images and their corresponding JavaScript Object Notation (JSON) files. The JSON files contained important details about the ___location of decay in each intraoral picture. The JSON files were then converted into standardized YOLO format using python code for training the AI model30. A70:10:20 ratio was used to provide sizable training, validation and testing datasets.

Deep learning models and training

The YOLOv5s was trained on pretrained ‘coco.pt’ model acquired from Github for 200 epochs with a batch size of 6031. The dataset consisted of 7,465 images in total with 1,799 images with disease and rest with no diseases. These images were a mix of five different angles namely, frontal, right buccal, left buccal, maxillary and mandibular. This dataset was split into 5,226, 1,493, and 746 images for training, validation and testing respectively, keeping a fixed ratio of 70:20:10. The overall process was performed on a T4(15 GB) GPU from GoogleColab32. The training time of the algorithm was estimated as 413 min at a rate of 124 s/epoch, with a total of 0.27 kg eq. carbon footprint. The performance of the algorithm was evaluated using YOLO validation script.

Additionally, a Detection Transformer (DeTR) was fine-tuned on pretrained ‘facebook/detr-resnet-50’ model acquired from HuggingFace for 100 epochs with a batch size of 428. The dataset contains 1799 labeled images in total, split into 1,275, 361, and 163 images for training, validation and testing respectively, keeping a fixed ratio of 70:20:10. The overall process was performed on a L4 (22.5 GB) GPU from GoogleColab32. The training time of the model was estimated as 385 min at a rate of 232 s/epoch, with a total of 0.26 kg eq. carbon footprint. The performance of the algorithm was evaluated using COCO evaluator.

XAI implementation

After training the model, EigenCAM was used to identify image areas relevant to the decision-making of YOLOv5s by generating heat maps that highlight significant regions. This can be seen in Fig. 2 where the EigenCAM employed in this study adequately visualizes the decay in intra-oral photographs as evidenced by the position and size of the heatmaps.

Fig. 2
figure 2

The results of the application of XAI on images to determine the decision-making processes of YOLOv5s model.

Sample size calculation for direct comparison of AI with junior dentists

Sample size for a direct comparison between AI and dentists for the decay detection task was calculated using a paired sample t-test to compare the accuracy of AI and dentists, according to a study by Mertens et al.33 A sample size of 251 teeth was required to detect at least a 10% absolute difference in accuracy, with a standard deviation of 0.4, 80% power, and a 5% alpha level. This sample size was then adjusted by a design effect of 2.8, considering a cluster size of 10 and an intraclass correlation coefficient (ICC) of 0.2 to account for the clustered design with multiple teeth per photograph. Hence, the overall number of teeth needed to be assessed was 703, corresponding to 70 photographs (each containing 10 teeth). Sample size was calculated using STATA version 18.

To assess the performance of AI and junior dentists on decay detection on intra-oral photographs, the ground truth had to be determined first. Therefore, to establish the gold standard, two subject experts with over four years of postgraduate clinical experience labeled a total of 70 unseen images for decay. The performance of the AI model and junior dentists with at least two years of clinical experience, was gauged against these labels.

Results

Comparison between DeTR and YOLOv5s

The performance comparison between the transformer-based DeTR and the CNN-based YOLOv5s is detailed in Table 1. DeTR exhibited suboptimal performance, with a sensitivity of 34.4% and a precision of 26.9%, resulting in an F1 score of 30.1%. Conversely, the CNN-based YOLOv5s outperformed DeTR, achieving a precision of 90.7%, a sensitivity of 85.6%, and an F1 score of 88.0%.

Table 1 Comparison of performances of both AI models: DeTR and YOLOv5s.

Direct comparison of YOLOv5s and junior dentists

Table 2 provides a performance evaluation comparing junior dentists in decay detection to the trained YOLOv5s algorithm. The results indicate that junior dentists achieved a sensitivity of 64.1%, a precision of 83.3%, and an F1 score of 72.4%. In contrast, the YOLOv5s algorithm outperformed the junior dentists, with a sensitivity of 67.5%, precision of 84.3% and an F1 score of 75.0%. The significance level was maintained at p ≤ 0.05 for all statistical analyses. Chi-squared test revealed non-significant results in the performance metrics for the two groups, indicating that both AI and humans have comparable performance in this task (p = 0.157). A regression analysis on the Receiver Operating Characteristic (ROC) of the performance of AI and junior dentists was 0.79 (CI 0.69–0.86) and 0.76 (CI 0.68–0.86), shown in Fig. 3.

Table 2 Comparison of performances of junior dentists and YOLOv5s.
Fig. 3
figure 3

The ROC of the performance of the AI model compared with that of junior dentists.

Deployable application

The YOLOv5s algorithm was then translated into a smartphone-based application due to its superior results when compared to the junior dentists, further details of the app development are provided in the linked GitHub repository28. This application can be used by healthcare workers for determining disease prevalence of populations.

Discussion

This study assessed the effectiveness of an object detection model (YOLOv5s) in identifying decay on clinical photographs and its deployment into a smartphone application. The significance of this study lies in the potential of this open-sourced application that can be widely implemented to assess decay prevalence in populations, making dental health surveillance more accessible. Moreover, the comparison between AI and human performance in detecting dental decay also provides valuable insights into the capabilities and limitations of AI in deployment.

To develop a deployable algorithm for decay detection on intra-oral photographs, the authors of this study employed the YOLOv5s algorithm. This algorithm specializes in object detection and has the advantage of allowing simultaneous classification and localization of multiple objects in image datasets34. These algorithms are also capable of operating in real-time on high-resolution images, contributing to their widespread use across many fields, including dentistry34. In our study this algorithm showed superior performance with a precision of 90.7%, sensitivity of 85.6% and F-1 score of 88.0%. This is indicative of the model detecting decay with higher accuracy relative to the other reported studies performing object detection for decay identification18,35,36. A study by Thanh et al. employed the YOLOv3 algorithm, achieving a sensitivity of 74% for decay detection in intra-oral images35. However, this improvement could be attributed to the use of 2,652 intra-oral images in their study compared to a larger dataset of 7,465 images used in the current study, thus making it more robust. A study by Kim et al., developed a decay detection smartphone application, similar to the authors of this study18. In their study the YOLOv3 algorithm had a mean Average Precision (mAP) of 94.46% which is comparable to the current study18.

Another notable distinction of this study is the direct comparison of the performance of this trained YOLOv5s algorithm with dentists. The findings were interesting, the authors noted that the trained YOLOv5s was able to identify decay better compared to junior dentists. The sensitivity of decay detection by the algorithm was 67.5% compared to 64.1% of junior dentists, with a notable difference in performances. A study by Alam et al. found that the sensitivity of decay detection on clinical examination by humans was 86% compared 92% of the trained AI algorithm with no significant difference in both groups37. The authors can conclude that the trained AI model in the current study performs at least as well as a junior dentist with at least two years of clinical experience. This validates its potential use as a decay detection application in community settings where access to trained dentists is scarce.

An additional key strength of this study is the implementation of XAI techniques for rationalizing the performance of the YOLOv5s algorithm making it superior to other reported studies18,35,36. The authors employed EigenCAM to generate heatmaps localizing the area of interest for the algorithm leading to the identification of decay in the intra-oral photographs. This transparency allows researchers and practitioners to understand the AI’s rationale, ensuring trustworthy and validated decisions38. XAI techniques like EigenCAM eliminate the black-box nature of AI, transforming it into a transparent tool38. This fosters trust and provides valuable insights for refining AI models. By clearly identifying the features considered by the AI, researchers can assess accuracy, identify biases, and improve performance. Only one study previously utilized XAI (GradCAM) in a similar problem39. However literature has evidenced that EigenCAM has faster processing speeds and results in precise and highlighted visual explanations compared to GradCAM40.

The authors of this study also experimented with the recently introduced DeTR algorithm for the object detection task with the aim to achieve comparable if not better results but encountered several challenges. These transformer-based algorithms are resource-intensive and perform poorly when the objects to be detected occupy a small portion of the image, for example in our case a decay lesion in an entire image41. Additionally, this algorithm also requires a higher number of images for training compared to YOLOv5s algorithm42. A study by Jiang et al. utilized a transformer in their model for decay detection, opting to incorporate only the transformer backbone rather than the entire DeTR framework to mitigate computational costs43. In contrast, this study used the full DeTR framework and observed poor performance. This suggests that using a transformer backbone not only reduces computational costs but also significantly improves the model’s performance in detecting decay. Another hinderance encountered by the authors during training of the DeTR algorithm was its inability to process images with no disease. Since the images with no decay lesions had empty corresponding annotation files, this algorithm excluded those images with ‘healthy’ teeth entirely and the dataset that was used to train this algorithm all had examples of ‘diseased’ teeth. This led to class imbalance and therefore inaccurate prevalence of disease in the training dataset. Another disadvantage of the DeTR was the incompatibility of this algorithm with XAI techniques therefore the rationale for its results could not be determined. And since the goal was deployment in the form of an open-source application, this algorithm does not allow translation into edge devices due to its higher computational needs42. For this reason, had the algorithm performed better the authors would still not have recommended its deployment for determining the prevalence of disease/decay.

This study has limitations since the authors could only successfully execute the decay detection task; identifying fillings and missing teeth was not possible due to the limited number of its examples in the dataset. This may be due to the sample population, as the images were predominantly collected from rural areas of Pakistan where dental care is limited. Therefore, there were very few examples of filled teeth and the number of images for missing teeth was also low presumably due to the inclusion of adolescents who are still undergoing dental eruption phases.

In the future, the authors plan to utilize this application to assess disease prevalence across various populations, particularly in low-resource settings. This approach will enable the identification of disease prevalence in rural areas, potentially guiding healthcare resources to these regions for treatment. Moreover, this app will be used to capture images from urban populations, and the algorithm will be further trained on these images to include the detection of filled and missing teeth. The authors aim to improve the application’s generalizability, enabling it to be used for DMFT scoring by community dentists across different populations. The authors also acknowledge that diagnosing decay involves clinical, tactile, and radiographic examinations. Therefore, while the application can identify areas of concern, a dentist, albeit remote, may need to corroborate the findings to confirm the presence of decay.

Conclusion

The trained YOLOv5s algorithm in this study demonstrates optimal performance in detecting dental decay on intra-oral photographs, as evidenced by its comparable results to junior dentists. When deployed, this application will assist in determining the caries index of populations, enabling the assessment of disease prevalence. Subsequently, necessary measures can be implemented to reduce the disease burden at the community level.