Introduction

Pain is an unpleasant sensory and emotional experience associated with, or resembling actual or potential tissue damage1. It has been reported that approximately 20% of patients worldwide suffer from pain2. Critically ill children, compared to other pediatric populations, are particularly vulnerable to pain due to their medical or surgical conditions, compounded by the frequent inevitability of many medical procedures3. Research findings from 66 Chinese medical institutions revealed that pain management in China remains suboptimal, with a relatively high prevalence of pain among hospitalized children across all age groups, especially those who are critically ill4. Pain experienced during hospitalization can result in short-term adverse effects, such as increased rates of delirium and decreased sleep quality in children. In the long term, it may impact children’s healthcare compliance and other health behaviors, causing significant negative consequences5.

Recognizing pain and accurately assessing pain intensity are fundamental steps for pain management to be effective in critically ill children. While self-reporting is considered the most reliable method for pain assessment, it is often challenging for critically ill children who cannot or refuse to verbally communicate their pain to healthcare providers. In such cases, observational measures based on facial expressions and behavioral cues are essential6,7,8,9. Unfortunately, there are still no universal tools for pain assessment that could be used in all children. Observational pain assessment tools, in particularly, face challenges such as variability among assessors and the time constraints imposed on staff by the need for regular assessments.

Recently, with the rapid development of artificial intelligence and computer vision technology, the accuracy of image-based facial expression recognition has continued to be improved. Algorithms based on facial image analysis show promise for assessing pain in children within intensive care settings10,11,12,13. High-quality, accessible datasets are essential for training these algorithms, yet such valuable resources remain insufficient. There are several adult pain datasets. The UNBC-McMaster Shoulder Pain Archive documents 129 adults with chronic shoulder pain during movement exercises, featuring video recordings with facial action coding and self-reported pain scores14. The BioVid Heat Pain Database captures physiological signals (ECG, EMG, EDA) and facial videos from 90 healthy adults under controlled heat stimuli, including pain thresholds and subjective ratings15. Similarly, the X-ITE Pain Database from University of Magdeburg contains comparable physiological measurements from 134 participants across varied pain stimuli (thermal, electrical, pressure), with comprehensive pain annotations. These resources maintain ethical research access protocols16.The most notable neonatal pain facial expression datasets include the facial expression of neonatal pain (FENP), the classification of pain expressions (COPE), the infants pain assessment dataset (IPAD), and the acute pain in neonates dataset (APN-db). The FENP dataset17, introduced by the Nanjing University of Posts and Telecommunications, contains 11,000 neonatal facial expression images of 106 Chinese neonates from two children’s hospitals. The facial expression images are classified into four levels: severe pain, mild pain, crying, and calmness. Each category contains 2750 images of neonatal facial expressions. The COPE dataset18 contains 204 images captured from 26 healthy infants aged between 18 h and 3 days undergoing stress or pain-inducing stimuli. Facial expression images were labeled as rest, cry, air stimulus, friction, and pain. It lacks pain intensity ranking and offers limited information because of small samples. The IPAD dataset19 originates from 31 neonates of various ethnic backgrounds, including Caucasians, African Americans, and Asians, who were admitted to the neonatal intensive care unit (NICU). It encompasses facial expressions, body movements, and vocalizations observed during medical procedures such as heel prick blood sampling. The APN-db dataset20 was compiled in NICU and vaccination departments, comprising 213 videos of newborns and infants undergoing clinical procedures that trigger facial expressions of pain. The age range of participants in this dataset spans from 0 to 6 months. Among the neonatal pain datasets mentioned, only the COPE dataset is publicly accessible.

Despite these valuable contributions to pain assessment research, significant limitations persist in applying existing datasets to the pediatric critical care environment. The challenges include the wide age range from 1 to 18 years, resulting in a greater diversity of facial expressions, the presence of multiple medical tubes, which partially obstruct facial images, and the complex real collection environment. With the help of the healthcare workers at the Children’s Hospital of Fudan University, we have collected data on children’s pain facial expressions in the PICU and cardiac intensive care unit (CICU) of the hospital for thirteen months. Currently, we have built a large facial expression dataset of critically ill children suffering procedural pain which includes 119 videos and 6,698 images. We hypothesize that the algorithm model trained on the PFECIC dataset will demonstrate better accuracy and generalization performance, indicating PFECIC’s superior usability and comprehensiveness.

Methods

This study was approved by the institutional review board of the Children’s Hospital of Fudan University (NO.2023-151), and all methods were carried out in accordance with the Declaration of Helsinki. Informed consent was obtained from a legal guardian for study participation and portrait rights usage authorization. To protect personal privacy, sensitive information such as the names, dates of birth, and diagnoses of the participating children was removed. Ten guiding principles jointly identified by the US Food and Drug Administration (FDA), Health Canada, and the United Kingdom’s medicines and healthcare products regulatory agency (MHRA) were followed21.

PFECIC dataset construction

Data collection environment

We collected videos and images of children’s facial expressions related to pain experienced while undergoing medical procedures in the PICU and CICU at Children’s Hospital of Fudan University. Participants were recruited from children admitted to the PICU and CICU. The inclusion criteria were as follows: age between 28 days and 18 years, planning to undergo only one of the listed medical procedures under non-emergent conditions, and written informed consents both for the study and the portrait rights usage authorization obtained. Exclusion criteria included children in deep sedation, prone positions, or with more than one-third of the face covered.

Data collection equipment

The equipment for collecting children’s facial expression videos and images was one Hikvision surveillance camera (1920 × 1080 60HZ) with a fixed bracket. The camera was installed directly above the head of the child’s bed, with the lens pointing vertically downward, directly facing the child’s face, as shown in (Fig. 1). The shooting angle was limited to 30 degrees, ensuring the face occupied at least half of the entire frame. Two staff members from the marketing department were designated as video recorders.

Fig. 1
figure 1

Detailed 3D visualization of facial expression video collection equipment setup. Created using SketchUp Pro 2014 (version 14.0490, https://www.sketchup.com/zh-cn).

Video collection process

The PFECIC dataset contains five facial expression statuses classified according to the ‘facial muscles’ category of the COMFORT behavior scale. To collect corresponding data, we first needed to find suitable scenarios in real PICU or CICU environments. Consequently, data on the frequency of painful medical procedures performed in the PICU and CICU from January 1, 2019, to December 31, 2019, were extracted to figure out the common painful procedures. We used pre-COVID-19 data to follow regular patterns. Additionally, interviews were conducted among ten senior nurses and two advanced practice nurses (APNs) working in the PICU or CICU to enrich the information. Ultimately, the seven most painful medical procedures were identified: nebulization suction, tracheal suction, surgical debridement/dressing change, peripheral venous catheterization, arterial catheterization, intramuscular/subcutaneous injection, and urinary catheterization.

One senior nurse from the PICU and another from the CICU, both skilled in pain assessment, were appointed to serve as data collection coordinators for their respective departments. They were trained to use the maximum variation sampling22 to select patients meeting the inclusion criteria for video recording, considering factors such as age, gender, mechanical ventilation, type of suffering procedure, and pain score rated at the bedside. Once a patient suitable for video recording was identified, the nurse coordinator called the video recording team. A team member would then arrive at the scene within 5 to 10 min to start the recording. The video was recorded two minutes before the bedside nurse or intern performed the corresponding medical procedure, during the entire procedure, and two minutes after the completion23,24,25. During the procedure, the nurse coordinator assessed pain at the bedside using the COMFORT behavior scale. The video collection procedure is depicted in (Fig. 2).

Fig. 2
figure 2

The flowchart of the data collection procedure for facial expression videos in children.

Data labels

The captured images were pre-processed. Videos in which children moved their bodies during the recording, leading to their faces being obscured or partially obscured by healthcare workers or the children’s limbs, or where external stimuli interfered, were excluded. Thus, only videos featuring clear and complete facial information were included in the subsequent annotation section.

Six experienced nurses from the hospital’s pain management team were selected to accurately classify the facial expression statuses for each video. Before annotation, they were trained to consistently use the COMFORT behavior scale to evaluate facial expressions (1 point: facial muscles totally relaxed; 2 points: normal facial tone; 3 points: tension evident in some muscles, not sustained, 4 points: tension evident throughout muscles, sustained; 5 points: facial muscles contorted and grimacing). Every video was triple annotated by three nurses independently using a custom software annotation tool developed by the study team. First, the annotator watched the entire video, pausing at moments that could be scored according to the “facial muscles” category of the COMFORT behavior scale. At these points, the segment was split and then annotated. The tool included basic video player functions such as play, pause, stop, forward, backward, and split. Additionally, after splitting video segments, a dialog box would pop up for inputting the rating score. The annotation process is illustrated in Fig. 3. Segment annotations agreed upon by at least two annotators were adopted; in cases where all three annotations differed, a working group discussion was initiated to resolve discrepancies and achieve consensus.

Fig. 3
figure 3

Data annotation flowchart.

Afterward, the classified video segments were processed frame by frame by an algorithm engineer, with each frame within the segment being automatically annotated. The initial frame (F1) and its adjacent frame (F2) were selected, and the difference between the two frames was calculated using the frame difference method. If the frame difference was less than a given threshold ε, it indicated the minimal change in the child’s facial expression, and F2 was discarded. The difference between F1 and the third consecutive frame (F3) was then calculated, continuing this process until the frame difference exceeded the threshold ε or until the last frame of the video. If the frame difference exceeded the threshold ε, it indicated a significant change in the child’s facial expression, and the frame was included as valid image data in the PFECIC dataset.

Dataset’s clinical validation

We conducted recognition experiments on the PFECIC and COPE datasets separately using an algorithm based on the Swin Transformer26. The PFECIC dataset was divided into training, validation, and test sets in a 7:2:1 ratio, comprising 4949 images, 1233 images, and 769 images, respectively. The sample sizes for each classification point from 1 to 5, in the training set were 7, 1344, 1156, 1067, and 1115, respectively. The validation and test sets were distributed proportionally. Since the PFECIC dataset contained objects besides faces, such as hospital beds and medical equipment, face detection was conducted first, followed by expression recognition. Similarly, for the COPE dataset, we divided the dataset into training, validation, and test sets, adhering to a 7:2:1 ratio. The training set encompassed 140 images, the validation set contained 40 images, and the test set included 24 images. Since the COPE dataset concentrates exclusively on the facial area, face detection was not required.

The experiments were implemented using PyTorch 1.11 deep learning framework and conducted on 8 GeForce RTX 1080Ti GPUs. The data was augmented during training to increase the dataset size, including random horizontal flipping and random cropping, where the crop size was set to 224. The model training parameters were set to a batch size of 16, a learning rate of 0.0001, a weight decay value of 0.00001, and 150 training epochs. To evaluate the effectiveness of the datasets for training and testing deep learning models based on the algorithm, we measured accuracy, precision, recall, harmonic mean of precision and recall (F1-score), and false positive rate (FPR). Detailed information about the algorithm and its evaluation metrics can be found in the Supplement.

Results

Results of the PFECIC dataset

The PFECIC dataset comprises 119 pain expression videos from 53 critically ill Chinese children, recorded across seven major clinical procedures that induce pain in the PICU and CICU of Children’s Hospital of Fudan University. The dataset also includes 6,951 annotated pain expression images, categorized into five facial expression levels (1 to 5 points), with 375, 1,887, 1,624, 1,499, and 1,566 images per category, respectively. Table 1 presents the basic information, while Fig. 4 provides sample images.

Table 1 Basic information of recorded videos (n = 119).
Fig. 4
figure 4

Sample images of five pain facial expression levels in the PFECIC dataset.

Experiment result and comparative analyses

As illustrated in Fig. 5, the pain expression levels in the PFECIC dataset exhibit smaller granularity and a more balanced data distribution than the COPE dataset.

Fig. 5
figure 5

Confusion matrix-based performance comparison between the two datasets.

Moreover, as shown in Table 2, for the PFECIC dataset, all five metrics are higher for the facial expression levels rated 1 point, 2 points, and 5 points compared to the other two levels. The ROC curves in Fig. 6 show that the AUC values range from 0.764 to 0.968. Facial expressions rated 3 and 4 points are relatively subtler and more challenging to distinguish.

Table 2 Model performance trained and tested based on the PFECIC dataset.
Fig. 6
figure 6

ROC curves on the PFECIC dataset.

The performance on the COPE dataset is displayed in Table 3, suggesting a significant class imbalance issue favoring the “non pain” class. The ROC curve in Fig. 7 exhibits a step-like pattern rather than a smooth curve, which typically indicates evaluation on a smaller dataset where each step represents individual test cases. With an AUC value of 0.829, the model demonstrates good overall discriminative ability between pain and no-pain classes, despite the aforementioned class imbalance issues.

Table 3 The performance results from the model trained on the COPE dataset.
Fig. 7
figure 7

ROC curves on the COPE dataset.

We trained the Swin Transformer_base model on the PFECIC dataset and tested it on the COPE dataset. The results, shown in Table 4, indicate that the accuracy of pain recognition is significantly higher than that of the model trained solely on the COPE dataset. This suggests that increasing the amount of training data can effectively enhance the performance of deep learning models and improve their generalization capability.

Table 4 The performance results from the model trained on the PFECIC and then tested on the COPE dataset.

For a more comprehensive comparative analysis between the PECIC and COPE datasets, considering that the COPE dataset only has two categories: pain and non-pain, to ensure the fairness of comparison, we defined the status of 1-point in PECIC as non-pain and statuses of from 2-points to 5-points as pain. The comparison of the performance metrics based on the Swin Transformer_base model trained on the PECIC and COPE datasets is shown in (Fig. 8). The PFECIC dataset shows an improvement over the COPE dataset with an increase in accuracy metric by 16.6%, precision by 15.7%, recall by 23%, F1-score by 24.2%, and a decrease in false positive rate by 30%.

Fig. 8
figure 8

Histogram comparing the indicators of the two data sets.

Discussion

Principal findings

Pain is a complex phenomenon. Pain-induced facial expressions share certain basic action units and physiological responses with expressions triggered by other factors, such as raised eyebrows, lowered mouth corners, widened eyes, and an open mouth. Nevertheless, they also exhibit notable differences in specific expression combinations, facial symmetry, and duration. For example, pain-related expressions are typically brief and occur frequently, fluctuating with the intensity of the pain, whereas expressions associated with other emotions, such as happiness or sadness, tend to last longer. Deep learning-based pain facial expression recognition algorithms offer advantages such as high accuracy and strong generalization performance. However, they require large-scale, high-quality datasets for effective training27.

Existing literature shows limitations in facial expression datasets for children’s pain analysis, including small scale, non-standardized construction processes, and inadequate coverage of various age groups28. We established the PFECIC dataset in the study, which captures children’s facial expressions experiencing real medical procedure-related pain, exhibiting typical characteristics of pain-induced expressions. A standardized data collection process, along with pain ratings triple-annotated by trained, experienced senior nurses, ensures the dataset’s rigor, diversity, rationality, and usability.

The comparative analysis experiment utilizes the publicly available COPE dataset18, which comprises data from 26 Caucasian neonates between the ages of 18 h and 3 days. The ROC curve exhibits a step-like pattern, indicating the dataset’s small scale and limited generalization capability. Moreover, the four stimuli used on the neonates are not typical clinical procedures that induce pain in children, making them unrepresentative of typical pediatric pain expressions. In contrast, the confusion matrix results from the PFECIC dataset reveal that mispredictions predominantly occur around values adjacent to the true labels. This suggests that the annotation process for the PFECIC dataset is reasonable and reliable. Consequently, the PFECIC dataset established in this study offers greater usability and accuracy for analyzing pain expressions in critically ill children. Notably, it represents the first comprehensive dataset of pain-related facial expressions in critically ill children, covering an age range of 1 to 18 years. This broad coverage enables the application of artificial intelligence algorithms for pain expression analysis across all pediatric age groups.

Existing pediatric pain datasets primarily focus on acute procedural pain in otherwise healthy neonates and infants, failing to capture the complex and variable pain expressions seen in critically ill children. These datasets typically document responses to brief, standardized painful procedures conducted in controlled settings, which differ significantly from the prolonged, fluctuating pain experiences common in intensive care units. Moreover, most available datasets lack comprehensive pain assessments for children who are sedated or reliant on specialized medical equipment, such as respiratory machines and tracheal tubes29. In contrast, the PFECIC dataset is the first to specifically address pain facial expressions in critically ill children, encompassing both mechanically and non-mechanically ventilated patients. This dataset enables consistent and accurate pain level evaluation, providing valuable insights for physicians to administer analgesics more effectively. Furthermore, PFECIC surpasses all previously mentioned datasets in scale, making it a superior resource for training advanced deep learning algorithms, ultimately enhancing pain recognition performance in critically ill pediatric patients.

Limitations and future research

Although the dataset has achieved satisfactory results, there are still some limitations. First, oxygen tubes, nasogastric tubes, and endotracheal tubes worn by critically ill children could partially obstruct facial images, posing challenges for facial expression analysis-based pain assessment in children. Second, relying solely on facial expression analysis was insufficient for accurately assessing pain in children. Recognition algorithms incorporating multimodal data, such as body movements and physiological indicators, were required. Third, statistical parity in the distribution of procedure types was not achieved across demographic subgroups, introducing potential confounding variables. Fourth, dataset annotations were exclusively based on observational assessments, without incorporating self-reported pain metrics for children with verbal communication abilities.

Further research should encompass the development of multimodal data recognition algorithms based on the established dataset and the expansion of application scenarios for existing datasets. We are enhancing the PFECIC dataset by including videos of each child’s face and limb movements and physiological indicators. It is essential to collect data from a broader range of scenarios in the future, including general ward settings, and to incorporate various types of acute and chronic pain. Additionally, integrating validated pediatric self-reported pain scales will provide convergent validity evidence, enhancing the database’s ecological relevance. This will be the first-ever multimodal dataset for pediatric pain assessment, laying the groundwork for developing multimodal monitoring-based models for pediatric pain assessment.

Conclusion

In this paper, we introduced a novel dataset PFECIC for pain facial expression of critically ill children’s pain to carry forward research on pian facial expression recognition in critically ill patients. The PFECIC dataset comprises a total of 119 videos capturing children’s pain expression, along with 6951 pain facial expression images sourced from 53 Chinese critically ill children treated at the Children’s Hospital of Fudan University. In this study, we employed a deep learning-based pain expression analysis algorithm to evaluate the PFECIC dataset. The PFECIC dataset demonstrates superior accuracy, rationality, usability, and comprehensiveness in training algorithm models and can serve as a valuable resource for training and testing algorithms for pain assessment in critically ill children.

Declaration