Introduction

Research background and motivations

As science and technology rapidly develop, artificial intelligence (AI) has penetrated all aspects of life, and the field of physical education (PE) is no exception1,2,3. In recent years, PE has received more and more attention, because it is not only related to students’ physical health but also closely correlated to their mental health, teamwork ability, and the cultivation of lifelong exercise habits. However, traditional PE teaching methods often rely on teachers’ intuitive teaching and students’ self-perception, lacking objective and accurate data support and immediate feedback4,5,6. Especially in the process of teaching and correcting complex movements, such as sit-ups, teachers need to invest a lot of time and energy to give guidance and feedback one by one. While students may find it difficult to improve their skills quickly because of the lack of timely and accurate feedback. Therefore, how to utilize modern scientific and technological means, especially AI and deep learning (DL) technology, to optimize PE teaching methods and improve teaching effects has become one of the current research hotspots.

As a vital branch of the AI field, DL provides new ideas for PE innovation with its powerful data processing and model generalization abilities7,8,9. In the PE ___domain, DL technology can be applied to action recognition, posture estimation, physical fitness assessment, and other aspects to furnish teachers and students with more accurate and objective data support and immediate feedback10,11. The application of this technology can help students better understand and master motor skills and facilitate teachers to assess students’ motor levels more accurately and make personalized teaching plans. In recent years, as an open-source cross-platform framework of Google, MediaPipe framework has shown great application potential in many fields with its high efficiency, flexibility and ease of use. MediaPipe integrates a variety of advanced computer vision and machine learning technologies, and is especially good at dealing with complex tasks such as human posture estimation and facial recognition. In the field of physical education, the introduction of MediaPipe framework provides a new solution for sports detection and evaluation, which is expected to achieve more scientific, accurate, and personalized teaching. Wei et al.12 pointed out that the design mode of wearable AI devices came from the development of the Internet. The core of hardware is various physiological information sensors and wearable technology, and the core technology of software is wireless network transmission and statistical data processing. Huang et al.13 realized the identification of risk factors for sleep disorders through machine learning. Huang et al.14 further explored the relationship between the increase of vitamin B6 intake and the improvement of lung function using machine learning methods.

In junior high school, students undergo a crucial phase of physical and psychological development, making the impact of physical education (PE) particularly profound. However, due to their age characteristics and cognitive limitations, traditional PE teaching methods often struggle to engage their interest and enthusiasm. Hence, it is of both theoretical and practical importance to develop an AI-assisted sports game teaching method tailored to junior high school students. As a common physical training action, sit-ups are widely used in physical education, fitness training and rehabilitation. However, traditional sit-ups often rely on teachers’ intuitive judgment and students’ self-perception, so it is difficult to ensure the accuracy and effectiveness of the movements. In addition, due to the differences in students’ physique and skill level, it is often difficult for teachers to give full consideration in the guidance process, which leads to some students being unable to get timely correction and feedback. This study selects sit-ups as the research focus. It aims to achieve accurate identification and evaluation of sit-ups by integrating the MediaPipe framework with the ST-GCN algorithm. The approach provides scientific data support and personalized teaching suggestions for teachers. As a result, it enhances students’ training effectiveness and overall experience.

The significance of this study lies in: firstly, it provides an immediate and accurate sports feedback mechanism for junior high school students, which helps them to correct their movements and improve their sports skills. Secondly, through the data-driven personalized teaching scheme, teachers can more accurately understand the sports state of each student, implement differentiated teaching, and promote educational equity. Finally, the development of the application of sports games enriches the teaching methods of sports, stimulates students’ interest in sports activities, and lays a solid foundation for cultivating lifelong sports habits.

Research statement

The significance of physical education for junior high school students is undeniable. However, traditional teaching methods no longer fully meet the demands of modern education. Therefore, exploring new approaches is essential to enhance teaching effectiveness. The integration of deep learning technology into physical education holds significant potential and promising prospects. However, current research in this area remains limited and requires further exploration and practical application. Additionally, the MediaPipe framework, a powerful machine learning tool, has been successfully implemented across various fields. Yet, its application in physical education is still relatively unexplored. Therefore, this study aims to utilize MediaPipe in sports activities to assess its feasibility and effectiveness in practical application.

Research objectives

The main objective of this study is to explore a deep learning algorithm for motor action recognition in junior middle school suitable physical education teaching under the background of artificial intelligence. (1) Describe the research background, objectives, and research necessity; (2) Summarize the research status of physical education teaching and posture estimation supported by artificial intelligence; (3) Introduce the method of artificial intelligence-assisted sports activities, and propose an algorithm of action recognition and posture estimation based on deep learning; (4) Experimental design to verify the application feasibility of the action recognition algorithm based on MediaPipe framework in real space; (5) Summarize the research contribution, limitations and future research direction.

Related works

The swift progress of AI technology has brought revolutionary changes to PE teaching. Huang et al.15 used machine learning model to predict physiological indexes. Liao et al.16 constructed a key point detection model for youth football training based on DL cellular neural networks. Huang et al.17 proposed a comprehensive framework of machine learning for medical applications. It included an initial machine learning selection method, which used bootstrap simulation to calculate the confidence interval of the accuracy statistics of many models. AI can provide more scientific and precise guidance for PE teaching through data collection, analysis, and processing. Table 1 shows the research progress of action recognition technology.

Table 1 Research progress of motion recognition technology.

Demrozi et al.18 focused on key technologies in the field of sports AI, encompassing motion behavior recognition and motion image analysis technologies. Motion behavior recognition technology assisted athletes’ training and game analysis through sensor data collection, data preprocessing, and action recognition algorithms. Motion image analysis technology could understand the movements and tactics of athletes by analyzing game or training images. Agarwal & Alam19 pointed out that AI could monitor students’ movement status and performance in real-time and provide instant feedback and guidance to students. This helped students correct errors in movements promptly and improve their sports skills. In basketball teaching, AI could identify students’ shooting actions through video analysis technology and offer suggestions on shooting posture, strength, angle, and other aspects.

Motion pose estimation is a critical research direction in the computer vision field, aiming to accurately identify human poses from images or videos. Ferrar et al.20 proposed a human posture estimation method based on multi-sensor fusion. By fusing the data of inertial measurement unit, optical camera and depth sensor, the accuracy and robustness of attitude estimation were improved. Dua et al.21 introduced the attention mechanism into the task of human posture estimation to improve the model’s attention to key information. The principle of attention mechanism and its application in attitude estimation were introduced in detail, including spatial attention mechanism and channel attention mechanism. Mazzia et al.22 used the information of bone points to estimate and identify human posture. Firstly, the bone points in the image were detected by the deep learning model, and then the attitude estimation was carried out based on the connection relationship among the bone points.

It is worth noticing that the application of MediaPipe framework in different fields also provides valuable reference for this study. For example, in the field of medical rehabilitation, MediaPipe has been used for the evaluation and training of patients’ motor ability, and provides accurate data support for rehabilitation teachers by monitoring patients’ motor posture in real time. These successful cases show that the MediaPipe framework has strong application potential and wide adaptability.

Proposed solution

AI-assisted sports activities in junior high school

From data analysis and motion pose estimation to virtual reality training, smart wearable devices, and intelligent referee systems, AI technology has brought revolutionary changes to various aspects of sports activities. In sports activities, data analysis is crucial23,24. AI technology collects and analyzes a large amount of match data, athlete training data, etc., to provide coaches and athletes with precise data support and decision-making basis. Specifically, AI can employ machine learning algorithms to mine athletes’ historical data, discover patterns and trends, and develop personalized training plans and game strategies for athletes25,26,27.

Computer vision technology is a vital tool for motion pose estimation and motion behavior recognition28,29. By capturing students’ motion images with cameras and utilizing DL algorithms to process and analyze the images, it is possible to accurately identify students’ motion poses and behaviors. This technology can furnish students with real-time feedback and guidance to help them correct errors in movements and improve their sports skills.

DL technology is one of the core technologies in the AI field, which automatically learns feature representations in data and achieves high-precision prediction and classification30,31,32. In PE teaching, DL technology can be applied to motion pose estimation, physical fitness assessment, motion behavior recognition, and many other aspects. By training neural network models, precise analysis and processing of students’ motion data can be achieved, offering scientific decision support for teachers.

Sensor technology can monitor students’ motion data and physiological indicators in real-time, such as heart rate, number of steps, movement trajectory, etc. By transmitting these data to smart devices or cloud servers for analysis and processing, personalized exercise suggestions and guidance can be provided to students. Meanwhile, sensor technology can also be used to monitor students’ motion status and safety conditions, ensuring their safety during sports activities. The application of AI technology in sports activities is summarized, as detailed in Table 2.

Table 2 The application of AI technology in sports activities.

Human pose estimation method and the Mediapipe framework in sports games

Human pose estimation refers to the extraction of the positions of human joints from images or videos using computer vision technology, thereby estimating and recognizing human body poses33,34,35. In the field of sports, human pose estimation is mainly used for athletes’ technical analysis, motion correction, and development of training plans. Depending on the task, human pose estimation can be divided into two-dimensional (2D) and three-dimensional (3D) pose estimation. Based on the recognition of individuals, it can be classified into single-person and multi-person pose estimation. The studies involving human participants were reviewed and approved by Department of Sports Studies, Facility of Educational Studies, Universiti Putra Malaysia Ethics Committee (Approval Number: 2022.200219819996). The participants provided their written informed consent to participate in this study. All methods were performed in accordance with relevant guidelines and regulations.

The 2D human pose estimation method is mainly employed to extract the 2D coordinate information of human joints from images or videos. This method is primarily used for technical analysis and motion correction of athletes. Common 2D human pose estimation methods include template matching-based and DL-based methods.

The template matching-based method is a traditional approach to human pose estimation. This method first establishes a template library of human poses and then matches the input image with the templates in the library to find the most similar template, thus obtaining the position information of human joints36,37. This method is simple and easy to implement, but it has poor capabilities in handling intricate backgrounds and occlusions.

The DL-based method is currently the mainstream approach in the 2D human pose estimation ___domain. This method trains the DL model to automatically extract the position information of human joints from images or videos, involving key point positions, joint angles, and body poses. Moreover, this method has strong feature extraction capabilities and generalization abilities, enabling it to deal with complex occlusions and backgrounds.

3D CNN achieves direct processing of 3D data by introducing 3D convolution kernels in the convolutional layers38,39,40. These kernels slide in the 3D space (height, width, and depth) and perform convolution operations with local regions in the input data to extract 3D features. This operation enables 3D CNN to capture spatial relationships and temporal dynamics in 3D data. In 3D CNN, components such as activation functions, pooling, fully connected (FC), and convolutional layers appear alternately, collectively forming the entire network structure. Convolutional layers extract features through convolution operations; pooling layers are utilized to mitigate the size of feature maps while retaining critical information. FC layers merge features and output the final classification or regression results, and activation functions introduce nonlinearity to enhance the network’s representation ability. By connecting a series of auxiliary output nodes to the last hidden layer of CNN, the extracted features during the training process more accurately approximate the computed high-level behavioral motion feature vectors, as revealed in Fig. 1.

Fig. 1
figure 1

Feature extraction of 3D CNN.

MediaPipe is an open-source framework developed by Google for implementing machine learning models in the media processing ___domain, encompassing images, videos, and audio, among others. This study selects sit-ups as the research focus, aiming to achieve accurate identification and evaluation of sit-ups by integrating the MediaPipe framework with the ST-GCN algorithm. The approach provides scientific data support and personalized teaching suggestions for teachers. As a result, it enhances students’ training effectiveness and overall experience. The core framework of MediaPipe is implemented in C + + and supports languages like Java and Objective C. Its main concepts include packets, streams, calculators, graphs, and subgraphs. MediaPipe graphs are directed, with packets flowing from data sources into the graph until they leave at the sink node.

Taking a hand key point detection model as an example, as presented in Fig. 2, the model operates on the cropped image area defined by the hand detector and returns high-fidelity 3D hand key points.

Fig. 2
figure 2

Key point ___location of 3D hand joint coordinates based on MediaPipe.

BlazePose is a pose estimation algorithm based on the MediaPipe framework, specifically designed for real-time human pose estimation. It features high accuracy, efficiency, and real-time performance, suitable for various platforms including mobile devices, desktop computers, and embedded systems. BlazePose uses a lightweight neural network architecture based on MobileNetV3 design, with lower computational and parameter requirements. This network structure achieves efficient feature extraction and information fusion through depth-wise separable convolutions and lightweight attention mechanisms. The BlazePose algorithm-based human pose prediction process is illustrated in Fig. 3.

Fig. 3
figure 3

Human pose prediction process based on BlazePose algorithm.

Fine-grained action segmentation and recognition based on DL

In the ___domain of video analysis and understanding, fine-grained action segmentation and recognition are pivotal research directions. Traditional action recognition methods typically concentrate on class labels for entire videos or video segments, while fine-grained action segmentation requires precisely segmenting each action in the video and assigning corresponding class labels40,41,42. In the fine-grained action segmentation and recognition of human skeletal motion, this study applies the Spatial Temporal-Graph Convolutional Network (ST-GCN) method (Fig. 4).

Fig. 4
figure 4

ST-GCN architecture.

The pseudo code of ST-GCN is shown in Table 3.

Table 3 The pseudo code of ST-GCN.

In each time step, the skeleton diagram structure is rolled up in space to capture the spatial relationship between key points41,42,43. This is usually achieved by defining an adjacency matrix, which describes the connection relationship between key points. Between adjacent time steps, the time graph convolution operation is carried out on the skeleton graph structure to capture the position change of key points. This can be achieved by treating the key points of adjacent time steps as nodes in the graph and defining edges between these nodes.

The internal connections of joints within a single frame are represented by the adjacency matrix A and the identity matrix I representing the self-connections. In the case of a single frame, ST-GCN using a partitioning strategy can be achieved through the following equations:

$${f_{out}}={\Lambda ^{ - \frac{1}{2}}}{\left( {A+I} \right)^{ - \frac{1}{2}}}{f_{in}}W$$
(1)
$${\Lambda ^{ii}}=\sum\nolimits_{j} {\left( {{A^{ij}}+{I^{ij}}} \right)}$$
(2)

W indicates that weight vectors of multiple output channels are superimposed to form a weight matrix.

In the spatiotemporal context, input feature maps can be represented as (C, V, T) dimensional tensors. Graph convolution is achieved by performing\(1 \times \Gamma\) standard 2D convolution and multiplying the resulting tensor with the normalized adjacency matrix\({\Lambda ^{ - \frac{1}{2}}}\left( {A+I} \right){\Lambda ^{ - \frac{1}{2}}}\) in the second dimension.

In the distance partitioning strategy,\({A_0}=I\)and\({A_1}=A\), there are:

$${f_{out}}=\sum\nolimits_{j} {\Lambda _{j}^{{ - \frac{1}{2}}}} \left( {{A_j}} \right)\Lambda _{j}^{{ - \frac{1}{2}}}{f_{in}}{W_j}$$
(3)

Similarly, it can be obtained that:

$$\Lambda _{j}^{{ii}}=\sum\nolimits_{k} {\left( {A_{j}^{{ik}}} \right)+\alpha }$$
(4)

For every adjacency matrix, there is a learnable weight matrix M. That is, the equation before the substitution is:

$${f_{out}}={\Lambda ^{ - \frac{1}{2}}}\left[ {\left( {A+I} \right) \otimes M} \right]{\Lambda ^{ - \frac{1}{2}}}{f_{in}}W$$
(5)

represents the product of elements between two matrices. The mask M is initialized as a matrix with all 1.

This study uses sit-up exercises in sports activities as an example. In sit-up action detection and counting, the analysis and processing of action data often involve iterative convergence and nonlinear fitting methods. The characteristic of sit-up actions is the periodic variation of the body between lying down and sitting up, which can be detected by analyzing the positional changes of a certain body part, such as the head or hips. To ensure the normativity and accuracy of sit-up actions and prevent excessive involvement of other body parts, this study constructs a reasonable penalty function by setting loss values and corresponding penalty factors for each joint or body part. The calculation of the penalty value can be written as:

$$S=\sum\limits_{{{s_i}}}^{{{e_i}}} {\left( {\alpha \cdot los{s_{wriet}}+\beta \cdot los{s_{knee}}+\gamma \cdot los{s_{ankle}}} \right)}$$
(6)

\(\alpha\),\(\beta\), and \(\gamma\) are all penalty factors; \(los{s_{wriet}}\), \(los{s_{knee}}\), and \(los{s_{ankle}}\), represent the loss values of the wrist, knee, and ankle, respectively. The calculation for the loss value of each part of the body can be expressed as:

$$los{s_j}=\sqrt {{{\left( {{\alpha _j} - threshold} \right)}^2}}$$
(7)

\({\alpha _j}\)refers to the parameter value calculated for each part of the current body.

The thresholds and penalty factors for parameters of various body parts are outlined in Table 4.

Table 4 The thresholds and penalty factors for parameters of various body parts.

In order to ensure that ST-GCN model can accurately and robustly deal with all kinds of complex and bad data conditions, this study has carried out detailed design and implementation in data preprocessing and enhancement. Firstly, key frames are extracted from the input video at a fixed frame rate to ensure that the model can handle consistent data input. Then, the extracted video frames are normalized, including image size adjustment, color space conversion (such as from RGB to gray image or YUV space) and pixel value normalization to improve the adaptability of the model to different lighting and color conditions. The pre-trained human detection model is used to detect the human body in the video frame to determine the position and size of the human body. According to the detection results, the video frame is cut, and only the area containing human body is reserved to reduce the interference of irrelevant information. In the aspect of data enhancement, the motion blur effect is simulated. By smoothing the video frame on the time axis or adding a blur filter, the image quality is simulated when the camera is moving rapidly or shaking.

Evaluation

Datasets collection

The experimental data of this study include the HMDB51 dataset and video data from 5 experimental subjects. HMDB51 is a widely used action recognition dataset that contains complex actions from various movie clips. These videos typically contain various interfering factors such as complex backgrounds, lighting changes, and camera movements due to their origin from movie scenes. In this study, video data from five subjects are used, and these subjects are tested in five different environments to evaluate the detection accuracy of the ST-GCN algorithm proposed here in different scenarios. All the data involved in the study comply with data protection and privacy laws. All personal information is anonymized during data processing to ensure the privacy and data security of participants. This includes strict control over the storage, processing and use of data to prevent unauthorized access and disclosure. The research methods and procedures all follow the research ethics and operational norms. During the experiment, the safety of the experimental environment, the physical health and psychological safety of the participants are ensured. In addition, the technical specifications of data collection, recording and analysis are followed, which ensures the scientific research and the reliability of data. It is ensured that all participants have provided written consent before participating in this study and have been approved by the ethics committee of their school. Before the experiment began, the research purpose, methods, potential risks, and participants’ rights were explained in detail to ensure their full understanding and voluntary participation. All data collection and processing complied with ethical standards, and participant privacy was protected. Informed consent forms, containing detailed terms, were distributed and required to be signed by the participants’ parents before they could continue. Each subject completed a certain number of sit-ups under the prescribed exercise standards, and recorded their sports videos through cameras installed in the experimental environment. Video data is used to test the performance of the algorithm in a relatively controlled environment to ensure the robustness and reliability of the algorithm. The video data are recorded by 5 experimental subjects in five different environments and used to test sit-up action detection in relatively controlled environments, and video samples.

Regarding data privacy and security, this study implements rigorous measures to protect participants’ identities and personal information. During data collection and storage, all personal identifiers undergo anonymization procedures. Written informed consent is obtained from both participants and their parents, with detailed explanations of the research objectives, methodologies, and potential risks. All data transmissions and storage processes employ encryption protocols to prevent unauthorized access and misuse. Regular data backup procedures and recovery tests are conducted to ensure data integrity and availability.

Although the proposed methodology relies on high-quality input data, such data may be affected by environmental factors including lighting conditions and camera angles. To mitigate these influences, this study adopts multiple countermeasures. Specifically, data normalization and augmentation techniques are applied, encompassing image resizing, color space conversion (e.g., from RGB to grayscale or YUV space), and pixel value normalization to enhance model robustness against illumination variations. Furthermore, the training dataset incorporates data collected from diverse camera angles and environments to strengthen the model’s generalization ability across different scenarios.

Experimental environment

Hardware configuration: CPU: Intel(R) Core(TM) i7-8700; GPU: NVIDIA T4. Software configuration: development environment: python 3.7.1 + CUDA 10.1, image processing: cv2 4.6.0.66, mediapipe 0.8.10.

To reduce the operational complexity of the system, this study has developed a user-friendly graphical user interface (GUI). The GUI simplifies steps such as model configuration, data input, and result output, making the system accessible to teachers and students without technical backgrounds. It offers a one-click initialization feature that automatically sets model parameters, eliminating the need for manual adjustments. An intuitive model selection interface allows users to choose pre-trained models like ST-GCN or other suitable models via drop-down menus or icons. Additionally, comprehensive user manuals and instructional videos are provided to guide users through system setup and usage. To further enhance user experience, a suite of automation tools has been integrated, including data preprocessing and model training tools. These tools automate tedious tasks such as data cleaning, format conversion, and model parameter tuning, thereby reducing the operational burden on users.

Parameters setting

The initial learning rate of the ST-GCN action recognition algorithm is set to 0.001. The number of iterations is set to 100 to ensure that the model is sufficiently trained but not overfitted; weight decay is 0.0001; the batch size is 32 to ensure the stability and speed of model training. Adam optimizer is chosen to automatically adjust the learning rate for each parameter to accelerate model convergence and reduce oscillation. Dropout layers are added to certain layers of the model to prevent overfitting.

Before entering the processing pipeline, all collected video data underwent rigorous anonymization. Specifically, during recording, no personally identifiable information (such as facial features) is captured. When storing the data, unique identifiers are used instead of real names or student numbers. To prevent data leakage, all data in transit, including that uploaded to the cloud, are encrypted using industry-standard protocols. Additionally, data stored on the server side are further secured with robust encryption algorithms. To ensure all operations complied with ethical standards and had necessary legal backing, this study establishes detailed data access policies and ensured transparency throughout their implementation.

Performance evaluation

In the HMDB51 dataset, the proposed algorithm’s average detection accuracy reaches 88.3%. In the video dataset from 5 different scenarios, the proposed algorithm’s average detection and counting accuracy of sit-up movements are depicted in Fig. 5. It can be observed that in diverse testing scenarios, due to the complexity of the environmental background, the accuracy of the action recognition algorithm is indeed affected and fluctuates to a certain extent. When the background contains many moving objects, textures, or colors similar to the target action, the algorithm may misidentify these background elements as part of the target, leading to a decrease in accuracy.

Fig. 5
figure 5

Detection and counting accuracy in diverse scenarios.

Furthermore, the proposed algorithm is compared with other mainstream action recognition algorithm models on the HMDB51 dataset to verify the proposed method’s effectiveness. Among them, Spatial-Temporal Synchronous Graph Convolutional Networks (STS-GCN) can directly capture local temporal-spatial correlation at the same time. It constructs a local spatio-temporal graph, connects the single spatial maps with adjacent time steps into a map, and captures the complex local spatio-temporal correlation in these local spatio-temporal graphs through the convolution module of spatio-temporal synchronization maps. MotionMoxer is an algorithm related to motion generation, motion simulation, or similar fields. Figures 6 and 7 respectively show the comparison results of different algorithms in Mean Absolute Error (MAE) and Mean Per Joint Position Error (MPJPE) indexes. The results reveal that the proposed method has strengths in long-term prediction, such as over 500ms, especially at 1000ms, MAE and MPJPE values are 71.1 and 1.04, respectively, which are markedly lower than other algorithms.

Fig. 6
figure 6

Comparison of different algorithms on the MAE index.

Fig. 7
figure 7

Comparison of various algorithms on the MPJPE index.

Discussion

In PE, AI can provide more accurate data support for coaches and offer more personalized and appropriate sports games for junior high school students. With the swift progress of DL technology, the algorithm based on ST-GCN has garnered prominent outcomes in the human action recognition ___domain44,45. During model deployment, infrastructure and equipment requirements are crucial. High-performance computing resources and storage devices are essential for stable model operation and efficient data processing. However, in practical applications, such as sports teaching, hardware resources are often limited, and costs are constrained. On one hand, high-performance GPUs and CPUs are vital for model training and inference. These devices are typically expensive and energy-intensive, making it difficult for many schools and educational institutions to afford the long-term operational costs. On the other hand, large amounts of video data require substantial storage space, and high-performance storage devices are costly and require regular maintenance and updates. These infrastructure and equipment requirements not only increase the cost of model deployment but also limit the model’s widespread adoption and application. In particular, in remote areas or resource-poor schools, hardware limitations may prevent full utilization of the model’s advantages to aid teaching. To mitigate the impact of infrastructure and equipment requirements on model deployment, cloud computing can provide flexible computing resources and storage space, dynamically allocated based on actual needs. Deploying the model on cloud servers leverages the cloud’s powerful computing capabilities to accelerate training and inference, reducing local hardware demands. Edge computing pushes computing tasks and data storage to the network edge, reducing data transmission latency and improving processing efficiency. By deploying the model on edge devices near the data source, such as smart cameras or smartphones, real-time data processing and feedback can be achieved, reducing bandwidth and latency. This technology is especially suitable for sports teaching scenarios with high real-time requirements, providing students with more timely and accurate guidance.

The ST-GCN motion detection algorithm based on the MediaPipe framework constructed in this study realizes accurate recognition of sit-ups by constructing a spatiotemporal graph model of the human skeletal point sequence. Initially, the algorithm obtains the data of human skeletal points through pose estimation technology. Subsequently, a spatio-temporal graph model is constructed to represent human skeletal points as nodes in the graph, and the connection relationship between nodes is represented as edges. Zhang & Tao46 found that there were two types of edges in spatiotemporal graphs. Spatial edges represent the connection between nodes at the same time step, while temporal edges indicate the connection between the same nodes at different time steps, which is consistent with the findings of this study. Shu et al.47 pointed out that through multiple layers of spatiotemporal graph convolution operations, the ST-GCN algorithm can capture spatial and temporal changes in human skeletal points, thereby achieving accurate recognition of sit-up actions. Table 5 shows the advantages of this study compared with the research in this field.

Table 5 Research progress of motion recognition technology.

Leveraging personalized teaching, precise data analysis, and gamified teaching, more high-quality and efficient PE services can be furnished to students. The ST-GCN model effectively integrates the spatial position and temporal dynamic information of human skeleton points by constructing a spatio-temporal graph model and realizes high-precision recognition of complex motion patterns. This technical breakthrough provides a powerful tool for action analysis in physical education teaching. Real-time data acquisition and processing based on attitude estimation technology enables the system to provide students with action feedback immediately, help them adjust their posture in time, and improve the training effect. This instant feedback mechanism is particularly important for beginners, which can significantly shorten the time to master skills.

When using video data for pose recognition and motion analysis, it is crucial to prevent the leakage of students’ facial features and identity information. To this end, this study employs techniques such as facial blurring and implements strict encryption measures during data storage and transmission. However, as technology continues to evolve, privacy protection policies must also be updated and refined to address potential new risks of privacy breaches. When storing and analyzing large amounts of student data, it is essential to establish strict data access and control mechanisms to prevent unauthorized access and misuse. Regular data backups and recovery tests are also necessary to ensure data integrity and availability.

In terms of the significance for PE practice, this study integrates AI and DL technologies into physical activities, providing new insights for PE in junior high school and significantly enhancing teaching and learning experiences. Real-time feedback and personalized guidance based on precise motion detection effectively improve students’ motor skills and engagement. This study also highlights the potential of technology in supporting differentiated teaching strategies, offering tailored instructional support based on the diverse needs and abilities of individual students. However, the successful application of these technologies in practice requires careful consideration of infrastructure, teacher training, and student privacy. Future research should explore the practical challenges and opportunities of deploying these technologies in diverse educational settings to maximize their impact on PE outcomes.

This study also has several limitations. First, the small sample size may affect the generalizability of the findings. The study focuses on a limited number of participants and examines only one type of exercise—sit-ups. While this narrow scope facilitates in-depth analysis, it restricts the applicability of the results to other types of physical activities. Future studies should expand the sample size, include more diverse participants, and extend the study to a wider range of exercises to enhance the robustness and generalizability of the results. Moreover, despite extensive efforts in data preprocessing and augmentation to enhance the model’s robustness against poor data conditions, the model’s generalization ability remains somewhat limited. Particularly when dealing with rare or complex sports activities, the model’s recognition performance may decline. This is mainly due to the limitations of the training dataset, which prevents the model from fully learning the characteristics of these activities. To improve the model’s generalization, future research could explore a broader range of sports activities, including various sports, dance, and fitness classes. By collecting video data of these activities and constructing a more diverse training dataset, a more robust and generalizable model can be trained. For the recognition of fine-grained or rapidly changing motions, future research could investigate more advanced deep learning algorithms and techniques, such as attention mechanisms and memory networks. These techniques can help the model better capture and understand motion details, thereby improving recognition accuracy and robustness. The proposed system in this study exhibits a primary limitation in its dependence on advanced hardware and computational resources, which may hinder adoption by financially constrained schools. To address this issue, cloud computing and edge computing technologies present viable solutions. Cloud computing provides flexible computational resources and storage capacity, enabling schools to access the model without requiring local high-performance hardware. Edge computing, conversely, processes data near its source, reducing transmission latency and improving processing efficiency - particularly advantageous for PE scenarios demanding real-time responsiveness. An additional challenge lies in the substantial processing power required for real-time analysis and feedback, potentially causing latency issues. To minimize delays, this study implements model efficiency optimizations including lightweight neural network architectures and optimized convolution operations. Furthermore, deploying the model on high-performance hardware significantly reduces processing time and latency. Future work may explore additional optimization strategies such as model pruning and quantization to enhance real-time performance.

Conclusion, and future works

Research contribution

The development of AI drives the continuous innovation of sports activities, and the action recognition and attitude estimation methods in sports activities are constantly updated. In this paper, a MediaPipe framework-based ST-GCN method is proposed for the fine-grained segmentation and recognition of human skeletons. Between adjacent time steps, a time graph convolution operation is performed on the skeleton structure to capture the position changes of key points. The research results facilitate understanding the information in motion scenes more comprehensively and improve the recognition accuracy of specific movements.

Future works and research limitations

Although this study achieves good performance on a specific dataset, the model’s generalization ability remains a challenge. This means that when the model is applied to new, unseen data, its performance can deteriorate. In future studies, larger and more diverse datasets should be collected to cover a wider range of motor movements and changes in detail. This helps to foster the model’s generalization ability and recognition accuracy. During data collection, the diversity and complexity of sports teaching activities often result in highly heterogeneous video data. Differences in teaching styles and movement performances across various classes, teachers, and students make data labeling and preprocessing particularly challenging. Additionally, in some movements, students may exhibit varied forms due to insufficient strength or lack of skill proficiency, making it difficult for the model to accurately capture and recognize these actions. In future research, efforts should be made to ensure the diversity and representativeness of the data while employing data augmentation techniques to enhance the model’s generalization ability. Regarding the variability in student performance, the training dataset can be enriched by introducing more student samples and types of movements, thereby improving the model’s recognition accuracy. While the proposed method demonstrates excellent performance in recognizing structured movements like sit-ups, its effectiveness in detecting more complex, dynamic actions remains limited. Future research should focus on compiling more diverse datasets encompassing various complex and dynamic physical activities. The investigation of advanced DL algorithms incorporating attention mechanisms and memory networks could also improve the model’s capacity to capture and interpret fine-grained movement details.