RaGeoSense for smart home gesture recognition using sparse millimeter wave radar point clouds

0
RaGeoSense for smart home gesture recognition using sparse millimeter wave radar point clouds

In this section, an experimental evaluation of RaGeoSense was conducted to assess the overall performance of the system by comparing different experimental dimensions. For this purpose, 10 users were recruited to participate in the training and an additional 4 users were recruited for testing the system’s recognition performance on 8 gestures.

Overall performance

The overall performance of gesture recognition in both simple and complex environments was first evaluated, and the impact of different experimental parameters on the model was analyzed. As shown in Fig. 14, the training loss continuously decreases and gradually approaches zero, indicating that the model effectively fits the training data. Meanwhile, the validation loss drops rapidly in the early stages and then stabilizes, and the validation accuracy rises significantly and eventually approaches the training accuracy. These trends suggest that the model not only converges well during training but also achieves high prediction accuracy on the validation set, demonstrating strong generalization ability without significant overfitting.

Fig. 14
figure 14

Training and validation loss vs. accuracy.

Overall, the training loss and validation loss decreased significantly, and both the training and validation accuracies were close to 97%, indicating that the model performed well on both the training and validation sets without significant overfitting. Figure 15 shows the confusion matrix for gesture eight classification. According to the confusion matrix, it can be seen that some of the gestures that are in conflict with other gestures have some features, such as (A) “raise” and (B) “lower” are performed in the y–z plane; (E) “pull” and (F) “push” both run along the y-axis but in opposite directions, and conflicting gestures share common features; e.g., (H) “draw a circle counterclockwise” and (G) “draw a circle clockwise” are the same as (A) “draw a circle in the y–z plane” and (A) “raise” and (B) “lower” have a tendency to move in the same direction as the z-axis, with the arm either moving away from the body or toward the body, and so have similar characteristics. (D) “Swipe left” and (C) “Swipe right” are actually the same gesture, but with the opposite direction of operation. (H) “Counterclockwise circle” is more distinctive than the other gestures, and the recognition accuracy under simple environmental conditions is very high. The recognition accuracy can reach 100% under simple environment conditions. Overall, the model is very accurate, with an overall accuracy of 96.8%, even in complex environments, the accuracy of all 8 gestures exceeds 94%.

Fig. 15
figure 15

Gesture categorization confusion matrix. (a) Simple environments. (b) Complex environments.

Effects of different distances and angles

Effects of different distances

We asked the experimenters to complete eight arm gestures at different distances and tested them in comparison in an open environment (empty hall) and a complex environment (laboratory), as shown in the setup in Fig. 16. The device distances were set to 1 m, 1.5 m, 2 m, 2.5 m, 3 m, 3.5 m, and 4 m to test the system performance and find the optimal device distance. The results show that the recognition accuracy is consistently higher in the hall than in the lab environment, which may be due to the complexity and additional interference in the lab environment. The average gesture recognition rate in both environments is more than 85%, which indicates that the system has a high gesture recognition performance.

Fig. 16
figure 16

Schematic of different distances with error rates.

The experimental results further show that the recognition rate of open scenes is consistently better than that of complex scenes, regardless of the distance. The recognition rate reaches its highest when the radar is about 2 m away from the subject. This is because the signal propagation is shorter at that distance, the signal attenuation is small, the sensing range is wider, and the reflected signal is stronger. When the distance exceeds 3.5 m, the recognition accuracy gradually decreases, which is due to the fact that the number of point clouds reflected from the human body decreases and the distribution of the point clouds becomes sparse, which reduces the amount of information and leads to the difficulty of extracting effective features. Meanwhile, as the distance increases, the point cloud generated by the gesture action is reflected in the smaller cross-section of the radar, which affects the integrity of the spatial structure features. In addition, the further the distance, the minimum distance for the radar sensor to discriminate between two targets increases, leading to a decrease in resolution, which in turn affects the accuracy of gesture recognition.

Effects of different angles

The performance of millimeter-wave radar gesture recognition systems is usually closely related to the angle of the visual axis. The radar has a field of view ranging from − 55° to + 55°, so the experiment set up five different angles of arrival: 0°, ± 10°, ± 20°, ± 30°, and ± 40° to test the system’s performance at different angles. The results of the experiment are shown in Fig. 17.

Fig. 17
figure 17

Schematic of different angles and error rates.

The experimental results show that the environment and the arrival angle have a significant effect on the gesture recognition rate. When the azimuth angle between the experimenter and the device is 0°, the overall recognition accuracy reaches 95%. When the user is located in the range of ± 20°, the recognition performance is still excellent and the accuracy rate remains around 90%. This is because in this range, the radar receives the shortest signal path, with less signal attenuation and stronger reflected signals, which helps to accurately extract features. However, when the azimuth angle between the experimenter and the device is ± 30°, the recognition accuracy drops significantly to about 75%. This is because as the angle increases, the signal path becomes longer, the signal attenuation increases, and the reflected signal weakens, resulting in increased difficulty in feature extraction. In addition, the signals received in the edge region of the radar field of view are weaker, thus increasing the uncertainty and error of recognition.

In addition, we note that the recognition accuracy at negative angles is slightly higher than that at positive angles. This difference may be due to the fact that at negative angles, the right hand gesture is closer to the center region of the radar’s field of view, and the radar receives a wider and clearer field of view, which reduces the error due to the edge effect of the field of view. At the same time, the hand movement is closer to the radar, and the radar is able to capture the details of the gesture more clearly with a stronger signal reflection, thus improving the recognition accuracy. In summary, RaGeoSense is able to maintain a high recognition performance within a small directional angle error, especially in the center region of the radar viewing angle.

Effects of different speeds

In order to evaluate the effect of different speeds on the performance of RaGeoSense gesture recognition, all gestures executed at three speeds, fast (1.5 s to complete), medium (3 s to complete), and slow (4.5 s to complete), by two participants at a distance of 2 m from the 0° position were recorded in the experiment. Participants completed the gestures based on their own perception of speed and there was no strict time limit for the experiment. Ultimately, a total of 1200 gesture instances were collected for testing. The results of the experiment are shown in Fig. 18.

Fig. 18
figure 18

Accuracy at different speeds.

The results show that the recognition accuracy of medium (96.28%) and fast (93.05%) gestures is consistently higher, while the recognition accuracy of slow gestures decreases by about 8%. This finding suggests that the speed of gesture execution has a significant effect on the recognition performance. Through further analysis, it was found that the CFAR algorithm showed high sensitivity to gesture speed. At lower speeds, the gesture’s movement changes more slowly, resulting in sparser point cloud data, which makes it more difficult for the algorithm to process and ultimately affects the accuracy of the recognition.

The reason for this phenomenon is that slower gesture movements lead to a sparse distribution of the point cloud data captured by the sensor, making it difficult to form a clear pattern of features. As a result, at slow speeds, it is difficult for the recognition algorithm to accurately track the complete trajectory and details of the gesture, thus affecting the recognition results. In order to improve the recognition performance of slow gestures, it is necessary to optimize the existing algorithm, enhance its robustness to sparse point cloud data, and improve the stability and accuracy of the system in slow gesture recognition.

Experimental results for multi-user scenarios

In order to more fully demonstrate the capability of the RaGeoSense system in a smart home environment, we designed two multi-user scenario experiments. In the experiments, the subject was located 2 meters from the radar and at the 0° position, and participants were gradually added to introduce interference. In the first experiment, participants executed different gestures from the subject at different positions and angles; in the second experiment, participants were required to walk in the background of the subject.

In addition, we optimized the filtering algorithm. During K-mean clustering and straight-through filtering, the system is able to identify multiple cluster centers and select the target closest to the radar as the subject person in the experiment. Based on this cluster center, the system automatically adjusts the appropriate window position and then performs the subsequent operations. Other experimental settings are consistent with single-user experiments.

Fig. 19
figure 19

Accuracy for different number of interlopers a. Making different gestures b. Walking in the background.

The experimental results are shown in Fig. 19. In most cases, the interference of multiple gestures has a notable impact on recognition accuracy. In an open environment, the recognition accuracy drops by 9.7 percentage points (from 95.7 to 86%) when multiple users perform gestures simultaneously. This is primarily due to the high complexity and similarity of gesture signals, which lead to signal aliasing and feature confusion-challenges that are more difficult to resolve than background noise or simple motion interference. In comparison, when multiple users walk around the subject without performing gestures, the recognition accuracy drops by 5.1 percentage points (from 95.7 to 90.6%), mainly due to increased background dynamics rather than misclassification. These two types of interference were chosen to reflect typical multi-user conditions in real-world smart home scenarios: concurrent interaction and passive background movement. To improve system robustness under these conditions, we optimized the K-means clustering in the preprocessing stage to detect multiple point cloud clusters and dynamically select the cluster closest to the radar as the primary subject. While multi-user interaction introduces challenges, the system still maintains high recognition accuracy, demonstrating strong anti-interference capabilities. We acknowledge that more complex multi-user scenarios, such as overlapping or continuous interactions between users, have not yet been explored and will be considered in future work to further enhance the system’s generalizability in practical applications.

Effect of point cloud filtering

In our gesture recognition system, point cloud filtering processing is one of the key steps to improve the recognition performance. In order to deeply analyze the effects of different filtering methods on the recognition accuracy, ablation experiments are designed to compare the overall recognition effects of no filtering, K-mean clustered straight-through filtering, frame-differential filtering, and median filtering at different distances (from 1 to 4 m) and at different angles (from 0° to ± 40°). The experimental results are shown in Fig. 20.

Fig. 20
figure 20

Comparison of point cloud filtering between different distances and angles.

Experiments show that the average contribution rate of K-mean clustering straight-through filtering is 29.4%. This filtering method can automatically determine the cluster center and window position according to the characteristics of the point cloud data, which improves the adaptivity of the system, especially under the conditions of longer distance and larger angle, its contribution rate can reach 38%, which is particularly significant to the system performance. By removing static backgrounds and retaining dynamic objects, frame differential filtering shows good results at different distances and angles, and its average contribution rate is 13.4%. The effectiveness of frame differential filtering lies in its ability to significantly enhance the detection of dynamic gestures, which is particularly beneficial for gesture feature extraction in specific backgrounds. In contrast, the median filter has an average contribution rate of only 5.3%, and although the effect is not as obvious as the other two filters, it plays a positive role in attenuating the effect of individual differences and helps to improve the generalization ability of the model. Therefore, median filtering can be used to improve the adaptability of the model among different users, although its improvement in the overall recognition rate is relatively limited.

In summary, the K-mean clustering straight-through filtering and the frame difference filtering perform most significantly in terms of recognition accuracy enhancement and are suitable for dealing with long-distance and large-angle scenarios, whereas the median filtering helps in enhancing the model’s generalization, and with the three proposed filtering methods, the quality of the human action point cloud can be significantly improved, which in turn can significantly improve the classification accuracy of RaGeoSense.

Effect of data enhancement

Data enhancement methods play an important role in improving the generalization ability and robustness of models. We analyze in detail the specific impact of these methods on model performance in terms of recognition accuracy and real-time performance. In the RaGeoSense system, the following four data enhancement methods are used: additive noise, time warping, time shifting, and sequence inversion. The experiments were conducted in open and complex environments to test the recognition accuracy in different scenarios and to compare the effect of a single enhancement method with the combination of enhancement methods, and the experimental results are shown in Table 4.

Table 4 Comparison of data enhancement accuracy.

The data enhancement approach significantly improved the recognition accuracy of the model, especially in complex environments. The combined enhancement approach improved the recognition accuracy from 93.51 to 95.20% in open environments and from 90.77 to 92.56% in complex environments. In addition, data augmentation significantly improves the robustness of the model, especially when dealing with different gesture velocities. Time warping has the most significant effect on robustness improvement, but it also imposes the largest computational overhead of about 6%. When using all data enhancement methods combined, the computational overhead increases by about 10%, which has some impact on real-time performance, but is still within acceptable limits.

Effect of different sampling parameters

While the motion features within a single frame are not obvious, the arm motion in consecutive frames constitutes a spatio-temporal structure. To process these data, this experiment uses sliding sequence sampling with point cloud tiling, and evaluates the system performance by adjusting the number of samples in different time series.

We discuss the effects of different sampling sequence lengths and shift lengths on the experiment. As can be seen in Fig. 21, the experimental results for different sampling sequence lengths vary significantly. For data with sampling lengths in the range of 10–25 rows, the recognition accuracy is lower, although the data processing time is shorter and the time complexity is lower. As the sampling sequence length increases, the overall recognition rate rises above 90% when the sampling length reaches 30 lines. When continuing to increase the length of the sampling sequence, the recognition rate gradually improves, and when reaching 45 lines, the recognition rate in the open environment reaches 95.2%. However, beyond 45 lines, the accuracy begins to decrease. This is because too large a sequence window may lead to information redundancy, which in turn affects the recognition accuracy. In addition, as the number of synthesized lines increases, the time complexity and preprocessing time also increase. When the sampling length reaches 40 lines, the improvement in accuracy becomes less significant. Too short sampling sequences cannot effectively capture long-range dependencies and global patterns, leading to insufficient understanding of global features by the model; while too long sampling sequences may increase computational complexity, consume too much computational resources, and may cause some key features to be diluted in a large amount of data, which reduces the recognition effect.

In the RaGeoSense system, increasing the length of the sampling sequences can significantly improve the recognition accuracy because longer sequences provide richer spatio-temporal features that help the model better capture the dynamic changes of the gestures. However, this also brings about an increase in computational overhead, which affects the real-time and energy efficiency of the system. Smart home scenarios usually require low-latency and energy-efficient interactions, so an optimal balance between computational complexity and real-time performance must be found. Experimental results show that when the length of the sampling sequence reaches 45 lines, continuing to increase the length not only results in limited accuracy improvement, but also in a significant increase in computational overhead. According to the requirements of practical application scenarios, the length of the sampling sequence can be dynamically adjusted to balance the computational complexity and real-time performance. For example, in scenarios with high real-time requirements, the length of the sampling sequence can be appropriately shortened to reduce the computational overhead; in scenarios with high accuracy requirements, the length of the sampling sequence can be appropriately increased.

Fig. 21
figure 21

Different sampling sequence lengths and shift lengths.

In addition, the sliding length of the series has a significant effect on the experimental results. As the sliding length increases, the data is sampled faster and the number of samples is reduced, which allows the model to converge faster. However, the experiments revealed a significant decrease in recognition accuracy as the sliding length increased. Increasing the step length reduces the overlap between the generated time-series data, resulting in the model not being able to fully learn the continuous time-series features, and some important information may be skipped. In contrast, a smaller step size generates more overlapping data, which can help the model better understand the distribution of the data, while a larger step size may cause the model’s understanding of the data to become one-sided. Therefore, increasing the slide length in order to reduce the amount of computation is not applicable to this system, as it will lead to a decrease in the recognition performance.

Comparison between single models

This system uses integrated models to aggregate features to improve recognition accuracy. In order to evaluate the feature extraction effect of different single models, five models: random forest, XGBoost, GBDT, PointNet and decision tree are compared and the effect of different spatial feature extraction modules on the experimental results is evaluated. The experiment uses the set of gestures collected in an open environment with a ratio of training samples to test samples of 4:1. The comparison results are shown in Fig. 22.

Fig. 22
figure 22

Accuracy and error rates for different single models.

The comparison results show that the accuracy of Random Forest and Decision Tree is significantly lower than the other three models, which are all below 85%. The decision tree model is prone to overfitting the training data, especially when the data is noisy or the features are complex. Although Random Forest mitigates the overfitting problem by integrating multiple decision trees, there is still a performance bottleneck when dealing with high-dimensional and sparse millimeter-wave radar point cloud data. This is because the randomness of feature selection and sample subset when constructing each tree in Random Forest may lead to under-capturing of important features.

In the experiments, it is found that PointNet has higher recognition accuracy for some specific gestures, such as (E) “pull” and (F) “push”, which are of small amplitude. This is due to the fact that PointNet’s architectural design allows it to process point cloud data directly, preserving subtle geometric information and local features. However, PointNet is prone to confusion when processing gestures with opposite directions. The main reason is that gestures with opposite orientations tend to have symmetries, and PointNet may not be able to adequately capture the differences in these symmetries.

In contrast, XGBoost and GBDT are effective in overcoming this problem because they are able to handle nonlinear relationships in the data. Gestures with opposite orientations may exhibit nonlinear differences in the point cloud data, and these models are able to capture these nonlinear features through the split points of the decision tree, thus improving the recognition accuracy. The figure on the right shows the error rates of the five models, and the results show that the error rates of XGBoost and GBDT are significantly lower than the other models, and their feature extraction results are also better than the other models.

Comparison with existing work

to validate the performance of RaGeoSense, we compared it with existing millimeter-wave radar gesture recognition methods, and the experimental results are shown in Table 5. The Doppler image-based method mainly utilizes the Doppler frequency shift information to generate 2D spectral images, which is suitable for dynamic and simple gestures such as sliding and waving, but has limitations in the representation of spatial information. The first two experiments use the datasets of the respective methods. The first experiment uses a deep convolutional neural network (DCNN) to classify Doppler spectra, but real-time recognition accuracy is limited34. The second experiment significantly enhances the expressive power of gesture recognition by fusing multidimensional features such as distance-time maps, Doppler-time maps, and angle-time maps37, but is less robust to environmental disturbances.

Table 5 Comparison with other methods.

The method based on point cloud data has richer spatial information, and the latter three experiments are evaluated using point cloud data and four-dimensional features from this paper. A neural network-based attention mechanism method improves robustness by adaptively focusing on important features and suppressing noise, but has limited improvement in real-time, with a recognition accuracy of 93.10%50. Methods such as PointNet, which directly process sparse point cloud data, can effectively extract spatial geometric features and show high accuracy (98.53%) in the gesture recognition task, but the computational complexity is high and real-time performance is poor51. In contrast, RaGeoSense enhances the continuity of temporal features through sliding sequence sampling and point cloud splicing, and comprehensively analyzes the spatial structure and temporal features of gestures by combining point cloud data, which improves the recognition accuracy while ensuring real-time response and significantly enhances the applicability of the system. In addition, some deep learning-based millimeter wave radar gesture recognition methods are prone to gradient vanishing or gradient explosion when dealing with complex gestures, which affects the model training and performance stability. RaGeoSense effectively circumvents the gradient problem by introducing GBDT and XGBoost to extract the spatial features and combining with the LSTM gated loop unit, which improves the stability of the model and the classification efficiency. Experimental results show that the average response time of RaGeoSense is only 103 ms, which is faster than most of the methods, and it takes into account the recognition accuracy and real-time performance, and it has a wider applicability in application scenarios such as smart home.

link

Leave a Reply

Your email address will not be published. Required fields are marked *