Comparative analysis and optimization of detection, liveness, and recognition models in an integrated face-based attendance framework

La Quang Thien Son; Nguyen Tat Quan; Bui Hoang Nguyet Anh; Pham Tran Cong Hoang; Nguyen Thi Nguyet

Аннотация статьи

This paper reworks the source study into a comparative and optimization-focused manuscript centered on model selection for an integrated face-based attendance framework. Rather than presenting the system mainly as a unified architecture, the paper analyzes how different detectors, liveness models, and recognition backbones behave under common deployment constraints and how those observations inform pipeline optimization. The benchmark spans three model families for each block: MTCNN, YOLOv8, and YOLO11 for detection; MiniFASNet, MediaPipe, and a 68-landmark challenge-response method for liveness; and EfficientNet-B3, ResNet50, and FaceNet for recognition. The source results indicate that YOLO11 offers the strongest detection performance among the tested candidates, the landmark-based PAD design yields the most practical liveness trade-off under the reported setup, and FaceNet provides the best identity discrimination. Building on those observations, the paper derives an optimization strategy based on block-wise benchmarking, two-stage fine-tuning, early stopping, identity-balanced sampling, and threshold-aware inference. The analysis shows that effective attendance systems should be optimized as heterogeneous pipelines rather than as monolithic models, with each block chosen for its local objective, interface quality, and deployment cost.

Текст статьи

1. Introduction

Face-based attendance systems operate under a mixed set of goals. They must localize a face reliably, verify that the presenter is live, and then match the face to an enrolled identity. These goals are related but not identical, and the best model for one block is not automatically the best model for another. This is why end-to-end system quality is often limited less by any single model and more by how candidate models are compared, selected, and tuned before integration.

The source manuscript already contained block-level comparisons across detection, anti-spoofing, and recognition stages. This paper restructures those results into a comparative analysis. The emphasis is therefore placed on experimental design, benchmark interpretation, and optimization logic. In particular, the paper asks three questions: (1) which detector best balances real-time localization and robustness, (2) which liveness strategy is most appropriate for a consumer-camera attendance pipeline, and (3) which recognition backbone yields the most discriminative identity representations after domain adaptation? The answers are then translated into a practical optimization recipe for integrated deployment.

2. Benchmark Design

2.1. Candidate Models by Block

The benchmark was designed to compare representative models with distinct inductive biases and deployment profiles [1, 2, 3, 4, 5, 6, 7, 8, 9]. In the detection block, MTCNN represents a classical cascaded CNN detector, while YOLOv8 and YOLO11 represent modern one-stage detection architectures [4, 5]. In the liveness block, MiniFASNet represents a lightweight appearance-based anti-spoofing family, MediaPipe provides dense facial geometry for real-time landmark analysis [7], and the selected 68-landmark method emphasizes interpretable geometric motion cues [8, 9]. In the recognition block, ResNet50 offers a strong residual-feature baseline [2], EfficientNet-B3 introduces a parameter-efficient scaling strategy [3], and FaceNet represents an embedding-centric recognition paradigm [1].

Table 1

Block	Candidates	Comparison objective	Decision criterion
Detection	MTCNN, YOLOv8, YOLO11	Localization quality and robustness	Choose detector with strongest practical accuracy/recall balance
Liveness/PAD	MiniFASNet, MediaPipe, FacialLandmark68	Spoof filtering under RGB-only deployment	Prioritize recall/F1 and low operational cost
Recognition	EfficientNet-B3, ResNet50, FaceNet	Identity discrimination and enrollment scalability	Prioritize accuracy/F1 and efficient embedding inference

2.2. Datasets and Split Strategy

The benchmark inherits the source study’s multi-source data construction. Detection training and testing used images from WIDER FACE, Face-Detection-Dataset, and FDDB. The PAD benchmark used live and spoof samples drawn from CelebA-Spoof, WFLW, and iBUG-300W, with binary live/spoof labels and augmentation to increase robustness. The recognition benchmark used identity-labeled samples organized from CelebA-Spoof, CASIA-FASD, and LCC-FASD. Importantly, the study reports subject-level or identity-level separation where applicable to reduce leakage across train and test splits.

2.3. Evaluation Metrics and Practical Criteria

Each block was evaluated with metrics aligned to its function. Detection used Accuracy, Precision, Recall, and F1-score to capture localization quality. The PAD block used precision, recall, and F1-score, along with normalized mean error for landmark localization, because the liveness method depends on geometric stability. Recognition used Accuracy, Precision, Recall, and F1-score to assess identity discrimination. Beyond quality metrics, the source study considered deployment factors such as training stability, inference feasibility on limited hardware, and compatibility with cloud environments using NVIDIA T4 GPUs.

3. Comparative Results

3.1. Detection Models

Table 2

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)
MTCNN	38.17	39.0	24.4	30.16
YOLOv8	59.0	90.2	58.1	69.0
YOLO11	60.0	89.1	58.2	69.0

The detector comparison reveals a clear pattern: modern YOLO variants substantially outperform MTCNN under the reported conditions. MTCNN reached only 38.17% accuracy and a 30.16 F1-score, while YOLOv8 and YOLO11 both achieved 69.0 F1-score with much higher precision. YOLO11 edges YOLOv8 in accuracy and recall, making it the more suitable choice when the detector acts as the pipeline’s gatekeeper. Even a modest recall improvement is valuable here because missed detections reduce both liveness opportunities and recognition throughput.

3.2. Liveness / PAD Models

Table 3

Model	NME (%)	Precision (%)	Recall (%)	F1-score (%)
MiniFASNet	N/A	71.0	17.22	22.2
MediaPipe	72.85	73.1	49.5	48.5
FacialLandmark68	72.96	73.3	49.7	48.6

The PAD comparison is more nuanced. MiniFASNet showed acceptable precision but extremely low recall, indicating that it missed too many spoof-related cases under the tested configuration. MediaPipe and the 68-landmark method were nearly tied, but the landmark method performed marginally better on recall and F1-score. Because liveness in attendance is a security filter rather than a cosmetic enhancement, this small edge matters. Moreover, the landmark-based design remains attractive because it is interpretable, compatible with standard RGB cameras, and easier to embed into an interactive challenge-response flow.

3.3. Recognition Models

Table 4

Model	Accuracy (%)	Precision (%)	Recall (%)	F1-score (%)
EfficientNet-B3	91.72	87.18	97.83	92.20
ResNet50	92.11	86.83	99.28	92.64
FaceNet	96.31	95.13	97.61	96.35

The recognition comparison shows the clearest winner. FaceNet achieved 96.31% accuracy, 95.13% precision, 97.61% recall, and 96.35% F1-score, outperforming both EfficientNet-B3 and ResNet50. This result suggests that for attendance verification, a compact metric-learning embedding can be more effective than treating the task primarily as a conventional classification problem. The advantage is operational as well as statistical: embedding-based recognition is naturally suited to enrollment updates and similarity thresholding.

4. Optimization of the Integrated Pipeline

4.1. Why Block-Wise Optimization Matters

A key lesson from the benchmark is that attendance systems should be optimized as heterogeneous pipelines. Detection, liveness, and recognition impose different computational and statistical burdens, so model choice should be localized to each block. The optimal detector is not the optimal recognizer, and the optimal recognizer may be too expensive or too opaque for liveness. By benchmarking candidates separately, the system designer can select the best detector for spatial reliability, the best PAD model for security and interaction cost, and the best recognizer for discriminative embeddings.

4.2. Training and Fine-Tuning Strategy

The source study’s final optimization strategy is sensible and transferable. YOLO11 is fine-tuned with standard detection objectives, data augmentation, early stopping, and learning-rate scheduling. The landmark/PAD module is tuned to reduce normalized mean error under realistic capture conditions so that geometric cues remain stable enough for temporal liveness tests. FaceNet is optimized in two stages: first, the backbone is frozen while the classification head adapts to the target identities; second, the backbone is unfrozen for end-to-end refinement. This staged procedure reduces catastrophic drift in the early epochs and encourages more stable convergence.

Two additional optimization choices are especially important. The first is identity-balanced sampling, which reduces bias toward classes with more images. The second is threshold-aware inference. Because the final recognition decision is based on similarity in embedding space, the operational threshold should be calibrated using validation data, and it should ideally be re-estimated when the camera domain or roster composition changes.

4.3. Deployment Recommendations

Based on the reported evidence, a practical deployment recipe emerges. Use YOLO11 as the default detector for robust entrance filtering. Use landmark-based challenge-response PAD when only commodity RGB cameras are available and latency budgets are tight. Use FaceNet as the verification engine for scalable identity management. Around these core choices, employ frame subsampling, confidence gating, early exit on failed liveness, and cache-friendly embedding storage. In cloud-assisted attendance, asynchronous upload can be combined with local pre-screening so that weak or clearly spoofed sessions never reach the identity server.

5. Discussion

The benchmark also reveals where further optimization is still needed. The liveness results remain materially lower than the recognition results, which indicates that PAD is currently the weakest block in the integrated framework. This is not surprising: liveness generalization is harder than identity discrimination, especially when operating with only RGB video and limited attack diversity. Future optimization should therefore focus on temporal modeling, attack-wise calibration, and richer PAD metrics such as APCER, BPCER, and ACER.

Another point is that the reported evaluation is predominantly block-wise. For real deployment, the next benchmark iteration should include end-to-end measures such as average transaction latency, percentage of successful attendance sessions, dropout rate per block, and robustness under low-bandwidth or cross-device scenarios. Even so, the present results are sufficient to justify the current selection and optimization strategy.

6. Conclusion

This paper transformed the source manuscript into a comparative and optimization-oriented analysis of an integrated face-based attendance framework. The results support a final pipeline composed of YOLO11 for detection, FacialLandmark68 for liveness, and FaceNet for identity recognition. More importantly, the study shows that robust attendance performance emerges from disciplined block-wise benchmarking and targeted optimization rather than from relying on a single dominant model family. The derived selection and tuning strategy provides a practical blueprint for future attendance systems that must balance security, accuracy, latency, and resource constraints.

Список литературы

Schroff F., Kalenichenko D., Philbin J. A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Tan M., Le Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning, 2019.
Ultralytics. Explore Ultralytics YOLOv8. Official documentation, 2023.
Ultralytics. Ultralytics YOLO11. Official documentation, 2024.
Yu Z., Zhao C., Lei Z. Face Presentation Attack Detection. arXiv preprint arXiv:2212.03680, 2022.
Grishchenko I., Bazarevsky V., Raveendran K., et al. Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. arXiv preprint arXiv:2006.10962, 2020.
Kazemi V., Sullivan J. One Millisecond Face Alignment with an Ensemble of Regression Trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
Soukupová T., Čech J. Real-Time Eye Blink Detection using Facial Landmarks. 21st Computer Vision Winter Workshop, 2016.

Comparative analysis and optimization of detection, liveness, and recognition models in an integrated face-based attendance framework

Цитирование

Похожие статьи

Другие статьи из раздела «Информационные технологии»