A modular and security-aware face recognition architecture with integrated presentation attack detection for real-world attendance systems

Nguyen Manh Kien; Dương Quoc Đat; Ngo Minh Hieu; Tran Thi Huyen Dieu; Nguyen Thi Nguyet

Аннотация статьи

Face-based attendance is attractive because it is contactless, scalable, and easy to deploy on smartphones and consumer cameras, yet it remains vulnerable to unconstrained capture conditions and presentation attacks. This manuscript reformulates the source study into a modular, security-aware architecture paper for real-world attendance systems. The proposed framework is organized as a deterministic three-block pipeline: face detection, presentation attack detection (PAD), and identity recognition. Each block exposes a strict interface to constrain error propagation and to ensure that recognition is executed only on valid, localized, and live facial inputs. Within this design, YOLO11 is used for face localization, a challenge-response landmark module performs liveness verification through geometric facial dynamics, and FaceNet generates compact embeddings for identity verification. The source experiments show that YOLO11 slightly outperforms the other detectors considered in the study, the selected landmark-based PAD model offers the best balance between practicality and security among the evaluated liveness candidates, and FaceNet provides the strongest recognition performance of the tested identity models. The resulting architecture is computationally lightweight, modular to maintain, and aligned with edge or cloud-assisted attendance scenarios. Beyond reporting the block-wise results, this paper emphasizes the architectural logic, security rationale, and deployment implications of integrating PAD directly into the recognition workflow rather than treating liveness as an optional add-on.

Текст статьи

1. Introduction

Automated attendance has long been positioned as a practical biometric application because it can reduce manual roll calls, shorten administrative processing time, and improve traceability. Among biometric modalities, face recognition is particularly suitable for attendance because it is contactless, non-invasive, and naturally compatible with cameras already embedded in phones, tablets, and laptops. However, an attendance scenario is significantly more challenging than a controlled authentication kiosk. Images may be captured under inconsistent lighting, with pose changes, partial occlusions, background clutter, and varying camera quality. In addition, a system that only recognizes identity without validating liveness can be deceived by printed photographs, replayed videos, or other presentation attacks.

The source manuscript developed a three-block attendance pipeline that explicitly combines face detection, anti-spoofing, and identity recognition. This paper re-frames that work as an architecture-oriented contribution. The core argument is that reliable attendance in unconstrained environments depends not merely on choosing accurate models, but on defining a secure execution order and strict data contracts among system components. In this reformulated view, the contribution is a modular and security-aware recognition architecture in which each block narrows the uncertainty for the next one: the detector limits processing to plausible face regions, the PAD block verifies that the subject is live, and the recognition block performs identity verification only after the input has passed the preceding security gate.

The rest of the paper is organized as follows. Section 2 summarizes the technical background motivating the architectural choices. Section 3 details the proposed modular pipeline and the role of each block. Section 4 presents the dataset design and evaluation protocol inherited from the source study. Section 5 discusses the reported results and interprets them from a deployment and security perspective. Section 6 outlines limitations and future work, and Section 7 concludes the paper.

2. Background and Design Motivation

Modern face recognition systems typically comprise at least three conceptual tasks: localization, liveness assessment, and identity representation. FaceNet remains a foundational choice for representation learning because it maps faces into a compact Euclidean embedding space and supports scalable verification through distance-based comparison [1]. Residual learning, as introduced in ResNet, makes deep feature extraction stable [2], while EfficientNet demonstrates that compound scaling can improve the accuracy–efficiency trade-off in convolutional networks [3]. On the detection side, recent Ultralytics models employ anchor-free heads and optimized training/inference pipelines that are well suited to real-time localization [4, 5]. In parallel, the PAD literature has matured from simple photo-attack detection toward broader protection against both physical and digital presentation attacks [6]. These developments make an integrated design feasible, but they also highlight a central engineering question: how should the individual capabilities be assembled into a trustworthy attendance pipeline?

A purely accuracy-driven design is insufficient. If recognition is executed before liveness validation, an attacker may exploit a strong recognizer with a simple spoofing artifact. If PAD is executed on loosely localized faces, geometric measurements become unstable and false decisions increase. A production-ready attendance system therefore requires architectural ordering, not just model selection. The source study addresses this by treating the pipeline as a sequence of constrained transformations. The present paper adopts and expands that perspective.

3. Proposed Modular and Security-Aware Architecture

The proposed architecture receives a short face video rather than a single still image. This design enables temporal reasoning for liveness and increases robustness to frame-level failures. The video is decomposed into frames, and each frame is processed by the detection block to identify a face bounding box. Only detected face crops are forwarded to the PAD block. If the subject successfully satisfies the liveness challenge, a subset of the validated face crops is passed to the recognition block to compute embeddings and verify identity. In other words, the pipeline enforces a security-first policy: no recognition without detection, and no recognition without liveness.

Table 1

Block	Input	Selected method	Primary function	Security role
Face detection	Video frames	YOLO11	Localize valid face regions	Reject non-face or weakly localized inputs
PAD / liveness	Detected face crops	68-landmark challenge-response module	Verify that the presenter is live	Block spoofed or replayed presentations before recognition
Identity recognition	Live validated face crops	FaceNet	Generate embeddings and verify identity	Accept recognition only after liveness verification

3.1. Block 1: Face Detection as the Entry Gate

The first block is responsible for constraining the visual search space. The source study evaluated MTCNN, YOLOv8, and YOLO11, selecting YOLO11 as the final detector because it achieved the highest accuracy and slightly better recall under the reported conditions; this is also consistent with the strong real-time orientation of recent Ultralytics detectors [4, 5]. In the final architecture, YOLO11 processes each frame and outputs face bounding boxes with confidence scores. Non-maximum suppression is used to suppress overlapping boxes, and the retained regions become the only inputs accepted by the subsequent block. This design has two direct benefits. First, it reduces computational waste because later stages operate on cropped face regions rather than full frames. Second, it improves security by limiting downstream analysis to localized faces instead of arbitrary background patterns.

Architecturally, the detector functions as an input validator. If a frame does not contain a sufficiently confident face, that frame should not contribute to liveness or identity decisions. In operational deployments, this strategy reduces the likelihood that noisy frames, background posters, or incidental bystanders contaminate the attendance transaction.

3.2. Block 2: Presentation Attack Detection as a Security Firewall

The second block is the most security-critical component because it determines whether the presented face belongs to a live participant rather than to an attack instrument. Instead of relying on specialized depth sensors or computationally heavy texture models, the source framework adopts a landmark-based challenge-response design. A detected face is analyzed through 68 facial landmarks, and liveness is inferred from temporal geometric changes associated with instructed actions such as eye closure, blinking, smiling, or sadness [8, 9]. The underlying idea is pragmatic: attendance systems commonly operate on commodity RGB cameras, so liveness verification must remain feasible without dedicated hardware.

This block is best understood as a firewall rather than as a classifier in isolation. Its role is to filter recognition requests using a rule-governed temporal test. For eye-based verification, the system tracks the eye aspect ratio across frames and validates that open and closed states both occur above a minimum temporal threshold [9]. For expression-based verification, the system evaluates geometric changes in mouth and eyebrow landmarks. Because the challenge is randomized or prompted interactively, the attacker must not only display a target face but also reproduce temporally coherent biological motion. Although landmark-based PAD is not a complete defense against sophisticated replay attacks, it offers a favorable security–cost trade-off for mobile attendance settings.

From an architectural viewpoint, the PAD block turns face recognition from a passive perception task into an active verification workflow. This shift is important. Passive attendance pipelines often optimize convenience at the expense of trust, whereas challenge-response PAD adds a modest interaction burden in exchange for significantly better resistance to low-cost spoofing.

3.3. Block 3: FaceNet-Based Identity Recognition

Once liveness has been established, the framework extracts a fixed number of validated face crops and forwards them to the identity block. The source study compared EfficientNet-B3, ResNet50, and FaceNet, and FaceNet produced the strongest recognition metrics, which is consistent with its embedding-oriented design [1]. This outcome is consistent with FaceNet’s core design objective: learning compact embeddings in which same-identity samples lie close together and different identities are separated by a margin. Inference then becomes a metric comparison problem rather than a full model retraining problem for every enrollment update.

The recognition block therefore contributes both accuracy and operational scalability. Because decisions are made in embedding space, new users can be enrolled with limited data, and template comparison remains efficient even when the attendance roster grows. This is especially useful in universities, events, or volunteer programs where the set of participants changes over time.

3.4. Integration Logic and Failure Containment

The strength of the framework lies not only in the three constituent blocks but also in the way they are connected. Each block consumes a narrower and more trustworthy input than the previous one. Detection constrains spatial uncertainty, PAD constrains authenticity uncertainty, and recognition resolves identity uncertainty. Because the interfaces are deterministic and well defined, error analysis can also be performed block by block. A missed detection is not confused with a spoofing failure, and a failed liveness challenge is not misinterpreted as an identity mismatch.

This modularity simplifies maintenance and future upgrades. For example, a stronger detector or a more advanced PAD model can be substituted while preserving the rest of the pipeline, provided the input–output contracts are maintained. Such separation is especially valuable in real deployments, where datasets evolve, hardware changes, and policy requirements become stricter over time.

4. Experimental Design and Data Construction

The source manuscript assembled a multi-source dataset aligned with the three-block architecture [10]. For detection, face localization data were drawn from WIDER FACE, a Face-Detection-Dataset subset, and FDDB. For PAD, live and spoof examples were aggregated from CelebA-Spoof, WFLW, and iBUG-300W, with spoof categories merged into a binary live-versus-spoof task. For identity recognition, the study repurposed CelebA-Spoof, CASIA-FASD, and LCC-FASD into an identity-labeled corpus. The central methodological strength of this design is that each data partition mirrors the functional input requirements of one pipeline block while still contributing to the integrity of the full attendance workflow.

The model-selection protocol was block-wise. Rather than comparing full pipelines end to end from the outset, the study evaluated multiple representative candidate models per block under common hardware conditions. This strategy makes engineering sense because it localizes trade-offs. The best deployment choice is often not the globally most complex model, but the model that best satisfies the objective and resource profile of its own block.

Table 2

Block	Candidate models	Selected model	Selection rationale from source results
Detection	MTCNN, YOLOv8, YOLO11	YOLO11	Best reported accuracy (60%) and slightly highest recall (58.2%) among the tested detectors
PAD	MiniFASNet, MediaPipe, FacialLandmark68	FacialLandmark68	Marginally best recall/F1 among evaluated liveness models while remaining low-cost and interpretable
Recognition	EfficientNet-B3, ResNet50, FaceNet	FaceNet	Best reported recognition metrics: 96.31% accuracy, 95.13% precision, and 96.35% F1-score

5. Results and Discussion

The reported results support the final configuration of YOLO11 + FacialLandmark68 + FaceNet. In the detection block, the two YOLO variants substantially outperformed MTCNN, with YOLO11 offering the best reported accuracy and recall. In the liveness block, the landmark-based approach slightly surpassed MediaPipe on the available metrics while preserving low computational complexity. MiniFASNet exhibited reasonable precision but poor recall in the reported experiment, making it risky for a security-sensitive setting where missed spoof attacks are costly. In the recognition block, FaceNet clearly exceeded EfficientNet-B3 and ResNet50 on accuracy, precision, and F1-score.

Architecturally, these results indicate that a practical attendance system benefits from pairing a strong real-time detector with a low-cost, interaction-based PAD layer and an embedding-centric recognizer. This combination does not attempt to maximize every isolated benchmark, but it creates a coherent system profile: spatially stable, security-aware, and lightweight enough for real deployment. The modular arrangement also improves interpretability. Administrators can diagnose whether a user failed attendance because the face was not localized, the liveness challenge was not passed, or the identity was not matched.

The source manuscript also reports stable convergence behavior for the chosen models during fine-tuning. Although the available evaluation is primarily block-wise rather than fully transactional, the evidence is sufficient to justify the proposed architecture as a solid systems baseline. For publication purposes, the paper’s main value lies in showing how PAD can be embedded directly into the attendance workflow instead of being appended as a secondary feature.

6. Limitations and Future Work

Two limitations should be acknowledged. First, the selected landmark-based PAD approach is attractive for cost-sensitive deployment, but it remains less comprehensive than advanced texture-temporal or multimodal PAD systems when confronting sophisticated replay or mask attacks. Second, the source evaluation emphasizes block-level metrics; future work should report end-to-end transaction success, full pipeline latency, attack-wise ACER or APCER/BPCER, and robustness under cross-device deployment.

Future research can therefore proceed along three directions. One direction is to replace or augment the current PAD module with a stronger temporal deep model while preserving the same modular interfaces. A second direction is to study adaptive thresholding and confidence fusion across blocks. A third direction is deployment validation on real classrooms or events, including network variability, user behavior, and privacy-aware template management.

7. Conclusion

This paper presented a reformulated architecture-centric version of the original integrated face-recognition study. The resulting design is a modular and security-aware attendance framework in which face detection, presentation attack detection, and identity recognition are executed as a deterministic sequence with explicit interfaces. The reported block-wise results support the use of YOLO11 for detection, landmark-based challenge-response PAD for liveness, and FaceNet for identity verification. More importantly, the study demonstrates that robust attendance is not merely a model-selection problem; it is an architectural problem requiring careful ordering, failure containment, and security-aware integration. Under these principles, the proposed framework offers a practical foundation for real-world attendance systems built on commodity cameras and scalable recognition back ends.

Список литературы

Schroff F., Kalenichenko D., Philbin J. A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
He K., Zhang X., Ren S., Sun J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
Tan M., Le Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. Proceedings of the 36th International Conference on Machine Learning, 2019.
Ultralytics. Explore Ultralytics YOLOv8. Official documentation, 2023.
Ultralytics. Ultralytics YOLO11. Official documentation, 2024.
Yu Z., Zhao C., Lei Z. Face Presentation Attack Detection. arXiv preprint arXiv:2212.03680, 2022.
Grishchenko I., Bazarevsky V., Raveendran K., et al. Attention Mesh: High-fidelity Face Mesh Prediction in Real-time. arXiv preprint arXiv:2006.10962, 2020.
Kazemi V., Sullivan J. One Millisecond Face Alignment with an Ensemble of Regression Trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014.
Soukupová T., Čech J. Real-Time Eye Blink Detection using Facial Landmarks. 21st Computer Vision Winter Workshop, 2016.

A modular and security-aware face recognition architecture with integrated presentation attack detection for real-world attendance systems

Цитирование

Похожие статьи

Другие статьи из раздела «Информационные технологии»