1. Introduction
The rapid growth of digital transformation in Vietnam has led to an explosion of online documents across a wide range of platforms, including document-sharing systems, digital libraries, enterprise content management solutions, learning platforms, and cloud-based storage services. Every day, users upload thousands of PDF and DOCX files containing diverse and unstructured information. This creates an urgent demand for automated solutions capable of handling well-known bottlenecks such as content extraction, safety verification, organization, and summarization.
In most existing systems, text extraction, content moderation, and summarization are implemented as separate components. This fragmentation often results in excessive computational cost due to repeated preprocessing, weak consistency across modules, and a lack of end-to-end oversight to ensure that unsafe or sensitive content does not pass through different stages unnoticed. The challenge becomes more pressing in educational contexts, where ensuring content safety is critical but manual checking is impractical at scale.
A particularly serious gap lies in the absence of a unified workflow capable of automatically filtering documents containing unsafe material. In practice, many uploaded documents include harmful textual or visual content especially hate speech, violent imagery, or sexually explicit material. These forms of content pose real risks for educational platforms, which must guarantee that students are exposed only to safe and appropriate materials. However, existing systems typically moderate only one modality (usually text) or apply moderation too late in the pipeline, allowing unsafe material to pass into other components such as indexing or summarization.
We propose a unified and extensible framework to mitigate these challenges which is defined by the tight integration of three key processing components by extraction, moderation and summarization. This modular design is crucial for enabling organizations to uphold consistent content quality and system efficiency by reducing redundant computation and intercepting unsafe content early in the pipeline. While the framework is built to scale across various domains, we specifically validate its efficacy in the context of a digital education ecosystem, focusing on a modern digital library system where large volumes of multimodal documents are uploaded and accessed daily.
2. Related Work
This section reviews prior work relevant to each block of our framework: extraction, moderation, and summarization. For every line of work, we briefly summarize its main achievements and then highlight the gaps with respect to our goal of a unified, safety- aware moderation-summarization framework for educational documents.
2.1. Extraction Block
Recent OCR research has focused on building lightweight yet accurate end-to-end systems. Li et al. proposed PP-OCRv3, which upgrades earlier PP-OCR and PP-OCRv2 systems by improving both the detector and recognizer through large-kernel PAN, residual attention feature pyramids, and a transformer-based recognizer (SVTR). Their experiments show up to 5% hmean improvement over PP-OCRv2 at comparable inference speed, and the models are released as open-source components in PaddleOCR [1]. Subsequent applied works have used PP-OCRv3 as a backbone in downstream tasks such as real-time automatic number plate recognition, demonstrating that it can be effectively embedded in practical, latency-sensitive systems [2].
However, existing studies typically optimize Optical Character Recognition (OCR) in isolation, often failing to account for how the extracted content will be consumed by downstream modules such as moderation or summarization. Crucially, in practical applications, we require more than mere text extraction; the data must be effectively transformed into actionable input for specific tasks like summarization or content auditing, a goal frequently unmet by conventional approaches. Furthermore, these studies do not adequately define a clear block-level interface between the extraction stage and subsequent processing stages.
In contrast, our proposed framework treats OCR and parsing as a plug-and-play extraction block: any document-to-text/image system (such as PP-OCRv3 or alternative solutions) can be seamlessly integrated, provided it yields structured text segments and image streams suitable for feeding the moderation block.
2.2. Moderation Block
2.2.1. Text Moderation Block
Text moderation research has explored both specialized toxicity detectors and general-purpose language models. The Jigsaw Toxic Comment Classification Challenge popularized a multi-label dataset of Wikipedia comments annotated with categories such as toxic, obscene, and threat, becoming a standard benchmark for toxicity detection [3]. Building on such data, Tan et al. introduced BERT-β, a proactive probabilistic moderation model that estimates a toxicity propensity score instead of only making binary decisions and showed improved rank-correlation with human judgments compared to traditional baselines [4, p. 8667-8675].
More recently, DeBERTaV3 extended the DeBERTa architecture using ELECTRA-style replaced-token detection and gradient-disentangled embedding sharing, yielding state-of-the-art results on a wide range of NLU benchmarks with better sample efficiency than conventional masked-language modeling [5]. Public model releases such as microsoft/deberta-v3-base make these capabilities accessible for safety-critical applications.
Despite these advances, existing works generally focus on improving toxicity classification itself and do not specify how moderation models should be integrated as explicit blocks in a larger workflow. They rarely define what happens to documents after being labeled (e.g., whether downstream tasks such as summarization should be conditioned on moderation outcomes). Our framework addresses this gap by treating text moderation as a hard gate, supported on only segments classified as safe by a DeBERTaV3-based block are allowed to enter the summarization block.
2.2.2. Image Moderation
On the visual side, Akyon and Temizel performed a comparative analysis of nudity classification methods across CNNs, vision transformers, and open-source safety checkers derived from Stable Diffusion and LAION. They show that well-tuned convolutional architectures such as MobileNet variants often outperform more complex models on real-world datasets, and they discuss limitations in current benchmarks (e.g., narrow cultural coverage and ambiguous labels) [6].
In parallel, Howard et al. proposed MobileNetV3, a family of CNNs designed via hardware-aware neural architecture search and NetAdapt. MobileNetV3-Large achieves higher ImageNet accuracy while reducing latency relative to MobileNetV2, making it well suited for mobile and edge deployments [7].
These works demonstrate that MobileNetV3-style models are accurate and efficient for image classification and that CNN-based nudity detectors can serve as strong baselines for content moderation. However, they usually target single-image classification scenarios and are not framed as modular blocks within a document-level pipeline. Our framework extends this line of work by embedding a MobileNetV3-Large classifier as the image-moderation sub-block within a multimodal pipeline, where its binary decision directly controls whether a document is accepted or rejected before summarization.
2.3. Summarization Block
Research on document summarization has shifted from early sequence-to-sequence models to large language models (LLMs), with increasing attention to long-document summarization (LDS). Koh et al. provide an empirical survey of LDS, covering datasets, models, and evaluation metrics, and emphasize that handling long, noisy, or multi-section documents remains challenging even for strong neural models [8]. More recently, Gana et al. present a systematic review of LDS studies from 2022–2024 and highlight open problems around factual consistency, domain adaptation, and reliable evaluation [9].
On the modeling side, Llama 3 represents the latest generation of open foundation models, with dense transformers up to 405B parameters and context windows up to 128K tokens; the authors report competitive performance on benchmarks such as MMLU, GSM8K, and long-context reasoning [10]. The Meta-Llama-3-Instruct variants are explicitly optimized for instruction following and are widely adopted for summarization and assistant-style tasks. In parallel, open-weight models like Mistral-7B demonstrate that compact 7B-parameter transformers with grouped-query and sliding-window attention can rival or surpass larger models such as Llama 2-13B on many benchmarks, providing attractive trade-offs for deployment [11]. On the proprietary side, OpenAI’s GPT-4o and the more cost-efficient GPT-4o mini offer strong summarization capabilities with improved efficiency and multimodal support.
While these works show that LLMs can produce high-quality summaries and that both open and closed models are available at multiple scales, they typically treat summarization as an isolated task. The interaction between moderation decisions and summarization outputs is rarely made explicit, and most systems do not enforce a principled rule that only verified safe content may be summarized. Our framework addresses this gap by defining a summarization block that is strictly downstream of moderation blocks, and by designing it to be model-agnostic: LLaMA-3.2-3B is one concrete instantiation, but GPT-4o-mini, Mistral-7B-Instruct, or other long-context LLMs can be swapped in without changing the overall block interfaces.
3. Proposed Method
The proposed method is built upon the need for a unified and scientifically grounded solution capable of processing multimodal documents in a reliable, scalable, and safety-aware manner [12, 13]. After examining the structural limitations of existing systems where extraction, moderation, and summarization operate as isolated components [14], we introduce a coherent framework that organizes the entire workflow into three theoretical blocks. Each block is underpinned by a distinct family of algorithms, follows a clearly defined input and output specification, and performs transformations essential for the next stage. Taken together, these blocks establish a principled approach that ensures content safety while maintaining computational efficiency for large, real-world educational environments.
3.1. Extraction Block
The extraction block is grounded in document analysis theory, particularly optical character recognition and multimodal parsing [15, 16, 17]. At its core, the block implements a pipeline that converts raw user-uploaded files primarily under PDF and DOCX into structured representations. The input to this block is a heterogeneous document whose internal structure may include embedded text, scanned pages, figures, tables, and images [18]. The output is a normalized set of textual segments and a corresponding set of images [19].
The handling process within this block consists of two theoretical stages. First, the system distinguishes between digitally encoded text and image-based text. Digital text can be read directly from the file’s structure, while image-based text requires a learned OCR transformation to map pixel representations into character sequences [20, 21]. This aligns with the longstanding view in document analysis that OCR serves as a bridge between human-readable formatting and machine-readable text. Second, the extraction block identifies and isolates all non-text modalities, such as photographs, diagrams, and illustrations, which are passed as independent units to the moderation block [22, 23]. Through these transformations, the extraction block ensures that every downstream component receives structured, machine-interpretable data, eliminating inconsistencies introduced by varied document formats.
Its role in the framework is foundational because without reliable extraction, the moderation block cannot evaluate content integrity, and the summarization block cannot reason about textual coherence [24, 25]. Thus, this block establishes a clean, unified representation of the document and acts as the gateway through which all subsequent processing flows.
3.2. Moderation Block
The moderation block is the safety-critical component of the framework. It is theoretically grounded in two areas regarding natural language understanding for textual evaluation and statistical pattern recognition for visual assessment [26, 27]. The input to this block consists of the text segments and images produced by the extraction block, while the output is a single binary decision indicating whether the document is safe or unsafe.
Inside this block, the handling process is separated into two conceptual streams, reflecting the dual nature of multimodal risks. The textual stream evaluates linguistic content, applying semantic understanding to determine whether sentences contain harmful categories such as hate speech, violent expressions, or sexually explicit material [28, 29]. This is consistent with prevailing theories in NLP moderation, which conceptualize toxicity as a contextual property rather than a mere keyword-matching problem.
The visual stream performs the same safety assessment for images by analyzing content structure, texture, and visual patterns associated with sensitive or harmful material [30, 31]. Unlike text, visual signals often lack explicit boundaries, meaning the system must operate based on learned representations of unsafe imagery rather than predefined rules. Once both streams have produced safety indicators, the block applies a strict decision-aggregation function that if either the text or images are classified as unsafe, the entire document is rejected [32]. This early-exit logic follows modern content-safety principles, ensuring that no harmful material proceeds to later stages where it could be inadvertently summarized or indexed.
The moderation block’s role is therefore not only evaluative but also regulatory. It governs the flow of information through the entire framework, acting as a safety gate that upholds ethical and educational standards while preventing downstream propagation of inappropriate content [33].
3.3 Summarization Block
The summarization block is built upon theories of long-document understanding, hierarchical information compression and abstractive text generation [34, 35, 36]. It receives as input the verified-safe textual content produced by the previous blocks. Its output consists of three summary levels including brief, medium, and detailed in each serving different user needs within educational settings [37].
The handling process begins by merging the text segments into a unified representation while preserving logical flow and semantic coherence. This merged text is then encoded into a latent representation that captures the document’s key arguments, structure, and contextual dependencies [38, 39]. The system generates summaries at increasing levels of granularity as the shortest form conveys the central idea, the medium summary outlines major points, and the detailed version retains supporting explanations and contextual nuance [40].
This multilevel summarization approach is grounded in cognitive theories of information retrieval, which propose that different users require different depths of information to complete their tasks [41, 42]. In digital libraries, for example, students may only need a quick overview, while researchers may require more comprehensive summaries. By transforming a validated document into a structured set of summaries, the block enables efficient knowledge consumption and reduces the cognitive load associated with reading full-length materials [43, 44].
Within the larger framework, the summarization block is the final stage in the pipeline, converting safe and structured content into a usable form. Its output provides the educational value of the framework, ensuring that users benefit not only from content safety but also from improved accessibility and comprehension [45].
3.4. Integration of Blocks and Overall Framework Logic
Although each block is grounded in different theoretical foundations, they are tightly integrated through a well-defined flow of inputs and outputs. The extraction block converts unstructured multimodal documents into analyzable content. The moderation block enforces safety constraints and regulates whether a document may continue through the pipeline. Finally, the summarization block transforms approved content into structured knowledge artifacts that support diverse learning needs.
This integrated architecture addresses the shortcomings of existing systems by eliminating redundant preprocessing, enforcing safety throughout the pipeline, and enabling scalability through modular design. Each block contributes a distinct transformation, and together they establish a unified framework capable of handling large volumes of multimodal documents in modern digital education ecosystems.
Based on this integrated architectural model, in the next section we present evaluation experiments to verify the effectiveness of each block as well as the entire pipeline under real deployment conditions.
4. Experiment and Result
4.1. Training Process
4.1.1. Dataset
To evaluate our framework under realistic conditions, we constructed a multimodal dataset that reflects the diversity of documents typically found in digital library systems. The dataset includes a wide range of PDF and DOCX files containing both textual and visual content, along with labeled examples of sensitive material such as hate speech, violent expressions, and sexually explicit imagery for moderation testing. Images extracted from scanned pages, diagrams, and illustrations were added to represent the noisy and heterogeneous inputs commonly uploaded by users. For the summarization task, a subset of long-form educational documents was paired with multi-level reference summaries created through expert annotation. This structure allows each block of the framework to be assessed consistently and comprehensively. The dataset ultimately provides a realistic and balanced foundation for validating the effectiveness of the proposed method in modern digital education environments.
Regarding content of moderation dataset, for the text-based toxic content detection stage, we utilized the Jigsaw Toxic Comment Classification dataset, which contains approximately 160,000 English comments collected from Wikipedia talk pages. Each comment in the dataset is annotated with multiple toxicity-related categories, such as toxic, obscene, threat, insult, identity hate, and severe toxic.
To align with our binary classification objective, we merged all toxicity-related categories into a single toxic label and assigned non-toxic to the remaining samples. This transformation produces a clean binary dataset suitable for fine-tuning the DeberTa model on the task of toxic vs. non-toxic classification. All text samples were preprocessed through lowercasing and punctuation normalization, removal of URLs, emojis, and special symbols and tokenization using the DeBERTa-v3-base tokenizer.
The dataset was then divided into 80% training, 10% validation, and 10% testing subsets. The final data distribution is shown below:
Table 1
Content Moderation Dataset distribution
Split | Samples | Class Distribution | |
Toxic (%) | Non-toxic (%) | ||
Train | 128,000 | 30 | 70 |
Val | 16,000 | 30 | 70 |
Test | 16,000 | 30 | 70 |

Fig. 1. Content Moderation Dataset distribution
This binary version of the Jigsaw dataset enables robust fine-tuning and evaluation of transformer-based models for real-world toxic comment detection. On the focus of the Image data set for the visual toxic content detection stage, we constructed a custom binary image dataset containing 20,000 images labeled as either toxic or non-toxic.
The dataset was aggregated from multiple publicly available sources, mainly from Kaggle and other open repositories. To ensure diversity and representativeness, we combined images from different domains, covering both safe and harmful visual contexts. Non-toxic images (12,000 samples) were collected from general-purpose datasets on Kaggle, such as COCO-based or natural/lifestyle image sets, representing normal and safe visual content. Toxic images (8,000 samples) included both graphic violence and sexually explicit content, sourced from various open datasets and web collections commonly used in prior works on harmful image detection. Within the toxic subset, approximately 34% of samples depict violent or gory scenes, while 66% correspond to sexual content. All images underwent manual review and re-labeling to ensure consistency and accuracy. Each image was resized to 224×224 pixels, normalized to the [0,1][0,1][0,1] range, and augmented with random horizontal flips, rotations, and brightness adjustments to improve model robustness. The dataset was split into 70% training, 20% validation, and 10% testing, maintaining class balance across all subsets.
Table 2
Image Dataset distribution
Split | Samples | Class Distribution | |
Toxic (%) | Non-toxic (%) | ||
Train | 14,000 | 5600 | 8400 |
Val | 4,000 | 1600 | 2400 |
Test | 2,000 | 800 | 1200 |

Fig. 2. Image Dataset distribution
This composite dataset was used to fine-tune the MobileNetV3-Large model for binary classification of visual toxicity.
By maintaining a balanced yet diverse structure, spanning both explicit (NSFW) and violent (gore/blood) imagery, the dataset provides a strong foundation for evaluating the model’s performance in realistic moderation contexts.
Relying on the documents summarization dataset, the training corpus is constructed from Wikisource, Project Gutenberg, and modern custom educational materials, ensuring a balance of lexical diversity, formal syntax, and contemporary pedagogical relevance. Documents are stratified by length into short, medium, and long categories, allowing the model to progress from dense short texts to full-length books in a curriculum style sequence that stabilizes learning and improves generalization under limited VRAM. The dataset spans a broad educational spectrum including science, history, philosophy, and linguistics promoting strong semantic coverage within the domain. In total, the corpus contains approximately 3,500 documents with an average length of 50,000 words, amounting to around 175 million tokens. This composition enhances early convergence, strengthens contextual reasoning, and improves long-range coherence, ultimately supporting more reliable educational summarization.
Finally, the overall justification for this dataset lies in its ability to represent every stage of the proposed framework in a realistic and coherent manner. By combining multimodal documents, sensitive-content annotations, noisy visual samples, and curated reference summaries, the dataset mirrors the full complexity of materials typically processed in digital library systems. Each component text, images, and long-form documents directly corresponds to the inputs required by the extraction, moderation, and summarization blocks, enabling the framework to be assessed holistically rather than through isolated experiments. This alignment ensures that the experimental results accurately reflect real-world deployment conditions and demonstrate the practical viability of the entire end-to-end system.
4.1.2. Block Based Sampling and Comparison Strategy
To ensure a fair and comprehensive evaluation of the proposed framework, we adopt a block-based sampling and comparison strategy in which each processing block is paired with multiple candidate models. For the extraction stage, we compare OCR approaches such as PP-OCRv3, EasyOCR, and a traditional LSTM-based Tesseract system. The moderation block is assessed through separate candidates for text filtering, including DistilBERT, RoBERTa-base, and a more advanced contextual encoder, as well as candidates for image moderation such as MobileNetV2, EfficientNet-B0, and MobileNetV3-Large. For the summarization block, we evaluate different long-form generation methods, ranging from lightweight open models to more resource-intensive instruction-tuned architectures and an API-based baseline. Each combination of OCR, text moderation, image moderation, and summarization forms a complete pipeline configuration, expressed as OCR, TextModel, ImageModel, Summarizer. All configurations are tested on the same dataset and executed under identical GPU conditions, ensuring consistent and meaningful comparisons across the full framework.
4.1.3. Evaluation Dimensions
For every sampled model combination, we conduct a comprehensive evaluation that considers both effectiveness and deployability across the entire framework. Moderation performance is assessed through precision, recall, and F1-score, enabling us to measure how reliably each candidate identifies harmful content in both text and images. For summarization, we evaluate output quality using ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore, which together capture lexical overlap, structural coherence, and semantic fidelity. The above parameters will be used to evaluate the theoretical accuracy of the model and the usability of the model when applied in practice. Efficiency is examined under the constraints of an RTX 3050 GPU (4GB VRAM), focusing on memory consumption, inference latency, and system stability, including susceptibility to out-of-memory errors. We also examine practical considerations such as ease of fine-tuning, model size, quantization compatibility, and robustness when processing long documents. These evaluation dimensions ensure that the final selected configuration is not only strong in terms of academic performance but also realistic and reliable for real-world deployment
4.1.4. Experiment on Candidate models
To determine the most suitable configuration for the proposed framework, we independently evaluated candidate models for the extraction, moderation, and summarization blocks. All experiments were conducted on the same dataset and under identical hardware constraints using an RTX 3050 GPU with 4GB VRAM. The goal was not to highlight extreme performance from individual models but to identify the most balanced and reliable configuration. The following tables summarize the experimental results for each block, followed by brief conclusions.
On the actual experiments in the extraction block, the extraction block was evaluated using three OCR candidates across three document subsets: clean documents, medium-noise documents, and highly noisy scanned materials. Character-level accuracy and per-page latency were measured.
Table 3
OCR Performance Across Three OCR methods
OCR method | Clean Docs Accuracy (%) | Medium-noise Accurancy (%) | High-noise Accuracy (%) | Average Latency (s/page) |
Tesseract (LSTM) | 90.1 | 78.4 | 63.5 | 1.92 |
EasyOCR | 93.2 | 84.7 | 71.3 | 0.88 |
PP-OCR3 | 96.0 | 91.2 | 88.5 | 0.63 |
Leading to the table result, model PP-OCRv3 approach demonstrates consistently stronger performance across all document types and achieves lower latency, making it the most suitable option for real-world educational documents that often contain noisy scans or mixed formats.
Mentioning as the experimental on moderation experiments were divided into text and image evaluation. For text moderation, we measured precision, recall, and F1-score as indicators of harmful-content detection. For image moderation, the same metrics were applied to identify unsafe or sensitive images.
Table 4
Moderation Block – Text Moderation Performance
Candidate model | Precision (%) | Recall (%) | F1-score (%) | VRAM usage |
DistilBERT | 90.5 | 88.2 | 89.3 | ~650 MB |
RoBERTa-base | 92.8 | 91.7 | 92.2 | ~1.4 GB |
DeBERTa-V3-base | 94.1 | 94.5 | 94.2 | ~1.2 GB |
Table 5
Moderation Block - Image Moderation Performance
Image Model | Precision (%) | Recall (%) | F1-score (%) | Average Latency |
MobileNetV2 | 88.6 | 87.9 | 88.2 | 3.2 ms |
EfficientNet-B0 | 90.2 | 91.1 | 90.6 | 7.6 ms |
MobileNetV3-Large | 92.4 | 93.0 | 92.6 | 4.1 ms |
Table 6
Summarization Block - Quality Comparison
Candidate Model | ROUGE-1 | ROUGE-2 | ROUGE-L | BERT Score | VRAM Fit |
Mistral-7B-Instruct | 0.46 | 0.41 | 0.43 | 0.83 | 3.4 |
GPT04o-mini (API) | 0.48 | 0.44 | 0.45 | 0.84 | 3.5 |
LLama-3.2-3B (4-bit + LoRA) | 0.51 | 0.47 | 0.47 | 0.85 | 2.4 |
Accessing the last glance from table 4, Among the evaluated models, the third summarization candidate achieves the best overall performance while remaining compatible with 4GB VRAM through lightweight optimization techniques. It consistently produces coherent multi-level summaries suitable for educational purposes.
Based on the results from the block-level experiments, we selected a final configuration that offers the strongest balance of accuracy, efficiency, and real-world deployability. In the extraction block, PP-OCRv3 consistently achieved the highest accuracy across clean, semi-noisy, and heavily degraded documents, making it the most reliable option for large-scale educational data. For the moderation block, the pairing of DeBERTa-V3-base for text moderation and MobileNetV3-Large for image moderation demonstrated the most stable and precise performance while remaining approximately compatible with the 4GB VRAM limit of the RTX 3050. In the summarization block, the optimised LLaMA-3.2-3B (4-bit + LoRA) model provided the best ROUGE and BERTScore results among the candidates that could operate efficiently on limited hardware. Together, these components form the integrated pipeline used in our full system, and later evaluations confirm that this configuration delivers end-to-end results across the entire framework.
4.1.5. Training the Final Chosen Pipeline
After selecting the optimal block configuration, we fine-tune only the moderation and summarization components, while the extraction block (PP-OCR) is reused directly without modification. The moderation block is trained using cross-entropy classification under mixed-precision (FP16) to reduce memory consumption, combined with early stopping and a learning-rate scheduler to ensure stable convergence within the 4GB VRAM limit. Training for the summarization block follows an autoregressive causal language modeling objective, where text is packed into fixed-length windows to minimize preprocessing overhead. To enable the LLaMA-3.2-3B model to run on consumer hardware, we apply 4-bit NF4 quantization for loading base weights, LoRA adapters for updating only low-rank attention parameters, and an 8-bit paged AdamW optimizer to reduce memory usage for optimizer states. Gradient accumulation and selective gradient checkpointing further reduce VRAM pressure, allowing the model to train efficiently despite hardware constraints. Together, these optimizations enable stable and reliable fine-tuning of the summarization system on RTX 3050.
4.2. Result
The overall training process follows a structured and methodical procedure designed to ensure both empirical rigor and hardware feasibility. We begin by randomly sampling candidate models for each block and conducting controlled evaluations to examine their strengths, weaknesses, and GPU efficiency under identical conditions. Based on these results, we select the most balanced combination of models and proceed to fine-tune only the moderation and summarization blocks within the strict memory limits of the RTX 3050. This workflow guarantees that the final pipeline is not only validated through systematic comparison but also optimized for practical deployment on resource-constrained hardware. As a result, the system achieves a level of performance that is both scientifically reproducible and scalable for real-world educational document moderation and summarization.
The given image below indicates the result of the text, image and summarization training process. Over the data, actual experiments show the increasing steady progress of the model through the indicators used to accurately measure actual results.

Fig. 3. Text moderation training result

Fig. 4. Image moderation training result

Fig. 5. Summarization training result
Table 7
Entire Frame Efficiency
Model | Task | Accuracy | ROUGE-L |
DeBERTa-V3 | Text moderation | 94.2% | - |
MobileNet-V3 | Image moderation | 92.6% | - |
Llama-3.2-3B | Summarization | - | 0.47 |
After integrating for the complete entire framework, we proposed training on NVIDIA RTX3050, 4GB VRAM because of the advantages of cost reduction and simplifying setup process.
5. Conclusion
This research introduces a unified and scalable AI framework, meticulously designed to streamline the entire lifecycle of multimodal documents from extraction and moderation to summarization within expansive digital education ecosystems. By thoughtfully integrating advanced OCR techniques, a safety-aware content moderation engine, and multi-level summarization into a single, cohesive pipeline, the proposed system effectively overcomes the long-standing limitations inherent in fragmented workflows. It compellingly demonstrates that reliable end-to-end processing is achievable, even while respecting the constraints of typical hardware environments. The experimental findings gracefully underscore both the robustness of each individual component and the substantial practical value of uniting them into a seamless, deployment-ready solution capable of supporting the burgeoning global demand for safe and accessible digital learning resources.
Our work has illuminated several key guiding principles. Foremost, the modular, block-based design proved indispensable, offering clarity, simplified maintenance, and essential adaptability across diverse real-world scenarios. Furthermore, the evaluation highlighted the critical necessity of judicious model selection, emphasizing a delicate balance between achieving high accuracy, maintaining computational efficiency, and respecting VRAM limitations to ensure effective operation on standard hardware. Finally, the results affirmed the profound importance of early and stringent moderation; without this dependable safety gate, subsequent processes, such as summarization, carry the risk of inadvertently disseminating harmful or inappropriate content.
Looking forward, this framework gently paves the way for exciting future development. Subsequent efforts might involve extending the moderation component to encompass multilingual contexts, thereby facilitating broader deployment across varied countries and cultural settings. Similarly, the summarization block holds the potential to be significantly enhanced, offering user-personalized outputs that subtly adapt to unique learning styles or individual reading preferences. Ultimately, the thoughtful integration of this framework into large-scale digital library infrastructures promises to unlock vast possibilities for intelligent search, intuitive content recommendation, and automated knowledge organization, thereby contributing meaningfully to the development of safe, efficient, and truly learner-centric digital education platforms in Vietnam and globally.
.png&w=384&q=75)
.png&w=640&q=75)