Главная
АИ #50 (285)
Статьи журнала АИ #50 (285)
A framework for document moderation and summarization in digital education libra...

A framework for document moderation and summarization in digital education library platform

Рубрика

Информационные технологии

Ключевые слова

moderation
summarization
OCR
digital library
integrated framework

Аннотация статьи

This research proposes a unified, modular framework designed for the automated processing of multimodal documents such as PDFs and DOCX files across various high-volume domains, with a specific application focus on digital education libraries in Vietnam. The system integrates three sequential yet closely linked components for extraction, moderation, and summarization. The initial extraction block utilizes advanced OCR and image analysis techniques, achieving a high textual accuracy of 94.5%. The resulting data then proceeds to the moderation block, which scrutinizes content for safety, reaching a combined auditing accuracy of 93.8%. Finally, the summarization block generates high-quality, multi-level outputs, ranging from concise to detailed, suitable for diverse learning objectives. Overall, the framework successfully demonstrates a coherent and scalable workflow, validated by an impressive end-to-end accuracy of 93.1%. This performance robustly substantiates the practicality of this integrated architectural solution for large-scale content platforms, particularly in modern digital libraries.

Текст статьи

1. Introduction

The rapid growth of digital transformation in Vietnam has led to an explosion of online documents across a wide range of platforms, including document-sharing systems, digital libraries, enterprise content management solutions, learning platforms, and cloud-based storage services. Every day, users upload thousands of PDF and DOCX files containing diverse and unstructured information. This creates an urgent demand for automated solutions capable of handling well-known bottlenecks such as content extraction, safety verification, organization, and summarization.

In most existing systems, text extraction, content moderation, and summarization are implemented as separate components. This fragmentation often results in excessive computational cost due to repeated preprocessing, weak consistency across modules, and a lack of end-to-end oversight to ensure that unsafe or sensitive content does not pass through different stages unnoticed. The challenge becomes more pressing in educational contexts, where ensuring content safety is critical but manual checking is impractical at scale.

A particularly serious gap lies in the absence of a unified workflow capable of automatically filtering documents containing unsafe material. In practice, many uploaded documents include harmful textual or visual content especially hate speech, violent imagery, or sexually explicit material. These forms of content pose real risks for educational platforms, which must guarantee that students are exposed only to safe and appropriate materials. However, existing systems typically moderate only one modality (usually text) or apply moderation too late in the pipeline, allowing unsafe material to pass into other components such as indexing or summarization.

We propose a unified and extensible framework to mitigate these challenges which is defined by the tight integration of three key processing components by extraction, moderation and summarization. This modular design is crucial for enabling organizations to uphold consistent content quality and system efficiency by reducing redundant computation and intercepting unsafe content early in the pipeline. While the framework is built to scale across various domains, we specifically validate its efficacy in the context of a digital education ecosystem, focusing on a modern digital library system where large volumes of multimodal documents are uploaded and accessed daily.

2. Related Work

This section reviews prior work relevant to each block of our framework: extraction, moderation, and summarization. For every line of work, we briefly summarize its main achievements and then highlight the gaps with respect to our goal of a unified, safety- aware moderation-summarization framework for educational documents.

2.1. Extraction Block

Recent OCR research has focused on building lightweight yet accurate end-to-end systems. Li et al. proposed PP-OCRv3, which upgrades earlier PP-OCR and PP-OCRv2 systems by improving both the detector and recognizer through large-kernel PAN, residual attention feature pyramids, and a transformer-based recognizer (SVTR). Their experiments show up to 5% hmean improvement over PP-OCRv2 at comparable inference speed, and the models are released as open-source components in PaddleOCR [1]. Subsequent applied works have used PP-OCRv3 as a backbone in downstream tasks such as real-time automatic number plate recognition, demonstrating that it can be effectively embedded in practical, latency-sensitive systems [2].

However, existing studies typically optimize Optical Character Recognition (OCR) in isolation, often failing to account for how the extracted content will be consumed by downstream modules such as moderation or summarization. Crucially, in practical applications, we require more than mere text extraction; the data must be effectively transformed into actionable input for specific tasks like summarization or content auditing, a goal frequently unmet by conventional approaches. Furthermore, these studies do not adequately define a clear block-level interface between the extraction stage and subsequent processing stages.

In contrast, our proposed framework treats OCR and parsing as a plug-and-play extraction block: any document-to-text/image system (such as PP-OCRv3 or alternative solutions) can be seamlessly integrated, provided it yields structured text segments and image streams suitable for feeding the moderation block.

2.2. Moderation Block

2.2.1. Text Moderation Block

Text moderation research has explored both specialized toxicity detectors and general-purpose language models. The Jigsaw Toxic Comment Classification Challenge popularized a multi-label dataset of Wikipedia comments annotated with categories such as toxic, obscene, and threat, becoming a standard benchmark for toxicity detection [3]. Building on such data, Tan et al. introduced BERT-β, a proactive probabilistic moderation model that estimates a toxicity propensity score instead of only making binary decisions and showed improved rank-correlation with human judgments compared to traditional baselines [4, p. 8667-8675].

More recently, DeBERTaV3 extended the DeBERTa architecture using ELECTRA-style replaced-token detection and gradient-disentangled embedding sharing, yielding state-of-the-art results on a wide range of NLU benchmarks with better sample efficiency than conventional masked-language modeling [5]. Public model releases such as microsoft/deberta-v3-base make these capabilities accessible for safety-critical applications.

Despite these advances, existing works generally focus on improving toxicity classification itself and do not specify how moderation models should be integrated as explicit blocks in a larger workflow. They rarely define what happens to documents after being labeled (e.g., whether downstream tasks such as summarization should be conditioned on moderation outcomes). Our framework addresses this gap by treating text moderation as a hard gate, supported on only segments classified as safe by a DeBERTaV3-based block are allowed to enter the summarization block.

2.2.2. Image Moderation

On the visual side, Akyon and Temizel performed a comparative analysis of nudity classification methods across CNNs, vision transformers, and open-source safety checkers derived from Stable Diffusion and LAION. They show that well-tuned convolutional architectures such as MobileNet variants often outperform more complex models on real-world datasets, and they discuss limitations in current benchmarks (e.g., narrow cultural coverage and ambiguous labels) [6].

In parallel, Howard et al. proposed MobileNetV3, a family of CNNs designed via hardware-aware neural architecture search and NetAdapt. MobileNetV3-Large achieves higher ImageNet accuracy while reducing latency relative to MobileNetV2, making it well suited for mobile and edge deployments [7].

These works demonstrate that MobileNetV3-style models are accurate and efficient for image classification and that CNN-based nudity detectors can serve as strong baselines for content moderation. However, they usually target single-image classification scenarios and are not framed as modular blocks within a document-level pipeline. Our framework extends this line of work by embedding a MobileNetV3-Large classifier as the image-moderation sub-block within a multimodal pipeline, where its binary decision directly controls whether a document is accepted or rejected before summarization.

2.3. Summarization Block

Research on document summarization has shifted from early sequence-to-sequence models to large language models (LLMs), with increasing attention to long-document summarization (LDS). Koh et al. provide an empirical survey of LDS, covering datasets, models, and evaluation metrics, and emphasize that handling long, noisy, or multi-section documents remains challenging even for strong neural models [8]. More recently, Gana et al. present a systematic review of LDS studies from 2022–2024 and highlight open problems around factual consistency, domain adaptation, and reliable evaluation [9].

On the modeling side, Llama 3 represents the latest generation of open foundation models, with dense transformers up to 405B parameters and context windows up to 128K tokens; the authors report competitive performance on benchmarks such as MMLU, GSM8K, and long-context reasoning [10]. The Meta-Llama-3-Instruct variants are explicitly optimized for instruction following and are widely adopted for summarization and assistant-style tasks. In parallel, open-weight models like Mistral-7B demonstrate that compact 7B-parameter transformers with grouped-query and sliding-window attention can rival or surpass larger models such as Llama 2-13B on many benchmarks, providing attractive trade-offs for deployment [11]. On the proprietary side, OpenAI’s GPT-4o and the more cost-efficient GPT-4o mini offer strong summarization capabilities with improved efficiency and multimodal support.

While these works show that LLMs can produce high-quality summaries and that both open and closed models are available at multiple scales, they typically treat summarization as an isolated task. The interaction between moderation decisions and summarization outputs is rarely made explicit, and most systems do not enforce a principled rule that only verified safe content may be summarized. Our framework addresses this gap by defining a summarization block that is strictly downstream of moderation blocks, and by designing it to be model-agnostic: LLaMA-3.2-3B is one concrete instantiation, but GPT-4o-mini, Mistral-7B-Instruct, or other long-context LLMs can be swapped in without changing the overall block interfaces.

3. Proposed Method

The proposed method is built upon the need for a unified and scientifically grounded solution capable of processing multimodal documents in a reliable, scalable, and safety-aware manner [12, 13]. After examining the structural limitations of existing systems where extraction, moderation, and summarization operate as isolated components [14], we introduce a coherent framework that organizes the entire workflow into three theoretical blocks. Each block is underpinned by a distinct family of algorithms, follows a clearly defined input and output specification, and performs transformations essential for the next stage. Taken together, these blocks establish a principled approach that ensures content safety while maintaining computational efficiency for large, real-world educational environments.

3.1. Extraction Block

The extraction block is grounded in document analysis theory, particularly optical character recognition and multimodal parsing [15, 16, 17]. At its core, the block implements a pipeline that converts raw user-uploaded files primarily under PDF and DOCX into structured representations. The input to this block is a heterogeneous document whose internal structure may include embedded text, scanned pages, figures, tables, and images [18]. The output is a normalized set of textual segments and a corresponding set of images [19].

The handling process within this block consists of two theoretical stages. First, the system distinguishes between digitally encoded text and image-based text. Digital text can be read directly from the file’s structure, while image-based text requires a learned OCR transformation to map pixel representations into character sequences [20, 21]. This aligns with the longstanding view in document analysis that OCR serves as a bridge between human-readable formatting and machine-readable text. Second, the extraction block identifies and isolates all non-text modalities, such as photographs, diagrams, and illustrations, which are passed as independent units to the moderation block [22, 23]. Through these transformations, the extraction block ensures that every downstream component receives structured, machine-interpretable data, eliminating inconsistencies introduced by varied document formats.

Its role in the framework is foundational because without reliable extraction, the moderation block cannot evaluate content integrity, and the summarization block cannot reason about textual coherence [24, 25]. Thus, this block establishes a clean, unified representation of the document and acts as the gateway through which all subsequent processing flows.

3.2. Moderation Block

The moderation block is the safety-critical component of the framework. It is theoretically grounded in two areas regarding natural language understanding for textual evaluation and statistical pattern recognition for visual assessment [26, 27]. The input to this block consists of the text segments and images produced by the extraction block, while the output is a single binary decision indicating whether the document is safe or unsafe.

Inside this block, the handling process is separated into two conceptual streams, reflecting the dual nature of multimodal risks. The textual stream evaluates linguistic content, applying semantic understanding to determine whether sentences contain harmful categories such as hate speech, violent expressions, or sexually explicit material [28, 29]. This is consistent with prevailing theories in NLP moderation, which conceptualize toxicity as a contextual property rather than a mere keyword-matching problem.

The visual stream performs the same safety assessment for images by analyzing content structure, texture, and visual patterns associated with sensitive or harmful material [30, 31]. Unlike text, visual signals often lack explicit boundaries, meaning the system must operate based on learned representations of unsafe imagery rather than predefined rules. Once both streams have produced safety indicators, the block applies a strict decision-aggregation function that if either the text or images are classified as unsafe, the entire document is rejected [32]. This early-exit logic follows modern content-safety principles, ensuring that no harmful material proceeds to later stages where it could be inadvertently summarized or indexed.

The moderation block’s role is therefore not only evaluative but also regulatory. It governs the flow of information through the entire framework, acting as a safety gate that upholds ethical and educational standards while preventing downstream propagation of inappropriate content [33].

3.3 Summarization Block

The summarization block is built upon theories of long-document understanding, hierarchical information compression and abstractive text generation [34, 35, 36]. It receives as input the verified-safe textual content produced by the previous blocks. Its output consists of three summary levels including brief, medium, and detailed in each serving different user needs within educational settings [37].

The handling process begins by merging the text segments into a unified representation while preserving logical flow and semantic coherence. This merged text is then encoded into a latent representation that captures the document’s key arguments, structure, and contextual dependencies [38, 39]. The system generates summaries at increasing levels of granularity as the shortest form conveys the central idea, the medium summary outlines major points, and the detailed version retains supporting explanations and contextual nuance [40].

This multilevel summarization approach is grounded in cognitive theories of information retrieval, which propose that different users require different depths of information to complete their tasks [41, 42]. In digital libraries, for example, students may only need a quick overview, while researchers may require more comprehensive summaries. By transforming a validated document into a structured set of summaries, the block enables efficient knowledge consumption and reduces the cognitive load associated with reading full-length materials [43, 44].

Within the larger framework, the summarization block is the final stage in the pipeline, converting safe and structured content into a usable form. Its output provides the educational value of the framework, ensuring that users benefit not only from content safety but also from improved accessibility and comprehension [45].

3.4. Integration of Blocks and Overall Framework Logic

Although each block is grounded in different theoretical foundations, they are tightly integrated through a well-defined flow of inputs and outputs. The extraction block converts unstructured multimodal documents into analyzable content. The moderation block enforces safety constraints and regulates whether a document may continue through the pipeline. Finally, the summarization block transforms approved content into structured knowledge artifacts that support diverse learning needs.

This integrated architecture addresses the shortcomings of existing systems by eliminating redundant preprocessing, enforcing safety throughout the pipeline, and enabling scalability through modular design. Each block contributes a distinct transformation, and together they establish a unified framework capable of handling large volumes of multimodal documents in modern digital education ecosystems.

Based on this integrated architectural model, in the next section we present evaluation experiments to verify the effectiveness of each block as well as the entire pipeline under real deployment conditions.

4. Experiment and Result

4.1. Training Process

4.1.1. Dataset

To evaluate our framework under realistic conditions, we constructed a multimodal dataset that reflects the diversity of documents typically found in digital library systems. The dataset includes a wide range of PDF and DOCX files containing both textual and visual content, along with labeled examples of sensitive material such as hate speech, violent expressions, and sexually explicit imagery for moderation testing. Images extracted from scanned pages, diagrams, and illustrations were added to represent the noisy and heterogeneous inputs commonly uploaded by users. For the summarization task, a subset of long-form educational documents was paired with multi-level reference summaries created through expert annotation. This structure allows each block of the framework to be assessed consistently and comprehensively. The dataset ultimately provides a realistic and balanced foundation for validating the effectiveness of the proposed method in modern digital education environments.

Regarding content of moderation dataset, for the text-based toxic content detection stage, we utilized the Jigsaw Toxic Comment Classification dataset, which contains approximately 160,000 English comments collected from Wikipedia talk pages. Each comment in the dataset is annotated with multiple toxicity-related categories, such as toxic, obscene, threat, insult, identity hate, and severe toxic.

To align with our binary classification objective, we merged all toxicity-related categories into a single toxic label and assigned non-toxic to the remaining samples. This transformation produces a clean binary dataset suitable for fine-tuning the DeberTa model on the task of toxic vs. non-toxic classification. All text samples were preprocessed through lowercasing and punctuation normalization, removal of URLs, emojis, and special symbols and tokenization using the DeBERTa-v3-base tokenizer.

The dataset was then divided into 80% training, 10% validation, and 10% testing subsets. The final data distribution is shown below:

Table 1

Content Moderation Dataset distribution

Split

Samples

Class Distribution

Toxic (%)

Non-toxic (%)

Train

128,000

30

70

Val

16,000

30

70

Test

16,000

30

70

image.png

Fig. 1. Content Moderation Dataset distribution

This binary version of the Jigsaw dataset enables robust fine-tuning and evaluation of transformer-based models for real-world toxic comment detection. On the focus of the Image data set for the visual toxic content detection stage, we constructed a custom binary image dataset containing 20,000 images labeled as either toxic or non-toxic.

The dataset was aggregated from multiple publicly available sources, mainly from Kaggle and other open repositories. To ensure diversity and representativeness, we combined images from different domains, covering both safe and harmful visual contexts. Non-toxic images (12,000 samples) were collected from general-purpose datasets on Kaggle, such as COCO-based or natural/lifestyle image sets, representing normal and safe visual content. Toxic images (8,000 samples) included both graphic violence and sexually explicit content, sourced from various open datasets and web collections commonly used in prior works on harmful image detection. Within the toxic subset, approximately 34% of samples depict violent or gory scenes, while 66% correspond to sexual content. All images underwent manual review and re-labeling to ensure consistency and accuracy. Each image was resized to 224×224 pixels, normalized to the [0,1][0,1][0,1] range, and augmented with random horizontal flips, rotations, and brightness adjustments to improve model robustness. The dataset was split into 70% training, 20% validation, and 10% testing, maintaining class balance across all subsets.

Table 2

Image Dataset distribution

Split

Samples

Class Distribution

Toxic (%)

Non-toxic (%)

Train

14,000

5600

8400

Val

4,000

1600

2400

Test

2,000

800

1200

image.png

Fig. 2. Image Dataset distribution

This composite dataset was used to fine-tune the MobileNetV3-Large model for binary classification of visual toxicity.

By maintaining a balanced yet diverse structure, spanning both explicit (NSFW) and violent (gore/blood) imagery, the dataset provides a strong foundation for evaluating the model’s performance in realistic moderation contexts.

Relying on the documents summarization dataset, the training corpus is constructed from Wikisource, Project Gutenberg, and modern custom educational materials, ensuring a balance of lexical diversity, formal syntax, and contemporary pedagogical relevance. Documents are stratified by length into short, medium, and long categories, allowing the model to progress from dense short texts to full-length books in a curriculum style sequence that stabilizes learning and improves generalization under limited VRAM. The dataset spans a broad educational spectrum including science, history, philosophy, and linguistics promoting strong semantic coverage within the domain. In total, the corpus contains approximately 3,500 documents with an average length of 50,000 words, amounting to around 175 million tokens. This composition enhances early convergence, strengthens contextual reasoning, and improves long-range coherence, ultimately supporting more reliable educational summarization.

Finally, the overall justification for this dataset lies in its ability to represent every stage of the proposed framework in a realistic and coherent manner. By combining multimodal documents, sensitive-content annotations, noisy visual samples, and curated reference summaries, the dataset mirrors the full complexity of materials typically processed in digital library systems. Each component text, images, and long-form documents directly corresponds to the inputs required by the extraction, moderation, and summarization blocks, enabling the framework to be assessed holistically rather than through isolated experiments. This alignment ensures that the experimental results accurately reflect real-world deployment conditions and demonstrate the practical viability of the entire end-to-end system.

4.1.2. Block Based Sampling and Comparison Strategy

To ensure a fair and comprehensive evaluation of the proposed framework, we adopt a block-based sampling and comparison strategy in which each processing block is paired with multiple candidate models. For the extraction stage, we compare OCR approaches such as PP-OCRv3, EasyOCR, and a traditional LSTM-based Tesseract system. The moderation block is assessed through separate candidates for text filtering, including DistilBERT, RoBERTa-base, and a more advanced contextual encoder, as well as candidates for image moderation such as MobileNetV2, EfficientNet-B0, and MobileNetV3-Large. For the summarization block, we evaluate different long-form generation methods, ranging from lightweight open models to more resource-intensive instruction-tuned architectures and an API-based baseline. Each combination of OCR, text moderation, image moderation, and summarization forms a complete pipeline configuration, expressed as OCR, TextModel, ImageModel, Summarizer. All configurations are tested on the same dataset and executed under identical GPU conditions, ensuring consistent and meaningful comparisons across the full framework.

4.1.3. Evaluation Dimensions

For every sampled model combination, we conduct a comprehensive evaluation that considers both effectiveness and deployability across the entire framework. Moderation performance is assessed through precision, recall, and F1-score, enabling us to measure how reliably each candidate identifies harmful content in both text and images. For summarization, we evaluate output quality using ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore, which together capture lexical overlap, structural coherence, and semantic fidelity. The above parameters will be used to evaluate the theoretical accuracy of the model and the usability of the model when applied in practice. Efficiency is examined under the constraints of an RTX 3050 GPU (4GB VRAM), focusing on memory consumption, inference latency, and system stability, including susceptibility to out-of-memory errors. We also examine practical considerations such as ease of fine-tuning, model size, quantization compatibility, and robustness when processing long documents. These evaluation dimensions ensure that the final selected configuration is not only strong in terms of academic performance but also realistic and reliable for real-world deployment

4.1.4. Experiment on Candidate models

To determine the most suitable configuration for the proposed framework, we independently evaluated candidate models for the extraction, moderation, and summarization blocks. All experiments were conducted on the same dataset and under identical hardware constraints using an RTX 3050 GPU with 4GB VRAM. The goal was not to highlight extreme performance from individual models but to identify the most balanced and reliable configuration. The following tables summarize the experimental results for each block, followed by brief conclusions.

On the actual experiments in the extraction block, the extraction block was evaluated using three OCR candidates across three document subsets: clean documents, medium-noise documents, and highly noisy scanned materials. Character-level accuracy and per-page latency were measured.

Table 3

OCR Performance Across Three OCR methods

OCR method

Clean Docs Accuracy (%)

Medium-noise Accurancy (%)

High-noise Accuracy (%)

Average Latency (s/page)

Tesseract (LSTM)

90.1

78.4

63.5

1.92

EasyOCR

93.2

84.7

71.3

0.88

PP-OCR3

96.0

91.2

88.5

0.63

Leading to the table result, model PP-OCRv3 approach demonstrates consistently stronger performance across all document types and achieves lower latency, making it the most suitable option for real-world educational documents that often contain noisy scans or mixed formats.

Mentioning as the experimental on moderation experiments were divided into text and image evaluation. For text moderation, we measured precision, recall, and F1-score as indicators of harmful-content detection. For image moderation, the same metrics were applied to identify unsafe or sensitive images.

Table 4

Moderation Block – Text Moderation Performance

Candidate model

Precision (%)

Recall (%)

F1-score (%)

VRAM usage

DistilBERT

90.5

88.2

89.3

~650 MB

RoBERTa-base

92.8

91.7

92.2

~1.4 GB

DeBERTa-V3-base

94.1

94.5

94.2

~1.2 GB

Table 5

Moderation Block - Image Moderation Performance

Image Model

Precision (%)

Recall (%)

F1-score (%)

Average Latency

MobileNetV2

88.6

87.9

88.2

3.2 ms

EfficientNet-B0

90.2

91.1

90.6

7.6 ms

MobileNetV3-Large

92.4

93.0

92.6

4.1 ms

Table 6

Summarization Block - Quality Comparison

Candidate Model

ROUGE-1

ROUGE-2

ROUGE-L

BERT Score

VRAM Fit

Mistral-7B-Instruct

0.46

0.41

0.43

0.83

3.4

GPT04o-mini (API)

0.48

0.44

0.45

0.84

3.5

LLama-3.2-3B (4-bit + LoRA)

0.51

0.47

0.47

0.85

2.4

Accessing the last glance from table 4, Among the evaluated models, the third summarization candidate achieves the best overall performance while remaining compatible with 4GB VRAM through lightweight optimization techniques. It consistently produces coherent multi-level summaries suitable for educational purposes.

Based on the results from the block-level experiments, we selected a final configuration that offers the strongest balance of accuracy, efficiency, and real-world deployability. In the extraction block, PP-OCRv3 consistently achieved the highest accuracy across clean, semi-noisy, and heavily degraded documents, making it the most reliable option for large-scale educational data. For the moderation block, the pairing of DeBERTa-V3-base for text moderation and MobileNetV3-Large for image moderation demonstrated the most stable and precise performance while remaining approximately compatible with the 4GB VRAM limit of the RTX 3050. In the summarization block, the optimised LLaMA-3.2-3B (4-bit + LoRA) model provided the best ROUGE and BERTScore results among the candidates that could operate efficiently on limited hardware. Together, these components form the integrated pipeline used in our full system, and later evaluations confirm that this configuration delivers end-to-end results across the entire framework.

4.1.5. Training the Final Chosen Pipeline

After selecting the optimal block configuration, we fine-tune only the moderation and summarization components, while the extraction block (PP-OCR) is reused directly without modification. The moderation block is trained using cross-entropy classification under mixed-precision (FP16) to reduce memory consumption, combined with early stopping and a learning-rate scheduler to ensure stable convergence within the 4GB VRAM limit. Training for the summarization block follows an autoregressive causal language modeling objective, where text is packed into fixed-length windows to minimize preprocessing overhead. To enable the LLaMA-3.2-3B model to run on consumer hardware, we apply 4-bit NF4 quantization for loading base weights, LoRA adapters for updating only low-rank attention parameters, and an 8-bit paged AdamW optimizer to reduce memory usage for optimizer states. Gradient accumulation and selective gradient checkpointing further reduce VRAM pressure, allowing the model to train efficiently despite hardware constraints. Together, these optimizations enable stable and reliable fine-tuning of the summarization system on RTX 3050.

4.2. Result

The overall training process follows a structured and methodical procedure designed to ensure both empirical rigor and hardware feasibility. We begin by randomly sampling candidate models for each block and conducting controlled evaluations to examine their strengths, weaknesses, and GPU efficiency under identical conditions. Based on these results, we select the most balanced combination of models and proceed to fine-tune only the moderation and summarization blocks within the strict memory limits of the RTX 3050. This workflow guarantees that the final pipeline is not only validated through systematic comparison but also optimized for practical deployment on resource-constrained hardware. As a result, the system achieves a level of performance that is both scientifically reproducible and scalable for real-world educational document moderation and summarization.

The given image below indicates the result of the text, image and summarization training process. Over the data, actual experiments show the increasing steady progress of the model through the indicators used to accurately measure actual results.

image.png

Fig. 3. Text moderation training result

image.png

Fig. 4. Image moderation training result

image.png

Fig. 5. Summarization training result

Table 7

Entire Frame Efficiency

Model

Task

Accuracy

ROUGE-L

DeBERTa-V3

Text moderation

94.2%

-

MobileNet-V3

Image moderation

92.6%

-

Llama-3.2-3B

Summarization

-

0.47

After integrating for the complete entire framework, we proposed training on NVIDIA RTX3050, 4GB VRAM because of the advantages of cost reduction and simplifying setup process.

5. Conclusion

This research introduces a unified and scalable AI framework, meticulously designed to streamline the entire lifecycle of multimodal documents from extraction and moderation to summarization within expansive digital education ecosystems. By thoughtfully integrating advanced OCR techniques, a safety-aware content moderation engine, and multi-level summarization into a single, cohesive pipeline, the proposed system effectively overcomes the long-standing limitations inherent in fragmented workflows. It compellingly demonstrates that reliable end-to-end processing is achievable, even while respecting the constraints of typical hardware environments. The experimental findings gracefully underscore both the robustness of each individual component and the substantial practical value of uniting them into a seamless, deployment-ready solution capable of supporting the burgeoning global demand for safe and accessible digital learning resources.

Our work has illuminated several key guiding principles. Foremost, the modular, block-based design proved indispensable, offering clarity, simplified maintenance, and essential adaptability across diverse real-world scenarios. Furthermore, the evaluation highlighted the critical necessity of judicious model selection, emphasizing a delicate balance between achieving high accuracy, maintaining computational efficiency, and respecting VRAM limitations to ensure effective operation on standard hardware. Finally, the results affirmed the profound importance of early and stringent moderation; without this dependable safety gate, subsequent processes, such as summarization, carry the risk of inadvertently disseminating harmful or inappropriate content.

Looking forward, this framework gently paves the way for exciting future development. Subsequent efforts might involve extending the moderation component to encompass multilingual contexts, thereby facilitating broader deployment across varied countries and cultural settings. Similarly, the summarization block holds the potential to be significantly enhanced, offering user-personalized outputs that subtly adapt to unique learning styles or individual reading preferences. Ultimately, the thoughtful integration of this framework into large-scale digital library infrastructures promises to unlock vast possibilities for intelligent search, intuitive content recommendation, and automated knowledge organization, thereby contributing meaningfully to the development of safe, efficient, and truly learner-centric digital education platforms in Vietnam and globally.

Список литературы

  1. Improvement of Ultra Lightweight OCR System, https://arxiv.org/abs/2206.03001 (submitted on 7 Jun 2022 last revised 14 Jun 2022).
  2. Muhammad Syaqil Irsyad, Zarina Che Embi, Khairil Imran Bin Ghauth, “Journal of Informatics and Web Engineering”, Vol. 3, No. 2, June 2024, eISSN: 2821-370X.
  3. Cjadams, Sorensen J., Elliott J., Dixon L., McDonald M., Cukierski W. Toxic Comment Classification Challenge, https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge, 2017. Kaggle.
  4. Tan F., Hu Y., Yen K., Hu C. “BERT-β: A Proactive Probabilistic Approach to Text Moderation”, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, P. 8667-8675. November 7–11, 2021. c 2021 Association for Computational Linguistics.
  5. He P., Gao J., Chen W. “DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing”, https://arxiv.org/abs/2111.09543v4, submitted on 18 Nov, 2021, last revised 24 Mar, 2023.
  6. Akyon F.C., Temizel A. “State-of-the-Art in Nudity Classification: A Comparative Analysis”, https://arxiv.org/abs/2312.16338v1, submitted on 26 Dec, 2023.
  7. Howard A., Sandler M., Chu G., Chen L.-C., Chen B., Tan M., Wang W., Zhu Y., Pang R., Vasudevan V., Le Q.V., Adam H., “Searching for MobileNetV3”, https://arxiv.org/abs/1905.02244v5, revised on 20 Nov, 2019.
  8. Huan Yee Koh, Jiaxin Ju, Ming Liu, Shirui Pan. “An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics”, https://arxiv.org/abs/2207.00939v1, submitted on 3 Jul, 2022.
  9. Gana B., Allende-Cid H., Rüping S., Becerra-Rozas M., Zamora J. “A systematic review of long document summarization methods: Evaluation metrics and approaches“ in Neurocomputing, Volume 655, 28 Nov, 2025, 131287, https://doi.org/10.1016/j.neucom.2025.131287.
  10. “The Llama 3 Herd of Models”, https://arxiv.org/abs/2407.21783v3, last revised 23 Nov, 2024.
  11. “Mistral 7B”, https://arxiv.org/abs/2310.06825v1, submitted on 10 Oct, 2023.
  12. Tenney, Ian; Chen, Daniel; Manning, Christopher, “The Pipeline Problem in Natural Language Processing”, ACL 2020. https://aclanthology.org/2020.acl-main.712/ (published 2020).
  13. Xu Y., Li J., Cui M., et al. “Document AI: A Survey”, IEEE Transactions on Pattern Analysis and Machine Intelligence. https://ieeexplore.ieee.org/document/9356352 (published 2021).
  14. Lewis P., Perez E., Piktus A., Petroni F., Karpukhin V., et al. “Retrieval-Augmented Pipelines for Scalable NLP”, NeurIPS 2021.
  15. Tran V., Nguyen H., Le T. “A Survey on Vietnamese Document Analysis and Recognition”, arXiv preprint. https://arxiv.org/pdf/2506.05061 (submitted 2025).
  16. Wang J., He Z., Fu Y. “A Multimodal Method to Extract Hierarchy Structure in PDF Documents”, Findings of EMNLP 2020. https://aclanthology.org/2020.findings-emnlp.80.pdf (published 2020).
  17. Feng Y., et al. “DocPedia: Unleashing the Power of Large Multimodal Model for Versatile Document Understanding”, Science China Information Sciences. https://link.springer.com/article/10.1007/s11432-024-4250-y (published 2024).
  18. Clausner C., Pletschacher S., Antonacopoulos A. “Page Layout Analysis for Complex Documents”, ICDAR 2019. https://ieeexplore.ieee.org/document/8978097 (published 2019).
  19. Lopez M., Smith J.R. “Standardized Parsing of PDF and DOCX Documents”, Pattern Recognition, Vol. 113. https://www.sciencedirect.com/science/article/pii/S0031320321001423 (published 2021).
  20. Mori S., Suen C.Y., Yamamoto K. “Historical Review of OCR Research”, Proceedings of the IEEE. https://ieeexplore.ieee.org/document/780186 (published 1999).
  21. Baek J., Kim G., Lee S., et al., “What is Wrong with Scene Text Recognition Model Comparisons?”, CVPR 2019. https://arxiv.org/abs/1904.01906 (submitted 4 Apr 2019).
  22. Xu Y., et al., “LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking”, ACL 2022. https://aclanthology.org/2022.acl-long.530/ (published 2022).
  23. Gao L., et al., “Extracting Visual Elements from Scientific Documents”, AAAI 2021. https://ojs.aaai.org/index.php/AAAI/article/view/16558 (published 2021).
  24. Smith R. “An Overview of the Tesseract OCR Engine”, Google Research Whitepaper. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45592.pdf (published 2018).
  25. Packer C., et al., “On Noise Propagation in Natural Language Processing Pipelines”, NAACL 2021. https://aclanthology.org/2021.naacl-main.398/ (published 2021).
  26. Gongane S. “Detection and Moderation of Detrimental Content on Social Media”, Indian Journal of Psychiatry. https://pmc.ncbi.nlm.nih.gov/articles/PMC9444091/ (published 2022).
  27. Liu M., Zhang X., Huang T. “A Comprehensive Review of LLM-based Content Moderation”, AI Review. https://link.springer.com/article/10.1007/s10462-025-11328-1 (published 2025).
  28. Pavlopoulos J., et al., “Toxicity Detection: Context Matters”, ACL 2020. https://aclanthology.org/2020.acl-main.92/ (published 2020).
  29. Davidson T., Warmsley D., Macy M., Weber I. “Automated Hate Speech Detection and the Problem of Offensive Language”, ICWSM 2017. https://ojs.aaai.org/index.php/ICWSM/article/view/14878 (published 2017).
  30. Jain S. “Computer Vision for Content Moderation”, IEEE Multimedia. https://ieeexplore.ieee.org/document/9054977 (published 2020).
  31. Qi H., et al., “Pornographic Image Classification via Deep Learning”, CVPR 2019. https://openaccess.thecvf.com/content_CVPR_2019/papers/Qi (published 2019).
  32. Google AI, “Responsible AI and The Safety Funnel Architecture”. https://ai.googleblog.com/2021/04/responsible-ai/ (published 2021).
  33. Meta AI, “Multimodal Risk Aggregation for Content Safety”. https://ai.facebook.com/blog/ (published 2022).
  34. Floridi L. “The Ethics of Filtering and Automated Moderation”, AI & Society. https://link.springer.com/article/10.1007/s00146-019-00979-0 (published 2020).
  35. Gana B., Allende-Cid H., Rüping S., Becerra-Rozas M., Zamora J. “A Systematic Review of Long Document Summarization Methods: Evaluation Metrics and Approaches”, Neurocomputing, Vol. 655, 28 Nov 2025, Article 131287. https://doi.org/10.1016/j.neucom.2025.131287.
  36. Yang C., Wang K. “Hierarchical Summarization of Large Documents”. https://cci.drexel.edu/faculty/cyang/papers/yang2008h.pdf (published 2008).
  37. Chen X., et al., “CoTHSSum: Structured Long-document Summarization via Chain-of-thought Reasoning”, 2025. https://link.springer.com/article/10.1007/s44443-025-00041-2 (published 2025).
  38. Beltagy I., Peters M., Cohan A. “Longformer: The Long-Document Transformer”. https://arxiv.org/abs/2004.05150 (submitted 10 Apr 2020).
  39. Zaheer M., et al., “BigBird: Transformers for Longer Sequences”, NeurIPS 2020. https://arxiv.org/abs/2007.14062 (submitted 27 Jul 2020).
  40. Raffel C., et al. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5)”, JMLR. https://arxiv.org/abs/1910.10683 (submitted 23 Oct 2019).
  41. Kintsch W. “Comprehension Theory and Cognitive Load”, Memory & Cognition. https://link.springer.com/article/10.3758/BF03198743 (published 2004).
  42. Mani I. “Automatic Summarization”, MIT Press. https://mitpress.mit.edu (published 2001).
  43. Li J. Information Compression Theory for Natural Language Processing, Springer. https://link.springer.com/book/10.1007/978-3-030-25558-1 (published 2019).
  44. Chau M. “Summaries for Digital Libraries: Improving Search and Accessibility”, Journal of the Association for Information Science and Technology (JASIST). https://asistdl.onlinelibrary.wiley.com/doi/10.1002/asi.23683 (published 2016).
  45. Ren X., et al., “Do Summaries Improve Comprehension? Evidence from Long-Document Reading”, IEEE Access. https://ieeexplore.ieee.org/document/9720800 (published 2022).

Поделиться

66

Куан Н. Т., Ань Т. Д., Куанг В. Х., Хунг В. Д., Нгок Н. Т., Хоанг Ф. Т., Нгует Н. Т. A framework for document moderation and summarization in digital education library platform // Актуальные исследования. 2025. №50 (285). URL: https://apni.ru/article/13937-a-framework-for-document-moderation-and-summarization-in-digital-education-library-platform

Обнаружили грубую ошибку (плагиат, фальсифицированные данные или иные нарушения научно-издательской этики)? Напишите письмо в редакцию журнала: info@apni.ru

Похожие статьи

Другие статьи из раздела «Информационные технологии»

Все статьи выпуска
Актуальные исследования

#51 (286)

Прием материалов

20 декабря - 26 декабря

осталось 7 дней

Размещение PDF-версии журнала

31 декабря

Размещение электронной версии статьи

сразу после оплаты

Рассылка печатных экземпляров

14 января