نوع مقاله : مقاله پژوهشی
نویسندگان
پژوهشکده فناوری اطلاعات، پژوهشگاه علوم و فناوری اطلاعات ایران (ایرانداک)، تهران، ایران
چکیده
کلیدواژهها
موضوعات
عنوان مقاله [English]
نویسندگان [English]
Optical Character Recognition (OCR), as an advanced technology in the field of document processing, plays a pivotal role in the digitization process. By leveraging sophisticated image processing algorithms and deep learning models, this system is capable of detecting and extracting textual characters from document images. The output of this process consists of structured textual data that can be processed and analyzed by computer systems. The applications of this technology in intelligent document management are extensive, encompassing advanced archiving and indexing systems, automatic information extraction from documents, and the development of efficient information retrieval systems. Furthermore, OCR capabilities provide the necessary foundation for developing specialized language models and advanced text processing systems in scientific and industrial domains. By converting paper documents into digital data, this technology not only enables efficient storage and retrieval of information but also paves the way for advanced AI-driven analyses of documents.
In recent years, vision-language models based on the Transformer architecture, leveraging massive pre-trained models and extensive datasets, have achieved remarkable progress in this field, particularly for the English language. However, the performance of these models on languages with more complex structures, such as Persian, still faces significant challenges, and only a limited number of studies with narrow applicability have been presented in this area. This research investigates the performance and challenges of vision-language models in extracting Persian text from images. The Persian language, due to its unique characteristics, such as connected letter forms, right-to-left writing direction, and the presence of overlapping characters, poses considerably greater challenges compared to Latin-based languages.
In this study, first, a high-quality dataset comprising 174,361 Persian sentences in both text and image formats, sourced from digital books, was generated to train models for the Persian language. Additionally, a separate, diverse, and challenging evaluation dataset was designed to cover real-world scenarios. Subsequently, a vision-language model is proposed, designed based on a two-stage architecture: first, a pre-trained visual encoder extracts complex visual features, and then a specialized language decoder, specifically pre-trained on a large corpus of Persian texts, generates the corresponding text while adhering to Persian grammatical and orthographic structures. This decoupling enables independent optimization of both components, granting the model stronger visual understanding of images alongside deeper linguistic comprehension of Persian. The final model was then fine-tuned using the aforementioned training dataset.
The results were compared against Tesseract, a convolutional neural network-based model developed by Google, as well as Qwen2.5-VL-7B, a vision-language model introduced by Alibaba. Comprehensive evaluations demonstrate that the proposed model achieves 98% word-level accuracy (WER = 2%) on the Persian sentence test dataset, attesting to its strong capability in processing Persian text. Nevertheless, error analysis reveals that the model performs weakly in recognizing Persian numerals, Latin numerals, and English words within mixed-language texts. This weakness in Persian numeral recognition is observed across all evaluated models and is primarily attributed to the lack of structural and linguistic diversity in the training dataset. Accordingly, enriching the training dataset with more diverse samples, followed by model retraining, is proposed as an essential step toward realizing a comprehensive, Persian-centric OCR system effective in real-world conditions.
A noteworthy observation is that Tesseract, despite being based on convolutional and recurrent neural networks and featuring a simpler architecture with fewer parameters compared to vision-language models, demonstrates competitive or even superior overall performance (including in processing English texts) relative to some large-scale hybrid models. This outcome is likely because its training data, aligned with the model's architecture, is comprehensive and well-suited for optical character recognition tasks. Access to sufficient amounts of clean Persian data remains limited; however, the results clearly indicate that creating diverse datasets enhances the capability of Transformer-based OCR approaches relative to classical convolutional neural network-based methods.
کلیدواژهها [English]
ارسال نظر در مورد این مقاله