A Survey of Semantic Search and Retrieval Approaches for Persian and Arabic Texts

Document Type : Original Article

Author

Assistant Professor, Information Dissemination and Knowledge Exchange, Islamic Sciences and Culture Academy, Qom, Iran

10.22091/stim.2024.10402.2067

Abstract

Purpose: In recent decades, web search engines have become one of the most prominent and essential tools for accessing information in today's interconnected world. With the increasing volume of information available on the web, the demand for locating and accessing relevant and meaningful information has also risen. Traditional search engines typically retrieve results based on keyword matching and the number of similar entries in the texts. This method often leads to undesirable and irrelevant results. These problems are even more pronounced in Persian and Arabic due to the complex grammar of these languages, which is not machine-readable. The aim of this research is to review and present solutions for semantic search and retrieval of Persian and Arabic texts.
Method: This research is a content analysis study, and the library method was used to collect data. To collect information and access the required resources, various sources were used, including scientific articles, books, theses, and reports. For collecting Persian articles, sources, and for collecting English articles, sources with publication dates from 2020 onwards were used.
The content analysis method was utilized to analyze the collected data. By employing data analysis and interpretation methods, the results of previous studies were reviewed and evaluated alongside the new findings of the research. This evaluation involved identifying the issues and constraints of current semantic search engines and offering suggestions for enhancement.
Findings: In Persian and Arabic text semantic search and information retrieval research, methods based on text semantic analysis and processing using pre-trained language models, clustering algorithms like K-Means, and knowledge resources such as knowledge graphs are employed. Additionally, the dataset, the utilization of models and algorithms, and the method of semantic search and retrieval between words all influence the system's performance and accuracy. According to the findings of numerous studies, there is a wide range of methods and algorithms available for text semantic search and retrieval, each of which can produce different results. These findings demonstrate that each of the methods used has the ability to retrieve the semantic meaning of texts and varies in terms of search accuracy capabilities. An examination of the research findings reveals that some methods outperform others. These methods demonstrate strong semantic search capabilities by employing various techniques and algorithms such as topic analysis, neural networks, vector representations, and more. On the other hand, the appropriate method should be chosen based on the nature of the problem and the characteristics of the data. Each problem and dataset may have its own unique requirements. Selecting the best method and adjusting its parameters is critical for optimal performance.
Conclusion: Each of the presented methods offers unique solutions for the issues and linguistic characteristics of the two languages, Persian and Arabic. Additionally, various methods utilize
pre-trained language models like BERT, clustering algorithms such as K-Means, and knowledge resource-based retrieval systems like knowledge graphs. The presented solutions also utilize specific datasets and resources for training and evaluation. The differences in the dataset and how these models and algorithms are used and configured are critical. Some methods perform information retrieval based on meaning and semantic relationships between words, while others use keyword and root-based methods. This variation in the search and retrieval method can impact the system's performance and accuracy. Each method has a different performance and accuracy in retrieving information, which is attributed to the varied ways in which models, algorithms, and data sources are utilized.

Keywords

Main Subjects


ALMarwi, H., Ghurab, M. & Al-Baltah, I. (2020). A hybrid semantic query expansion approach for Arabic information retrieval. Journal of Big Data, 7(1): 1-19.          
https://doi.org/10.1186/s40537-020-00310-z
Alsuhaim, A.F., Azmi, A.M. & Hussain, M. (2021). Improving the Retrieval of Arabic Web Search Results Using Enhanced K-Means Clustering Algorithm. Entropy, 23(4): 449.         
https://doi.org/10.3390/e23040449
Bahari Varzaneh, H. (2023). Comparing the Performance of Information Retrieval of Semantic and Keyword Search Engines Based on Phrase Search. Journal of Knowledge Research Studies, 1(2): 100-114.
Baqeri, T., Norowzi, Y., Esfandiari Moghadam, A. & Zarei, A. (2019). Providing a model for the application of semantic technology in information retrieval in digital libraries. National Studies on Librarianship and Information Organization, 30(2): 129-151. 
https://doi.org/10.30484/nastinfo.2019.2145.1820. [in persian]
Daneshgah-e Elm va San'at (2009). Phase one of the comprehensive project for the Persian language corpus with the subject of the first phase of textual corpus studies in Persian language: Optimizing the use of search engines in Persian textual corpora. [in persian]
Esmeir, S. (2021). Serag: Semantic entity retrieval from Arabic knowledge graphs. In: Proceedings of the Sixth Arabic Natural Language Processing Workshop (pp. 219-225).
García, E. & Sicilia, M.A. (2003). User interface tactics in ontology-based information seeking. PsychNology Journal, 1(3): 242-255.
Jafari Pawarsi, H., Hariri, N., Alipour Hafezi, M., Babolhawaji, F. & Khademi, M. (2020). Enhancing semantic information retrieval using tagging and ontologies. National Studies on Librarianship and Information Organization, 31(1): 18-38. [in persian]
Karimi, A., Babaei, M. & Hosseini Beheshti, M. (2019). Investigating the semantic and ontological characteristics of information retrieval systems based on terminologies and ontology. Journal of Information Processing and Management, 34(4): 1585-1612.            
https://doi.org/10.35050/JIPM010.2019.015. [in persian]
Khanmohammadi, R., Mirshafiee, M.S. & Allahyari, M. (2021). COPER: A query-adaptable semantics-based search engine for Persian COVID-19 articles. In: 2021 7th International Conference on Web Research (ICWR) (pp. 64-70). IEEE.
Mohamed, E.H. & Shokry, E.M. (2022). QSST: A Quranic Semantic Search Tool based on word embedding. Journal of King Saud University-Computer and Information Sciences, 34(3):
934-945.
Mortazaei, L. (2001). Persian language and script issues in information storage and retrieval. Etlā' Rassāni, 17(1-2). [in persian]
Read, A.W. (1942). The lexicographer and general semantics. General Semantics Monographs.
Sutcliffe, A. & Ennis, M. (1998). Towards a cognitive theory of information retrieval. Interacting with computers, 10(3): 321-351.
Zouaoui, S. & Rezeg, K. (2021). A novel Quranic search engine using an ontology-based semantic indexing. Arabian Journal for Science and Engineering, 46(4): 3653-3674.
CAPTCHA Image