Translating or Stealing? Probing the Limits of Cross-lingual Plagiarism Detection Systems in Literary Texts
DOI:
https://doi.org/10.33806/ijaes1026Keywords:
Arabic, cross-lingual plagiarism, detection accuracy, detection precision, translationAbstract
This study compares three plagiarism detection systems (Rabin-Karp, KNN, and Word2Vec) to measure their effectiveness in detecting cross-lingual plagiarism in Arabic literary texts translated from English. The dataset consisted of an Arabic translation of Daly Walker’s ‘I am the Grass’ (2012) conducted by the authors and evaluated by three translators with experience of more than ten years. It is divided into 60 percent directly translated, 30 percent paraphrased, and 10 percent original content. Findings showed that KNN achieved the highest precision in detecting cross-lingual plagiarism (26.7%), while Word2Vec performed best with paraphrased content (16.7%). Additionally, Rabin-Karp was most reliable in detecting original content (80% precision); however, all three systems demonstrated low overall accuracy (23–26%). These findings highlight the limitations of current systems when applied to Arabic texts, primarily due to the language’s morpho-syntactic and lexical complexities. Given the limited scope of the study, as it analyzes a single text, it recommends expanding to multiple genres for broader generalizability. Furthermore, the study recommends the development of more sophisticated, hybrid plagiarism detection systems and the development of rich Arabic corpora to enhance their performance.
References
Abdelhamid, Mehdy, Faical Azouaou and Sofiane Batata. (2022). ‘A survey of plagiarism detection systems: Case of use with English, French and Arabic languages.’ Arvix, 1: 1-28.
Abid, Mahwish, Muhammad Usman and Muhammad W. Ashraf. (2017). ‘Plagiarism detection process using data mining techniques.’ International Journal of Recent Contributions from Engineering, Science and IT, 5(4): 68. https://doi.org/10.3991/ijes.v5i4.7869
Adeniyi, D.A., Z. Wei and Y. Yongquan. (2016). ‘Automated web usage data mining and recommendation system using k-nearest neighbor (KNN) classification method’. Applied Computing and Informatics, 12(1): 90-108. https://doi.org/10.1016/j.aci.2014.10.001
Ahuja, Lovepreet, Vishal Gupta and Rohit Kumar. (2020). ‘A new hybrid technique for detection of plagiarism from text documents.’ Arabian Journal for Science and Engineering, 45(12): 9939-
https://doi.org/10.1007/s13369-020-04565-9
Al Duhayyim, Mesfer, Manal A. Alohali, Fahd N. Al-Wesabi, Anwer M. Hilal, Mohammad Medani and Manar A. Hamza. (2022). ‘Securing Arabic contents algorithm for smart detecting of illegal tampering attacks.’ Computers, Materials and Continua, 70(2): 2879-2894. https://doi.org/10.32604/cmc.2022.019594
Aljohani, Adel and Masnizah Mohd. (2014). ‘Arabic-English cross-language plagiarism detection using winnowing algorithm’. Information Technology Journal, 13(14): 2349-2355. https://doi.org/10.3923/itj.2014.2349.2355
Alotaibi, Naif and Mike Joy. (2021). ‘English-Arabic cross-language plagiarism detection’. Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications. https://doi.org/10.26615/978-954-452-072-4_006
Alshehri, Mona, Natalia Beloff and Martin White. (2024). ‘AraXLM: New XLM-Roberta based method for plagiarism detection in Arabic text’. Lecture Notes in Networks and Systems: 81-96. https://doi.org/10.1007/978-3-031-62277-9_6
Alzahrani, Salha M., Naomie Salim and Ajith Abraham. (2012). ‘Understanding plagiarism linguistic patterns, textual features, and detection methods.’ IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2): 133-149.
https://doi.org/10.1109/tsmcc.2011.2134847
Angelil-Carter, Shelley. (2014). Stolen Language? Plagiarism in Writing. Harlow: Routledge.
Anguita, Angel, Alejandra Beghelli and Werner Creixell. (2011). ‘Automatic cross-language plagiarism detection’. 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.
https://doi.org/10.1109/nlpke.2011.6138189
Apter, Emily. (2013). Against World Literature: On the Politics of Untranslatability. Brooklyn: National Geographic Books.
Arabi, Hamed and Mehdi Akbari. (2022). ‘Improving plagiarism detection in text document using hybrid weighted similarity’. Expert Systems with Applications, 207: 91-108. https://doi.org/10.1016/j.eswa.2022.118034
Arifin, Yulyani, Sani M. Isa, Lili A. Wulandhari and Edi Abdurachman. (2018). ‘Plagiarism detection for Indonesian language using winnowing with parallel processing’. Journal of Physics: Conference Series, 978(1); 1-7.
https://doi.org/10.1088/1742-6596/978/1/012082
Asghari, Habibollah, Salar Mohtaj, Omid Fatemi, Heshaam Faili, Paolo Rosso and Martin Potthast. (2018). Algorithms and Corpora for Persian Plagiarism Detection. Basingstoke: Springer International Publishing.
Avetisyan, Karen, Arthu Malajyan, Tsolak Ghukasyan, Arutyun Avetisyan and Chunxia Dou. (2023). ‘A Simple and Effective Method of Cross-Lingual Plagiarism Detection’ (Pre-Print).
https://doi.org/10.21203/rs.3.rs-3040948/v1
Bellahsene, Zohra, Angela Bonifati and Erhard Rahm. (2011). Schema Matching and Mapping. Berlin: Springer Science and Business Media.
Bouaine, Chaimaa, Faouzia Benabbou and Imane Sadgali. (2023). ‘Word embedding for high performance cross-language plagiarism detection techniques.’ International Journal of Interactive Mobile Technologies 17(10): 69-91. https://doi.org/10.3991/ijim.v17i10.38891
Bouville, Mathieu. (2008). ‘Plagiarism: Words and ideas.’ Science and Engineering Ethics, 14(3): 311-322. https://doi.org/10.1007/s11948-008-9057-6
Chang, Chia-Yang, Shie-Jue Lee, Chih-Hung Wu, Chih-Feng Liu and Ching-Kuan Liu. (2021). ‘Using word semantic concepts for plagiarism detection in text documents.’ Information Retrieval Journal, 24(4-5): 298-321. https://doi.org/10.1007/s10791-021-09394-4
Classe, Olive. (2000). Encyclopedia of Literary Translation into English: A-L. New York: Taylor and Francis.
Curtis, Guy J. and Kell Tremayne. (2021). ‘Is plagiarism really on the rise? Results from four 5-yearly surveys.’ Studies in Higher Education, 46(9): 1816-1826. https://doi.org/10.1080/03075079.2019.1707792
Dehouche, N. (2021). ‘Plagiarism in the age of massive generative pre-trained transformers (GPT-3)’. Ethics in Science and Environmental Politics, 21, 17-23. https://doi.org/10.3354/esep00195
De Lima, Jorge Á., Áurea Sousa, Angélica Medeiros, Beatriz Misturada and Cátia Novo. (2021). ‘Understanding undergraduate plagiarism in the context of students’ academic experience’. Journal of Academic Ethics, 20(2): 147-168. https://doi.org/10.1007/s10805-021-09396-3
Dougherty, M. V. (2020). Disguised Academic Plagiarism: A Typology and Case Studies for Researchers and Editors. Basingstoke: Springer Nature.
Durakovic, Esad. (2019). The Poetics of Ancient and Classical Arabic Literature: Orientology. London: Routledge.
Eka Diana, Nova and Ikrima Hanana Ulfa. (2019). ‘Measuring performance of N-Gram and jaccard-similarity metrics in document plagiarism application’. Journal of Physics: Conference Series, 1196: 1-7.
https://doi.org/10.1088/1742-6596/1196/1/012069
El-Rashidy, Mohamed A., Ramy G. Mohamed, Nawal A. El-Fishawy and Marwa A. Shouman. (2023). ‘An effective text plagiarism detection system based on feature selection and SVM techniques.’ Multimedia Tools and Applications, 83(1): 2609-2646. https://doi.org/10.1007/s11042-023-15703-4
Foltýnek, Tomáš, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky and Bela Gipp. (2020). ‘Detecting machine-obfuscated plagiarism’. In Anneli Sundqvist, Gerd Berget, Jan Nolin and Kjell Ivar Skjerdingstad (eds.), Sustainable Digital Communities, 816-827. Basingstoke: Springer. https://doi.org/10.1007/978-3-030-43687-2_68
Gharavi, Erfaneh, Hadi Veisi and Paolo Rosso. (2019). ‘Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: No training phase’. Neural Computing and Applications, 32(14): 10593-10607. https://doi.org/10.1007/s00521-019-04594-y
Gipp, Bela. (2014). Citation-based Plagiarism Detection: Detecting Disguised and Cross-language Plagiarism Using Citation Pattern Analysis. Basingstoke: Springer Vieweg.
Haikal, Walid. (2012). ‘Detection of plagiarism in Arabic documents.’ International Journal of Information Technology and Computer Science, 4(10): 80-89. https://doi.org/10.5815/IJITCS.2012.10.10
Haitch, Russell. (2016). ‘Stealing or sharing? Cross-cultural issues of plagiarism in an open-source era’. Teaching Theology and Religion, 19(3): 264-275. https://doi.org/10.1111/teth.12337
Hattab, Ezz. (2015). ‘Cross-language plagiarism detection method: Arabic vs. English’. 2015 International Conference on Developments of E-Systems Engineering (DeSE), 141-144. https://doi.org/10.1109/dese.2015.25
Hourrane, Oumaima and El Habib Benlahmar. (2022). ‘Graph transformer for cross-lingual plagiarism detection’. International Journal of Artificial Intelligence (IJ-AI), 11(3): 905-915.
https://doi.org/10.11591/ijai.v11.i3.pp905-915
Ibrahim, Ribwar, Soran Saeed and Karzan Wakil. (2017). ‘Plagiarism detection techniques for Arabic script languages: A literature review’. Kurdistan Journal of Applied Research, 2(3): 106- 111.
https://doi.org/10.24017/science.2017.3.1
Ilyas, Muhammad, Nasreen Malik, Ahmad Bilal, Saad Razzaq, Fahad Maqbool and Qaisar Abbas. (2021). ‘Plagiarism detection using natural language processing techniques.’ Technical Journal, 26(1): 90-101.
Jaber, Zahraa J. and Ahmed H. Aliwy. (2021). ‘Design and implementation of Arabic plagiarism detection system’. In Valentina E. Balas, Vijender K. Solanki and Raghvendra Kumar (eds.), Further Advances in Internet of Things in Biomedical and Cyber Physical Systems, 347-358. Basingstoke: Springer International Publishing.
Kumar, D. P., Ananda Tiwari, B. S. Priya, M. G. Raghavendra and A. C. Raju. (2023). ‘Plagiarism detection using KNN’. Proceedings of the 1st International Conference on Frontier of Digital Technology Towards a Sustainable Society. https://doi.org/10.1063/5.0130846
Lancaster, Thomas and Fintan Culwin. (2005). ‘Classifications of plagiarism detection engines.’ Innovation in Teaching and Learning in Information and Computer Sciences, 4(2): 1- 16.
https://doi.org/10.11120/ital.2005.04020006
Long, Pamela O. (1991). ‘Invention, authorship, “Intellectual property,” and the origin of patents: Notes toward a conceptual history.’ Technology and Culture, 32(4): 846-884. https://doi.org/10.2307/3106154
Mahmoud, Adnen and Mounir Zrigui. (2020). ‘Semantic similarity analysis for corpus development and paraphrase detection in Arabic’. The International Arab Journal of Information Technology, 18(1): 1-7.
https://doi.org/10.34028/iajit/18/1/1
Mesfar, Slim. (2010). ‘Toward a cascade of morpho-syntactic tools for Arabic natural language processing’. In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 150-162. Basingstoke: Springer. https://doi.org/10.1007/978-3-642-12116-6_13
Mohtaj, Salar and Habibollah Asghari. (2022). ‘A corpus for evaluation of cross language text re-use detection systems.’ Journal of Information Systems and Telecommunication, 10(39): 169-179.
https://doi.org/10.52547/jist.33583.10.39.169
Mohabey, Niraj, Yash Gavanang, Abubakkar Khan, Lavesh Singh Chib and Bhushan Patil. (2023). ‘Plagiarism detection for project report using machine learning.’ International Journal of Engineering Technology and Management Sciences, 7(3): 87- 93.
https://doi.org/10.46647/ijetms.2023.v07i03.012
Naaman, Erez. (2011). ‘Sariqain practice: The case of al-Ṣāḥib Ibn ‘Abbād’. Middle Eastern Literatures, 14(3): 271-285.
https://doi.org/10.1080/1475262x.2011.616712
Nagoudi, El Moataz, Ahmed Khorsi, Hadda Cherroun and Didier Schwab. ‘2L-APD: A two-level plagiarism detection system for Arabic documents.’ Cybernetics and Information Technologies, 18(1): 124-138. https://doi.org/10.2478/cait-2018-0011
Nennuri, Rajashekar, M. Geetha Yadav, M. Samhitha, S. Sandeep Kumar and G. Roshini. (2021). ‘Plagiarism detection through data mining techniques.’ Journal of Physics: Conference Series, 1979(1): 1-6.
https://doi.org/10.1088/1742-6596/1979/1/012070
Omar, Khaled and Ammar Hilal. (2022). ‘Plagiarism detection in Arabic documents using word2vector and Arabic WordNet’. 2022 International Arab Conference on Information Technology
(ACIT). https://doi.org/10.1109/acit57182.2022.9994090
Quidwai, Ali, Chunhui Li and Parijat Dube. (2023). ‘Beyond black box AI generated plagiarism detection: From sentence to document level’. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications.
https://doi.org/10.18653/v1/2023.bea-1.58
Ragin, Charles C. (1998). ‘The logic of qualitative comparative analysis’. International Review of Social History, 43(6): 105- 124.
https://doi.org/10.1017/cbo9780511563874.006
Rigney, Ann. (2019). ‘Texts and Intertextuality’. In Kiene B. Wurth and Ann Rigney (eds.), The Life of Texts: An Introduction to Literary Studies, 79-112. Amsterdam University Press.
Son, Nguyen V., Le T. Huong and Nguyen C. Thanh. (2021). ‘A two-phase plagiarism detection system based on multi-layer long short-term memory networks.’ International Journal of Artificial Intelligence (IJ-AI), 10(3): 636-648. https://doi.org/10.11591/ijai.v10.i3.pp636-648
Sterman, Sarah, Evey Huang, Vivian Liu and Eric Paulos. (2020). ‘Interacting with literary style through computational tools.’ Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1: 1-12.
https://doi.org/10.1145/3313831.3376730
Sutherland-Smith, Wendy. (2008). Plagiarism, the Internet, and Student Learning: Improving Academic Integrity. London: Routledge.
Teresa, Turell M. (2004). ‘Textual kidnapping revisited: The case of plagiarism in literary translation.’ International Journal of Speech, Language and the Law - Forensic Linguistics, 11(1): 1- 26.
https://doi.org/10.1558/sll.2004.11.1.1
Terry, Richard. (2010). The Plagiarism Allegation in English Literature from Butler to Sterne. Basingstoke: Springer.
Ulum, Muhammad B. (2023). ‘Plagiarism in classic Arabic poetics (Comparative study of al-jumahi and al-qairawany’s thoughts).’ Jurnal CMES, 16(1): 61-71. https://doi.org/10.20961/cmes.16.1.53447
Wali, Wafa, Bilel Gargouri and Abdelmajid Ben Hamadou. (2018). ‘Using sentence similarity measure for plagiarism detection of Arabic documents.’ In Ajith Abraham, Niketa Gandhi, Thomas Hanne, Tzung-Pei Hong, Tatiane Nogueira Rios and Weiping Ding (eds.), Intelligent Systems Design and Applications, 52-62. Basingstoke: Springer International Publishing.
Wall, David. (2003). Crime and the Internet. London: Routledge.
Wijaya, Indra., Andy Seputra and Wayan G. Parwita. (2021). ‘Comparison of the BM25 and rabinkarp algorithm for plagiarism detection.’ Journal of Physics: Conference Series, 1810(1): 1-10. https://doi.org/10.1088/1742-6596/1810/1/012032
Worton, Michael and Judith Still. (1990). Intertextuality: Theories and Practices. New York: Manchester University Press.
Wu, Jain-Shing, Ting-Hsuan Chien, Li-Ren Chien and Chin-Yi Yang. (2021). ‘Using artificial intelligence to predict class loyalty and plagiarism in students in an online blended programming course during the COVID-19 pandemic.’ Electronics, 10(18): 1-20.
https://doi.org/10.3390/electronics10182203
Zaher, Mahmoud, Abdulaziz Shehab, Mohamed Elhoseny and Farahat F. Farahat. (2020). ‘Unsupervised model for detecting plagiarism in internet-based handwritten Arabic documents.’ Journal of Organizational and End User Computing, 32(2): 42-66. https://doi.org/10.4018/joeuc.2020040103
Zouaoui, Samia and Khaled Rezeg. (2019). ‘Ontological approach based on multi-agent system for indexing and filtering Arabic documents.’ Journal of Digital Information Management, 17(3): 145-163.
https://doi.org/10.6025/jdim/2019/17/3/145-163
Zuo, Ziyu. (2022). ‘On the determination of literary plagiarism in copyright law.’ PONTE International Scientific Researches Journal, 78(6): 1-10.