Forthcoming

Translating or Stealing? Probing the Limits of Cross-lingual Plagiarism Detection Systems in Literary Texts

Authors

DOI:

https://doi.org/10.33806/ijaes1026

Keywords:

Arabic, cross-lingual plagiarism, detection accuracy, detection precision, translation

Abstract

This study compares three plagiarism detection systems (Rabin-Karp, KNN, and Word2Vec) to measure their effectiveness in detecting cross-lingual plagiarism in Arabic literary texts translated from English.  The dataset consisted of an Arabic translation of Daly Walker’s ‘I am the Grass’ (2012) conducted by the authors and evaluated by three translators with experience of more than ten years. It is divided into 60 percent directly translated, 30 percent paraphrased, and 10 percent original content. Findings showed that KNN achieved the highest precision in detecting cross-lingual plagiarism (26.7%), while Word2Vec performed best with paraphrased content (16.7%). Additionally, Rabin-Karp was most reliable in detecting original content (80% precision); however, all three systems demonstrated low overall accuracy (23–26%). These findings highlight the limitations of current systems when applied to Arabic texts, primarily due to the language’s morpho-syntactic and lexical complexities. Given the limited scope of the study, as it analyzes a single text, it recommends expanding to multiple genres for broader generalizability. Furthermore, the study recommends the development of more sophisticated, hybrid plagiarism detection systems and the development of rich Arabic corpora to enhance their performance.

Author Biographies

Abdulfattah Omar, Port Said University, Egypt

Associate Professor of Linguistics

Port Said University, Port Said, Egypt

The Australian National University, Australia

Email: a.a.omar2010@gmail.com

Wafya Ibrahim Hamouda, Tanta University, Egypt

Associate Professor of Linguistics

Tanta University, Tanta, Egypt

Email: wafia.hamouda@edu.tanta.edu.eg

Waheed M. A. Altohami, Prince Sattam bin Abdulaziz University, Saudi Arabia

Associate Professor of Linguistics

Prince Sattam bin Abdulaziz University, Al-Kharj, Saudi Arabia

Mansoura University, Mansoura, Egypt

Email: w.m.altohami@gmail.com 

References

Abdelhamid, Mehdy, Faical Azouaou and Sofiane Batata. (2022). ‘A survey of plagiarism detection systems: Case of use with English, French and Arabic languages.’ Arvix, 1: 1-28.

Abid, Mahwish, Muhammad Usman and Muhammad W. Ashraf. (2017). ‘Plagiarism detection process using data mining techniques.’ International Journal of Recent Contributions from Engineering, Science and IT, 5(4): 68. https://doi.org/10.3991/ijes.v5i4.7869

Adeniyi, D.A., Z. Wei and Y. Yongquan. (2016). ‘Automated web usage data mining and recommendation system using k-nearest neighbor (KNN) classification method’. Applied Computing and Informatics, 12(1): 90-108. https://doi.org/10.1016/j.aci.2014.10.001

Ahuja, Lovepreet, Vishal Gupta and Rohit Kumar. (2020). ‘A new hybrid technique for detection of plagiarism from text documents.’ Arabian Journal for Science and Engineering, 45(12): 9939-

https://doi.org/10.1007/s13369-020-04565-9

Al Duhayyim, Mesfer, Manal A. Alohali, Fahd N. Al-Wesabi, Anwer M. Hilal, Mohammad Medani and Manar A. Hamza. (2022). ‘Securing Arabic contents algorithm for smart detecting of illegal tampering attacks.’ Computers, Materials and Continua, 70(2): 2879-2894. https://doi.org/10.32604/cmc.2022.019594

Aljohani, Adel and Masnizah Mohd. (2014). ‘Arabic-English cross-language plagiarism detection using winnowing algorithm’. Information Technology Journal, 13(14): 2349-2355. https://doi.org/10.3923/itj.2014.2349.2355

Alotaibi, Naif and Mike Joy. (2021). ‘English-Arabic cross-language plagiarism detection’. Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Methods and Applications. https://doi.org/10.26615/978-954-452-072-4_006

Alshehri, Mona, Natalia Beloff and Martin White. (2024). ‘AraXLM: New XLM-Roberta based method for plagiarism detection in Arabic text’. Lecture Notes in Networks and Systems: 81-96. https://doi.org/10.1007/978-3-031-62277-9_6

Alzahrani, Salha M., Naomie Salim and Ajith Abraham. (2012). ‘Understanding plagiarism linguistic patterns, textual features, and detection methods.’ IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2): 133-149.

https://doi.org/10.1109/tsmcc.2011.2134847

Angelil-Carter, Shelley. (2014). Stolen Language? Plagiarism in Writing. Harlow: Routledge.

Anguita, Angel, Alejandra Beghelli and Werner Creixell. (2011). ‘Automatic cross-language plagiarism detection’. 2011 7th International Conference on Natural Language Processing and Knowledge Engineering.

https://doi.org/10.1109/nlpke.2011.6138189

Apter, Emily. (2013). Against World Literature: On the Politics of Untranslatability. Brooklyn: National Geographic Books.

Arabi, Hamed and Mehdi Akbari. (2022). ‘Improving plagiarism detection in text document using hybrid weighted similarity’. Expert Systems with Applications, 207: 91-108. https://doi.org/10.1016/j.eswa.2022.118034

Arifin, Yulyani, Sani M. Isa, Lili A. Wulandhari and Edi Abdurachman. (2018). ‘Plagiarism detection for Indonesian language using winnowing with parallel processing’. Journal of Physics: Conference Series, 978(1); 1-7.

https://doi.org/10.1088/1742-6596/978/1/012082

Asghari, Habibollah, Salar Mohtaj, Omid Fatemi, Heshaam Faili, Paolo Rosso and Martin Potthast. (2018). Algorithms and Corpora for Persian Plagiarism Detection. Basingstoke: Springer International Publishing.

Avetisyan, Karen, Arthu Malajyan, Tsolak Ghukasyan, Arutyun Avetisyan and Chunxia Dou. (2023). ‘A Simple and Effective Method of Cross-Lingual Plagiarism Detection’ (Pre-Print).

https://doi.org/10.21203/rs.3.rs-3040948/v1

Bellahsene, Zohra, Angela Bonifati and Erhard Rahm. (2011). Schema Matching and Mapping. Berlin: Springer Science and Business Media.

Bouaine, Chaimaa, Faouzia Benabbou and Imane Sadgali. (2023). ‘Word embedding for high performance cross-language plagiarism detection techniques.’ International Journal of Interactive Mobile Technologies 17(10): 69-91. https://doi.org/10.3991/ijim.v17i10.38891

Bouville, Mathieu. (2008). ‘Plagiarism: Words and ideas.’ Science and Engineering Ethics, 14(3): 311-322. https://doi.org/10.1007/s11948-008-9057-6

Chang, Chia-Yang, Shie-Jue Lee, Chih-Hung Wu, Chih-Feng Liu and Ching-Kuan Liu. (2021). ‘Using word semantic concepts for plagiarism detection in text documents.’ Information Retrieval Journal, 24(4-5): 298-321. https://doi.org/10.1007/s10791-021-09394-4

Classe, Olive. (2000). Encyclopedia of Literary Translation into English: A-L. New York: Taylor and Francis.

Curtis, Guy J. and Kell Tremayne. (2021). ‘Is plagiarism really on the rise? Results from four 5-yearly surveys.’ Studies in Higher Education, 46(9): 1816-1826. https://doi.org/10.1080/03075079.2019.1707792

Dehouche, N. (2021). ‘Plagiarism in the age of massive generative pre-trained transformers (GPT-3)’. Ethics in Science and Environmental Politics, 21, 17-23. https://doi.org/10.3354/esep00195

De Lima, Jorge Á., Áurea Sousa, Angélica Medeiros, Beatriz Misturada and Cátia Novo. (2021). ‘Understanding undergraduate plagiarism in the context of students’ academic experience’. Journal of Academic Ethics, 20(2): 147-168. https://doi.org/10.1007/s10805-021-09396-3

Dougherty, M. V. (2020). Disguised Academic Plagiarism: A Typology and Case Studies for Researchers and Editors. Basingstoke: Springer Nature.

Durakovic, Esad. (2019). The Poetics of Ancient and Classical Arabic Literature: Orientology. London: Routledge.

Eka Diana, Nova and Ikrima Hanana Ulfa. (2019). ‘Measuring performance of N-Gram and jaccard-similarity metrics in document plagiarism application’. Journal of Physics: Conference Series, 1196: 1-7.

https://doi.org/10.1088/1742-6596/1196/1/012069

El-Rashidy, Mohamed A., Ramy G. Mohamed, Nawal A. El-Fishawy and Marwa A. Shouman. (2023). ‘An effective text plagiarism detection system based on feature selection and SVM techniques.’ Multimedia Tools and Applications, 83(1): 2609-2646. https://doi.org/10.1007/s11042-023-15703-4

Foltýnek, Tomáš, Terry Ruas, Philipp Scharpf, Norman Meuschke, Moritz Schubotz, William Grosky and Bela Gipp. (2020). ‘Detecting machine-obfuscated plagiarism’. In Anneli Sundqvist, Gerd Berget, Jan Nolin and Kjell Ivar Skjerdingstad (eds.), Sustainable Digital Communities, 816-827. Basingstoke: Springer. https://doi.org/10.1007/978-3-030-43687-2_68

Gharavi, Erfaneh, Hadi Veisi and Paolo Rosso. (2019). ‘Scalable and language-independent embedding-based approach for plagiarism detection considering obfuscation type: No training phase’. Neural Computing and Applications, 32(14): 10593-10607. https://doi.org/10.1007/s00521-019-04594-y

Gipp, Bela. (2014). Citation-based Plagiarism Detection: Detecting Disguised and Cross-language Plagiarism Using Citation Pattern Analysis. Basingstoke: Springer Vieweg.

Haikal, Walid. (2012). ‘Detection of plagiarism in Arabic documents.’ International Journal of Information Technology and Computer Science, 4(10): 80-89. https://doi.org/10.5815/IJITCS.2012.10.10

Haitch, Russell. (2016). ‘Stealing or sharing? Cross-cultural issues of plagiarism in an open-source era’. Teaching Theology and Religion, 19(3): 264-275. https://doi.org/10.1111/teth.12337

Hattab, Ezz. (2015). ‘Cross-language plagiarism detection method: Arabic vs. English’. 2015 International Conference on Developments of E-Systems Engineering (DeSE), 141-144. https://doi.org/10.1109/dese.2015.25

Hourrane, Oumaima and El Habib Benlahmar. (2022). ‘Graph transformer for cross-lingual plagiarism detection’. International Journal of Artificial Intelligence (IJ-AI), 11(3): 905-915.

https://doi.org/10.11591/ijai.v11.i3.pp905-915

Ibrahim, Ribwar, Soran Saeed and Karzan Wakil. (2017). ‘Plagiarism detection techniques for Arabic script languages: A literature review’. Kurdistan Journal of Applied Research, 2(3): 106- 111.

https://doi.org/10.24017/science.2017.3.1

Ilyas, Muhammad, Nasreen Malik, Ahmad Bilal, Saad Razzaq, Fahad Maqbool and Qaisar Abbas. (2021). ‘Plagiarism detection using natural language processing techniques.’ Technical Journal, 26(1): 90-101.

Jaber, Zahraa J. and Ahmed H. Aliwy. (2021). ‘Design and implementation of Arabic plagiarism detection system’. In Valentina E. Balas, Vijender K. Solanki and Raghvendra Kumar (eds.), Further Advances in Internet of Things in Biomedical and Cyber Physical Systems, 347-358. Basingstoke: Springer International Publishing.

Kumar, D. P., Ananda Tiwari, B. S. Priya, M. G. Raghavendra and A. C. Raju. (2023). ‘Plagiarism detection using KNN’. Proceedings of the 1st International Conference on Frontier of Digital Technology Towards a Sustainable Society. https://doi.org/10.1063/5.0130846

Lancaster, Thomas and Fintan Culwin. (2005). ‘Classifications of plagiarism detection engines.’ Innovation in Teaching and Learning in Information and Computer Sciences, 4(2): 1- 16.

https://doi.org/10.11120/ital.2005.04020006

Long, Pamela O. (1991). ‘Invention, authorship, “Intellectual property,” and the origin of patents: Notes toward a conceptual history.’ Technology and Culture, 32(4): 846-884. https://doi.org/10.2307/3106154

Mahmoud, Adnen and Mounir Zrigui. (2020). ‘Semantic similarity analysis for corpus development and paraphrase detection in Arabic’. The International Arab Journal of Information Technology, 18(1): 1-7.

https://doi.org/10.34028/iajit/18/1/1

Mesfar, Slim. (2010). ‘Toward a cascade of morpho-syntactic tools for Arabic natural language processing’. In Alexander Gelbukh (ed.), Computational Linguistics and Intelligent Text Processing, 150-162. Basingstoke: Springer. https://doi.org/10.1007/978-3-642-12116-6_13

Mohtaj, Salar and Habibollah Asghari. (2022). ‘A corpus for evaluation of cross language text re-use detection systems.’ Journal of Information Systems and Telecommunication, 10(39): 169-179.

https://doi.org/10.52547/jist.33583.10.39.169

Mohabey, Niraj, Yash Gavanang, Abubakkar Khan, Lavesh Singh Chib and Bhushan Patil. (2023). ‘Plagiarism detection for project report using machine learning.’ International Journal of Engineering Technology and Management Sciences, 7(3): 87- 93.

https://doi.org/10.46647/ijetms.2023.v07i03.012

Naaman, Erez. (2011). ‘Sariqain practice: The case of al-Ṣāḥib Ibn ‘Abbād’. Middle Eastern Literatures, 14(3): 271-285.

https://doi.org/10.1080/1475262x.2011.616712

Nagoudi, El Moataz, Ahmed Khorsi, Hadda Cherroun and Didier Schwab. ‘2L-APD: A two-level plagiarism detection system for Arabic documents.’ Cybernetics and Information Technologies, 18(1): 124-138. https://doi.org/10.2478/cait-2018-0011

Nennuri, Rajashekar, M. Geetha Yadav, M. Samhitha, S. Sandeep Kumar and G. Roshini. (2021). ‘Plagiarism detection through data mining techniques.’ Journal of Physics: Conference Series, 1979(1): 1-6.

https://doi.org/10.1088/1742-6596/1979/1/012070

Omar, Khaled and Ammar Hilal. (2022). ‘Plagiarism detection in Arabic documents using word2vector and Arabic WordNet’. 2022 International Arab Conference on Information Technology

(ACIT). https://doi.org/10.1109/acit57182.2022.9994090

Quidwai, Ali, Chunhui Li and Parijat Dube. (2023). ‘Beyond black box AI generated plagiarism detection: From sentence to document level’. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications.

https://doi.org/10.18653/v1/2023.bea-1.58

Ragin, Charles C. (1998). ‘The logic of qualitative comparative analysis’. International Review of Social History, 43(6): 105- 124.

https://doi.org/10.1017/cbo9780511563874.006

Rigney, Ann. (2019). ‘Texts and Intertextuality’. In Kiene B. Wurth and Ann Rigney (eds.), The Life of Texts: An Introduction to Literary Studies, 79-112. Amsterdam University Press.

Son, Nguyen V., Le T. Huong and Nguyen C. Thanh. (2021). ‘A two-phase plagiarism detection system based on multi-layer long short-term memory networks.’ International Journal of Artificial Intelligence (IJ-AI), 10(3): 636-648. https://doi.org/10.11591/ijai.v10.i3.pp636-648

Sterman, Sarah, Evey Huang, Vivian Liu and Eric Paulos. (2020). ‘Interacting with literary style through computational tools.’ Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1: 1-12.

https://doi.org/10.1145/3313831.3376730

Sutherland-Smith, Wendy. (2008). Plagiarism, the Internet, and Student Learning: Improving Academic Integrity. London: Routledge.

Teresa, Turell M. (2004). ‘Textual kidnapping revisited: The case of plagiarism in literary translation.’ International Journal of Speech, Language and the Law - Forensic Linguistics, 11(1): 1- 26.

https://doi.org/10.1558/sll.2004.11.1.1

Terry, Richard. (2010). The Plagiarism Allegation in English Literature from Butler to Sterne. Basingstoke: Springer.

Ulum, Muhammad B. (2023). ‘Plagiarism in classic Arabic poetics (Comparative study of al-jumahi and al-qairawany’s thoughts).’ Jurnal CMES, 16(1): 61-71. https://doi.org/10.20961/cmes.16.1.53447

Wali, Wafa, Bilel Gargouri and Abdelmajid Ben Hamadou. (2018). ‘Using sentence similarity measure for plagiarism detection of Arabic documents.’ In Ajith Abraham, Niketa Gandhi, Thomas Hanne, Tzung-Pei Hong, Tatiane Nogueira Rios and Weiping Ding (eds.), Intelligent Systems Design and Applications, 52-62. Basingstoke: Springer International Publishing.

Wall, David. (2003). Crime and the Internet. London: Routledge.

Wijaya, Indra., Andy Seputra and Wayan G. Parwita. (2021). ‘Comparison of the BM25 and rabinkarp algorithm for plagiarism detection.’ Journal of Physics: Conference Series, 1810(1): 1-10. https://doi.org/10.1088/1742-6596/1810/1/012032

Worton, Michael and Judith Still. (1990). Intertextuality: Theories and Practices. New York: Manchester University Press.

Wu, Jain-Shing, Ting-Hsuan Chien, Li-Ren Chien and Chin-Yi Yang. (2021). ‘Using artificial intelligence to predict class loyalty and plagiarism in students in an online blended programming course during the COVID-19 pandemic.’ Electronics, 10(18): 1-20.

https://doi.org/10.3390/electronics10182203

Zaher, Mahmoud, Abdulaziz Shehab, Mohamed Elhoseny and Farahat F. Farahat. (2020). ‘Unsupervised model for detecting plagiarism in internet-based handwritten Arabic documents.’ Journal of Organizational and End User Computing, 32(2): 42-66. https://doi.org/10.4018/joeuc.2020040103

Zouaoui, Samia and Khaled Rezeg. (2019). ‘Ontological approach based on multi-agent system for indexing and filtering Arabic documents.’ Journal of Digital Information Management, 17(3): 145-163.

https://doi.org/10.6025/jdim/2019/17/3/145-163

Zuo, Ziyu. (2022). ‘On the determination of literary plagiarism in copyright law.’ PONTE International Scientific Researches Journal, 78(6): 1-10.

https://doi.org/10.21506/j.ponte.2022.6.4

Downloads