Forthcoming

Natural Language Processing Approaches to Text Data Augmentation: A Computational Linguistic Analysis

Authors

DOI:

https://doi.org/10.33806/ijaes.v25i1.682

Keywords:

easy data augmentation, natural language processing, neural augmentation, text data augmentation, text genre

Abstract

In the context of Natural Language Processing (NLP) tasks, problems such as insufficient or skewed data are frequently encountered. One practical solution to this problem is to generate additional textual data. Text Data Augmentation (TDA) refers to small changes made to accessible text at the character, word, or sentence level to generate synthetic data that is subsequently inserted into data loaders to train the model. By producing synthetic data, models can learn from a larger range of instances and, hence, enhance their resilience and generalization skills. Despite the fact that the entire NLP community has extensively studied many NLP DA approaches, recent research on the subject suggests that the relationship between the several DA techniques now in use is not entirely known in practice. Therefore, this study applies and extends the advances of TDA to encounter and cover varied tools on multiple settings or contexts. To carry out a thorough practical implementation of NLP DA approaches, comparing the way they perform and highlighting some of the significant similarities and differences in these various scenarios, this work depends on different tools of easy data augmentation and neural-based augmentation. This study suggests that some typical DA techniques might not be suitable in some circumstances or text environments. Specifically, according to the initial results, the context and word count of a text may have a significant impact on the quality of the synthetic data.

Author Biographies

Hoda Zaiton, Alexandria University in collaboration with Arab Academy for Science, Technology and Maritime Transport (AASTMT), Egypt

MA Candidate- Applied Linguistics

College of Language and Communication (CLC)

The Arab Academy for Science, Technology and Maritime Transport (AASTMT),

in collaboration with the Institute of Applied Linguistics and Translation, Faculty of Arts, Alexandria University, Egypt.

Sameh Al-Ansary , Alexandria University, Egypt

Professor of Computational Linguistics

Phonetics and Phonology Department

Faculty of Arts, Alexandria University, Egypt.

References

Abu-Ssaydeh, Abdul-Fattah and Najib Jarad. (2016). ‘Complex sentences in English legislative texts: Patterns and translation strategies’. International Journal of Arabic-English Studies (IJAES), 16(1): 111-128.

Al-Taher, Mohammad Anwar. (2019). ‘Google translate’s rendition of verb-subject structures in Arabic news reports’. International Journal of Arabic-English Studies (IJAES), 19(1):195-208.

https://doi.org/10.33806/ijaes.19.1.11.

Bax, Stephen. (ed.). (2011). Discourse and Genre. London: Macmillan Education UK.

Belinkov, Yonatan and Yonatan Bisk. (2018). ‘Synthetic and natural noise both break neural machine translation’. In 6th International Conference on Learning Representations, ICLR 2018, Conference Track Proceedings, 1–13.

Bojanowski, Piotr, Edouard Grave, Armand Joulin and Tomas Mikolov. (2017). ‘Enriching word vectors with subword information’. Transactions of the Association for Computational Linguistics, 5: 135–46. https://doi.org/10.1162/tacl_a_00051.

Chalkidis, Ilias, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Martin Katz and Nikolaos Aletras. (2022). ‘LexGLUE: A benchmark dataset for legal language understanding in English’. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume1: Long Papers), 4310–4330. https://doi.org/10.18653/v1/2022.acl-long.297

Coulombe, Claude. (2018). ‘Text data augmentation made simple by leveraging NLP Cloud APIs’. Available at: http://arxiv.org/abs/1812.04718

Couture, Barbara. (1986). ‘Effective ideation in written text: A functional approach to clarity and exigence’. Faculty Publications--Department of English. 67: 69-92.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. (2019). ‘BERT: Pre-Training of deep bidirectional transformers for language understanding’. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://doi.org/10.18653/v1/N19-1423

Fadaee, Marzieh, Arianna Bisazza and Christof Monz. (2017). ‘Data augmentation for low-resource neural machine translation’. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Volume 2: Short Papers, 567–573. https://doi.org/10.18653/v1/P17-2090

Feng, Steven Y., Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura and Eduard Hovy. (2021). ‘A survey of data augmentation approaches for NLP’. In Findings of the Association for Computational Linguistics: ACL/IJCNLP, 968–988. https://doi.org/10.18653/v1/2021.findings-acl.84/

Futrell, Richard, Kyle Mahowald, and Edward Gibson. (2015). ‘Quantifying word order freedom in dependency corpora’. In Proceedings of the Third International Conference on Dependency Linguistics, 91–100.

Gangal, Varun, Steven Y. Feng, Malihe Alikhani, Teruko Mitamura and Eduard Hovy. (2022). ‘NAREOR: The narrative reordering problem’. Proceedings of the 36th AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 36 (10): 10645–53.

https://doi.org/10.1609/aaai.v36i10.21309.

Greene, Derek and Pádraig Cunningham. (2006). ‘Practical solutions to the problem of diagonal dominance in Kernel document clustering’. In Proceedings of the 23rd International Conference on Machine

Learning-ICML’06,337-384. https://dl.acm.org/doi/10.1145/1143844.1143892

Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen and Marco Baroni. (2018). ‘Colorless green recurrent networks dream hierarchically’. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, Volume 1 (Long Papers), 1195–1205. https://doi.org/10.18653/v1/N18-1108

Haralabopoulos, Giannis, Mercedes Torres Torres, Ioannis Anagnostopoulos and Derek McAuley. (2021). ‘Text data augmentations: Permutation, antonyms and negation’. Expert Systems with Applications, 177

: 114769. https://doi.org/10.1016/j.eswa.2021.114769

Johns, Ann M. (2008). ‘Genre awareness for the novice academic student: An ongoing quest’. Language Teaching, 41 (2): 237–52.

https://doi.org/10.1017/s0261444807004892

Karpukhin, Vladimir, Omer Levy, Jacob Eisenstein and Marjan Ghazvininejad. (2019). ‘Training on synthetic noise improves robustness to natural noise in machine translation’. In Proceedings of the 5th Workshop on Noisy User-generated Text, W-NUT@EMNLP 2019, 42–47. https://doi.org/10.18653/v1/D19 -5506

Kim, Hyeon Soo, Hyejin Won, and Kyung Ho Park. (2022). ‘PMixUp: Simultaneous utilization of part-of-speech replacement and feature space interpolation for text data augmentation’.

https://openreview.net/forum?id=O4fNuE8F51T

Kobayashi, Sosuke. (2018). ‘Contextual augmentation: Data augmentation by words with paradigmatic relations’. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), 452–457. https://doi.org/10 .18653/v1/N18-2072

Kolomiyets, Oleksandr, Steven Bethard, and Marie-Francine Moens. (2011). ‘Model-Portability experiments for textual temporal analysis’. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2: 271–276.

Ma, Edward. (2019). Nlpaug: Data Augmentation for NLP. Available at: https://github.com/makcedward/nlpaug

Mccarthy, Philip M. (2005). An assessment of the range and usefulness of lexical diversity measures and the potential of the measure of textual, lexical diversity. PhD Dissertation, The University of Memphis.

Mikolov, Tomáš, Wen-tau Yih, and Geoffrey Zweig. (2013). ‘Linguistic regularities in continuous space word representations’. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 7: 46–51.

Miller, George A. (1995). ‘WordNet: A lexical database for English’. Communications of the ACM, 38 (11): 39–41.

https://doi.org/10.1145/219717.219748

Nazarenko, Adeline, and Adam Wyner. (2017). ‘Legal NLP Introduction’. Traitement automatique des langues, 58(2): 7-19.

Lakshmana Pandian, S., and T. V. Geetha. (2008). ‘Morpheme based language model for Tamil part-of-speech tagging’. Polibits, 38: 19–25.

Pavlick, Ellie, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme and Chris Callison-Burch. (2015). ‘PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings and style classification’. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 425-430.

Pellicer, Lucas Francisco Amaral Orosco, Taynan Maier Ferreira and Anna Helena Reali Costa. (2023). ‘Data augmentation techniques in natural language processing’. Applied Soft Computing, 132: 109803. https://doi.org/10.1016/j.asoc.2022.109803 .

Pennington, Jeffrey, Richard Socher and Christopher Manning. (2014). ‘Glove: Global vectors for word representation’. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543. https://doi.org/10 .3115/v1/D14-1162

Pratt, Lorien. (1996). ‘Special issue: Reuse of neural networks through transfer’. Connection Science, (Print), 8(2).

Radford, Alec and Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. (2018). ‘Improving language understanding by generative pre-training’. Available at https://api.semanticscholar.org/CorpusID:49313245.

Sabty, Caroline, Islam Omar, Fady Wasfalla, Mohamed Islam and Slim Abdennadher. (2021). ‘Data augmentation techniques on Arabic data for named entity recognition’. Procedia Computer Science, 189: 292–99. https://doi.org/10.1016/j.procs.2021.05.092.

Şahin, Gözde Gül, and Mark Steedman. (2018). ‘Data augmentation via dependency tree morphing for low resource languages’. In Conference on Empirical Methods in Natural Language Processing, 5004–5009. https://doi.org/10.18653/v1/D18-1545

Sennrich, Rico, Barry Haddow and Alexandra Birch. (2015). ‘Improving neural machine translation models with monolingual data’. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 86–96. https://doi.org/10.18653/v1/P16-1009

Shen, Yutong, Jiahuan Li, Shujian Huang, Yi Zhou, Xiaopeng Xie, and Qinxin Zhao. (2022). ‘Data augmentation for low-resource word segmentation and pos tagging of ancient Chinese texts’. In Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, 169–173.

Shorten, Connor, Taghi M. Khoshgoftaar and Borko Furht. (2021). ‘Text data augmentation for deep learning’. Journal of Big Data, 8 (1): 101. https://doi.org/10.1186/s40537-021-00492-0 .

Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke and Andrew Rabinovich. (2015). ‘Going deeper with convolutions’. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9.

Wang, Jiapeng and Yihong Dong. (2020). ‘Measurement of text similarity: A survey’. Information, 11 (9): 421. https://doi.org/10.3390/info11090421.

Wei, Jason and Kai Zou. (2019). ‘EDA: Easy data augmentation techniques for boosting performance on text classification tasks’. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 6382–6388.

Wang, William and Diyi Yang. (2015). ‘That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets’. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2557–2563.http://dx.doi.org/10.18653/v1/D15-1306

Wu, Xing, Shangwen Lv, Liangjun Zang, Jizhong Han and Songlin Hu. (2019). ‘Conditional BERT contextual augmentation’. In Computational Science - ICCS 2019 - 19th International Conference, Proceedings, Part IV, 84–95. http://dx.doi.org/10.1007/978-3-030-22747- 0_7.

Xiang, Rong, Emmanuele Chersoni, Qin Lu, Chu-Ren Huang, Wenjie Li and Yunfei Long. (2021). ‘Lexical data augmentation for sentiment analysis’. Journal of the Association for Information Science and Technology, 72 (11): 1432–47. https://doi.org/10.1002/asi.24493.

Zhang, Xiang, Junbo Zhao and Yann LeCun. (2015). ‘Character-level convolutional networks for text classification’. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, 649–657.

Zhu, Jinhua, Fei Gao, Lijun Wu, Yingce Xia, Tao Qin, Wengang Zhou, Xueqi Cheng and Tie-Yan Liu. (2019). ‘Soft contextual data augmentation for neural machine translation’. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 5539–5544. http://dx.doi.org/10.18653/v1/P19-1555

Downloads

Date of Publication

2024-06-23

How to Cite

Zaiton, H., & Al-Ansary , S. (2024). Natural Language Processing Approaches to Text Data Augmentation: A Computational Linguistic Analysis. International Journal of Arabic-English Studies. https://doi.org/10.33806/ijaes.v25i1.682

Issue

Section

Table of Contents
Received 2024-02-07
Accepted 2024-06-11
Published 2024-06-23