spaCy in Vietnamese Natural Language Processing

spaCy in Vietnamese Natural Language Processing: Custom Models, Integrations, and Advancements Abstract spaCy, a leading industrial-strength NLP library in Python, excels in efficient, production-ready pipelines for tokenization, named entity recognition (NER), dependency parsing, and more, leveraging neural architectures like Transformers for high-resource languages. For Vietnamese—a tonal, low-resource language with syllabic ambiguities and diacritic complexities—spaCy’s multilingual … Read more

NLTK in Vietnamese Natural Language Processing

NLTK in Vietnamese Natural Language Processing: Adaptations, Challenges, and Integrations Abstract The Natural Language Toolkit (NLTK), a cornerstone Python library for natural language processing (NLP), offers extensible tools for tokenization, part-of-speech (POS) tagging, and syntactic parsing, primarily optimized for high-resource languages like English. For Vietnamese—a low-resource, tonal, isolating language with ambiguous word boundaries—NLTK’s out-of-the-box performance … Read more

SeaLLM

SeaLLM: Southeast Asian Large Language Models for Multilingual and Low-Resource NLP Abstract SeaLLM, a suite of open-source large language models (LLMs) tailored for Southeast Asian (SEA) languages, marks a transformative stride in equitable NLP by addressing the representational voids in low-resource tongues like Vietnamese, Indonesian, Thai, and Khmer. Developed by AI Singapore and collaborators, the … Read more

PhoGPT

PhoGPT: A Generative Pre-Trained Transformer for Vietnamese Language Tasks Abstract PhoGPT, a pioneering generative language model for Vietnamese, represents a leap in low-resource NLP by adapting the GPT architecture to capture the language’s tonal intricacies, syllabic morphology, and contextual nuances. Launched in 2022 by VinAI Research, PhoGPT—available in base (1.3B parameters) and large (7B) variants—excels … Read more

Underthesea

UnderTheSea: An Open-Source Python Toolkit for Vietnamese Natural Language Processing Abstract UnderTheSea, a versatile Python library for Vietnamese Natural Language Processing (VNLP), has solidified its position as a go-to resource since its 2019 launch, offering streamlined pipelines for tokenization, part-of-speech (POS) tagging, named entity recognition (NER), chunking, and dependency parsing. Engineered to navigate Vietnamese’s tonal … Read more

VnCoreNLP

VnCoreNLP: A Java-Based Toolkit for Vietnamese Natural Language Processing Abstract VnCoreNLP, an open-source Java toolkit for Vietnamese Natural Language Processing (VNLP), stands as a foundational resource since its release in 2018, providing integrated pipelines for tokenization, part-of-speech (POS) tagging, named entity recognition (NER), dependency parsing, and coreference resolution. Tailored to Vietnamese’s tonal, syllabic, and isolating … Read more

PyVi

PyVi: A Python-Based Toolkit for Vietnamese Natural Language Processing Abstract PyVi, an open-source Python library dedicated to Vietnamese Natural Language Processing (VNLP), has become instrumental in democratizing access to linguistic tools for low-resource languages. Released in 2018, PyVi integrates efficient algorithms for core tasks such as word segmentation, part-of-speech tagging, named entity recognition, and dependency … Read more

PhoBERT

PhoBERT: A Transformer-Based Pre-Trained Language Model for Vietnamese Natural Language Processing Abstract PhoBERT, a robust pre-trained language model tailored for Vietnamese, has revolutionized Natural Language Processing (NLP) tasks in low-resource settings. Derived from the RoBERTa architecture, PhoBERT leverages extensive monolingual corpora to encode tonal and syllabic nuances inherent to Vietnamese. This report offers a comprehensive … Read more

Advances in Vietnamese Natural Language Processing: Challenges, Techniques, and Future Directions

Abstract Vietnamese Natural Language Processing (VNLP) has emerged as a critical subfield within computational linguistics, addressing the unique linguistic characteristics of the Vietnamese language. As a low-resource language with tonal inflections, syllabic structure, and rich morphology, Vietnamese poses significant challenges for standard NLP pipelines developed primarily for high-resource languages like English. This report provides a … Read more