Thai Monitor Corpus:  Challenges and Contribution to Thai NLP

Wirote Aroonmanakun; Natawut Nupairoj; Veera Muangsin; Songphan Choemprayong

Thai Monitor Corpus: Challenges and Contribution to Thai NLP

Wirote Aroonmanakun, Natawut Nupairoj, Veera Muangsin, Songphan Choemprayong

Abstract

Building a corpus has been a necessary task for NLP and other research fields like linguistics, language teaching, and translation. Only a few Thai corpora have been created and released. Most of them are static and small in size. They are not designed to be a monitor corpus, which can grow over time. The concept of a monitor corpus bears similarity to the new research area named Big Data, which has gained more interests in the past few years because of the extensive growth of data available online. In this paper, the differences between monitor corpus and Big Data will be first discussed. Then, the design and the framework for developing a Thai monitor corpus will be outlined. To carry out this task, techniques and methods used in Big Data research that are suitable for storing texts will be selected and summarized. The progress of this work will be reported in section 3, and the plan for further development and the use of TMC will be sketched. The paper is concluded by pointing out the relationship between the two research fields, NLP and Big Data. Contributions to each other will be reviewed.

Keywords

Thai corpus, monitor corpus, NLP, Big Data

Full Text:

PDF

Refbacks

There are currently no refbacks.

Open Journal Systems

Journal Help

User

Notifications

Journal Content
Browse

Font Size

Username
Password
Remember me