What is the text of a Tweet?

Joan Codina and Jordi Atserias
Fundació Barcelona Media


Twitter is a popular micro blogging/social medium for broadcasting news, staying in touch with friends and sharing opinions using up to 140 characters per message. In general, User generated Content (e.g. Blogs, Tweets) differs from the kind of text the traditional Natural Language Processing tools has been developed and trained. The use of non-standard language, emoticons, spelling errors, letter casing, unusual punctuation, etc makes applying NLP to user generated content still an open issue (Kobus et al.’2008), (Simard+Deslauriers’2000), (CAW2.0 workshop). Moreover, in Twitter, all the differences of the user generated content are magnified by the message length restriction and the use of several particular conventions of the twitter framework (user references, hashtags, etc). Although previous works have proposed some methodology, e.g. (Kaufmann+Kalita, 2010), as far as we know no evaluation has been carried out to measure the impacts of these heuristics on the text processing.

This work will focus on the effect of the Twitter metalanguage elements in the text processing. The paper presents three different techniques to pre-process text before applying a PoS tagger: Synonym substitution for non-standard text normalization and Text and PoS Filtering strategies for removing Twitter metaelements.

Although the techniques presented are language independent we will focus on the processing of Spanish Tweets and specifically to PoS tagging which is a basic previous step to more complex NLP tasks, such Named Entity Recognition or parsing. A small evaluation is performed to evaluate the different approaches presented in the task of PoS tagging Tweets in Spanish.