Linguistic Resources for Genre-Independent Language Technologies: User-Generated Content in BOLT

Jennifer Garland,  Stephanie Strassel,  Safa Ismael,  Zhiyi Song,  Haejoong Lee
Linguistic Data Consortium, University of Pennsylvania


We describe an ongoing effort to collect and annotate very large corpora of user-contributed content in multiple languages for the DARPA BOLT program, which has among its goals the development of genre-independent machine translation and information retrieval systems. Initial work includes collection of several hundred million words of online discussion forum threads in English, Chinese and Egyptian Arabic, with multi-layered linguistic annotation of a portion of the collected data. Future phases will target additional informal genres like Twitter and text messaging. We provide details of the collection strategy and review some of the particular technical and annotation challenges stemming from these genres, and conclude with a discussion of strategies we’ve adopted for tackling these issues.