4/26/12

:: LWC: Leiden Weibo Corpus ::


"It is my pleasure to announce to you today the Leiden Weibo Corpus (LWC),
an annotated linguistic 100-million word corpus containing 5.1 million
messages from Sina Weibo, China's premier Twitter-like microblogging
service.

The LWC is freely available online http://lwc.daanvanesch.nl/. Data for
the LWC was collected in January 2012. As such, it contains many linguistic
phenomena that may not be found in older corpora, such as suffixation with
"-ing", an aspectual marker borrowed from English.

Furthermore, Sina Weibo messages come with valuable meta data, such as the
gender of the user and his location. This information allows the LWC to
calculate how often words are used in different provinces and cities across
China, which is useful for research into lexical variation across China.

Naturally, the LWC also supports searching for single words or grammar
patterns, such as "any verb followed by an aspectual particle and then a
noun". This feature may also be of interest to students and teachers of
Mandarin who are looking for example sentences.

Please feel free to forward this announcement to anyone who might be
interested. "

No comments:

Post a Comment