Word of the day – corpus

A text corpus (pl. corpora) is a large and structured set of texts usually stored, processed and analysed electronically. They are used to do statistical analysis, checking occurrences or validating linguistic rules. They are also used by dictionary makers to find definitions of words. The word corpus comes from the Latin for body.

According to an article in the New York Times on this topic that I found today, the verb migrate is used much more frequently with the direction south than with north. Pink things tend to be fluffy, while green things are more likely to be fuzzy. We tend to chide ourselves but we are more likely to lambaste others. The word fake is most commonly associated with smiles, tans, IDs, passports, fur and boobs.

The article contains many other interesting examples, all taken from the Oxford English Corpus (OEC), a 1.8-billion-word database of written and spoken English.

I found another corpus of English today that’s accessible online: the British National Corpus – it’s smaller than the OEC – only 100 million words – and covers mainly British English.

Do you know of similar corpora for other languages?

FacebookTwitterGoogle+Share
This entry was posted in English, Language, Words and phrases.

3 Responses to Word of the day – corpus

  1. teaandcrumpets says:

    http://corpus.byu.edu/ has a nice collection of corpora. British English, American English, TIME Magazine, Spanish and Portuguese.

  2. Paul says:

    Just noticed this today (reported in BBC online news story) – corpus of Scots – http://www.scottishcorpus.ac.uk