Word of the day – corpus

A text corpus (pl. corpora) is a large and structured set of texts usually stored, processed and analysed electronically. They are used to do statistical analysis, checking occurrences or validating linguistic rules. They are also used by dictionary makers to find definitions of words. The word corpus comes from the Latin for body.

According to an article in the New York Times on this topic that I found today, the verb migrate is used much more frequently with the direction south than with north. Pink things tend to be fluffy, while green things are more likely to be fuzzy. We tend to chide ourselves but we are more likely to lambaste others. The word fake is most commonly associated with smiles, tans, IDs, passports, fur and boobs.

The article contains many other interesting examples, all taken from the Oxford English Corpus (OEC), a 1.8-billion-word database of written and spoken English.

I found another corpus of English today that’s accessible online: the British National Corpus – it’s smaller than the OEC – only 100 million words – and covers mainly British English.

Do you know of similar corpora for other languages?

3 thoughts on “Word of the day – corpus”

teaandcrumpets says:

3 August 2007 at 7:57 pm

http://corpus.byu.edu/ has a nice collection of corpora. British English, American English, TIME Magazine, Spanish and Portuguese.
suchosch says:

10 August 2007 at 11:15 pm

Czech National Corpus – http://ucnk.ff.cuni.cz/english/index.html
Paul says:

13 August 2007 at 8:52 am

Just noticed this today (reported in BBC online news story) – corpus of Scots – http://www.scottishcorpus.ac.uk

Comments are closed.

Omniglot Blog

Adventures in the world of words and language

Word of the day – corpus

3 thoughts on “Word of the day – corpus”