I’m currently involved in a few projects that use corpus analysis (What is Corpus Linguistics?), so this page will be developed as a repertoire of the most commonly used linguistics corpora that focus on contemporary Br/Am English, mostly with general access. (More soon!)
- British National Corpus (BNC): contains a 100 million words of text texts from a wide range of genres
- BYU Google Books corpora: for written English data
- Collins Wordbanks Online English corpus: contains 56 million words of contemporary written and spoken text, both British and American English
- Corpus of Contemporary American English (COCA): contains more than 520 million words of text; equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.
- CHILDES (+CLAN): a large and widely used child language data exchange system
- Michigan Corpus of Academic Spoken English: transcribed academic speech (native and non- native)
- Phrase Detectives Corpus: a recently released (May 2017) corpus developed by the University of Essex, with 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game
- TIME Magazine Corpus of American English: based on 100 million words of text in about 275,000 articles from TIME magazine from 1923-2006
- Word Frequency Data: a corpus of contemporary American English
A short list of websites that provide all different types of linguistic corpora in various languages:
- BYU Corpora
- Corpus-based Linguistics Links
- Data and corpora from Max Planck Institute for Psycholinguistics
- Introduction to Corpus Linguistics, University of Lancaster
- Linguistic Data Consortium (LDC)
- Resources for Corpora, Stanford Linguistics
- Texts & Corpora compiled by the Linguist List