List of text corpora
Following is a list of text corpora in various languages. "Text corpora" is the plural of "text corpus". A text corpus is a large and structured set of texts (nowadays usually electronically stored and processed). Text corpora are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
English language
- Google Books Ngram Corpus[1][2]
- American National Corpus
- Bank of English
- British National Corpus
- Corpus Juris Secundum
- Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online.
- Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB.
- International Corpus of English
- Oxford English Corpus
- Scottish Corpus of Texts & Speech
- Corpus Resource Database (CoRD), more than 80 English language corpora.[3]
European languages
- Bulgarian National Corpus[4]
- CETENFolha
- Croatian Language Corpus
- Croatian National Corpus
- Czech National Corpus[5]
- Google Books Ngram Corpus
- Russian National Corpus
- General Internet Corpus of Russian
- Slovenian National Corpus
- Thesaurus Linguae Graecae (Ancient Greek)
- Eastern Armenian National Corpus (EANC) 110 million words. Freely searchable online.
- National Corpus of Polish
- German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
- Free corpus of German mistakes from people with dyslexia
- Spanish text corpus by Molino de Ideas, which contains 660 million words.[6]
- CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania [7]
- Reference Corpus of Contemporary Portuguese (CRPC)
Middle Eastern Languages
- Hamshahri Corpus (Persian a.k.a. Farsi)
- Persian in MULTEXT-EAST corpus (Persian a.k.a. Farsi)[8]
- Amarna letters, (for Akkadian, Egyptian, Sumerogram's, etc.)
- TEP: Tehran English-Persian Parallel Corpus[9]
- TMC: Tehran Monolingual Corpus, Standard corpus for Persian Language Modeling[9]
- Persian Today Corpus: The Most Frequent Words of Today Persian, based on a one-million-word corpus (in Persian: Vāže-hā-ye Porkārbord-e Fārsi-ye Emrūz), Hamid Hassani, Tehran, Iran Language Institute (ILI), 322 pp. ISBN 964-8699-32-1
- Kurdish-corpus.uok.ac.ir (Kurdish-corpus Sorani dialect) University of Kurdistan Department of English language and linguistics
- Bijankhan Corpus A Contemporary Persian Corpus for NLP researches, University of Tehran, 2012
- Quranic Arabic Corpus (Classical Arabic)
- Turkish National Corpus[10]
East Asian Languages
- Kotonoha Japanese language corpus[11]
- LIVAC Synchronous Corpus (Chinese)
Parallel corpora of diverse languages
- Europarl Corpus - proceedings of the European Parliament from 1996–2011
- EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[12]
- OPUS: Open source Parallel Corpus in many many languages [13]
- Tatoeba A parallel corpus which contains about 2288000 sentences in 122 languages.[14]
- NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie) [15] (legacy repo)
- SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[16]
Comparable Corpora
- WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus (eng, fre, deu, ita)
- Disambiguating Similar Language Corpora Collection (DSLCC)[17] (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
- Wikipedia Comparable Corpora (41 million aligned Wikipedia articles for 253 language pairs)
See also
References
- ↑ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
- ↑ "PhraseFinder". A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
- ↑ "Corpus Resource Database (CoRD)". Department of English, University of Helsinki.
- ↑ "Under Update". search.dcl.bas.bg. Retrieved 12 January 2014.
- ↑ https://ucnk.ff.cuni.cz/english/index.php
- ↑ (Spanish) "Molinolabs - corpus". molinolabs.com. Retrieved 12 January 2014.
- ↑ "CorALit – CorALit - Lietuvių mokslo kalbos tekstynas". coralit.lt. Retrieved 12 January 2014.
- ↑ "Available from CLARIN".
- 1 2 "University of Tehran NLP Lab". ece.ut.ac.ir. Retrieved 12 January 2014.
- ↑ "Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage". tnc.org.tr. Retrieved 12 January 2014.
- ↑ "KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言". kotonoha.gr.jp. Retrieved 12 January 2014.
- ↑ "EUR-Lex Corpus". sketchengine.co.uk. Retrieved 27 October 2016.
- ↑ "OPUS - an open source parallel corpus". opus.lingfil.uu.se. Retrieved 12 January 2014.
- ↑ "Tatoeba - Number of sentences per language". tatoeba.org. Retrieved 13 January 2014.
- ↑ Liling Tan and Francis Bond (14 May 2012). "Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)" (PDF). International Journal of Asian Language Processing. 22 (4): 161–174.
- ↑ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of The use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
- ↑ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of The 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
This article is issued from Wikipedia - version of the 11/15/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.