Concordance Overview



In its simplest form, a Keyword-in-Context (KWIC) Concordance  is a listing of some or all of the words in a text or set of texts, surrounded by the text that they are embedded in.  Here is a section of a concordance of just the first sentence of this page:

 surrounded by the text    
they are embedded in.  H
sting of some or all of
words in a text, or set 
of texts, surrounded by
text that they are embedd

Typically, the concordance lines would show more of the surrounding text, so the user could more clearly understand how the words are used.

The purpose of a concordance is to study how words are used in a language, and to allow us to acquire a deeper understanding of meaning and usage than can be obtained from a dictionary.  As an example, consider the words tan and auburn.  Both can be used to mean a color; both indicate a brownish hue.  This much you can find in a dictionary.  But in a dictionary, you would not find that auburn is used frequently to describe hair color but never to describe skin color.  Nor would you find that tan is not used to describe hair.  But a concordance which uses a large amount of text from the target language could show you many occurrences of these two words at a glance (and other meanings as well, of course, such as the use of tan as an abbreviation of a trigonometric tangent).  In this way you could infer how native speakers use the words, and how these usages may be limited to specific situations.

Acquiring this sense of how words are actually used (as opposed to just what they mean) will help in creating the best possible translations.  For example, if you were reading an English story in which someone's skin was described as auburn, you would immediately know that something unusual was intended: perhaps, for example, it is used for comic effect.  Your translation, then, would attempt to accomplish the intended comic effect in the target language.  If you didn't know that auburn was not normally used to describe skin, but only know that it is a brownish-red color, you would probably just translate the word to the target-language equivalent and lose the intended comic effect.


Clearly, the more text that goes into a concordance, the more useful the results will be.  Doing a concordance on a sentence or paragraph cannot tell you very much about patterns of usage in the language.  Most European languages already have electronic versions of tens or even hundreds of megabytes of text which are publicly available.  A single such collection is called a corpus - plural corpora.

Some corpora consist of a broad selection of materials from the language - novels, plays, newspaper articles, transcriptions of authentic speech, and so on.  Others are specialized - religious documents or political writings or the works of a single author, etc.  Clearly, if you are studying the works of Shakespeare you would want his collected plays and poems in a corpus and nothing more.  If you are interested in the language of current events, you would want newspapers or political writings.  But if you are interested in written language in general, you would want as broad a selection as possible.

The less commonly taught languages generally do not have prepared corpora.  For this course, some medium-sized corpora have been prepared for your use  For the concordance programs supplied with this course, you will be able to create your own corpora by downloading documents from the internet or obtaining electronic version of texts in other ways.  Most likely these corpora will not reach megabyte size, but using your own data with our concordance programs should give you a good feel for the way that concordances can be used.

In fact, the concordance programs used in this course were not developed to process megabytes of text data.  There are professional concordance programs available that can do that, and there are web sites (for some European languages, including English) that can access huge corpora.

Our programs, on the other hand, are free and can be used by any computer with an internet connection and a web browser.  They can use texts that you supply, not just already prepared texts. However, they will probably choke if you feed them too much data.  (At this writing, 240 printed pages or a bit less than half a megabyte of text have been processed without causing a crash.)

Corpus Preparation

If you want to create your own corpus, you can do so by any convenient means... but you must ensure that that the data you create is pure text.  By that, I mean a text with no hidden format codes, font information, or other information in it.  Here is one suggested workflow to create a text file and use it in our concordance programs:

Special Notes for Thai:

1. The Thai must be Unicode Thai.  Some web sites use older encodings that will not work.

2. Someone must manually insert spaces between words.  The software cannot guess at word divisions.  Sorry!

Concordance Usage and Options

Step 1: Concordance Step 1: Provide a Document to Use

Step 2: Choose Concordance Type

Step 3: Choose Matching Type (only used for single word display)

Note that there is a link on the concordance screen to Regular Expression Help.

Step 4: Enter how many characters to display before and after the word


Click on the Submit button to create and display the concordance results.  The actual work is done on the SEAsite server and the results are sent back to your computer and displayed on the screen by your browser.  This may take anywhere from a second to several minutes, depending on the amount of data to be sent back and forth and processed on the server.

The concordance output consists of

You can now

And then...

You can use your browser's Back button to go back and run a new concordance on a new word or corpus.  When you return to that screen, the Reset button will clear all fields to default values.