How StringNet differs from a corpus and why

The most widely known type of resource for accessing language data for language analysis is corpora. And the most common way of accessing corpus content is through concordancing software. Their most common function is the keyword in context search (KWIC search).  StringNet isn’t a corpus. And here I will try to suggest the basic differences and some why’s for those differences.

StringNet isn’t a corpus. It’s is a massive archive of multiword patterns that have been statistically derived from a corpus (from the BNC). Further, StringNet massively cross-indexes these  patterns of English (2.2 billion of them) to each other to make it possible to easily navigate from one pattern to discover other related patterns (say, from ‘count yourself lucky’ to ‘consider yourself lucky’ to ‘consider yourself fortunate’ to ‘[verb] yourself [adj]‘ to many others). Here I try to clarify how this makes StringNet different from corpora and their concordancing softwares. I hope to show how StringNet relies on corpora but tries to help bridge gaps between what users might want to get from corpora on the one hand and what corpora and concordancing tools offer on the other.

Some basic distinctions influence what we can get out of corpora, yet these distinctions seem to fly under the radar. Some that we’ve tried to keep in mind in designing StringNet are: (1) token vs type; (2) syntagmatic vs paradigmatic patterns of word behavior; and (3) finding what I am looking for in a corpus vs discovering what I wouldn’t have thought to look for. We’ve published some work elaborating on these (see a reference below), but here’s a thumbnail of just one of them–-tokens vs types.

Corpora are collections of tokens, that is, of instances. Saussure’s ‘parole’. What puts off many learners using corpora is that KWIC searches yield tokens, and tokens are of little interest in and of themselves. Tokens are interesting to corpus users mainly as windows onto what they betoken; and what they betoken is types, that is, patterns–in this case, patterns of word use. And the path from tokens in concordance lines to patterns of recurrent word behaviors can be a long and winding one. StringNet has tried to distill the recurrent patterns of word use from British National Corpus. So rather than a list of tokens that a concordancer provides, a response to a StringNet query is a list of types, of the patterns in which the query word participates. Clicking on the ‘example’ icon beside one of those listed patterns, in turn, yields tokens of those patterns used in sentences.

By the way, click on everything you see in the StringNet query results. The clickability is intended to encourage exploration through the net that is StringNet.

Here’s a chapter where we say more on all this:

David Wible and Nai-Lung Tsao (2011).“Towards a New Generation of Corpus-derived Lexical Resources for Language Learning,”in Meunier F., De Cock S., Gilquin G. and Paquot M. (eds). A Taste for Corpora. Amsterdam: John Benjamins.

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply