New function: Find similar words

A few months ago we changed the buttons beside our query box. Where there was one before, which said “Search,” now there are two: “Find patterns” and “Find similar words”. The first one, “Find patterns,” does the same thing the previous “search” button did before. The second button is a new beta trial function “Find similar words.” Let me introduce this new one.

To try it, input just one word, a content word (noun, verb, adjective for now) in the query box and click “Find similar words.” It is something like a thesaurus search, showing words with meanings similar to the query word. But a more accurate way to say it is that we show words that have similar ‘behaviors’ to the query word. The words we list as similar to the query word are those that occupy the same paradigmatic slot as the query word in multi-word patterns (that is, in our hybrid n-grams). We take our inspiration for this from Zellig Harris (1954; 1968), specifically, his notion that words derive their meaning(s) from the contexts of their use and so words with similar distributions (or sets of contexts) will tend to have similar meanings. The recent computational literature that pursues the implications of Harris is referenced in the paper we link to below. When the list of similar words appears after clicking “Find similar words,” it will include a column on the far right called: “shared patterns” with a number there for each of the ‘similar words’ listed. That number tells how many patterns the query word shares with that listed word (how many patterns where those two words are attested in the same paradigmatic slot in the pattern). Crucially, that number is also a link, an important link. Please read on.

Noteworthy but easily missed: What makes our ‘similar word’ function different from any thesaurus function that we know of is that we can show all of the contexts that the query word shares with any of the listed similar words.  For any of the listed similar words, to see those contexts it shares with the query word, click on the number listed in the “shared patterns” column. That yields a list of all the patterns those two words share. Most computational approaches to word similarity render a quantitative measure (a score) of similarity between two words. Our approach, however, is based on the hunch that for language learners, the key question concerning two words is not “How similar are they?” (answered by a score), but rather “How are they similar?” (which we answer by simply showing all the contexts they share). We have said this and elaborated on it some in the first two sections of this paper. We think the answer to the “How are they similar” question can come in the form of a ‘feel’ for the similarity that can result from encountering the words enough being used in similar ways.

For example, the top noun we list as similar to the noun ‘light’ is the noun ‘context’. Now  simply informing someone of this (or giving a similarity score as evidence) may not strike any chords. But encounters with both words in the same slot side by side would be a different matter: “in the [context/light] of the report,” (the report serving as a context that illuminates) or in any of the 110 other contexts they share.

This is a beta version, and there is plenty of noise in the results, but hopefully also plenty of goodies too, chances for directly encountering the shared contexts of words in order to get a feel for how they may resemble each other.

We trained this on nouns, so it may give weaker results for other parts of speech.

We presented our approach and initial results last November at the Joint Symposium on Semantic Processing (JSSP) 2013 (Trento, Italy). The published paper is here.



This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply