Custom dictionaries available now

You can now have a custom dictionary automatically created from your texts. Any language, any text type. Just put in a sufficient amount of text and get out a dictionary of the words used there.

What you get

A quantitative dictionary, like the ones in qlaara, documents exactly how words are used in its underlying text corpus. There is no manual labour involved, which of course drastically reduces the time required to compile a dictionary, but also rules out guesswork and human errors.

The dictionary is created using methods of vector semantics, grouping words by their approximate meanings. Current algorithms do not understand the text as a human would. Instead, vector semantics is based on the distributional similarity hypothesis, or the idea that similar words tend to occur in similar contexts. Given a large enough corpus, deep learning and statistical methods are surprisingly good at finding patterns of usage that can be helpful for the dictionary user.

The current default dictionary in qlaara describes English popular science writing. It’s best used on the same text type but can also help in related genres like technology blogs or user manuals. A new dictionary is needed if you are dealing with other languages or entirely different text types like legal documents or fiction. You can also go more specific, creating a dictionary from texts of your own or those of your company, to add to the corporate style guide.

What is needed

To produce meaningful results, we need a sufficient amount of training material. How much exactly, depends on the texts, namely how variable they are. Good results are typically achieved for words that occur at least hundreds of times in the training corpus, meaning total corpus word counts of at least hundreds of millions. If such volumes are not available, you can also try starting smaller, especially in well-defined specialised domains. Plain text format is best.

Rare words, very common words, stopwords, typos, proper nouns, product codes and the like can be filtered out, and there are also several parameters of the preprocessing and training steps to play with. We’ll discuss these with you together with what you expect from the dictionary.

How it works

You either upload the texts or point us to where to find them. We don’t need to store the texts themselves, so we are happy to follow any copyright or confidentiality restrictions you may have. After discussing what kind of dictionary you need and what is technically possible, we process the material and produce the dictionary. Depending on the amount of text and the training parameters, this may take from several hours to several weeks of processing on our server. The resulting dictionary can be browsed on qlaara.com, where you can make it private or public, or downloaded in standard JSON format and used any way you want.

To get started, let us know what kind of text corpus you have and what kind of dictionary you would like to get. We’ll then take it from there.

Myth #4: Vocabulary learning stops in adulthood

Any exposure to texts results in vocabulary growth, also in adults. Image © | Dreamstime.com

By the time they reach adulthood, people know the words of their mother tongue. This is what allows us to communicate, right?

Not necessarily. While vocabulary size tests do suggest that word learning stops in early adulthood, there are two things about the word frequency distribution that make such testing mathematically impossible: the distribution is skewed and bursty.Read More »

Quantification? Definitely. But conceptual graphs? Perhaps not.

Example of an empirically grounded conceptual graph. It appears that the same colour is called blu in Italian and lilla in Estonian, both colour names transparent for English speakers.

A paper just came out about dictionary data structures in the rare situation when we do have access to concepts: How blue is azzurro?Representing probabilistic equivalency of colour terms in a dictionaryby Mari Uusküla and myself. There were two ideas in that paper, only one of which has survived to be used in qlaara. Read More »

What do heresy and empathy have in common

Heretics might not be that bad after all.

Our understanding of how things work is based on a well-established belief system that is usually taken for granted. A set of basic assumptions is placed beyond doubt and everything else, including normal science, is built on that. But what if those assumptions are not the most productive ones? Or if they turn out to be internally inconsistent or in conflict with some other theory that we also like to believe?Read More »

Too much noise vs lack of information

Humans are good at ignoring noise. Image © Dreamstime.com

Humans are good at ignoring background noise and retrieving only the information that they need. This ability is supported by generic learning mechanisms and improves with age and life experience.

At the same time, humans do benefit from hints. They use all, even small and ambiguous pieces of information for reducing their uncertainty about the intentions of the speaker, and in many cases the hints are sufficient despite their ambiguity.
Read More »

Redundancy is useful

https://upload.wikimedia.org/wikipedia/commons/9/90/Bernardino_Pinturicchio_-_Saint_Jerome_in_the_Wilderness_-_Walters_371089.jpg
Saint Jerome in the Wilderness by Bernardino Pinturicchio. St Jerome is the patron saint of translators.

I used to translate. At the time I honestly believed in the fixed code model of communication: that there is meaning in the source text and that the task of the translator is to convey that meaning in the target text as exactly as possible. That view of language also included the belief that it is possible to convey meanings precisely, concisely, economically.
Read More »