Custom dictionaries available now

You can now have a custom dictionary automatically created from your texts. Any language, any text type. Just put in a sufficient amount of text and get out a dictionary of the words used there.

What you get

A quantitative dictionary, like the ones in qlaara, documents exactly how words are used in its underlying text corpus. There is no manual labour involved, which of course drastically reduces the time required to compile a dictionary, but also rules out guesswork and human errors.

The dictionary is created using methods of vector semantics, grouping words by their approximate meanings. Current algorithms do not understand the text as a human would. Instead, vector semantics is based on the distributional similarity hypothesis, or the idea that similar words tend to occur in similar contexts. Given a large enough corpus, deep learning and statistical methods are surprisingly good at finding patterns of usage that can be helpful for the dictionary user.

The current default dictionary in qlaara describes English popular science writing. It’s best used on the same text type but can also help in related genres like technology blogs or user manuals. A new dictionary is needed if you are dealing with other languages or entirely different text types like legal documents or fiction. You can also go more specific, creating a dictionary from texts of your own or those of your company, to add to the corporate style guide.

What is needed

To produce meaningful results, we need a sufficient amount of training material. How much exactly, depends on the texts, namely how variable they are. Good results are typically achieved for words that occur at least hundreds of times in the training corpus, meaning total corpus word counts of at least hundreds of millions. If such volumes are not available, you can also try starting smaller, especially in well-defined specialised domains. Plain text format is best.

Rare words, very common words, stopwords, typos, proper nouns, product codes and the like can be filtered out, and there are also several parameters of the preprocessing and training steps to play with. We’ll discuss these with you together with what you expect from the dictionary.

How it works

You either upload the texts or point us to where to find them. We don’t need to store the texts themselves, so we are happy to follow any copyright or confidentiality restrictions you may have. After discussing what kind of dictionary you need and what is technically possible, we process the material and produce the dictionary. Depending on the amount of text and the training parameters, this may take from several hours to several weeks of processing on our server. The resulting dictionary can be browsed on qlaara.com, where you can make it private or public, or downloaded in standard JSON format and used any way you want.

To get started, let us know what kind of text corpus you have and what kind of dictionary you would like to get. We’ll then take it from there.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s