Information on extracting terminology


Terminology is the sum of the terms which identify a specific topic. Extracting terminology is the process of extracting terminology from a text.

The idea is to compare the frequency of words in a given document with their frequency in the language. Words which appear very frequently in the document but rarely in the language are probably terms.


It uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language. It uses a probabilistic part of speech tagger to take into account the probability that a particular sequence could be a term. It creates n-grams of words by minimizing the relative entropy.

Why have we developed this?

Translated has developed this technology to help its translators to be aware of the difficulties in a document and to simplify the process of creating glossaries.

We also use it to improve search results in traditional search engines (es. Google) by giving a better estimation of how much a keyword is relevant to a document.

I want it!

If you are interested in this technology, please read more on Translated Labs and our services for natural language processing.

I could do better!

If you think you could improve these applications, if you are passionate about information retrieval, natural language processing, machine learning or artificial intelligence in general, you have come to the right place. Send us your CV