Working with Named Entity Recognition – Unbabel Community Support

At Unbabel we are deeply committed to supporting our community members in their daily work on the platform, which is why we do our best to regularly provide updates & feedback whenever possible, while also ensuring that quality remains high for our clients.

This is why we want to provide you with some information regarding Named Entity Recognition (NER) in Unbabel’s tasks and why it is important that you recognize them and work on them, whenever necessary.

What is Named Entity Recognition (NER)?

NER is a subtask of information extraction that involves detecting and categorizing important information in a text known as named entities. Named entities refer to the key subjects of a text, such as names, locations, companies, events and products, times, monetary values, and percentages.

The main goal of entity recognition is for the MT output to be both better quality and avoid MT hallucinations.

Another goal, in some cases, is to localize to the client/language preferences, once Unbabel recognizes a string/chunk of text as an entity. For example, a numeric entity like 1600 in the source can be rendered as 1.600 in the target language if that's the expected format. In other cases, like company names, it may preserve them from translation. By underlining them on the interface, we aim at drawing your attention to them.

These entities are frequently the most important ones in a segment. If we make a mistake in translating, omitting, or mishandling a Named Entity, there is a high chance of losing or, even worse, falsifying crucial information.

What is the difference between NER and PII (Personally Identifiable Information, which is anonymized?

In summary, PII focuses on personal information that can identify individuals, while NER focuses on identifying and categorizing named entities in text. PII is broader in scope and encompasses various types of personal data, while NER is a specific technique used for information extraction and understanding in text processing. And while PII doesn’t need any intervention from your side, NER sometimes does.

In Unbabel tasks, named entities appear in the form of GRAY UNDERLINED annotations, such as the example below.

Depending on the client, different entities will appear TRANSLATED, TRANSLITERATED, or remain UNTRANSLATED (same as source). However, sometimes they can be MISTRANSLATED.

This is why it’s vital that you check the context carefully (the sidebar can be helpful) and determine whether an entity requires editing, as these gray underlined words are editable.

Note: Green underlined words, however, are client-approved glossary terms that should only be changed to make sure they fit in context when it comes to gender, number, or grammatical case.

What errors may arise and why is it important to check and correct them?

All Artificial Intelligence (AI) systems, including our NER system, are characterized by their non-deterministic nature. Unlike deterministic processes, where the result is always predictable and consistent, AI relies on statistical models to make predictions and decisions. This means that what is recognized as an entity depends on the context where it appears: the same brand name may be recognized, and therefore underlined in gray in one paragraph, and not in the next one simply because it appears in a different context.

However, due to the inherent nature of the recognition process, there is a possibility of occasional errors where words that are actually verbs, common nouns, or adjectives (that is, not entities), may be highlighted.

Some examples of issues you might encounter are:

some entities may not be recognized and therefore are not underlined (not very frequent)
words that are not entities may be mistakenly recognized as entities and underlined. This is due to the NER system being prone to assign entity status to words with initial uppercase letters. In these cases, it may result in untranslated words
partial recognition - depending on the case it can result in untranslated parts of the text
a correctly identified entity is handled inappropriately, such as a date that is not localized or a company name that was translated when it shouldn't have been

In all the above cases, you need to ensure the translations you submit are correct and according to client specifications.

In the below example, you can see how NER captures “Brand-New Medium SUV” (not an entity, but a common noun) along with the real entity “HAVAL” (the name of an automotive brand), which results in a chunk of untranslated text in Russian that needs to be translated.

On the sidebar, whenever a segment is active you will see the added context of the recognized terms, which can also contain insight into the expected treatment, so we encourage you always to check it:

In conclusion, Named Entity Recognition plays a vital role in maintaining accuracy and quality in translations. This is why we need you to ensure their correct handling and avoid any loss or distortion of important information. Remember, thorough attention to Named Entities contributes to producing precise and contextually relevant translations.

Related articles