At Unbabel, we know how important privacy is and we act accordingly, making sure we don’t compromise customers’ sensitive data of any kind. That’s why we developed an Anonymization step. It automatically removes sensitive, personally-identifiable data before it’s dispatched to our community. Credit card and social security numbers, URLs, dates, email addresses, and other personal details are all stripped out and replaced by an anonymized term block with the type of content that it hides, giving context to our editors while ensuring that private data is never at risk.
Anonymization is a type of annotation that relates to specific parts of the text and can be labeled to a specific category. Annotations help the translation machine and the community better understand how to handle that part of the text.
While protecting our customer data, we also want to give the person who is working on the text as much context as possible. With names, Polyglot replaces the identified name with a semantic equivalent - i.e. a name of the same gender - so that our community understands which gender to apply.
Below, you’ll see the Unbabel translation pipeline. Note how important anonymization is, and where it fits into our ecosystem:
So why is anonymization important?
- Legal compliance: it ensures regulations such as the General Data Protection Regulation (GDPR) are being followed.
- Ethical considerations: it helps to ensure that the privacy rights of individuals are respected.
- Quality assurance: it improves the quality of MT output by reducing the risk of overfitting*. By removing PII (Personal Identifiable Information), the MT system is forced to focus on the linguistic patterns and structures of the text, rather than the specific details of the content.
- Security concerns: it prevents data breaches and cyberattacks by reducing the value of the data to potential attackers. Even seemingly innocuous information can be used to infer sensitive details about individuals or organizations.
*Overfitting occurs when an MT is trained too much on a specific set of data, to the point where it starts to "memorize" that data instead of learning the underlying patterns. This can make the MT perform well on the training data, but poorly on new, unseen data. It's like a student who memorizes answers for a specific test but doesn't actually learn the concepts, so they struggle when they encounter new questions.
In other words, the MT becomes too specialized to the training data and loses its ability to generalize to new situations, being prone to produce biases.
Next, you can see how they appear in Polyglot:
Sensitive data has been automatically replaced in such a way that it doesn’t affect your work. We protect clients’ information while giving you enough context. You don’t have to edit it, just make sure the text is in accordance with the information you see.