9 June 2009
Google Translator Toolkit
Google Translator Toolkit is a new tool being launched today to help translators organize their work and benefit from shared translations, glossaries and translation memories, the Google China Blog reports (English translation by Google).
Evidence that Google was working on a service like this originally surfaced in August 2008 when references to Google Translation Center appeared in Google’s robots.txt file. At the time, the service was only available to Trusted Testers and most of the pages and screenshots were quickly taken offline. Since those screenshots were produced, it’s clear that a lot of changes have been made to the tool.
The Translation Process
The Google Translator Toolkit Workbench, showing side-by-side editing of Wikipedia’s Google article.
For those not familiar with standard translation processes, a professional translator is likely to use a Computer-aided translation (CAT) tool to help identify and extract snippets of text for translation from various file types.
Google Translator Toolkit currently only allows users to upload HTML, Microsoft Word, OpenDocument Text, Rich Text and Plain Text documents up to 1MB for translation. Alternatively, it’s possible to enter the URL of a file on the web, select a Wikipedia article or a Knol for translation.
Once uploaded or selected, files can be translated using the Workbench interface which shows the source text and the target language translations either side-by-side or above and below each other.
Previously translated segments from the translation memory are suggested and can be rated by yourself and others.
One good reason to share translations with others is so that they can be reviewed for consistency and style. Google allows users to rate translated segments, presumably for style and accuracy. Comments can also be added to the target document, which is especially useful when collaborating with other users.
In addition to the global translation memory, users can also create and share their own TMs.
Many CAT tools allow the translator to store their human translations in a database called a translation memory. The memory can then be used to help with future translation projects by checking to see whether a certain word, phrase, sentence or segment has been translated before. Even if it’s not exactly the same phrase, the translation memory can be used to suggest what’s called a fuzzy match, often indicated by a percentage to reflect how similar the text is.
When translating Wikipedia articles and Knols, the translations are stored in a global, shared translation memory that’s available to everyone by default. That means previously translated phrases from these articles are stored and available for use by other translators using the service, so if they ever find themselves translating the same piece of text, Google will automatically populate the interface with the previous translations to help save time.
Google’s support article explains the process:
Pretranslating your documents
When you upload a document into Google Translator Toolkit, we automatically ‘pretranslate’ your document as follows:
- We divide your document into segments, usually sentences, headers, or bullets.
- We search all available translation databases for previous human translations of each segment.
- If any previous human translations of the segment exist, we pick the highest-ranked search result and ‘pretranslate’ the segment with that translation.
- If no previous human translation of the segment exists, we use machine translation to produce an ‘automatic translation’ for the segment, without intervention from human translators.
We realize for some translators, pre-filling with machine translation may actually slow, not speed up, the translation process. In such cases, you can change your settings to pre-fill the segment with the source text, so you can type over the source text instead of making corrections to automatic translation.
Uploaded documents can benefit from using this global TM too, but if users don’t want to share their translations with everyone, they can create their own translation memories and control exactly which users can make additions and rate translations.
Translators already using CAT tools may have translation memories stored in the Translation Memory eXchange (.tmx) open standard XML format. Google allows translations contained in those TMs to be uploaded and added to existing Google Translator Toolkit TMs, providing they’re no larger than 50MB and confirm to TMX 1.0 or higher.
TMs other than the global TM can also be searched for previously translated segments which can then be rated without opening a translation document.
Glossaries are collections of words and phrases with definitions and notes associate with them. They are often used in the translation process to help choose which phrase is most appropriate and to maintain consistency between translations of technical or specialty subjects. Google Translator Toolkit requires CSV format glossaries to be uploaded (it’s not possible to create one from scratch) which will then be automatically searched for terminology in the segments that are currently being translated.
For a really quick overview of some of these features in action, you can watch this YouTube video:
How could this be useful to Google?
A machine translation of the Google China Blog explains, “Google’s mission is to organize the world’s information and make it universally accessible and useful. Translation of information, in our view is the key to access to information.”
Google has been working on a statistical machine translation system for a few years now, which it started to use for Google Translate instead of Systran in October 2007. Since then it’s been slowly integrating translation into many of its services, including Google Toolbar, Google Talk, Google Reader, Gmail, and YouTube. There’s even an AJAX Language API which anyone can use to build upon.
In my opinion, this latest tool has clearly been designed to help improve Google’s translation offerings. One thing on which statistical machine translation relies is aligned translations. In very simple terms, to help train a statistical machine translation system, text in one language is fed into the system alongside the same text in another language. Will enough text, the system can start to learn how certain phrases should be translated. Without aligned translations, there’s no easy way to know exactly which sentence in the source document relates to the translated version. That’s where translation memories are very useful; they contain aligned translations.
There are literally thousands of Wikipedia articles being translated all the time, but the translations aren’t usually maintained in a translation memory. Through using Google Translator Toolkit, translators could benefit from seeing previously translated text from the global translation memory and, in return, Google could clearly benefit from translators using its interface to translate any content that’s then stored as aligned translations in their global TM, which it can ultimately use to enhance its statistical machine translation system and improve the translations that are provided to end-users of any service using Google Translate.
And as the global TM grows, it might even be possible for end-users to get near-to-human-quality for translations of their documents, websites, blog posts, emails and tweets instantly.
Disclaimer: I am an employee of SDL, a translation company that provides translation services and software.
Labels: blogoscoped, google, translation