Blog

9 June 2009

Google Translator Toolkit

Google Translator Toolkit is a new tool being launched today to help translators organize their work and benefit from shared translations, glossaries and translation memories, the Google China Blog reports (English translation by Google).

Evidence that Google was working on a service like this originally surfaced in August 2008 when references to Google Translation Center appeared in Google’s robots.txt file. At the time, the service was only available to Trusted Testers and most of the pages and screenshots were quickly taken offline. Since those screenshots were produced, it’s clear that a lot of changes have been made to the tool.

The Translation Process


The Google Translator Toolkit Workbench, showing side-by-side editing of Wikipedia’s Google article.

For those not familiar with standard translation processes, a professional translator is likely to use a Computer-aided translation (CAT) tool to help identify and extract snippets of text for translation from various file types.

Google Translator Toolkit currently only allows users to upload HTML, Microsoft Word, OpenDocument Text, Rich Text and Plain Text documents up to 1MB for translation. Alternatively, it’s possible to enter the URL of a file on the web, select a Wikipedia article or a Knol for translation.

Once uploaded or selected, files can be translated using the Workbench interface which shows the source text and the target language translations either side-by-side or above and below each other.


Previously translated segments from the translation memory are suggested and can be rated by yourself and others.

One good reason to share translations with others is so that they can be reviewed for consistency and style. Google allows users to rate translated segments, presumably for style and accuracy. Comments can also be added to the target document, which is especially useful when collaborating with other users.

Translation Memories


In addition to the global translation memory, users can also create and share their own TMs.

Many CAT tools allow the translator to store their human translations in a database called a translation memory. The memory can then be used to help with future translation projects by checking to see whether a certain word, phrase, sentence or segment has been translated before. Even if it’s not exactly the same phrase, the translation memory can be used to suggest what’s called a fuzzy match, often indicated by a percentage to reflect how similar the text is.

When translating Wikipedia articles and Knols, the translations are stored in a global, shared translation memory that’s available to everyone by default. That means previously translated phrases from these articles are stored and available for use by other translators using the service, so if they ever find themselves translating the same piece of text, Google will automatically populate the interface with the previous translations to help save time.

Google’s support article explains the process:

Pretranslating your documents

When you upload a document into Google Translator Toolkit, we automatically ‘pretranslate’ your document as follows:

  1. We divide your document into segments, usually sentences, headers, or bullets.
  2. We search all available translation databases for previous human translations of each segment.
  3. If any previous human translations of the segment exist, we pick the highest-ranked search result and ‘pretranslate’ the segment with that translation.
  4. If no previous human translation of the segment exists, we use machine translation to produce an ‘automatic translation’ for the segment, without intervention from human translators.

We realize for some translators, pre-filling with machine translation may actually slow, not speed up, the translation process. In such cases, you can change your settings to pre-fill the segment with the source text, so you can type over the source text instead of making corrections to automatic translation.

Uploaded documents can benefit from using this global TM too, but if users don’t want to share their translations with everyone, they can create their own translation memories and control exactly which users can make additions and rate translations.

Translators already using CAT tools may have translation memories stored in the Translation Memory eXchange (.tmx) open standard XML format. Google allows translations contained in those TMs to be uploaded and added to existing Google Translator Toolkit TMs, providing they’re no larger than 50MB and confirm to TMX 1.0 or higher.

TMs other than the global TM can also be searched for previously translated segments which can then be rated without opening a translation document.

Glossaries

Glossaries are collections of words and phrases with definitions and notes associate with them. They are often used in the translation process to help choose which phrase is most appropriate and to maintain consistency between translations of technical or specialty subjects. Google Translator Toolkit requires CSV format glossaries to be uploaded (it’s not possible to create one from scratch) which will then be automatically searched for terminology in the segments that are currently being translated.

Learn More

For a really quick overview of some of these features in action, you can watch this YouTube video:

How could this be useful to Google?

A machine translation of the Google China Blog explains, “Google’s mission is to organize the world’s information and make it universally accessible and useful. Translation of information, in our view is the key to access to information.”

Google has been working on a statistical machine translation system for a few years now, which it started to use for Google Translate instead of Systran in October 2007. Since then it’s been slowly integrating translation into many of its services, including Google Toolbar, Google Talk, Google Reader, Gmail, and YouTube. There’s even an AJAX Language API which anyone can use to build upon.

In my opinion, this latest tool has clearly been designed to help improve Google’s translation offerings. One thing on which statistical machine translation relies is aligned translations. In very simple terms, to help train a statistical machine translation system, text in one language is fed into the system alongside the same text in another language. Will enough text, the system can start to learn how certain phrases should be translated. Without aligned translations, there’s no easy way to know exactly which sentence in the source document relates to the translated version. That’s where translation memories are very useful; they contain aligned translations.

There are literally thousands of Wikipedia articles being translated all the time, but the translations aren’t usually maintained in a translation memory. Through using Google Translator Toolkit, translators could benefit from seeing previously translated text from the global translation memory and, in return, Google could clearly benefit from translators using its interface to translate any content that’s then stored as aligned translations in their global TM, which it can ultimately use to enhance its statistical machine translation system and improve the translations that are provided to end-users of any service using Google Translate.

And as the global TM grows, it might even be possible for end-users to get near-to-human-quality for translations of their documents, websites, blog posts, emails and tweets instantly.

[Thanks TOMHTML!]

Disclaimer: I am an employee of SDL, a translation company that provides translation services and software.

Labels: , ,


17 May 2006

Google Press Day

It was the Google Press Day 2006 last Wednesday. I probably should have blogged about this then.

Anyway, forget about the four new releases, the most exciting part of the webcast was when Sergey Brin (namesake of Google Brin Creator) answered my question about their plans for their statistical machine translation system around 03:32:45 into the webcast.

Tony Ruscoe asked:

Does Google have any plans in the near future to integrate their statistical machine translation system with services such as Google News, Gmail, Google Talk and even Google Search?

Sergey Brin replied:

We actually – for those of you who haven’t heard – we actually developed a statistical machine translation system that won a number of awards last year and came in first on Chinese/English translation as well as Arabic/English translation. And we’re very excited about it. We’d certainly love to get that launched into our products, and we’re working on it.

It was really developed to be as good as possible in terms of the quality of the translation, not as “productionizable” as possible. So, I know it seems trivial, but it would actually take some work to make that happen. But we’re committed to doing it and I believe we will succeed.

No real surprises there then. Although it possibly sounds like Google may be waiting for Moore’s law to kick in before they integrate their statistical machine translation system with more of their services.

Also see Google Press Day posts elsewhere:

Labels: , ,


23 January 2006

Free Translation Blog Released

Here I am, banging on about Google and their new stuff all the time, whilst forgetting that the company I work for has got a pretty cool tool of our own. (No, I’m not talking about myself; whilst many people might call me a tool, I’m not sure how many would call me cool.)

FreeTranslation.com has been around since 1999. If you’re between 13 and 21 years old and living in the USA, you’ll probably have used it to translate some insults into a foreign language before emailing them to your friends. (For some reason, most Europeans use another website – but we won’t talk about that here!)

Today we quietly released the FreeTranslation.com Blog. There’s only one post there at the moment, but there are plenty more to come and they should make quite interesting reading. Some people might think that we’re jumping on the blogging bandwagon a few years too late – I think that if you’ve got something to say, it’s better to say it late than never! (Also, trying to justify working on something that isn’t going to make you money is always difficult!)

The fact that I’m posting about something work related like this makes me feel like I’m trying to be to FreeTranslation.com what Matt Cutts is to Google or something! As if... (Well, I can dream can’t I!?!)

Anyway, if you’re interested in translation, add the FreeTranslation.com Blog Feed to your feed reader or aggregator. And if there’s anything you’d like to know, post a comment here or there and we’ll do our best to answer your questions.

BTW, this was going to be an exclusive post but I thought I’d let Chris have the scoop since he’s been so patiently waiting for its release so he could be the first person in the blogosphere to post about it!

Labels: , ,


28 November 2005

TalkMan: Talking Global with PSP

Whilst my PSP is absolutely brilliant, I can’t help thinking that it must still have loads more to offer than just games, movies and music...

Imagine if I could take my PSP on holiday with me, speak English into it and have it instantly translate what I said into another language and speak it back to me. Well, if I want to translate what I say into Chinese, Korean or Japanese, this review suggests that’s not such a crazy idea after all. TalkMan was recently released in Japan and does exactly that. It’s bundled with a USB microphone that screws into the top of your PSP and uses speech recognition software to try and find a match for what you’ve said in its huge list of common phrases. You can even play games to help you with your language learning and pronunciation.

It’s not quite a Babel fish, but it’s one step closer I guess.

(What else could that USB microphone be used for though? I’m thinking that SingStar for the PSP would certainly keep my fellow tram passengers entertained on the way into work...)

[Via Waxy]

Labels: , , , ,