MT — checklist before you even bother

wasting timeThis is a follow-up to my earlier post on MTH improvements.

MT has been a (sometimes superimposed) buzzword among early (or not so early now) technology adopters who are translators. Setting up an MT engine and maintaining it on your own, however, can take a lot of time and effort with disappointing results. These simple questions will help you decide if it’s even worth your thought.

Don’t bother training an MT engine if:

  • You don’t have a quality TM of at least 1,000 segments.
  • It’s not a recurring project and it has less than 10,000 words to it.
  • It’s written like a novel (i.e. written by different authors in different styles with highly variable sentence length and ample use of synonyms).
  • Your CAT tool doesn’t have an API for MT.

These are the prerequisites that will help you save a lot of time you could use to spend with the people you like.

HT v. MT

human or machine

Human translation versus raw machine. There is a huge difference, of course. It’s all about quality and utility…

From my perspective, one of the best videos to demonstrate the difference was created by Elan Languages. Now I don’t know who they are (yet), but I certainly think they hired a great script writer and director. Be sure to check it out yourself. I’m sure you will enjoy it.


MTH raises MT training standards

choc croc

One of the primary problems with trained MT engines is… poor output, of course. With free solutions, the problem partly lies with customizations. Now Microsoft Translator Hub claims to have overcome the two main bottlenecks on the path to output improvements: inability to update glossary on the fly and the need for a huge number of TM segments to train the engine right. Here’s an excerpt from their blog:

Dictionary only training: You can now train a custom translation system when you just have a dictionary and no other parallel documents. There is no minimum size to that dictionary. It should have at least one entry. Simply upload the dictionary, which is an Excel file with the language ID as column header, include it in your training set, and hit train. The training completes very quickly, and you are ready to deploy.

Training with 1000 parallel sentences only: You can now train a custom system with only 1000 parallel sentences. Use 500 sentences for the tuning set and 500 sentences in the test set. The Hub will build a system based on Microsoft models, and will tune the models to your tuning set, giving you a better adjusted system than the generic translation system. You can use in-domain target language documents as part of this training as well. The 1000 sentences must be unique and pass the Hub’s data filtering. (Source)

If this really works as described, this is a fascinating breakthrough for individual translators, who train their MT engines for recurring projects of technical nature. Well, about time you checked that out yourself.

Photo credit: Crocodile via photopin (license)