首页 - 专题 -  - 正文1

机器翻译引擎的基本原理 

The basic principles of machine translation engines

机器翻译引擎的基本原理 

编者按:Statsbot数据科学家Daniil Korbut简明扼要地介绍了用于机器学习翻译的基本原理:RNN、LSTM、BRNN、Seq2Seq、Zero-Shot、BLEU。

Google Machine Translation

 

Every day we use different technologies without even knowing how exactly they work. In fact, it’s not very easy to understand engines powered by machine learning. The Statsbot team wants to make machine learning clear by telling data stories in this blog. Today, we’ve decided to explore machine translators and explain how the Google Translate algorithm works.

我们都在使用的很多技术,我们其实并不知道它们到底是如何工作的。实际上,理解机器学习驱动的引擎并非易事。Statsbot团队博客希望能讲清楚机器学习是怎么一回事。这次我们决定探索机器翻译,并解释Google翻译算法的原理。

……

 

Years ago, it was very time consuming to translate the text from an unknown language. Using simple vocabularies with word-for-word translation was hard for two reasons: 1) the reader had to know the grammar rules and 2) needed to keep in mind all language versions while translating the whole sentence.

许多年前,翻译来自未知语言的文本是非常耗时的。使用简单的词汇表逐字翻译之所以很困难,是因为 1)读者必须知道语法规则;2)在翻译整句时需要记住所有的语言版本。

 

Now, we don’t need to struggle so much– we can translate phrases, sentences, and even large texts just by putting them in Google Translate. But most people don’t actually care how the engine of machine learning translation works. This post is for those who do care.

现在,我们不需要为此付出太多的努力——只需将它们粘贴到Google翻译中,就可以翻译短语、句子甚至大段文本。然而,大多数人实际上并不关心机器翻译的引擎是如何工作的。本文为那些关心这个的人而写。

Deep learning translation problems

深度学习翻译问题

 

If the Google Translate engine tried to kept the translations for even short sentences, it wouldn’t work because of the huge number of possible variations. The best idea can be to teach the computer sets of grammar rules and translate the sentences according to them. If only it were as easy as it sounds.

如果Google翻译引擎试图储存所有的翻译,甚至仅仅储存短句的翻译,都是行不通的,因为可能的变体数量巨大。最好的想法可能是教会计算机一组语法规则,并根据语法规则来翻译句子,如果这一切真像听起来那样简单的话。

 

If you have ever tried learning a foreign language, you know that there are always a lot of exceptions to rules. When we try to capture all these rules, exceptions and exceptions to the exceptions in the program, the quality of translation breaks down.

如果你曾经试过学习外语,那么你该知道规则总是有很多例外的。当我们试图在程序中刻画所有这些规则,所有这些例外,乃至例外的例外时,翻译质量就崩塌了。

 

Modern machine translation systems use a different approach: they allocate the rules from text by analyzing a huge set of documents.

现代机器翻译系统使用不同的方法:通过分析大量文档将文本与规则联系起来。

 

Creating your own simple machine translator would be a great project for any data science resume.

创建你自己的简单机器翻译工具,对任何数据科学简历来说都是一个很棒的项目。

 

Let’s try to investigate what hides in the “black boxes” that we call machine translators. Deep neural networks can achieve excellent results in very complicated tasks (speech/visual object recognition), but despite their flexibility, they can be applied only for tasks where the input and target have fixed dimensionality.

我们试着调查一下我们称之为机器翻译的“黑盒子”里隐藏着什么。深度神经网络可以在非常复杂的任务(语音/视觉对象识别)中取得优异的结果,但是,尽管它们很灵活,却只能应用于具有固定维度的输入和目标的任务。

 

Recurrent Neural Networks

循环神经网络

 

Here is where Long Short-Term Memory networks (LSTMs) come into play, helping us to work with sequences whose length we can’t know a priori.

因此,我们需要长短期记忆网络(LSTM),它能应对事先未知长度的序列。

 

LSTMs are a special kind of recurrent neural network (RNN), capable of learning long-term dependencies. All RNNs look like a chain of repeating modules.

LSTM是一种能够学习长期依赖的循环神经网络(RNN)。循环神经网络看起来就像一串重复的模块。

Unrolled recurrent neural network

So the LSTM transmits data from module to module and, for example, for generating Ht we use not only Xt, but all previous input values X. To learn more about structure and mathematical models of LSTM, you can read the great article “Understanding LSTM Networks.”

因此LSTM在模块之间传递数据,比如,为了生成Ht,我们不仅使用Xt,同时使用所有X之前的输入。关于LSTM的更多信息,可参考Understanding LSTM Networks(英文)和循环神经网络入门(中文)。

 

Bidirectional RNNs

双向循环神经网络

 

Our next step is bidirectional recurrent neural networks (BRNNs). What a BRNN does, is split the neurons of a regular RNN into two directions. One direction is for positive time, or forward states. The other direction is for negative time, or backward states. The output of these two states are not connected to inputs of the opposite direction states.

我们的下一步是双向循环神经网络(BRNN)。BRNN将常规RNN的神经元分成两个方向。一个方向是正向的时间,或前馈状态。另一个方向是负向的时间,或反馈状态。这两个状态的输出与反方向的状态的输入互不相连。

Bidirectional recurrent neural networks

 

To understand why BRNNs can work better than a simple RNN, imagine that we have a sentence of 9 words and we want to predict the 5th word. We can make it know either only the first 4 words, or the first 4 words and last 4 words. Of course, the quality in the second case would be better.

要理解为何BRNN效果更好,可以想像一下我们有一个包含9个单词的句子,然后想要预测第5个单词。我们可以让网络仅仅知道前面4个单词,或者让网络知道前面4个单词和后面4个单词。显然第二种情况下预测的质量会更好。

Sequence to sequence model

 

Sequence to sequence

序列到序列

 

Now we’re ready to move to sequence to sequence models (also called seq2seq). The basic seq2seq model consist of two RNNs: an encoder network that processes the input and a decoder network that generates the output.

然后是序列到序列模型(也称为seq2seq)。基本的seq2seq模型包含两个RNN:一个处理输入的编码网络和一个生成输出的解码网络。

 

Finally, we can make our first machine translator!

最后,我们将创建我们的第一个机器翻译工具!

However, let’s think about one trick. Google Translate currently supports 103 languages, so we should have 103x102 different models for each pair of languages. Of course, the quality of these models varies according to the popularity of languages and the amount of documents needed for training this network. The best that we can do is to make one NN to take any language as input and translate into any language.

不过,让我们先考虑一个绝招。Google翻译目前支持103种语言,所以我们应该有103x102个不同的模型。当然,取决于语言的流行程度和训练网络需要的文档数量,这些模型的质量会有所不同。最好我们能创建一个神经网络,然后这个网络能接受任何语言作为输入,然后将其翻译成任何语言。

Google Translate

Google翻译

 

That very idea was realized by Google engineers at the end of 2016. Architecture of NN was build on the seq2seq model, which we have already studied.

这个想法正是Google工程师们在2016年末实现的想法。Google工程师使用的正是我们上文提及的seq2seq模型。

The only exception is that between the encoder and decoder there are 8 layers of LSTM-RNN that have residual connections between layers with some tweaks for accuracy and speed. If you want to go deeper with that, take a look at the article Google’s Neural Machine Translation System.

唯一的例外是在编码和解码网络之间有8层LSTM-RNN网络,层间有残差连接,还有一些出于精度和速度考虑的调整。如果你想深入了解相关信息,可以看Google’s Neural Machine Translation System这篇论文。

The main thing about this approach is that now the Google Translate algorithm uses only one system instead of a huge set for every pair of languages.

最重要的一点是Google的翻译算法使用单个系统,而不是包含每对语言组合的庞大集合。

The system requires a “token” at the beginning of the input sentence which specifies the language you’re trying to translate the phrase into.

在输入句子的开始,系统需要一个指明目标语言的token。

 

This improves translation quality and enables translations even between two languages which the system hasn’t seen yet, a method termed “Zero-Shot Translation.”

这一方法改善了翻译的质量,同时允许翻译那些系统没有见过对应译文语料的语言组合,这一方法称为“零样本翻译”(Zero-Shot Translation)。

 

What means better translation?

更好的翻译?

When we’re talking about improvements and better results from Google Translate algorithms, how can we correctly evaluate that the first candidate for translation is better than the second?

当我们谈论Google翻译算法的改进和更好的结果时,我们如何才能正确地评估第一个翻译候选比第二个候选更好呢?

It’s not a trivial problem, because for some commonly used sentences we have the sets of reference translations from the professional translators, that have, of course, some differences.

这不是一个微不足道的问题,因为对于一些常用的句子,我们有来自专业译员的参考译文集合,这些译文间当然有一些差异。

There are a lot of approaches that partly solve this problem, but the most popular and effective metric is BLEU (bilingual evaluation understudy). Imagine, we have two candidates from machine translators:

能部分解决这个问题的方法有很多,但最流行和最有效的衡量标准是BLEU(bilingual evaluation understudy)。想象一下,我们有来自机器翻译的两个候选:

Candidate 1: Statsbot makes it easy for companies to closely monitor data from various analytical platforms via natural language.

Candidate 2: Statsbot uses natural language to accurately analyze businesses’ metrics from different analytical platforms.

 

Although they have the same meaning they differ in quality and have different structure.

尽管它们的意思相同,但在质量和结构上都有差异。

 

Let’s look at two human translations:

让我们看下两个来自人类的翻译:

 

Reference 1: Statsbot helps companies closely monitor their data from different analytical platforms via natural language.

Reference 2: Statsbot allows companies to carefully monitor data from various analytics platforms by using natural language.

 

Obviously, Candidate 1 is better, sharing more words and phrases compared to Candidate 2. This is a key idea of the simple BLEU approach. We can compare n-grams of the candidate with n-grams of the reference translation and count the number of matches (independent from their position). We use only n-gram precisions, because calculating recall is difficult with multiple refs and the result is the geometric average of n-gram scores.

显然,候选一更好,与候选二相比,候选一和人工翻译共享更多的单词和短语。这是简单BLEU方法的核心想法。我们可以比较候选翻译和参考翻译的n元语法,并计算匹配的数量(与它们的位置无关)。我们只评估n元语法的准确率,因为计算多个参考的召回很困难,评估结果是n元语法的几何平均值。

 

……

Now you can evaluate the complex engine of machine learning translation. Next time when you translate something with Google Translate, imagine how many millions of documents it analyzed before giving you the best language version.

现在可以评估机器学习翻译的复杂引擎了。下一次使用Google翻译进行翻译时,想象一下,在返回最好的版本之前,它分析了数百万个文档。

 

原文地址:https://blog.statsbot.co/machine-learning-translation-96f0ed8f19e4

译文地址:https://www.zhuanzhi.ai/document/34f52b75edd1c84aa2d6907f63512a0c