Beijing time. 8 month 5 The day went on ACL 2021 At the conference , The organizer is “ Best paper ” Formal Award , Bytes to beat AI Lab The paper of 《Vocabulary Learning via Optimal Transport for Neural Machine Translation》 Win the honor . This study proposes a new vocabulary learning scheme VOLT, Excellent results have been achieved in a variety of translation tasks .

This is a ACL establish 59 Since then , The team of Chinese scientists won the highest award for the second time .ACL The conference is hosted by the International Association for Computational Linguistics , It is the highest level academic conference in the field of natural language processing and computational linguistics . since 2010 Baidu hit ACL Start , More and more Chinese faces appear in ACL On , And continue to gain honor ：

2012 year , Tsinghua University won “ Best student paper ” prize ;

2019 year , Institute of computing, Chinese Academy of Sciences 、 Tencent WeChat AI Cooperate with Huawei Noah's Ark laboratory and other units to obtain “ Best long paper ” prize , Nanjing Polytechnic University and Hong Kong University of science and technology also won “ Outstanding thesis ” prize ;

Came to 2021 year , Not only did you get “ Best paper ” Honor , The Chinese University of Hong Kong and Tencent AI Lab The cooperative paper was also rated as “ Outstanding thesis ”.

according to ACL 2021 Official information , A total of 3350 Papers submitted , finally 21.3% I've been invited to the main meeting （Main Conference）, And received it 14.9% My thesis arrives Findings Sub issue , The comprehensive employment rate is 36.2%. Bytes to beat AI Lab Of VOLT Why can I be in 3350 Stand out in this paper ？

ACL The official review opinion is that ： Byte hopping VOLT This scheme provides an effective and novel solution to an important problem in machine translation , It can significantly reduce the learning and search time of Thesaurus , I believe it will not only have an important impact in the research community , It also has great potential in industrial applications .

Address of thesis ：https://arxiv.org/abs/2012.15671

Project address ：https://github.com/Jingjing-NLP/VOLT

The paper blog：https://jingjing-nlp.github.io/volt-blog/

VOLT Try to solve natural language processing （NLP） Two basic problems of ： What is the best vocabulary , How to generate the best vocabulary .

** What is the best vocabulary ： The evaluation index of thesaurus is defined by marginal income ****MUV**

On how to define the optimal vocabulary , First, consider what attributes of the thesaurus are more important . The following is an example , Which of the following is better ？

So far, , The sub word level thesaurus is widely used and has been verified on multiple tasks . therefore , Under the current cognitive conditions , We can identify subwords as a better choice . Compared with the traditional word based vocabulary , Subwords are small in size and will not face sparse markers （token） The problem of . Sparse tagging refers to subwords with low probability in language . Compared with the word structure of the Thesaurus , Sub words will not face the problem of too large entropy and indistinguishable semantics . The author also mainly considers the two main factors of information entropy and Thesaurus size to involve the evaluation index .

【 Information entropy 】 Information entropy can also be understood as the average semantic content contained in each word . Intuitively, the smaller the information entropy, the simpler the information represented by each word or word , Then it is more conducive to model learning . The author uses word based entropy calculation to evaluate this attribute , among v For the Thesaurus ,i For the tag in the Thesaurus ,P Is the frequency that the tag appears in the training set ：

【 Vocabulary size 】 The bigger the vocabulary is , The probability of sparse markers also increases . as everyone knows , Machine learning requires a high amount of training data , The probability of sparse markers is low , So the more sparse tags , The more training data you need . therefore , Under the frequency based method , The smaller the Thesaurus , The fewer sparse tags , The fewer parameters . So from the perspective of thesaurus size , We expect the smaller the vocabulary size, the better .

in general , Information entropy and Thesaurus size can't have both . Generally speaking , The bigger the vocabulary is , The larger the parameters required , The more sparse tags , But information entropy is decreasing .

In order to model this balance , The concept of marginal income is introduced . Marginal income measures the amount of benefit that can be obtained by paying a unit price . The greater the marginal income , Then the higher the input-output ratio . The author regards information entropy as the benefit in marginal income , The size of thesaurus is regarded as the cost in marginal income . With the increase of Thesaurus , The information entropy gains of different sizes of thesauri are different , The author uses the concept of marginal income to define the index to measure the quality of Thesaurus MUV, And observed MUV Correlation between indicators and downstream tasks .

** How to generate the best vocabulary ： Turn thesaurus search into optimal transportation problem **

Given thesaurus evaluation index MUV after , The problem of learning the optimal thesaurus can be roughly equivalent to finding the most MUV Vocabulary problem , But thesaurus search space is not only huge , And it's a discrete space , How to learn the corresponding vocabulary efficiently ？ Here, the author skillfully turns thesaurus search into the process of optimal transportation . In order to make it easier for everyone to understand the optimal transportation , Here is a brief review of the optimal transportation .

【 Optimal transportation 】 about 250 Years ago , French mathematician munge makes a strict analysis of such problems in his works , Here is a more intuitive example . Suppose in World War II , Some of our front lines have signaled the need for more troops , And our soldiers are scattered in different rear base areas . Different frontlines need different numbers of soldiers , The number of soldiers in the rear base areas is also different , The distance between the front line and the rear base area is also different . How to design the transfer scheme , Minimize the total transfer cost ？ This is the question that optimal transportation wants to answer .

Schematic diagram of optimal transportation problem

【 Thesaurus search 】 Under the new definition framework , Express the word as a soldier , All marker candidates are represented as front lines , The number of different words is the frequency in the training set , The number of candidate words required for different markers is the frequency of the marker in the training set . such as cat In the training set 20 Time , that cat need 20 individual c,20 individual a, and 20 individual t To form the tag . In order to avoid illegal handling , The author sets the illegal transportation as infinity （ For example, words e Carry to mark cat It's illegal ）. Because the number of words is limited , There must be some tag candidates who can't get the corresponding word , Then these marks will be kicked out of the final vocabulary . Last , Get the carried marks into the final vocabulary . in other words , Each handling method corresponds to a vocabulary , Only the cost of handling needs to be corresponding to the goal of thesaurus learning problem , We can use the efficient solution of optimal transportation to solve . So how to turn the problem of vocabulary learning into the cost of optimal transportation , Here, we do some refactoring ：

First ,MUV It can be understood as the first derivative of entropy to the size of the vocabulary , To model continuous derivatives , In this paper, the relative fraction is introduced to simulate the derivative ：

there H It represents information entropy , Molecule is the relative change of information entropy , And in the denominator i Represents the change in the size of the Thesaurus ,S Is an increasing sequence , Each element represents all thesaurus combinations with the size of the time as the previous term . So for each step , There is one with the largest MUV The vocabulary of scores , Just traverse all the steps , You can find the best word list . In order to further reduce the difficulty of solving , The author makes an approximation to the solution formula of each step , The last session of the optimal formula is proposed , That is to say ：

Then the problem of each step is transformed into the problem of finding the maximum entropy vocabulary in each step . The author then uses the optimal transportation solution based on entropy to define the goal of optimal transportation as the problem of finding the maximum entropy Thesaurus . The specific solution formula is as follows , The left side of the minimization formula is the entropy of the transfer matrix , It can be approximately understood as the entropy of the Thesaurus , The following items are regularized items to avoid illegal handling . So since , You can use the standard solution algorithm to solve the formula ：

in general , For each step , The authors use the above algorithm to find the maximum entropy of the thesaurus and calculate the current maximum MUV fraction , Finally, we can find the optimal solution by traversing all the steps MUV The vocabulary of . It should be noted that when the tag candidate set is obtained , For the sake of simplicity , Borrowed byte pair encoding （BPE：byte pair encoder） Learned marker combinations . Specifically VOLT The pseudo code of the method is shown below ：

This method does not require downstream task training , So it's very simple and efficient .

** test result **

Here are VOLT The generated thesaurus results in bilingual translation , It can be seen that , The vocabulary learned by the new method is much smaller than that often used , The effect is also very competitive .

Here are the results of multilingual translation , overall , The effect is also good on two-thirds of the data sets .

** Extended thinking **

Through the ages , When we do machine translation, we are used to using the same set of vocabulary size , For example, in bilingual translation 32000. But looking at the experiment in this paper, we can find that , There are many more than 32000 Better choice , Especially in low resource translation .

This article introduces VOLT Provide a better vocabulary learning tool , The influence of thesaurus size on performance is also analyzed . Author use VOLT The size of the searched thesaurus generates BPE The vocabulary of , It is found that similar results can be obtained , Therefore, the author also recommends VOLT As a way of learning vocabulary size . besides , The experiment also found that the simple baseline model was used VOLT The generated thesaurus also achieves the optimal and restricted results （ Without external resources ） Matching scores , It may also trigger further thinking about the effect of the baseline model .

** reference **

[1] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. 2019. Character-level language modeling with deeper self-attention.

[2] Wenhu Chen, Yu Su, Yilin Shen, Zhiyu Chen, Xifeng Yan, and William Yang Wang. 2019. How large a vocabulary does text classification need?

[3] Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport.

[4] Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing.

[5] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units