Paper Review

[NLP] Neural Machine : Translation by jointly learning to align and translate

_Sun_ 2023. 2. 28. 16:39

본 논문은 NLP에서 중요한 개념인 Attention에 대해 처음 언급한 논문으로 Bahdanau Attention이라고도 알려져 있다.
Bahdanau Attention 논문 링크

Abstract

Neural machine translation은 번역 성능을 최대화하기 위해 jointly 조정될 수 있는 하나의 neural network 구성을 목표로 한다.
본 논문에서는 target word를 예측 시 원문장의 해당부분을 자동으로 soft-search하는 모델을 만들어 basic encoder-decoder 구조 확장을 제안한다.

1. Introduction

Machine translation 종류

phrase-based translation system (전통적 방법)
- many small sub-components that are tuned separately
Encoder-Decoder
- encoder
  - 원 문장을 fixed-length vector 로 인코딩
- decoder
  - encoded vector에서 번역 산출
- 주어진 원 문장에서 probability of a correct translation을 최대화하기 위해 joinly trained
Neural machine translation
- 문장을 읽고 맞는 번역을 출력하는 하나의 large neural network을 만들고 학습
- 대부분의 neural machine translation은 encoder-decoder의 일종

Encoder-Decoder의 한계점

neural network가 source sentence의 필요한 모든 정보를 fixed-length vector로 압축할 수 있어야 한다.
인풋 문장의 길이가 증가함에 따라 basic encoder-decoder의 성능이 빠르게 악화

RNNsearch

align과 translate를 jointly 배우는 encoder-decoder 모델의 확장
전체 인풋 sentence를 single fixed-length vector로 encode하지 않는다. 대신에 input sentence를 sequence of vectors로 인코딩하고 번역을 디코딩할 동안 이 벡터들의 부분집합을 적절하게 선택한다.
jointly learning to align and translate 접근은 매우 향상된 번역 성능을 이뤄냄

2. Background: Neural Machine Translation

probabilistic 관점에서의 번역
- source sentence, x가 주어졌을 때 target sentence, y의 conditional probability를 극대화하는 y를 찾아내는 것. ⇒ $\argmax_{y}p(y|x)$
neural machine translation의 2 요소
- source sentence x를 encode
- target sentence y

2.1 RNN Encoder - Decoder

encoder

인풋 senetence : 벡터들의 시퀀스 $x=(x_1,...,x_{T_x})$ into a vector $c^2$
시간 t에서의 hidden state : $h_t=f(x_t,h_{t-1})$
hidden state에서 $c=q({h_1,...,h_{T_x}})$ 생성
(f,q) : nonlinear functions

decoder

context vector c와 이전에 예측된 단어들 $\{y_1,...,y_{t'-1}\}$를 바탕으로 다음 단어 $y_{t'}$예측
즉, ordered conditionals에서 joint probability를 분해하면서 번역 y 의 probability 정의

$$ p(y)= \prod_{t=1}^Tp(y_t|\{y_1,...,y_{t-1}\},c) $$

RNN으로 각 conditional probability 모델링
$$ p(y_t|\{y_1,...,y_{t-1}\},c)=g(y_{t-1},s_t,c) $$
- g : nonlinear, potentially multi-layered, function that outputs the probability of $y_t$
- (s_t) : hidden state of the RNN

3. Learning to align and translate

3.1 Decoder: General description

conditional probability

$$ p(y_t|\{y_1,...,y_{t-1}\},c)=g(y_{t-1},s_t,c) $$

$s_i$ : RNN hidden state for time i

$$ s_i=f(s_{i-1},y_{i-1},c_i) $$

$c_i$ : distinct context vector, depend on a sequence of annotations $(h_1,...h_{T_x})$
$$ c_i=\sum_{j=1}^{T_x}\alpha_{ij}h_j $$
$\alpha_{ij}$ : weight of each annotaion $h_j$
$$ \alpha_{ij}={\exp(e_{ij}) \over \sum_{k=1}^{T_x}\exp(e_{ik})} $$

$y_i$ : each target word

probability that the target word $y_i$ is aligned to, or translated from, a source word $x_j$

$$ e_{ij}=a(s_{i-1},h_j) $$

probability $\alpha_{ij}$, associated energe $e_{ij}$ : the importance of the annotation $h_j$

전통적인 기계번역과는 달리, alignments는 latent variable이 아니며 alignment 모델은 backpropagate되는 cost 함수의 기울기를 구하는 soft alignment를 바로 계산한다

3.2 Encoder : Bidirectional RNN for Annotating sequences

본논문에서는 이전단어들과 나중단어들 모두 summarize하기위해 bidirectional RNN(BiRNN) 사용

BiRNN 구조

forward RNN

(\vec{f}) : input sequence 순서대로 읽고 sequence of forward hidden state, h 계산

backward RNN

sequence in the reverse order 읽고 sequemce of backward hidden states, h 계산

forward hidden state와 backward hidden state를 concatenate해서 단어 (x_j)를 위한 annotation 생성

4. Experiment settings

영어-프랑스어 번역로 제안모델 평가

4.1 Dataset

WMT’14

Europarl (61M word)
news commentary (5.5M)
UN (421M)
crawled corpora (90M, 272.5M)

4.2 Models

RNN Encoder-Decoder
RNNsearch : 본 논문의 제안 모델

Experimental Designs

각 모델 2번 학습
- 1번째는 최대 30단어의 문장으로 학습
- 2번째는 최대 50단어의 문장으로 학습
RNNencdec 구성
- encoder, decoder: 1000 hidden units
RNNsearch 구성
- encoder : forward, backward recurrent neural networks(RNN). 1000 hidden units
- decoder: 1000 units
학습 과정
- multilayer network with a single maxout hidden layer to compute the conditional probability of each target word
- minibatch stochastic gradient descent (SGD)
  - SGD update direction: minibatch of 80 sentences
- 학습이 완료되면 beam search를 이용해 conditional probability를 최대화하는 번역 찾기.

5. Results

5.1 Quantitative results

문장 길이가 증가함에 따라 RNNencdec 성능이 급격하게 감소하는 것 확인가능

RNNsearch가 RNNencdec보다 문장길이가 늘어남에 따라 더 좋은 성능을 보이는 것 확인 가능

5.2 Qualitative analysis

5.2.1 Alignment

본 논문은 생성된 번역과 원 문장에서 단어들 간 soft-alignment을 조사하는 직관적 방법 제공.

→ target word를 만들때 source 문장에서의 위치가 더 중요하다

각 행렬의 diagonal weights가 보통 크지만(monotonic alignment), 많은 non-trivial, non-monotonic alignments도 존재

soft-alignment의 강점
- evident
- 단어들을 매핑하는 counter-intuitive way없이 다른길이의 source와 target 문구들을 다룰 수 있다
hard-alignment의 단점
- 예시- the man = 1’ homme
- hard-alignment는 the와 man을 따로 번역하여 the가 le, la,les, l`중에서 어떤 건지 결정해야.
- soft-alignment는 모델이 the와 man을 같이 보게 해서 자연스럽게 이 이슈를 해결

5.2.2 Long Sentences

RNNendec(기존모델)은 밑줄 이전까지는 잘 번역하지만 밑줄부분은 잘 번역하지 못했다.
제안하는 모델(RNNsearch)는 기존모델(RNNencdec)보다 긴 문장 번역에 좋은 성능을 보인다.

6. Related Work

6.1 Learning to Align

context of handwrting synthesis, Graves(2013)
- Handwrting synthesis: 주어진 character sequence의 handwrting 생성을 요청하는 모델
- annotation의 weight를 계산하기 위해 a mixture of Guassian kernels 사용
- modes of the weights of the annotations가 one direction으로 움직인다
- long-distance reordering은 문법적으로 맞는 번역 생성이 필요

본논문의 모델은 source sentence에서 각 단어 번역을 위해 각 단어의 annotation weight를 계산한다.

6.2 Neural Networks for Machine Translation

Neural probabilistic language model, San Bengio(2003)
- 고정된 이전단어들이 주어지면 단어의 conditional probability를 모델링하는 neural network 사용
- 하지만 neural network는 존재하는 statistical machine translation system에 single feature 제공하거나 시스템에서 제공된 후보 번역리스트를 re-rank하는 역할로 제한되어 왔다.
Phrase-based statstical machine translation system, Schwenk(2012)
- 추가 feature로 score사용하고 source와 target phrases의 pair의 score를 계산하는 feedforward neural network 사용
Kalchbrenner and Blunsom(2013) and Devlin(2014)
- 존재하는 translation system의 sub-component로 neural network 사용
전통적으로 target-side language 모델로 학습된 neural network는 후보 번역 리스트를 rescore하거나 rerank하는데 사용되었다.

7. Conclusion

Encoder-decoder approach
- 전체 입력 문장을 고정 벡터로 인코딩 후 decoder에서 번역
- fixed length context vector 사용은 긴 문장 번역 시 문제가 될 수 있다.
본 논문, RNNsearch
- 입력 단어들의 집합(encoder에서 계산된 인풋단어의 annotations)에서 soft-search 모델 적용해 기본 encoder-decoder 확장.
- 기존 기계번역과 달리, alignment mechanism을 포함한 번역 system의 모든 부분들이 정확한 번역을 생산하는 더 나은 log-probability으로부터 jointly 학습된다.
- 영어-프랑스어 번역에서 기존 모델(RNNencdec)보다 좋은 성능을 보임을 실험적으로 증명.
- 현존하는 phrase-based statistical machine translation보다 더 좋은 번역 성능
future work
- unknown, rare 단어들을 더 잘 다루는 것.

Personal Thoughts

NLP에서 일명 Attention(Bahdanauu attention) 기법을 적용한 첫 연구
- 기존 번역 모델의 한계점 극복
Attention은 최근 추천시스템, 챗봇에서 다양하게 쓰이고 있는 만큼 원리를 정확하게 알고 알고리즘 구성을 꼼꼼하게 공부하는 게 좋을 것 같다