Notice

Recent Posts

Recent Comments

Link

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

garret

[추천] Decoupled Side Information Fusion for Sequential Recommendation 본문

Paper Review

[추천] Decoupled Side Information Fusion for Sequential Recommendation

_Sun_ 2023. 5. 25. 16:59

SIGIR 2022 발표

홍콩과기대와 Upstage에서 발표한 논문

해당 논문은 Sequential Recommendation 중에 Session-based recommendation

논문 링크 : https://arxiv.org/pdf/2204.11046.pdf

논문 읽기전에 알면 좋은 개념

Side Information

사용자의 특성, 아이템의 특성(attribute) 또는 예측에 도움이 되는 추가 정보들

Session이란?

유저가 상품을 실제로 구매하기 전까지의 히스토리

Session-based Recommendation vs. Sequential Recommendation

💡 SBR과 SR은 데이터 사용하는 방식만 다르고 모델구조는 거의 동일하게 사용

SBR

세션을 기반으로 Next interaction을 예측
interaction은 다음에 클릭할 item이 될 수도, 다음에 주문할 item이 될 수도 있다.

유저의 sequential 데이터를 기반으로 Next item 추천

Abstract

기존 side information을 활용한 모델의 문제점
- rank bottleneck 발생
- gradients의 유연함 제한
기존 문제를 보완한 DIF-SR(Decoupled Side Information Fusion for Sequential Recommendation) 제안

소스코드

소스코드 github 홈 : https://github.com/AIM-SE/DIF-SR

Pytorch 기반의 RecBole 라이브러리에 DIF-SR 아키텍처 추가

DIF- SR 함수 링크 :

https://github.com/AIM-SE/DIF-SR/blob/main/recbole/model/sequential_recommender/sasrecd.py

소스코드를 제공하나 논문과 이름이 달라서 혼동가능
- 논문은 DIF-SR이라고 명시하지만, 소스코드는 SASRecD라고 명시

Related Work

Sequential Recommendation

Sequential Recommendation (SR) 의 목표

유저의 historical sequential interaction으로부터 유저의 preference를 파악해서 다음 item을 예측

Sequential Recommendation 종류

Markov Chain assumption, Matrix Factorization methods
CNNs(Convolutional Neural Networks), RNNs(Recurrent Neural Networks), GNNs(Graph Neural Networks), transformer
self- attention based methods : SASRec, BERT4Rec

기존 SR 모델의 단점

attention layer 전에 embedding을 통합하는 것은 attention matrices의 rank bottleneck 문제를 발생시키고 이는 attention score representation 능력을 저하한다.
복합적인 embedding space에서 attention이 실행되는 것은 random disturbance를 일으킬 수 있으며, 이는 다양한 information 출처에서부터 섞인 embeddings가 관련되지 않은 정보가 불가피하게 포함될 수 있다
통합된 embedding이 모든 attention block안에서 분할 할 수 없기에, 조기 통합은 복잡하고 무거운 integration solutions과 다양한 side information을 위한 flexible gradients를 가능하게 하는 학습 스키마를 개발해야하는 부담을 준다.

현존하는 SR 방식의 문제점

item ID만 사용하고 item 특성을 고려하지 않아 추가 supervision signals을 줄 수 있는 highly-related information을 무시

Side Information Fusion for Sequential Recommendation

Side information은 attention-based SR model에 광범위하게 적용되고 있음
side information을 적용한 attention SR 모델 종류
- FDSA
- S3-Rec

Problem Formulation

$\mathcal I$ : item set

$\mathcal U$ : user set

user’s historical interaction
$\mathcal S_u = [v_1. v_2,..., v_n]$ , item : $v_i$

Side information

user, item의 attribute 또는 예측을 위한 추가 information을 제공하는 actions
item-related information : brand, category
behavior-related information : position, rating

Interaction

$v_i = (I_i, f^{(1)}_i,...f^{(p)}_i )$ , where $f^{(j)}_i$ : jth type of the side information of the ith interaction in the sequence
$I_i$ : the item ID of the ith interaction

Goal

user u가 interact할 next item $I_{pred} \in I$를 예측
기준 : $I_{pred}= I^{(\hat K)}$, where $\hat k = \argmax_k P(v_{n+1}= (I^{(k)},.)|S_u)$

Methodology

DIF-SR은 3가지의 모듈로 구성

Embedding Module
Decoupled Side Information Fusion Module
Prediction Module with AAP

Embedding Module

input sequence $S_u = [v_1,v_2,...,v_n]$ 가 item embedding layer의 입력으로 들어간다.

Item ID Embedding, $E^{ID}$
$E^{ID} = \mathcal E_{id}([I_1,I_2,...,I_n])$
Item attribute (side information) Embedding…
$E^{f1} = \mathcal E_{fp}([f^{(p)}_1,f^{(p)}_2,...f^{(p)}_n])$
$E^{f1} = \mathcal E_{f1}([f^{(1)}_1,f^{(1)}_2,...f^{(1)}_n])$

$\mathcal E$ : corresponding embedding layer that encodes the item and different item attributes into vectors

look-up embedding matrices
- $M_{id} \in \R^{|I| \times d}$
- $M_{f_1} \in \R^{|f_1| \times d_{f_1}}$ , … , $M_{f_p} \in \R^{|f_p| \times d_{f_p}}$
- | . | : the corresponding total number of different items and various side information
- $d$ , $d_{f_1},...,d_{f_p}$ : the dimension of embedding of item and various side information
output embeddings
- $E^{ID} \in \R^{n \times d}$
- $E^{f_1} \in \R ^{n \times d_{f_1}}$, … , $E^{f_p} \in \R ^{n \times d_{f_p}}$

Decoupled Side Information Fusion Module

Layer Structure

여러개의 순서대로 결합된 DIF attention layer가 쌓인 block과 feed forward layer로 구성
block 구성은 기존 multi-head self-attention을 multi-head DIF attention으로 변경한 것만 제외하고 SASRec과 동일
각 DIF block은 현재 item representation과 auxiliary side information의 2가지를 인풋으로 받고 업데이트된 item representation을 아웃풋으로 받는다

Decoupled Side Information Fusion Module의 프로세스

$R^{(ID)}_i \in \R^{n \times d}$ : the input item representation of block i
FFN : Fully Connected feed-forward network
LN : layer normalization
$$
R^{(ID)}_{i+1} = LN(FFN(DIF(R^{(ID)}_i,E^{f1}, ..., E^{fp}))),\
R^{(ID)}_1 = E^{(ID)}
$$

Decoupled Side information fusion module 의 자세한 작동 방식

item representation에 대한 attention score 계산
각 attribute에 대해 attention score 계산
fusion 함수(더하기, concatenation, gating 등 사용 가능)으로 모든 attention 행렬 융합. 실험결과, addition, concatenation 등 어떤 걸 써도 성능이 크게 달라지진 않는다.
모든 attention의 output을 concatenate하고 feed-forward layer의 입력으로 넣는다.

$W^i_Q, W^i_K, W^i_V \in \R^{d \times d_h}$ , $i \in [h]$ : the query, key, value projection matrices for h heads ($d_h = d/h$)
$\mathcal F$ : fusion function. addition, concatenation, gating, 각 헤드에서 output 산출하는 것 모두 포함.

$$
R^{(ID)}_{i+1} = LN(FFN(DIF(R^{(ID)}_i,E^{f1}, ..., E^{fp}))),\
R^{(ID)}_1 = E^{(ID)}
$$

Prior Attention Solutions vs. DIF Attention

DIF Attention의 장점 1

각 attribute 당 개별 attention map을 생성하는 게 더 좋은 이유

rank
- 행렬의 열들로 생성될 수 있는 벡터 공간의 차원을 의미. 즉, 데이터가 표현할 수 있는 최대 dimension
수식증명

그림 설명

해당 그림은 이해를 돕기위한 예시

💡 Low-Rank Bottleneck in Multi-head Attention Models(ICML 2020) 고정된 크기의 multi-head dimension이 너무 작으면 rank bottleneck으로 표현력이 낮아질 수 있다는 점 지적

각 모델에서 사용하는 Attention head의 rank 비교

다른 모델은 multi-head attention dimension이상의 rank를 가지지 못하는 bottle neck 현상 발생
DIF Attention을 사용하면 rank bottleneck 문제 완화 가능

DIF Attention의 장점2

flexible gradients, 각 attribute에 다른 gradient 적용가능해 side information 학습 용이
다른 모델은 early fusion을 진행하기에 모든 attribute에 같은 gradient 적용

Prediction Module with AAP

Item prediction layer$$
\hat y = softmax(M_{id}R^{(ID)}_L[n]\top)
$$
- $R^{(ID)}_L$ : side information의 도움으로 sequence information을 인코딩
- $\hat y$ : the $|\mathcal I|$ - dimensional probability
- $M_{id} \in \R^{|\mathcal I|\times d}$ : item embedding table in the embedding layer
학습 시 attribute에 Auxiliary Attribute Predictors (AAP) 적용
속성 j에 대한 prediction$$
\hat y^{(fj)} = \sigma(W_{fj}R^{(ID)}L[n]^{\top}+b{fj})
$$
- $\hat y^{(fj)}$ : |fj|- dimensional probability
- $W_{fj} \in \R^{|fj| \times d_{fj}}$ , $b_{fj} \in \R^{|fj| \times 1}$ : learnable parameters
- $\sigma$ : sigmoid function
item loss에 cross entropy 사용

$$
L_{id}= -\sum^{|I|}_{i=1}y_i\log(\hat y_i)
$$

side information loss에는 binary cross entropy 사용(학습에만 사용)

$$
L_{fj} = -\sum^{|fj|}_{i=1}y_i^{(fj)}\log(\hat y^{(fj)}_i)+(1-y^{(fj)}_i)\log(1-\hat y^{(fj)}_i)
$$

combined loss function

$$
L = L_{id} +\lambda\sum^p_{j=1}L_{fj}
$$

Experiments

RQ1 : DIF-SR의 성능이 다른 SR 모델들 보다 우수한가?

RQ2 : DIF와 AAP가 정말 성능에 도움을 주는가?

RQ3 : DIF-SR에서 hyperparameter와 다른 components들의 효과는 어떻게 되는가?

RQ4 : DIF의 attention matrices fusion의 시각화가 우수한 성능의 증거를 제공하는가?

Experimental Settings

Dataset

Amazon Beauty
Amazon Sports
Amazon Toys
Yelp

Evaluation Metrics

Evaluation

leave-one-out strategy

Performance

top-K Recall
top-K Normalized Discounted Cumulative Gain (NDCG)
- 가장 이상적인 추천 결과(rannk list) 대비 현재 모델의 추천 리스트(rank)가 얼마나 좋은지를 나타내는 지표

Baseline Models : 9개

GRU4Rec
GRU4Rec_F
Caser
BERT4Rec
SASRec
SASRec_F
S3-Rec
ICAI-SR
NOVA

Implementation Details

Settings

Adam optimizer, 200 epoch
batch size : 2048
learning rate : 1e-4

Other Hyperparameters

grid-search
- attribute_embedding_size {10,32,64,128,256}
- num_heads {2,4,8}
- num_layers {2,3,4}
- balance parameter $\lambda$ {5,10,15,20,25}

Overall Performance (RQ1)

Recall과 NDCG 결과를 보면 다른 비교군보다 우수한 성능 확인 가능. (Recall과 NDCG는 높을수록 우수한 성능)

Enhancement Study (RQ2)

DIF와 AAP가 self-attention based SR model에 쉽게 적용되고 성능을 높일 수 있다.

이를 증명하기 위해 SASRec과 BERT4Rec에 DIF 적용

→ 실제로 성능 향상되었다.

Ablation and Hyper-parameter Study (RQ3)

Effectiveness of Different Components

DIF-SR모델에서 DIF, AAP가 있을때와 없을 때의 데이터셋별 Recall 차이

DIF는 가장 효과적인 구성요소.
AAP만 사용하는 것은 성능향상에 항상 도움이 되지는 않는다
AAP는 DIF과 결합했을 때 모델 성능을 더욱 향상시킨다

Effectiveness of Different Kinds of Side Information

position : order-aware self-attention를 가능하게 하는 특별하고 기본적인 side information

Yelp는 city, category 포함

Side information이 포함되면 성능이 향상될 수 있다.

Impact of Hyperparameters and Fusion Function

Effect of balance parameter $\lambda$

대체로 $\lambda$가 5 , 10일 때 가장 좋은 성능.

effect of attribute embedding size $d_f$

attribute embedding size는 어떤 걸 해도 비슷한 성능을 보인다.

따라서 모델 복잡도를 줄이기 위해 $d_f$의 세팅은 작게 설정. (보통 item embedding의 dimension보다 작게 설정)

effect of fusion function $\mathcal F$

Visualization of Attention Distribution (RQ4)

DIF-SR의 interpretability를 보기 위해 Yelp dataset의 test 샘플에서 attention matrices 시각화

각 attribute의 decoupled attention matrices는 데이터 패턴을 잡을 때의 다른 선호를 보여준다
fused attention matrix는 decoupled attention matrices를 통해서 각 종류의 side information의 contribution을 알맞게 조절하고 중요한 패턴을 합성

Conclusion

SR에 효과적으로 side information을 융합하는 DIF-SR 제안.
새로운 decoupled side information attention 매커니즘 제안.

Personal Thoughts

SASRec에서 attention 적용방식 변경한 것 외에는 크게 다른 점이 보이지 않음.
item 아이디, side information별로 attention이 들어가기에 side information가 많다면 모델이 매우 복잡해 질 것. 이 경우 computational efficiency가 괜찮을지 의문.
- 이전 모델과 성능차이가 매우 작다. (Recall 기준 이전모델과 0.005 정도 차이) computational efficiency를 고려하면 다른 간단한 모델이 엔지니어링 면에서 더 나을수도 있다.
유저의 아이템 선택 history에 기반한 next item 추천이라 history가 없는 신규 유저는 추천이 힘들 것
소스코드는 pytorch 기반의 RecBole 라이브러리 사용. 해당 코드를 사용하려면 pytorch와 RecBole 라이브러리의 이해가 어느정도 필요할 것

궁금점

논문에서 사용한 데이터(Amazon, Yelp)가 어떻게 구성되어 있는지
논문에서는 사용자의 선택 히스토리 데이터를 사용한 것으로 보인다. 슈퍼브레인의 사용자가 플레이한 게임들의 순서만으로도 해당모델에 적용할 수 있을지.

Reference

https://www.youtube.com/watch?v=5Ftg8PpJi5A

https://arxiv.org/pdf/2205.10759.pdf

'Paper Review' 카테고리의 다른 글

[NLP] Neural Machine : Translation by jointly learning to align and translate (0)	2023.02.28
[추천] Deep Matrix Factorization Models for Recommender Systems (0)	2023.02.01
[추천] PMLF : Prediction-Sampling-based Multilayer-Structured Latent Factor Analysis (0)	2023.01.31
[추천] Neural Collaborative Filtering (0)	2023.01.21

'Paper Review' Related Articles