[MDM 2024]Spatial-Temporal Large Language Model for Traffic Prediction

论文网址：[2401.10134] Spatial-Temporal Large Language Model for Traffic Prediction

论文代码：GitHub - ChenxiLiu-HNU/ST-LLM: Official implementation of the paper "Spatial-Temporal Large Language Model for Traffic Prediction"

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Work

2.3.1. Large Language Models for Time Series Analysis

2.3.2. Traffic Prediction

2.4. Problem Definition

2.5. Methodology

2.5.1. Overview

2.5.2. Spatial-Temporal Embedding and Fusion

2.5.3. Partially Frozen Attention (PFA) LLM

2.6. Experiments

2.6.1. Datasdets

2.6.2. Baselines

2.6.3. Implementations

2.6.4. Evaluation Metrics

2.6.5. Main Results

2.6.6. Performance of ST-LLM and Ablation Studies

2.6.7. Parameter Analysis

2.6.8. Inference Time Analysis

2.6.9. Few-Shot Prediction

2.6.10. Zero-Shot Prediction

2.7. Conclusion

3. Reference

1. 心得

（1）尽管几天后要投的论文还没开始写，仍然嚼嚼饼干写写阅读笔记。哎。这年头大家都跑得太快了

（2）比起数学，LLM适合配一杯奶茶读，全程轻松愉悦，这一篇就是分开三个卷积→合在一起→LLM（部分解冻一些模块）→over

2. 论文逐段精读

2.1. Abstract

①They proposed Spatial-Temporal Large Language Model (ST-LLM) to predict traffic（好像没什么特别的我就不写了，就是在介绍方法，说以前的精度不高。具体方法看以下图吧）

2.2. Introduction

①Traditional CNN and RNN cannot capture complex/long range spatial and temporal dependencies. GNNs are prone to overfitting, thus reseachers mainly use attention mechanism.

②Existing traffic prediction methods mainly focus on temporal feature rather than spatial

③For better long term prediction, they proposed partially frozen attention (PFA)

2.3. Related Work

2.3.1. Large Language Models for Time Series Analysis

①Listing TEMPO-GPT, TIME-LLM, OFA, TEST, and LLM-TIME, which all utilize temporal feature only. However, GATGPT, which introduced spatial feature, ignores temporal dependencies.

imputation n.归责；归罪；归咎；归因

2.3.2. Traffic Prediction

①Filter is a common and classic method for processing traffic data

②Irrgular city net makes CNN hard to apply or extract spatial feature

2.4. Problem Definition

①Input traffic data: $\mathbf{X}\in\mathbb{R}^{T\times N\times C}$ , where $T$ denotes timesteps, $N$ denotes numberof spatial stations, $C$ denotes feature

②Task: given historical traffic data $\mathbf{X}_{P}=\{\mathbf{X}_{t-P+1},\mathbf{X}_{t-P+2},\ldots,\mathbf{X}_{t}\}\in\mathbb{R}^{P\times N\times C}$ of $P$ time steps only, learning a function $f\left ( \cdot \right )$ with parameter $\theta$ to predict future $S$ timesteps: $\mathbf{Y}_{S}=\{\mathbf{Y}_{t+1},\mathbf{Y}_{t+2},\ldots,\mathbf{Y}_{t+S}\}\in\mathbb{R}^{S\times N\times C}$ :

$[\mathbf{X}_{t-P+1},\mathbf{X}_{t-P+2},\ldots,\mathbf{X}_{t}]\xrightarrow{f(\cdot)}[\mathbf{Y}_{t+1},\mathbf{Y}_{t+2},\ldots,\mathbf{Y}_{t+S}]$

2.5. Methodology

2.5.1. Overview

①Overall framework of ST-LLM:

where Spatial-Temporal Embedding layer extracts timesteps $\mathbf{E}_{T}\in\mathbb{R}^{N\times D}$ , spatial embedding $\mathbf{E}_{S}\in\mathbb{R}^{N\times D}$ , and temporal embedding $\mathbf{E}_{P}\in\mathbb{R}^{N\times D}$ of historical $P$ timesteps. Then, they three are combined to $\mathbf{E}_{F}\in\mathbb{R}^{N\times3D}$ . Freeze first $F$ layers and preserve last $U$ layers in PFA LLM and get output $\mathbf{H}^{L}\in\mathbb{R}^{N\times3D}$ . Lastly, regresion convolution convert it to $\widehat{\mathbf{Y}}_{S}\in\mathbb{R}^{S\times N\times C}$ .

2.5.2. Spatial-Temporal Embedding and Fusion

①They get tokens by pointwise convolution:

$\mathbf{E}_{P}=PConv(\mathbf{X}_{P};\theta_{p})$

②Applying linear layer to encode input $\mathbf{X}_P\in\mathbb{R}^{P\times N\times C}$ to day $\mathbf{X}_{day}\in\mathbb{R}^{N\times T_{d}}$ and week $\mathbf{X}_{week}\in\mathbb{R}^{N\times T_{w}}$ :

$E_T^d = W_{day}(X_{day}), \\ E_T^w = W_{week}(X_{week}), \\ E_T = E_T^d + E_T^w.$

where $\mathbf{W}_{day}\in\mathbb{R}^{T_{d}\times D}$ and $\mathbf{W}_{week}\in\mathbb{R}^{T_{w}\times D}$ are learnable parameter and the output is $\mathbf{E}_{T}\in\mathbb{R}^{N\times D}$

③They extract spatial correlations by:

$\mathbf{E}_S=\sigma(\mathbf{W}_s\cdot\mathbf{X}_\mathbf{P}+\mathbf{b}_s)$

④Fusion convolution:

$\mathbf{H}_F=FConv(\mathbf{E}_P||\mathbf{E}_S||\mathbf{E}_T;\theta_f)$

where $\mathbf{H}_{F}\in\mathbb{R}^{N\times3D}$

2.5.3. Partially Frozen Attention (PFA) LLM

①They freeze the first $F$ layers (including multihead attention and feed-forward layers) which contains important information:

$\mathbf{\bar{H}}^{i}=MHA\left(LN\left(\mathbf{H}^{i}\right)\right)+\mathbf{H}^{i},\\\mathbf{H}^{i+1}=FFN\left(LN\left(\mathbf{\bar{H}}^{i}\right)\right)+\mathbf{\bar{H}}^{i},$

where $i \in \left \{ 1,F-1 \right \}$ , $\mathbf{H}^{1}=[\mathbf{H}_{F}+\mathbf{P}\mathbf{E}]$ , $\mathrm{PE}$ denotes learnable positional encoding, $\mathbf{\bar{H}}^{i}$ represents the intermediate representation of the $i$ -th layer after applying the frozen multi-head attention (MHA) and the first unfrozen layer normalization (LN), $\mathbf{H}^{i}$ symbolizes the final representation after applying the unfrozen LN and frozen feed-forward network (FFN), and:

$LN \left( \mathbf { H } ^ { i } \right) = \gamma \odot \frac { \mathbf { H } ^ { i } - \mu } { \sigma } + \beta ,\\ MHA ( \tilde { \mathbf { H } } ^ { i } ) = \mathbf { W } ^ { O } ( \mathrm { h e a d } _ { 1 } ^ { i } \| \cdots \| \mathrm { h e a d } _ { h } ^ { i } ) ,\\ \mathrm { h e a d } _ { k } ^ { i } = A t t e n t i o n ( \mathbf { W } _ { q } ^ { k } \tilde { \mathbf { H } } ^ { i } , \mathbf { W } _ { k } ^ { k } \tilde { \mathbf { H } } ^ { i } , \mathbf { W } _ { v } ^ { k } \tilde { \mathbf { H } } ^ { i } ) ,\\ A t t e n t i o n ( \tilde { \mathbf { H } } ^ { i } ) = \operatorname { s o f t m a x } \left( \frac { \tilde { \mathbf { H } } ^ { i } \tilde { \mathbf { H } } ^ { i T } } { \sqrt { d _ { k } } } \right) \tilde { \mathbf { H } } ^ { i } ,\\ F F N ( \tilde { \mathbf { H } } ^ { i } ) = \max \left( 0 , \mathbf { W } _ { 1 } \tilde { \mathbf { H } } ^ { i + 1 } + \mathbf { b } _ { 1 } \right) \mathbf { W } _ { 2 } + \mathbf { b } _ { 2 } ,\\$

②Unfreezing the last $U$ layers:

$\mathbf{\bar{H}^{F+U-1}}=MHA\left(LN\left(\mathbf{H^{F+U-1}}\right)\right)+\mathbf{H^{F+U-1}},\\\mathbf{H^{F+U}}=FFN\left(LN\left(\mathbf{\bar{H}^{F+U-1}}\right)\right)+\mathbf{\bar{H}^{F+U-1}},$

③The final regresion convolution (RConv):

$\hat{\mathbf{Y}}_{S}=RCon\nu(\mathbf{H}^{F+U};\theta_{r})$

④Loss function:

$\mathcal{L}=\left\|\widehat{\mathbf{Y}}_{S}-\mathbf{Y}_{S}\right\|+\lambda\cdot L\mathrm{reg}$

where $\mathbf{Y}_{S}$ is ground truth

⑤Algorithm:

2.6. Experiments

2.6.1. Datasdets

①Statistics of datasets:

②NYCTaxi: includes 266 virtual stations and 4,368 timesteps (each timestep is half-hour)

③CHBike: includes 250 sites and 4,368 timesteps (30 mins as well)

2.6.2. Baselines

①GNN based baselines: DCRNN, STGCN, GWN, AGCRN, STGNCDE, DGCRN

②Attention based model: ASTGCN, GMAN, ASTGNN

③LLMs: OFA, GATGPT, GCNGPT, LLAMA2

2.6.3. Implementations

①Data split: 6:2:2

②Historical and future timesteps: $P=12,S=12$

③ $T_w=7,T_d=48$

④Learning rate: 0.001 and Ranger21 optimizer for LLM and 0.001 and Adam for GCN and attention based

⑤LLM: GPT2 and LLAMA2 7B

⑥Layer: 6 for GPT2 and 8 for LLAMA2

⑦Epoch: 100

⑧Batch size: 64

2.6.4. Evaluation Metrics

①Metrics: Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), Root Mean Squared Error (RMSE), and Weighted Absolute Percentage Error (WAPE)

2.6.5. Main Results

①Performance table:

2.6.6. Performance of ST-LLM and Ablation Studies

①Module ablation:

②Frozen ablation:

2.6.7. Parameter Analysis

①Hyperparameter $U$ ablation:

2.6.8. Inference Time Analysis

①Inference time table:

2.6.9. Few-Shot Prediction

①10% samples few-shot learning:

2.6.10. Zero-Shot Prediction

①Performance:

2.7. Conclusion

3. Reference

@inproceedings{liu2024spatial,
title={Spatial-Temporal Large Language Model for Traffic Prediction},
author={Liu, Chenxi and Yang, Sun and Xu, Qianxiong and Li, Zhishuai and Long, Cheng and Li, Ziyue and Zhao, Rui},
booktitle={MDM},
year={2024}
}