Yuhui Xu

I am a research scientist with Salesforce AI Research. I was part of the MIN LAB, advised by Prof. Hongkai Xiong and Prof. Weiyao Lin. I was a visiting student of CCVL LAB, advised by Prof. Alan Yuille. Prior to SJTU, I obtained my B.S. degree in Chien-Shiung Wu College from Southeast University in 2016.

News

Jan 23, 2025	One paper accepted to ICLR2025 Spotlight
Jan 23, 2025	One paper accepted to WWW25
May 16, 2024	One paper accepted to ACL 2024 (Main Conference)
May 02, 2024	One paper accepted to ICML 2024
Jan 17, 2024	One paper accepted to ICLR 2024

Selected Publications

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao* , Yuhui Xu* , Hanze Dong* , and 5 more authors

arXiv preprint arXiv:2501.19324, 2025

* = equal contribution

Bib PDF Code

@article{liao2025reward,
  title = {Reward-Guided Speculative Decoding for Efficient LLM Reasoning},
  author = {Liao*, Baohao and Xu*, Yuhui and Dong*, Hanze and Li, Junnan and Monz, Christof and Savarese, Silvio and Sahoo, Doyen and Xiong, Caiming},
  journal = {arXiv preprint arXiv:2501.19324},
  year = {2025},
  note = {* = equal contribution}
}

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu , Zhanming Jie , Hanze Dong , and 6 more authors

International Conference on Learning Representation (Spotlight), 2025

Bib PDF Code

@article{xu2024think,
  title = {ThinK: Thinner Key Cache by Query-Driven Pruning},
  author = {Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
  journal = {International Conference on Learning Representation (Spotlight)},
  year = {2025},
}

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Ke Yi* , Yuhui Xu* , Heng Chang , and 4 more authors

arXiv preprint arXiv:2405.20202, 2024

* = equal contribution

Bib PDF

@article{yi2024one,
  title = {One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments},
  author = {Yi*, Ke and Xu*, Yuhui and Chang, Heng and Tang, Chen and Meng, Yuan and Zhang, Tong and Li, Jia},
  journal = {arXiv preprint arXiv:2405.20202},
  year = {2024},
  note = {* = equal contribution}
}

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Xudong Lu* , Aojun Zhou* , Yuhui Xu* , and 3 more authors

International Conference on Machine Learning, 2024

* = equal contribution

Bib PDF Code

@article{lu2024spp,
  author = {Lu*, Xudong and Zhou*, Aojun and Xu*, Yuhui and Zhang, Renrui and Gao, Peng and Li, Hongsheng},
  title = {SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models},
  journal = {International Conference on Machine Learning},
  year = {2024},
  note = {* = equal contribution}
}

QA-LoRA: Quantization-aware low-rank adaptation of large language models

Yuhui Xu , Lingxi Xie , Xiaotao Gu , and 6 more authors

International Conference on Learning Representation, 2024

Abs Bib PDF Code

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
@article{xu2023qa, author = {Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhensu and Zhang, Xiaopeng and Tian, Qi}, title = {QA-LoRA: Quantization-aware low-rank adaptation of large language models}, journal = {International Conference on Learning Representation}, year = {2024}, }
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search

Yuhui Xu , Lingxi Xie , Xiaopeng Zhang , and 4 more authors

International Conference on Learning Representation (Spotlight), 2020

Abs Bib PDF Code

Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNe (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at https://github.com/yuhuixu1993/PC-DARTS
@article{xu2019pc, author = {Xu, Yuhui and Xie, Lingxi and Zhang, Xiaopeng and Chen, Xin and Qi, Guo-Jun and Tian, Qi and Xiong, Hongkai}, title = {PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search}, journal = {International Conference on Learning Representation (Spotlight)}, year = {2020}, }