I am a research scientist with Salesforce AI Research. I was part of the MIN LAB, advised by Prof. Hongkai Xiong and Prof. Weiyao Lin. I was a visiting student of CCVL LAB, advised by Prof. Alan Yuille. Prior to SJTU, I obtained my B.S. degree in Chien-Shiung Wu College from Southeast University in 2016.
@article{liao2025reward,title={Reward-Guided Speculative Decoding for Efficient LLM Reasoning},author={Liao*, Baohao and Xu*, Yuhui and Dong*, Hanze and Li, Junnan and Monz, Christof and Savarese, Silvio and Sahoo, Doyen and Xiong, Caiming},journal={arXiv preprint arXiv:2501.19324},year={2025},note={* = equal contribution}}
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu , Zhanming Jie , Hanze Dong , and 6 more authors
International Conference on Learning Representation (Spotlight), 2025
@article{xu2024think,title={ThinK: Thinner Key Cache by Query-Driven Pruning},author={Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},journal={International Conference on Learning Representation (Spotlight)},year={2025},}
One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments
Ke Yi* , Yuhui Xu* , Heng Chang , and 4 more authors
@article{yi2024one,title={One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments},author={Yi*, Ke and Xu*, Yuhui and Chang, Heng and Tang, Chen and Meng, Yuan and Zhang, Tong and Li, Jia},journal={arXiv preprint arXiv:2405.20202},year={2024},note={* = equal contribution}}
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
Xudong Lu* , Aojun Zhou* , Yuhui Xu* , and 3 more authors
International Conference on Machine Learning, 2024
@article{lu2024spp,author={Lu*, Xudong and Zhou*, Aojun and Xu*, Yuhui and Zhang, Renrui and Gao, Peng and Li, Hongsheng},title={SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models},journal={International Conference on Machine Learning},year={2024},note={* = equal contribution}}
QA-LoRA: Quantization-aware low-rank adaptation of large language models
Yuhui Xu , Lingxi Xie , Xiaotao Gu , and 6 more authors
International Conference on Learning Representation, 2024
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
@article{xu2023qa,author={Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhensu and Zhang, Xiaopeng and Tian, Qi},title={QA-LoRA: Quantization-aware low-rank adaptation of large language models},journal={International Conference on Learning Representation},year={2024},}
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search
Yuhui Xu , Lingxi Xie , Xiaopeng Zhang , and 4 more authors
International Conference on Learning Representation (Spotlight), 2020
Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNe (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at https://github.com/yuhuixu1993/PC-DARTS
@article{xu2019pc,author={Xu, Yuhui and Xie, Lingxi and Zhang, Xiaopeng and Chen, Xin and Qi, Guo-Jun and Tian, Qi and Xiong, Hongkai},title={PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search},journal={International Conference on Learning Representation (Spotlight)},year={2020},}