I am a research scientist at Google Research, focusing on efficient, large-scale foundation models. Previously, I was a research scientist at Salesforce AI Research. I was a member of the MIN Lab, advised by Professors Hongkai Xiong and Weiyao Lin. I was also a visiting student at the CCVL Lab, advised by Prof. Alan Yuille.
@article{xu2025scalable,title={Scalable Chain of Thoughts via Elastic Reasoning},author={Xu, Yuhui and Dong, Hanze and Wang, Lei and Sahoo, Doyen and Li, Junnan and Xiong, Caiming},journal={International Conference on Learning Representation},year={2026}}
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Baohao Liao* , Yuhui Xu* , Hanze Dong* , and 5 more authors
International Conference on Machine Learning, 2025
@article{liao2025reward,title={Reward-Guided Speculative Decoding for Efficient LLM Reasoning},author={Liao*, Baohao and Xu*, Yuhui and Dong*, Hanze and Li, Junnan and Monz, Christof and Savarese, Silvio and Sahoo, Doyen and Xiong, Caiming},journal={International Conference on Machine Learning},year={2025},note={* = equal contribution}}
ThinK: Thinner Key Cache by Query-Driven Pruning
Yuhui Xu , Zhanming Jie , Hanze Dong , and 6 more authors
International Conference on Learning Representation (Spotlight), 2025
@article{xu2024think,title={ThinK: Thinner Key Cache by Query-Driven Pruning},author={Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},journal={International Conference on Learning Representation (Spotlight)},year={2025},}
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models
Xudong Lu , Qi Liu , Yuhui Xu , and 5 more authors
The 62nd Annual Meeting of the Association for Computational Linguistics, 2024
@article{lu2024not,author={Lu, Xudong and Liu, Qi and Xu, Yuhui and Zhou, Aojun and Huang, Siyuan and Zhang, Bo and Yan, Junchi and Li, Hongsheng},title={Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models},journal={The 62nd Annual Meeting of the Association for Computational Linguistics},year={2024},}
QA-LoRA: Quantization-aware low-rank adaptation of large language models
Yuhui Xu , Lingxi Xie , Xiaotao Gu , and 6 more authors
International Conference on Learning Representation, 2024
Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
@article{xu2023qa,author={Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhensu and Zhang, Xiaopeng and Tian, Qi},title={QA-LoRA: Quantization-aware low-rank adaptation of large language models},journal={International Conference on Learning Representation},year={2024},}
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search
Yuhui Xu , Lingxi Xie , Xiaopeng Zhang , and 4 more authors
International Conference on Learning Representation (Spotlight), 2020
Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNe (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at https://github.com/yuhuixu1993/PC-DARTS
@article{xu2019pc,author={Xu, Yuhui and Xie, Lingxi and Zhang, Xiaopeng and Chen, Xin and Qi, Guo-Jun and Tian, Qi and Xiong, Hongkai},title={PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search},journal={International Conference on Learning Representation (Spotlight)},year={2020},}