Yuhui Xu

I am a research scientist at Google Research, focusing on efficient, large-scale foundation models. Previously, I was a research scientist at Salesforce AI Research. I was a member of the MIN Lab, advised by Professors Hongkai Xiong and Weiyao Lin. I was also a visiting student at the CCVL Lab, advised by Prof. Alan Yuille.

News

Feb 08, 2026	Three papers accepted to ICLR 2026
May 15, 2025	One paper accepted to ACL 2025 (Main Conference, Oral)
May 01, 2025	One paper accepted to ICML 2025
Jan 23, 2025	One paper accepted to ICLR2025 Spotlight
Jan 23, 2025	One paper accepted to WWW25

Selected Publications

Scalable Chain of Thoughts via Elastic Reasoning

Yuhui Xu , Hanze Dong , Lei Wang , and 3 more authors

International Conference on Learning Representation, 2026

Bib HTML PDF Code

@article{xu2025scalable,
  title = {Scalable Chain of Thoughts via Elastic Reasoning},
  author = {Xu, Yuhui and Dong, Hanze and Wang, Lei and Sahoo, Doyen and Li, Junnan and Xiong, Caiming},
  journal = {International Conference on Learning Representation},
  year = {2026}
}

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao* , Yuhui Xu* , Hanze Dong* , and 5 more authors

International Conference on Machine Learning, 2025

* = equal contribution

Bib PDF Code

@article{liao2025reward,
  title = {Reward-Guided Speculative Decoding for Efficient LLM Reasoning},
  author = {Liao*, Baohao and Xu*, Yuhui and Dong*, Hanze and Li, Junnan and Monz, Christof and Savarese, Silvio and Sahoo, Doyen and Xiong, Caiming},
  journal = {International Conference on Machine Learning},
  year = {2025},
  note = {* = equal contribution}
}

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu , Zhanming Jie , Hanze Dong , and 6 more authors

International Conference on Learning Representation (Spotlight), 2025

Bib PDF Code

@article{xu2024think,
  title = {ThinK: Thinner Key Cache by Query-Driven Pruning},
  author = {Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
  journal = {International Conference on Learning Representation (Spotlight)},
  year = {2025},
}

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Xudong Lu , Qi Liu , Yuhui Xu , and 5 more authors

The 62nd Annual Meeting of the Association for Computational Linguistics, 2024

Bib PDF Code

@article{lu2024not,
  author = {Lu, Xudong and Liu, Qi and Xu, Yuhui and Zhou, Aojun and Huang, Siyuan and Zhang, Bo and Yan, Junchi and Li, Hongsheng},
  title = {Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models},
  journal = {The 62nd Annual Meeting of the Association for Computational Linguistics},
  year = {2024},
}

QA-LoRA: Quantization-aware low-rank adaptation of large language models

Yuhui Xu , Lingxi Xie , Xiaotao Gu , and 6 more authors

International Conference on Learning Representation, 2024

Abs Bib PDF Code

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
@article{xu2023qa, author = {Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhensu and Zhang, Xiaopeng and Tian, Qi}, title = {QA-LoRA: Quantization-aware low-rank adaptation of large language models}, journal = {International Conference on Learning Representation}, year = {2024}, }
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search

Yuhui Xu , Lingxi Xie , Xiaopeng Zhang , and 4 more authors

International Conference on Learning Representation (Spotlight), 2020

Abs Bib PDF Code

Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNe (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at https://github.com/yuhuixu1993/PC-DARTS
@article{xu2019pc, author = {Xu, Yuhui and Xie, Lingxi and Zhang, Xiaopeng and Chen, Xin and Qi, Guo-Jun and Tian, Qi and Xiong, Hongkai}, title = {PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search}, journal = {International Conference on Learning Representation (Spotlight)}, year = {2020}, }