Yuhui Xu

I am a research scientist with Salesforce AI Research. I was part of the MIN LAB, advised by Prof. Hongkai Xiong and Prof. Weiyao Lin. I was a visiting student of CCVL LAB, advised by Prof. Alan Yuille. Prior to SJTU, I obtained my B.S. degree in Chien-Shiung Wu College from Southeast University in 2016.

News

May 15, 2025	One paper accepted to ACL 2025 (Main Conference, Oral)
May 01, 2025	One paper accepted to ICML 2025
Jan 23, 2025	One paper accepted to ICLR2025 Spotlight
Jan 23, 2025	One paper accepted to WWW25
May 16, 2024	One paper accepted to ACL 2024 (Main Conference)

Selected Publications

Fractured Chain-of-Thought Reasoning

Baohao Liao* , Hanze Dong* , Yuhui Xu* , and 4 more authors

arXiv preprint arXiv:2505.12992, 2025

* = equal contribution

Bib HTML PDF Code

@article{liao2025fractured,
  title = {Fractured Chain-of-Thought Reasoning},
  author = {Liao*, Baohao and Dong*, Hanze and Xu*, Yuhui and Sahoo, Doyen and Monz, Christof and Li, Junnan and Xiong, Caiming},
  journal = {arXiv preprint arXiv:2505.12992},
  year = {2025},
  note = {* = equal contribution}
}

Scalable Chain of Thoughts via Elastic Reasoning

Yuhui Xu , Hanze Dong , Lei Wang , and 3 more authors

arXiv preprint arXiv:2505.05315, 2025

Bib HTML PDF Code

@article{xu2025scalable,
  title = {Scalable Chain of Thoughts via Elastic Reasoning},
  author = {Xu, Yuhui and Dong, Hanze and Wang, Lei and Sahoo, Doyen and Li, Junnan and Xiong, Caiming},
  journal = {arXiv preprint arXiv:2505.05315},
  year = {2025}
}

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Ke Yi* , Yuhui Xu* , Heng Chang , and 4 more authors

The 63rd Annual Meeting of the Association for Computational Linguistics (Oral), 2025

* = equal contribution

Bib PDF

@article{yi2024one,
  title = {One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments},
  author = {Yi*, Ke and Xu*, Yuhui and Chang, Heng and Tang, Chen and Meng, Yuan and Zhang, Tong and Li, Jia},
  journal = {The 63rd Annual Meeting of the Association for Computational Linguistics (Oral)},
  year = {2025},
  note = {* = equal contribution}
}

Reward-Guided Speculative Decoding for Efficient LLM Reasoning

Baohao Liao* , Yuhui Xu* , Hanze Dong* , and 5 more authors

International Conference on Machine Learning, 2025

* = equal contribution

Bib PDF Code

@article{liao2025reward,
  title = {Reward-Guided Speculative Decoding for Efficient LLM Reasoning},
  author = {Liao*, Baohao and Xu*, Yuhui and Dong*, Hanze and Li, Junnan and Monz, Christof and Savarese, Silvio and Sahoo, Doyen and Xiong, Caiming},
  journal = {International Conference on Machine Learning},
  year = {2025},
  note = {* = equal contribution}
}

ThinK: Thinner Key Cache by Query-Driven Pruning

Yuhui Xu , Zhanming Jie , Hanze Dong , and 6 more authors

International Conference on Learning Representation (Spotlight), 2025

Bib PDF Code

@article{xu2024think,
  title = {ThinK: Thinner Key Cache by Query-Driven Pruning},
  author = {Xu, Yuhui and Jie, Zhanming and Dong, Hanze and Wang, Lei and Lu, Xudong and Zhou, Aojun and Saha, Amrita and Xiong, Caiming and Sahoo, Doyen},
  journal = {International Conference on Learning Representation (Spotlight)},
  year = {2025},
}

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

Xudong Lu* , Aojun Zhou* , Yuhui Xu* , and 3 more authors

International Conference on Machine Learning, 2024

* = equal contribution

Bib PDF Code

@article{lu2024spp,
  author = {Lu*, Xudong and Zhou*, Aojun and Xu*, Yuhui and Zhang, Renrui and Gao, Peng and Li, Hongsheng},
  title = {SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models},
  journal = {International Conference on Machine Learning},
  year = {2024},
  note = {* = equal contribution}
}

QA-LoRA: Quantization-aware low-rank adaptation of large language models

Yuhui Xu , Lingxi Xie , Xiaotao Gu , and 6 more authors

International Conference on Learning Representation, 2024

Abs Bib PDF Code

Recently years have witnessed a rapid development of large language models (LLMs). Despite the strong ability in many language-understanding tasks, the heavy computational burden largely restricts the application of LLMs especially when one needs to deploy them onto edge devices. In this paper, we propose a quantization-aware low-rank adaptation (QA-LoRA) algorithm. The motivation lies in the imbalanced degrees of freedom of quantization and adaptation, and the solution is to use group-wise operators which increase the degree of freedom of quantization meanwhile decreasing that of adaptation. QA-LoRA is easily implemented with a few lines of code, and it equips the original LoRA with two-fold abilities: (i) during fine-tuning, the LLM’s weights are quantized (e.g., into INT4) to reduce time and memory usage; (ii) after fine-tuning, the LLM and auxiliary weights are naturally integrated into a quantized model without loss of accuracy. We apply QA-LoRA to the LLaMA and LLaMA2 model families and validate its effectiveness in different fine-tuning datasets and downstream scenarios. The code is made available at https://github.com/yuhuixu1993/qa-lora.
@article{xu2023qa, author = {Xu, Yuhui and Xie, Lingxi and Gu, Xiaotao and Chen, Xin and Chang, Heng and Zhang, Hengheng and Chen, Zhensu and Zhang, Xiaopeng and Tian, Qi}, title = {QA-LoRA: Quantization-aware low-rank adaptation of large language models}, journal = {International Conference on Learning Representation}, year = {2024}, }
PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search

Yuhui Xu , Lingxi Xie , Xiaopeng Zhang , and 4 more authors

International Conference on Learning Representation (Spotlight), 2020

Abs Bib PDF Code

Differentiable architecture search (DARTS) provided a fast solution in finding effective network architectures, but suffered from large memory and computing overheads in jointly training a super-network and searching for an optimal architecture. In this paper, we present a novel approach, namely, Partially-Connected DARTS, by sampling a small part of super-network to reduce the redundancy in exploring the network space, thereby performing a more efficient search without comprising the performance. In particular, we perform operation search in a subset of channels while bypassing the held out part in a shortcut. This strategy may suffer from an undesired inconsistency on selecting the edges of super-net caused by sampling different channels. We alleviate it using edge normalization, which adds a new set of edge-level parameters to reduce uncertainty in search. Thanks to the reduced memory cost, PC-DARTS can be trained with a larger batch size and, consequently, enjoys both faster speed and higher training stability. Experimental results demonstrate the effectiveness of the proposed method. Specifically, we achieve an error rate of 2.57% on CIFAR10 with merely 0.1 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.2% on ImageNe (under the mobile setting) using 3.8 GPU-days for search. Our code has been made available at https://github.com/yuhuixu1993/PC-DARTS
@article{xu2019pc, author = {Xu, Yuhui and Xie, Lingxi and Zhang, Xiaopeng and Chen, Xin and Qi, Guo-Jun and Tian, Qi and Xiong, Hongkai}, title = {PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search}, journal = {International Conference on Learning Representation (Spotlight)}, year = {2020}, }