Research Topics
(Updated Jul 2025) Full publications available on Google Scholar, DBLP, arXiv. Research Summary: The research of my lab is focused on the principles and practice of machine intelligence, often with a focus on generalization, and making machine learning more reliable. Our applied research includes applications to healthcare, biomedical imaging, and cognitive neuroscience. * indicates equal contribution. ** indicates alphabetic author order. ‡ indicates authors working closely with me. Efficient and Scalable World Foundation ModelsWe explore how to design general-purpose algorithms that enable world foundation models to align efficiently across tasks, modalities, and data scales. Our focus is on task-agnostic pretraining, scalable adaptation, and architectures that support transfer and compositionality. We aim to unify learning across heterogeneous domains while minimizing supervision and compute overhead. To achieve this, we develop methods that support dynamic task alignment, modular learning, and cross-domain generalization. These algorithmic advances lay the groundwork for building universal models applicable to science, medicine, and beyond.
Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
Tiansheng Wen*, Yifei Wang*, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You ICML 2025 / Paper / Code / Hugging Face Blog / X (Formerly Twitter) Oral Presentation
Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
(Robust) Machine Learning for Imperfect DataThe development of machine learning models, particularly in the context of label scarcity, increasingly necessitates the collection of substantial annotated data. Moreover, massive data often display a long-tailed class distribution or subpopulation shifts, which consequently results in notable imbalance issues. To this end, there are several growing interests in training machine learning models jointly across imbalanced subpopulation distributions and limited annotations. We are developing novel algorithmic and computational approaches to ensure the efficiency and robustness of large machine learning models. Our applied research includes applications to healthcare, biomedical imaging, and cognitive neuroimaging.
Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Chenyu You*, Yifei Min*, Weicheng Dai*, Jasjeet S Sekhon, Lawrence Staib, James S Duncan CVPR 2024 / Paper / Code
Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels
Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
Bootstrapping Semi-supervised Medical Image Segmentation with Anatomical-aware Contrastive Distillation
SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation
Learning with Theoretical GuaranteesAs machine learning methods have become ubiquitous in human decision-making, their reliability and interpretability have become important. This is particularly crucial in domains where decisions carry significant consequences, interpretable models can uncover crucial but unexpected patterns that complex models often obscure. We are currently studying provably interpretable modeling with theoretical guarantees. We are also exploring structured sparsity and attention in deep neural networks to enable interpretability.
Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels
Chenyu You*, Weicheng Dai*, Fenglin Liu, Yifei Min, Xiaoxiao Li, David A. Clifton, Lawrence Staib, James S Duncan IEEE TPAMI 2024 / Paper / Code
Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
ACTION++: Improving Semi-supervised Medical Image Segmentation with Adaptive Anatomical Contrast
Class-Aware Adversarial Transformers for Medical Image Segmentation
Learning with Multi-Modality DataMulti-modality data is ubiquitous in healthcare and science applications. We are pursuing various techniques for modeling such multiple data, primarily using probabilistic graphical models and other statistical analyses. These tools are primarily used to facilitate biomedical research. We are developing various tools to effectively tackle real-world challenges associated with data heterogeneity. Of particular interest are novel methods that address robustness issues, such as confounding, as well as trustworthy computational approaches, with primary applications in healthcare, biomedical imaging, and cognitive neuroscience.
UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You NAACL 2025 / Paper / Code Oral Presentation
End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation
Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
World Foundation Models for Biomedical DataThe development of medical world foundation models often requires massive and diverse biomedical data. To this end, I have developed various world foundation models for biomedical imaging data and explored novel applications of these models. I have also developed novel biomedical AI Agents that lead to the scalable and accurate predictive modeling, particularly for distribution shift problems.
Uncovering Memorization Effect in the Presence of Spurious Correlations
Chenyu You*, Haocheng Dai*, Yifei Min*, Jasjeet S. Sekhon, Sarang Joshi, James S. Duncan Nature Communications 2025 / Paper / Code
OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport
Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
Segment Anything in Medical Images
Implicit Anatomical Rendering for Medical Image Segmentation with Stochastic Experts
|