| Research Topics(Updated Jul 2025) Full publications available on Google Scholar, DBLP, arXiv. Research Summary: The research of my lab is focused on the principles and practice of machine intelligence, often with a focus on generalization, and making machine learning more reliable. Our applied research includes applications to healthcare, biomedical imaging, and cognitive neuroscience. * indicates equal contribution. ** indicates alphabetic author order. ‡ indicates authors working closely with me. Efficient and Scalable World Foundation ModelsWe explore how to design general-purpose algorithms that enable world foundation models to align efficiently across tasks, modalities, and data scales. Our focus is on task-agnostic pretraining, scalable adaptation, and architectures that support transfer and compositionality. We aim to unify learning across heterogeneous domains while minimizing supervision and compute overhead. To achieve this, we develop methods that support dynamic task alignment, modular learning, and cross-domain generalization. These algorithmic advances lay the groundwork for building universal models applicable to science, medicine, and beyond. 
 
      Beyond Matryoshka: Revisiting Sparse Coding for Adaptive Representation
    
 Tiansheng Wen*, Yifei Wang*, Zequn Zeng, Zhong Peng, Yudi Su, Xinyang Liu, Bo Chen, Hongwei Liu, Stefanie Jegelka, Chenyu You ICML 2025 / Paper / Code / Hugging Face Blog / X (Formerly Twitter) Oral Presentation 
      Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
    
 
      UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
    
 (Robust) Machine Learning for Imperfect DataThe development of machine learning models, particularly in the context of label scarcity, increasingly necessitates the collection of substantial annotated data. Moreover, massive data often display a long-tailed class distribution or subpopulation shifts, which consequently results in notable imbalance issues. To this end, there are several growing interests in training machine learning models jointly across imbalanced subpopulation distributions and limited annotations. We are developing novel algorithmic and computational approaches to ensure the efficiency and robustness of large machine learning models. Our applied research includes applications to healthcare, biomedical imaging, and cognitive neuroimaging. 
 
      Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
    
 Chenyu You*, Yifei Min*, Weicheng Dai*, Jasjeet S Sekhon, Lawrence Staib, James S Duncan CVPR 2024 / Paper / Code 
      Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels
    
 
      Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
    
 
      Bootstrapping Semi-supervised Medical Image Segmentation with Anatomical-aware Contrastive Distillation
    
 
      SimCVD: Simple Contrastive Voxel-Wise Representation Distillation for Semi-Supervised Medical Image Segmentation
    
 World Foundation Models for Biomedical DataThe development of medical world foundation models often requires massive and diverse biomedical data. To this end, I have developed various world foundation models for biomedical imaging data and explored novel applications of these models. I have also developed novel biomedical AI Agents that lead to the scalable and accurate predictive modeling, particularly for distribution shift problems. 
 
      Uncovering Memorization Effect in the Presence of Spurious Correlations
    
 Chenyu You*, Haocheng Dai*, Yifei Min*, Jasjeet S. Sekhon, Sarang Joshi, James S. Duncan Nature Communications 2025 / Paper / Code 
      OTSurv: A Novel Multiple Instance Learning Framework for Survival Prediction with Heterogeneity-aware Optimal Transport
    
 
      Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations
    
 
      Segment Anything in Medical Images
    
 
     Implicit Anatomical Rendering for Medical Image Segmentation with Stochastic Experts
    
 Learning with Theoretical GuaranteesAs machine learning methods have become ubiquitous in human decision-making, their reliability and interpretability have become important. This is particularly crucial in domains where decisions carry significant consequences, interpretable models can uncover crucial but unexpected patterns that complex models often obscure. We are currently studying provably interpretable modeling with theoretical guarantees. We are also exploring structured sparsity and attention in deep neural networks to enable interpretability. 
 
      Mine yOur owN Anatomy: Revisiting Medical Image Segmentation with Extremely Limited Labels
    
 Chenyu You*, Weicheng Dai*, Fenglin Liu, Yifei Min, Xiaoxiao Li, David A. Clifton, Lawrence Staib, James S Duncan IEEE TPAMI 2024 / Paper / Code ESI - Top 1% highly cited papers 
      Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective
    
 
      ACTION++: Improving Semi-supervised Medical Image Segmentation with Adaptive Anatomical Contrast
    
 
      Class-Aware Adversarial Transformers for Medical Image Segmentation
    
 Learning with Multi-Modality DataMulti-modality data is ubiquitous in healthcare and science applications. We are pursuing various techniques for modeling such multiple data, primarily using probabilistic graphical models and other statistical analyses. These tools are primarily used to facilitate biomedical research. We are developing various tools to effectively tackle real-world challenges associated with data heterogeneity. Of particular interest are novel methods that address robustness issues, such as confounding, as well as trustworthy computational approaches, with primary applications in healthcare, biomedical imaging, and cognitive neuroscience. 
 
      UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers
    
 Dehai Min, Zhiyang Xu, Guilin Qi, Lifu Huang, Chenyu You NAACL 2025 / Paper / Code Oral Presentation 
      End-to-end Spoken Conversational Question Answering: Task, Dataset and Model
    
 
      Auto-Encoding Knowledge Graph for Unsupervised Medical Report Generation
    
 
      Self-supervised Contrastive Cross-Modality Representation Learning for Spoken Question Answering
    
 |