Weijia Shi

Screenshot 2024-04-15 at 1.13.43 AM.png

Pronouns: she/her

My name in Chinese: 施惟佳

<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/c48c1754-22f6-4247-b5fe-2f8fc6074fd7/f30b1dd3-7384-4ae2-b912-11f36f7e174f/Picture2.png" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/c48c1754-22f6-4247-b5fe-2f8fc6074fd7/f30b1dd3-7384-4ae2-b912-11f36f7e174f/Picture2.png" width="40px" /> Google Scholar

</aside>

<aside> <img src="https://prod-files-secure.s3.us-west-2.amazonaws.com/c48c1754-22f6-4247-b5fe-2f8fc6074fd7/4a681ff9-9560-425e-a09b-83cefa5ab4e8/twitter-3.png" alt="https://prod-files-secure.s3.us-west-2.amazonaws.com/c48c1754-22f6-4247-b5fe-2f8fc6074fd7/4a681ff9-9560-425e-a09b-83cefa5ab4e8/twitter-3.png" width="40px" /> X

</aside>

👋 Hi!

I am Weijia Shi, a PhD student in Computer Science at University of Washington advised by Luke Zettlemoyer and Noah A. Smith. I am currently a student researcher at the Allen Institute of AI. During my PhD, I spent 2 years as a visiting researcher at Meta AI, working closely with Scott Yih and Mike Lewis. Prior to UW, I graduated from UCLA with a B.S. in Computer Science and Applied Math.

<aside> 🌱 What’s NEW

☑️ I will be on the job market for 2026. Please reach out if you think my background and experience could be a good fit for your organization.

☑️ Released 💪**FlexOlmo (NeurIPS 25 Spotlight)**, a mixture-of-experts LM enabling co-development of AI through data collaboration (Video | Blog | Wired Coverage | Tweet | Interest Form)

☑️ **s1: Simple test-time scaling** (Github 6.5K 🌟) wins Best Paper Award at an ICLR workshop

☑️ Released 🎨**LMFusion (NeurIPS 25)**, an efficient recipe for building unified multimodal models. ☑️ Don't Hallucinate, Abstain wins ACL ****Outstanding Paper Award

☑️ 🧑‍🏫 **Instructor** embedding model reached 10 million downloads

</aside>

🌋 Research Interest:

My research is at the intersection of natural language processing and machine learning with a focus on large language models (LMs). I aim to develop LMs with a new way to interact with large-scale data. Current LM development follows a "monolithic" paradigm where all data is centralized and encoded into a single model during training. This framework makes any targeted updates—adding knowledge, improving behaviors, or removing data—nearly impossible without (very expensive) retraining. To address the limitations, my work spans three areas:

Understanding model-data interactions:
- Data auditing (Detecting Pretraining Data, MIA)
- Data unlearning (MUSE)
- Privacy and copyright risks of LMs (CotaEval) and diffusion models (🐱CopyCat)
- Data-efficient post-training recipe (s1)
Augmenting existing LMs with complementary modules:
- Develop better embedding models (🧑‍🏫 Instructor, 🔌Replug)
- Train LMs to use external modules (📄 In-Context Pretraining, RA-DiT, Recomp)
- Multimodal augmentation (✍️Sketchpad, RA-CM3)
- Advanced decoding algorithms for LMs (Context-aware decoding) and diffusion models (NegToMe) to use external modules
Designing inherently modular, configurable LMs: I envision a future where LMs are built from independently trained components that can be flexibly composed and co-developed.
- Co-development of LMs through data collaboration (💪 FlexOlmo)
- Efficient pretraining of multimodal LMs via modular development (🎨 LMFusion)

🏆 Honors & Recognition

My work has received the ACL Outstanding Paper Award, Best Paper Award at an ICLR workshop, and multiple spotlights/orals at NeurIPS and ICLR.
Machine Learning Rising Star (2023) and Data Science Rising Star (2024)
Instructor embedding model reached 10M+ downloads and is deployed by Databricks MosaicML, LlamaIndex, and LangChain.

📜 Selected Publications

Please see my Google Scholar for the full list.

(*: equal contribution)

FlexOlmo: Open Language Models for Flexible Data Use

Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min

NeurIPS Spotlight 2025. [paper][website][code]

LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation

*Weijia Shi, *Xiaochuang Han, Chunting Zhou, Weixin Liang, Xi Victoria Lin, Luke Zettlemoyer, Lili Yu

NeurIPS 2025. [paper][code]

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

*Weijia Shi, *Jaechan Lee, *Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, Chiyuan Zhang

ICLR 2025. [paper][website][code]

Fantastic Copyrighted Beasts and How (Not) to Generate Them

Luxi He*, Yangsibo Huang*, Weijia Shi*, Tinghao Xie, Haotian Liu, Yue Wang, Luke Zettlemoyer, Chiyuan Zhang, Danqi Chen, Peter Henderson

ICLR 2025. [paper][website][code]

Evaluating Copyright Takedown Methods for Language Models

*Boyi Wei, *Weijia Shi, *Yangsibo Huang, Noah A Smith, Chiyuan Zhang, Luke Zettlemoyer, Kai Li, Peter Henderson

NeurIPS 2024. [paper][website][code]

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Yushi Hu*, Weijia Shi*, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

NeurIPS 2024. [paper][website][code]

Don't Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Vidhisha Balachandran, Yulia Tsvetkov

ACL 2024 🏆 Outstanding Paper Award. [paper][code]

Knowledge Card: Filling LLMs' Knowledge Gaps with Plug-in Specialized Language Models

Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, Yulia Tsvetkov.

ICLR Oral. 2024. [paper][code]

In-Context Pretraining: Language Modeling Beyond Document Boundaries

Weijia Shi, Sewon Min, Maria Lomeli, Chunting Zhou, Margaret Li, Xi Victoria Lin, Noah A. Smith, Luke Zettlemoyer, Scott Yih, Mike Lewis

ICLR Spotlight. 2024. [paper][code]

Detecting Pretraining Data from Large Language Models

Weijia Shi**,* Anirudh Ajith*, Mengzhou Xia, Yangsibo Huang, Daogao Liu, Terra Blevins, Danqi Chen, Luke Zettlemoyer

ICLR. 2024. [paper] [website][code]

SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore

Sewon Min*, Suchin Gururangan*, Eric Wallace, Weijia Shi, Hannaneh Hajishirzi, Noah A. Smith, Luke Zettlemoyer.

ICLR Spotlight. 2024. [paper][code]

Trusting Your Evidence: Hallucinate Less with Context-aware Decoding.

Weijia Shi,* Xiaochuang Han*, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, Scott Wen-tau Yih.

NAACL. 2024. [paper][code]

REPLUG: Retrieval-Augmented Black-Box Language Models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, Wen-tau Yih

NAACL. 2024. [paper][code]

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

Hongjin Su*, Weijia Shi*, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Scott Wen- tau Yih, Noah A. Smith, Luke Zettlemoyer, Tao Yu

ACL, 2023. [paper] [website][model (🌟 10M downloads on HuggingFace)]

Toward Human Readable Prompt Tuning: Kubrick’s The Shining is a good movie, and a good prompt too?