Hello!
I am a final-year PhD student at the Paul G. Allen School of Computer Science & Engineering,
University of Washington.
I am fortunate to be co-advised by Hannaneh Hajishirzi and Noah Smith.
I am also a student researcher at AI2.
I have previously interned at Meta AI and Microsoft Research Asia.
Prior to UW, I did a master at Peking University and an undergraduate at Shanghai Jiao Tong University.
These days, I am excited about data-centric approaches for understanding and advancing
AI systems.
I believe that data can serve as an effective, sustainable, auditable, and beneficial ground for future human AI collaboration and dual improvement.
Here are some topics I have been thinking about recently:
-
Data Curation.
Large-scale data curation has powered several AI breakthoughs.
I have led/contributed to several popular datasets,
including Super-NaturalInstructions for unifying NLP tasks,
MultiModalQA for multimodal question answering,
DROP for machine reasoning,
and DuReader for machine reading comprehension.
I am looking forward to new paradigms and algorithms for constructing and maintaining high-quality data, particularly for more complex scenarios.
-
Synthetic Data.
Generative models have enabled producing data easily at scale.
Our work Self-Instruct has pioneered the idea of using model-generated data
to improve models. It also raised many interesting questions, such as: Can models really achieve self-improvement?
Can models generate meaningful information that does not exist before? How can we design better data synthesis pipelines and algorithms?
-
Data Attribution.
I envision a future where individuals and organizations can collaborately contribute data to AI systems,
and benefit in return when their data takes effect.
This requires a new data attribution framework that is fair, transparent, and efficient.
I am interested in the development of such a framework and the required algorithms, as well as the potential impact on society.
-
Open Science of Language Models.
I believe that openness is the scientific, trustworthy, and sustainable way to advance language models.
Throughout my PhD, I have been actively constributing to the development of open language models (OLMo) at AI2.
I also co-led the Tülu series of work(1,
2,
and 2.5), exploring post-training techniques.
We release all our artifacts to facilitate open understanding and research of language models.
I am on the job market for Fall 2024!
Please feel free to reach out if you would like to share opportunities, collaborate, or just chat :)
Selected Publications
* indicates equal contribution.
For a full list of publications, please refer to my Google Scholar page.
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Hamish Ivison, Yizhong Wang, Jiacheng Liu, Zeqiu Wu, Valentina Pyatkin, Nathan Lambert, Noah A. Smith, Yejin Choi, Hannaneh Hajishirzi
Set the Clock: Temporal Alignment of Pretrained Language Models
Bowen Zhao*, Zander Brumbaugh*, Yizhong Wang*, Hannaneh Hajishirzi, Noah A. Smith
OLMo: Accelerating the Science of Language Models (Best Theme Paper)
Groeneveld et al.
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources (Spotlight)
Yizhong Wang*, Hamish Ivison*, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi
Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, Hannaneh Hajishirzi
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Yizhong Wang*, Swaroop Mishra*, Pegah Alipoormolabashi, Yeganeh Kordi et al.
Probing Across Time: What Does RoBERTa Know and When?
Leo Z. Liu*, Yizhong Wang*, Jungo Kasai, Hannaneh Hajishirzi, Noah A. Smith
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith and Yejin Choi
Do Neural NLP Models Know Numbers? Probing Numeracy in Embeddings
Eric Wallace*, Yizhong Wang*, Sujian Li, Sameer Singh and Matt Gardner
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh and Matt Gardner
Multi-Passage Machine Reading Comprehension with Cross-Passage Answer Verification
Yizhong Wang, Kai Liu, Jing Liu, Wei He, Yajuan Lyu, Hua Wu, Sujian Li and Haifeng Wang.
A Two-Stage Parsing Method for Text-level Discourse Analysis (Outstanding Paper Award)
Yizhong Wang, Sujian Li and Houfeng Wang