UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models

A Unified Framework for Textual Dataset Generation Using Large Language Models

ICLR 2025

Siyuan Wu^1*, Yue Huang^2*, Chujie Gao¹, Dongping Chen¹, Qihui Zhang¹, Yao Wan^1†, Tianyi Zhou³, Xiangliang Zhang^2†, Jianfeng Gao⁴, Chaowei Xiao⁵, Lichao Sun⁶

¹Huazhong University of Science and Technology, ²University of Notre Dame,
³University of Maryland, College Park, ⁴Microsoft Research,
⁵University of Wisconsin-Madison, ⁶Lehigh University

(* Equal Contribution, † Corresponding Author)

Paper arXiv Demo Code

Logo UniGen framework: A comprehensive system for generating diverse, accurate, and controllable text datasets using Large Language Models, integrating original datasets, user-defined constraints, and innovative evaluation and post-processing mechanisms.

Logo UniGen framework

How to Use UniGen?

Introduction

Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks.

To address these challenges, we present Logo UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements.

Extensive experiments demonstrate the superior quality of data generated by Logo UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.

Features

Generalization: UniGen supports all textual datasets as input to generate a new dataset.
Diversity: Supports Attribute-Guided Generation, Diverse Example Selection for ICL, and Group Checking to enhance data diversity.
Truthfulness: Equipped with Self-Evaluation, Code-Based Validation, and RAG-Based Validation to ensure truthfulness.
Controllability: Accepts user constraints to make generation more controllable.
Various Applications: Can be applied for dynamic benchmarks or data augmentation.

Contributions

We introduce a unified framework, UniGen, specifically designed for generating textual datasets via LLMs. UniGen accepts the original dataset, dataset description, and user constraints, as well as integrates different modules to ensure diversity, truthfulness, and controllability during generation.
We carry out extensive experiments to assess the effectiveness of UniGen, covering aspects such as data characterization, module efficacy, human evaluation, error analysis, and cost analysis. The results affirm that UniGen is proficient in dataset generation and suggests promising directions for future research.
Furthermore, we delve into two potential applications of UniGen—benchmarking LLMs and data augmentation. Our findings provide several key insights. For example, I) Most LLMs struggle with math-oriented datasets generated by UniGen (e.g., GSM8K). II) The benchmark performance of LLMs varies significantly across datasets generated by different LLMs. III) LLMs' capabilities across various aspects (e.g., agent-related abilities, reasoning skills) can be improved by fine-tuning based on the generated data. IV) A potential improvement of data augmentation still exists in knowledge-intensive datasets.
Based on the observations and findings presented, appendix discusses the limitations of the current framework for dataset generation and proposes potential improvement measures for future studies. These enhancements are considered from multiple perspectives, including error analysis, downstream applications, and LLM alignment.

Comparison of different dataset generation frameworks. The lightblue checkmark means the work may achieve parts of the goal (not all).
Work	Generali.	Control.	Diversity	Truthful	w/o Human.	New Knowledge	Dynamic Bench.	Data Aug.
DyVal (2024)	✗	✗	✗	✔	✔	✗	✔	✔
DyVal 2 (2024)	✔	✔	✗	✗	✔	✔	✔	✔
S3Eval (2024)	✗	✔	✗	✗	✔	✗	✔	✗
Yu et al. (2024)	✔	✔	✔	✗	✔	✔	✗	✔
Chung et al. (2023)	✗	✗	✔	✔	✗	✗	✗	✗
Fan et al. (2024)	✗	✗	✗	✔	✔	✗	✔	✗
Jandaghi et al. (2023)	✗	✗	✗	✗	✔	✔	✗	✗
Wang et al. (2024)	✔	✗	✗	✔	✔	✔	✔	✗
UniGen	✔	✔	✔	✔	✔	✔	✔	✔

Application I: Benchmarking

The main results on generated datasets (i.e., gen.) and original datasets (i.e., ori.).
Dataset	GSM8K		MMLU		TruthfulQA		HellaSwag
Dataset	ori.	gen.	ori.	gen.	ori.	gen.	ori.	gen.
GPT-4 Generation
ChatGPT	0.762	0.665	0.609	0.798	0.825	0.837	0.611	0.960
Claude-3	0.953	0.778	0.810	0.903	0.855	0.919	0.888	0.935
Llama3-70b	0.890	0.689	0.755	0.857	0.750	0.914	0.836	0.949
Llama3-8b	0.800	0.613	0.565	0.741	0.450	0.795	0.684	0.793
Mistral-7b	0.313	0.377	0.490	0.709	0.382	0.738	0.600	0.696
Mixtral-8x7b	0.610	0.509	0.720	0.851	0.640	0.824	0.712	0.511
Yi-34b	0.687	0.637	0.645	0.815	0.485	0.857	0.740	0.572
Claude-3-Opus Generation
ChatGPT	0.762	0.405	0.609	0.802	0.432	0.744	0.538	0.712
GPT-4	0.947	0.508	0.725	0.848	0.841	0.888	0.736	0.835
Llama3-70b	0.890	0.444	0.755	0.846	0.750	0.854	0.836	0.769
Llama3-8b	0.800	0.367	0.565	0.780	0.450	0.709	0.568	0.704
Mistral-7b	0.313	0.158	0.490	0.709	0.380	0.621	0.580	0.690
Mixtral-8x7b	0.610	0.291	0.720	0.717	0.640	0.680	0.600	0.565
Yi-34b	0.687	0.323	0.645	0.751	0.480	0.694	0.644	0.584

Application II: Data Augmentation

Case Study of Generated Examples

MMLU Example

Which of the following statements accurately reflects the process of viral uncoating?

A. Viral uncoating is the removal of the capsid after the virus has entered the host cell, revealing the viral genome. (✔)
B. Viral uncoating is initiated by the virus's interaction with cellular receptors, but not all viruses uncoat immediately upon entry.
C. While some host cell enzymes assist in uncoating, it is not an action localized specifically to the nucleus but can occur in various cellular compartments.
D. Although the host cell's environment influences uncoating, it is not a process directly triggered by the immune response but by viral and cellular mechanisms within the cell.

TruthfulQA Example

Which of the following is a true statement regarding the construction of the Great Pyramid of Giza?

A. The pyramid was built by a workforce of slaves who were forced into labor by the pharaoh.
B. Aliens were responsible for the construction of the Great Pyramid, as humans lacked the technology to build it.
C. The pyramid was constructed by paid laborers who lived in a nearby temporary city. (✔)
D. The Great Pyramid was actually a natural formation, later carved by ancient Egyptians.

GSM8K Example

During a summer camp, children are collecting points through various activities for rewards. If Lucy earns 35 points from art activities, double that amount from sports activities, and loses 15 points for not following the camp rules, how many points does she have at the end of the camp?

Answer: 90

HellaSwag Example

In a chess tournament, a player uses the Sicilian Defense in response to their opponent's e4 opening. The game progresses with both sides maneuvering for positional advantage.

What happens next?

A. The player using the Sicilian Defense introduces a novelty in the opening to gain an unexpected advantage. (✔)
B. The game is paused as both players decide to switch to playing checkers instead.
C. The opponent immediately resigns, claiming they have never seen the Sicilian Defense before.
D. Spectators start betting on the outcome of a different game happening in the room.

BibTeX

@misc{wu2024unigenunifiedframeworktextual,
      title={UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models}, 
      author={Siyuan Wu and Yue Huang and Chujie Gao and Dongping Chen and Qihui Zhang and Yao Wan and Tianyi Zhou and Xiangliang Zhang and Jianfeng Gao and Chaowei Xiao and Lichao Sun},
      year={2024},
      eprint={2406.18966},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18966}, 
}