UniGen framework: A comprehensive system for generating diverse, accurate, and controllable text datasets using Large Language Models, integrating original datasets, user-defined constraints, and innovative evaluation and post-processing mechanisms.
UniGen framework
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks.
To address these challenges, we present UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements.
Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
Work | Generali. | Control. | Diversity | Truthful | w/o Human. | New Knowledge | Dynamic Bench. | Data Aug. |
---|---|---|---|---|---|---|---|---|
DyVal (2024) | ✗ | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ | ✔ |
DyVal 2 (2024) | ✔ | ✔ | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ |
S3Eval (2024) | ✗ | ✔ | ✗ | ✗ | ✔ | ✗ | ✔ | ✗ |
Yu et al. (2024) | ✔ | ✔ | ✔ | ✗ | ✔ | ✔ | ✗ | ✔ |
Chung et al. (2023) | ✗ | ✗ | ✔ | ✔ | ✗ | ✗ | ✗ | ✗ |
Fan et al. (2024) | ✗ | ✗ | ✗ | ✔ | ✔ | ✗ | ✔ | ✗ |
Jandaghi et al. (2023) | ✗ | ✗ | ✗ | ✗ | ✔ | ✔ | ✗ | ✗ |
Wang et al. (2024) | ✔ | ✗ | ✗ | ✔ | ✔ | ✔ | ✔ | ✗ |
UniGen | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
Dataset | GSM8K | MMLU | TruthfulQA | HellaSwag | ||||
---|---|---|---|---|---|---|---|---|
ori. | gen. | ori. | gen. | ori. | gen. | ori. | gen. | |
GPT-4 Generation | ||||||||
ChatGPT | 0.762 | 0.665 | 0.609 | 0.798 | 0.825 | 0.837 | 0.611 | 0.960 |
Claude-3 | 0.953 | 0.778 | 0.810 | 0.903 | 0.855 | 0.919 | 0.888 | 0.935 |
Llama3-70b | 0.890 | 0.689 | 0.755 | 0.857 | 0.750 | 0.914 | 0.836 | 0.949 |
Llama3-8b | 0.800 | 0.613 | 0.565 | 0.741 | 0.450 | 0.795 | 0.684 | 0.793 |
Mistral-7b | 0.313 | 0.377 | 0.490 | 0.709 | 0.382 | 0.738 | 0.600 | 0.696 |
Mixtral-8x7b | 0.610 | 0.509 | 0.720 | 0.851 | 0.640 | 0.824 | 0.712 | 0.511 |
Yi-34b | 0.687 | 0.637 | 0.645 | 0.815 | 0.485 | 0.857 | 0.740 | 0.572 |
Claude-3-Opus Generation | ||||||||
ChatGPT | 0.762 | 0.405 | 0.609 | 0.802 | 0.432 | 0.744 | 0.538 | 0.712 |
GPT-4 | 0.947 | 0.508 | 0.725 | 0.848 | 0.841 | 0.888 | 0.736 | 0.835 |
Llama3-70b | 0.890 | 0.444 | 0.755 | 0.846 | 0.750 | 0.854 | 0.836 | 0.769 |
Llama3-8b | 0.800 | 0.367 | 0.565 | 0.780 | 0.450 | 0.709 | 0.568 | 0.704 |
Mistral-7b | 0.313 | 0.158 | 0.490 | 0.709 | 0.380 | 0.621 | 0.580 | 0.690 |
Mixtral-8x7b | 0.610 | 0.291 | 0.720 | 0.717 | 0.640 | 0.680 | 0.600 | 0.565 |
Yi-34b | 0.687 | 0.323 | 0.645 | 0.751 | 0.480 | 0.694 | 0.644 | 0.584 |
Which of the following statements accurately reflects the process of viral uncoating?
Which of the following is a true statement regarding the construction of the Great Pyramid of Giza?
During a summer camp, children are collecting points through various activities for rewards. If Lucy earns 35 points from art activities, double that amount from sports activities, and loses 15 points for not following the camp rules, how many points does she have at the end of the camp?
Answer: 90
In a chess tournament, a player uses the Sicilian Defense in response to their opponent's e4 opening. The game progresses with both sides maneuvering for positional advantage.
What happens next?
@misc{wu2024unigenunifiedframeworktextual,
title={UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models},
author={Siyuan Wu and Yue Huang and Chujie Gao and Dongping Chen and Qihui Zhang and Yao Wan and Tianyi Zhou and Xiangliang Zhang and Jianfeng Gao and Chaowei Xiao and Lichao Sun},
year={2024},
eprint={2406.18966},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.18966},
}