Unlocking data synthesis with a conditional generator


Experiments

We conducted experiments on four datasets, where three datasets correspond with downstream generative tasks and one dataset with a classification task. Generative tasks are typically more challenging than classification tasks. This is because the generative tasks are evaluated by the next-token prediction accuracy, which requires the synthetic data to preserve fine-grained textual information from the private data. In contrast, the classification tasks only require maintaining the co-occurrence patterns between labels and words in the private data.

The three generative tasks are chosen to cover a diverse set of practical scenarios: PubMed (medical paper abstracts), Chatbot Arena (human-to-machine interactions), and Multi-Session Chat (human-to-human daily dialogues). To evaluate the quality of the generated synthetic data, we followed the setup of Aug-PE to train a small downstream language model on the synthetic data and then compute the next-token prediction accuracy on the real test data.

The classification task is performed on the OpenReview (academic paper reviews) dataset. To evaluate the quality of the generated synthetic data, we train a downstream classifier on the synthetic data, and compute the classification accuracy on the real test data.

To mitigate concerns regarding data contamination, we carefully analyzed our selected datasets. Our analysis showed no overlap between our pre-training data and the downstream datasets.

Leave a Reply

Your email address will not be published. Required fields are marked *