Securing private data at scale with differentially private partition selection


Large, user-based datasets are invaluable for advancing AI and machine learning models. They drive innovation that directly benefits users through improved services, more accurate predictions, and personalized experiences. Collaborating on and sharing such datasets can accelerate research, foster new applications, and contribute to the broader scientific community. However, leveraging these powerful datasets also comes with potential data privacy risks.

The process of identifying a specific, meaningful subset of unique items that can be shared safely from a vast collection based on how frequently or prominently they appear across many individual contributions (like finding all the common words used across a huge set of documents) is called “differentially private (DP) partition selection”. By applying differential privacy protections in partition selection, it’s possible to perform that selection in a way that prevents anyone from knowing whether any single individual’s data contributed a specific item to the final list. This is done by adding controlled noise and only selecting items that are sufficiently common even after that noise is included, ensuring individual privacy. DP is the first step in many important data science and machine learning tasks, including extracting vocabulary (or n-grams) from a large private corpus (a necessary step of many textual analysis and language modeling applications), analyzing data streams in a privacy preserving way, obtaining histograms over user data, and increasing efficiency in private model fine-tuning.

In the context of massive datasets like user queries, a parallel algorithm is crucial. Instead of processing data one piece at a time (like a sequential algorithm would), a parallel algorithm breaks the problem down into many smaller parts that can be computed simultaneously across multiple processors or machines. This practice isn’t just for optimization; it’s a fundamental necessity when dealing with the scale of modern data. Parallelization allows the processing of vast amounts of information all at once, enabling researchers to handle datasets with hundreds of billions of items. With this, it’s possible to achieve robust privacy guarantees without sacrificing the utility derived from large datasets.

In our recent publication, “Scalable Private Partition Selection via Adaptive Weighting”, which appeared at ICML2025, we introduce an efficient parallel algorithm that makes it possible to apply DP partition selection to various data releases. Our algorithm provides the best results across the board among parallel algorithms and scales to datasets with hundreds of billions of items, up to three orders of magnitude larger than those analyzed by prior sequential algorithms. To encourage collaboration and innovation by the research community, we are open-sourcing DP partition selection on GitHub.

Leave a Reply

Your email address will not be published. Required fields are marked *