(Image descrption: First row shows a bird generated by a pretrained model compared to a bird generated with the same prompt after applying the reserachers' method to redact the concept of long beak. Second row shows a bird generated by the pretrained model compared to a bird generated with the same prompt after applying their method to redact the concept of blue birds-- demonstrating that their method knows clearly which part should be redacted and which should be kept.)
Fake robocalls during elections. The voice of a public figure appropriated to hawk products. Pictures altered to mislead the public. From social media posts to celebrity voices, the trustworthiness of AI-generated content is under fire. So here’s the burning question: how do we stamp out harmful or undesirable content without dampening innovation?
Computer scientists from the University of California San Diego Jacobs School of Engineering have proposed a novel solution to optimize the tremendous potential of deep generative models while mitigating the production of content that is biased or toxic in nature.
In a 2024 IEEE Secure and Trustworthy Machine Learning paper, Data Redaction from Conditional Generative Models, researchers introduced a framework that would prevent text-to-image and speech synthesis models from producing undesirable outputs. Their innovative approach earned a Distinguished Paper Award at the IEEE conference held earlier this month at the University of Toronto.
“Modern deep generative models often produce undesirable outputs – such as offensive texts, malicious images, or fabricated speech – and there is no reliable way to control them. This paper is about how to prevent this from happening technically,” said Zhifeng Kong, a Computer Science and Engineering Department PhD student and lead author of the paper. (Sound demos from the project can be heard here.)
“The main contribution of this work is to formalize how to think about this problem and how to frame it properly so that it can be solved,” said computer science Professor Kamalika Chaudhuri.
A New Method to Extinguish Harmful Content
Traditional mitigation methods have taken one of two approaches. The first method is to re-train the model from scratch using a training set that excludes all undesirable samples; the alternative is to apply a classifier that filters undesirable outputs or edits outputs after the content has been generated.
These solutions have certain limitations for most modern, large models. Besides being cost-prohibitive—requiring millions of dollars to retrain industry scale models from scratch— these mitigation methods are computationally heavy, and there’s no way to control whether third parties will implement available filters or editing tools once they obtain the source code. Additionally, they might not even solve the problem: sometimes undesirable outputs, such as images with artifacts, appear even though they are not present in the training data.
Chaudhuri and Kong aim to mitigate undesired content while overcoming each of these hurdles. They were inspired to design a formal statistical machine learning framework that was effective, universal, and computationally efficient while retaining high-generation quality.
“Our framing gives rise to an efficient and more controllable approach where we can ‘post-edit’ a small part of the model after training with relatively small computational overhead,” said Chaudhuri.
Specifically, the team proposed to post-edit the weights of a pre-trained model, a method they call data redaction. They introduced a series of techniques to redact certain conditionals, or user inputs, that will, with high statistical probability, lead to undesirable content.
Prior work in data redaction focused on unconditional generative models. Those studies considered the problem in the space of outputs, redacting generated samples. That same technique is too unwieldy to apply to conditional generative models, which typically learn an infinite number of distributions.
Chaudhuri and Kong overcame this challenge by redacting in the conditional space rather than the output space. With text-to-image models they redacted prompts; in text-to-speech models, they redacted voices. In short, they extinguished sparks before they could be fanned into toxic output.
For example, in the text-to-speech context, they could redact a specific person’s voice, such as a celebrity voice. The model would then generate a generic voice in place of the celebrity voice, making it much more difficult to put words in someone’s mouth.
The team’s method — which only needed to load a small fraction of the dataset — kept their data redaction computationally light. It also offered better redaction quality and robustness than baseline methods and retained similar generation quality as the pre-trained model.
The researchers note that this work is a small-scale study which provides an approach that is applicable to most types of generative models.
“If this was to be scaled up and applied to much bigger and modern models, then the ultimate broader impact would be a path towards safer generative models,” said Chaudhuri.
This work was supported by the National Science Foundation (1804829) and an Army Research Office MURI award (W911NF2110317).
-By Kimberley Clementi