In machine learning, overfitting happens when results adhere too closely to training data. Instead of taking the original data and extrapolating new insights, the model simply regurgitates that information, adding little to no value.
“The problem is that, very often, these are really complex models, which are generating highly complex objects and they overfit,” said Computer Science and Engineering (CSE) Associate Professor Kamalika Chaudhuri, who studies overfitting in generative models and unsupervised learning. “If your generative model is just copying out the training data, the whole thing is useless.”
As a result, identifying overfitting is an important problem to solve, one Chaudhuri and colleagues Casey Meehan, a CSE Ph.D. student, and Sanjoy Dasgupta, a CSE professor, are pursuing aggressively. In a paper published in the Artificial Intelligence and Statistics (AISTATS) 2020 conference in August, A Non-Parametric Test to Detect Data-Copying in Generative Models, the researchers explore two kinds of overfitting-- overrepresentation and data copying-- and how data copying should be addressed.
Overfitting: Two Different Kinds
Overrepresentation and data copying are two different species of overfitting that can produce bad data. In overrepresentation, the training data might include dog, cat and flower images, but the model somehow misses the flowers. As a result, the output includes half cats, half dogs and no flowers – overrepresenting cats and dogs.
In data copying, the system would simply spit out random images of dogs, cats or flowers from the training data, without providing any additional insights.
“It is well understood that our models tend to…deftly regurgitating their training data, yet struggle to generalize to unseen examples similar to the training data,” Meehan, a student in Chaudhuri’s lab and first author on the paper, said in a recent blog post.
Chaudhuri notes these are two completely distinct problems. Unfortunately, they are not always treated differently. After doing a thorough literature search, the team found that existing detection methods for overrepresentation came up short against data copying. Data scientists needed better tools to detect this form of overfitting.
“A lot of the previous tests designed to catch overrepresentation completely fail with data copying,” said co-author Dasgupta. “This is extremely important because data-copying indeed occurs in contemporary generative models and has significant consequences for user privacy and model generalization. But it is simply not identified by the most prominent generative model tests in the literature today.”
In the paper, the researchers propose a test that compares three different data sets (training set, target distribution and generated sample) to detect data copying, giving data researchers additional tools to detect this type of overfitting and produce better data. This work is a key step in an ongoing process to understand different types of overfitting and develop better ways to mitigate them.
“We have proposed this test to detect data copying, and it works okay,” said Chaudhuri. “We need to design better tests, and we also need to understand what other overfitting modes could happen and extend our findings to super complex models.”
By Josh Baxt