Machine unlearning can become an effective solution to the problem of training AI models with undesirable, misleading, or harmful data.
Over the past two years, we’ve seen how deep learning models have revolutionized AI, enabling a wide range of applications, from new search tools to image generators. But as amazing and efficient as these models are, their ability to remember and accurately replicate training data has become a double-edged sword, posing serious challenges in this emerging field.
AI models like GPT-4o or Llama 3.1 are trained on an incredible amount of data to best serve our needs, but the trouble starts when that data needs to be erased from the models’ memory. For example, suppose your machine learning model was accidentally trained on data that includes personal banking information. How can you erase this specific information without retraining the model?
Fortunately, researchers are now working on this problem in a new field called Machine Unlearning. Machine unlearning is an emerging but vital field that major players are entering.
Join us as we take a closer look at this concept and see if large language models can actually forget what they’ve learned.
How do language models train?
Even the most powerful generative AI models are not truly intelligent. You can think of them as predictive statistical systems that can generate or supplement words, images, speech, music, video, and other data. These models learn to predict the likelihood of certain data occurrences by analyzing large numbers of examples (such as movies, audio recordings, articles, and the like). They identify patterns and take into account the context surrounding each piece of data.
For example, when an email message contains the phrase “…Looking forward to…,” a model trained to autocomplete messages suggests “…to hearing back,” based on patterns it has identified from similar emails. In fact, what the model is doing is not knowledge, but simply an intelligent guess based on similar statistics and patterns.
Most models, including flagships like GPT-4o, are trained on data published on public websites and datasets around the web. Most companies that make money from selling subscriptions to chatbots and AI tools believe that collecting data to train models is “fair use,” and does not require attribution to content owners or payment of copyright. However, many publishers and art owners disagree with this claim and are pursuing their rights through legal complaints.
In the pre-training phase, AI models use a large amount of data, called a corpus, to obtain a specific value and weight for each word or feature, which indicates the importance and relevance of those features in different data. This data directly determines what the model will understand. After the pre-training phase, the model is refined to improve the results.
In the case of transformer-based models like ChatGPT, this refinement is often done in the form of RLHF (Reinforcement Learning with Human Feedback), which means that humans directly provide feedback to the model to improve its responses.
Training AI models requires GPU processing power, which is both expensive and increasingly scarce. The Information recently estimated that ChatGPT’s daily operating costs were $700,000.
What is Machine Unlearning?
Researchers describe the main goal of machine unlearning as removing the “effects” of training data from a trained model. In other words, by unlearning a particular model, we should achieve the same behavior as another model trained on “the same original data set, minus some unwanted information.”
However, there are some points to consider in this definition:
How can we accurately identify the information that should be forgotten? Is it always possible to access models trained on real data? Do we always have valid retrained models, and if not, how do we actually evaluate unlearning?
Can we actually verify and control unlearning? If models simply “pretend” to forget, is that enough for a safe and desirable outcome? Finally, when is machine forgetting a useful solution?
The Idea and Motivations of Machine unlearning
The idea of forgetting machine learning models began in 2014 with the adoption of the “right to be forgotten” in Article 17 of the EU GDPR. At that time, the issue was to allow users to ask online service providers to delete their data from systems; however, it should be noted that the “right to be forgotten” was originally designed to delete information in regular and structured systems, such as user account data in services like Gmail, and not for deep learning models that store data in complex and intertwined forms.
This complexity has led researchers to study methods for data deletion and machine forgetting. Now, in 2024, the motivation for models to forget is not limited to privacy. As we develop large models trained on diverse datasets that include copyrighted, dangerous, or offensive content, the need to forget parts of this data becomes a necessity.
Broadly speaking, the motivations for machine forgetting can be divided into two categories:
Revocation of access: Forgetting private and copyrighted data
In an ideal world, we could perhaps think of data as information that is borrowed. In that case, machine forgetting would be aimed at returning these loans to their owners. But due to the complexities of deep learning, the data fed into the model is more like “consumed items,” and it is not so easy to return something that has been consumed. Even some data, like personal chat history, is irreplaceable and its value depends on each individual.
To understand this concept, consider a simple example: if we take “Bob ate Alice’s cheesecake” as “data,” then an action such as “Alice would rather Bob pay her money or return something of the same value” would be equivalent to compensation or financial rights for the data owner, because the possibility of returning what Bob ate, or the machine forgetting, would be very illogical and difficult.
In this case, it would probably be very valuable to create alternatives such as data markets where data owners are properly paid so that later data do not need to be forgotten.
Model Correction and Editing: Removing Toxic Content, Biases, and Outdated or Dangerous Knowledge
This type of forgetting is used to correct errors and remove undesirable elements from models. In other words, forgetting can act as a risk-mitigation mechanism against AI risks.
Unlike revoking access, we have more flexibility in modifying models, because the modification or editing is mainly driven by utility and not by legal necessity: just like the accuracy of the model in classifying images or the toxicity of the generated text (although these can also cause real harm).
In this case, we do not need a formal guarantee (although a guarantee is desirable for us) to ensure that the machine’s forgetting works well; as there are already many users who are completely satisfied with models that have been found to be “safe enough.”
Types of Ways to Make a Machine unlearned
At first glance, making a machine forget is simply achieved by retraining the model without any unwanted data. Imagine you have a large library and you want to remove all books by a particular author. The simplest way is to throw out all the books and reorder the library without books by that author. This approach can be considered the equivalent of “full retraining” in machine learning; but researchers are looking for better solutions because retraining is often very expensive, and finding the items to remove from the training data is a lot of work (think finding all references to Harry Potter in a trillion tokens).
Delearning techniques basically seek to reduce or avoid this retraining cost, while producing the same or similar results.
Exact unlearning: In this method, the unlearned model must be statistically identical to the retrained model.
Forgetting by reducing the clarity of the data: In this method, the goal is to make the model behave in such a way that removing or keeping any specific data does not make much difference.
Experimental forgetting with a known sample space: This method involves taking incremental steps to adjust the model to forget specific data.
Experimental forgetting with an unknown sample space: In this case, the data to be forgotten are not exactly known and are only conceptually or knowledgeable instilled into the model.
Directly requested forgetting: In this method, models are tried to behave in a specific way by direct commands as if the data were forgotten.
Inexact forgetting methods are sometimes known as “approximate forgetting,” meaning that the behavior of the unlearned model is roughly similar to that of the retrained model.
We’ll take a closer look at each of these methods below.
Exact unlearning
The goal of exact forgetting is to make the new model (after removing data) behave exactly like the model originally trained without that data.
This method is typically done by dividing the dataset into non-overlapping parts and training the model separately with each part of the data. If specific data needs to be forgotten, only the part of the model trained with the relevant data is retrained.
In the previous example, suppose we have divided the library into several sections and assigned a separate librarian to each section. So when we want to remove books by an author, we only notify the librarians who have access to the books we want.
If we have divided the dataset into N sections, the computational cost of the strict forgetting method, i.e. retraining the model based on the data changes in one section, will be equivalent to the Nth training of the entire model. At inference time, all models are combined.
The most important advantage of strict forgetting is that its modular structure assures us that the deleted data really does not affect the results and the algorithm structure itself proves the correctness of the work. In other words, the challenge of evaluating models after forgetting is solved to some extent. On the other hand, due to the transparency of the processes, we better understand what effect each data has on the model’s performance.
Unlearning Through differential privacy
If the presence or absence of data in the model does not cause a significant change in its behavior, we can infer that we will not need to unlearn the model on the relevant data.
This idea is the main basis of the method called Differential Privacy (DP): in other words, in the differential privacy method, the sensitivity of the model to all data is reduced to such an extent that removing or adding data does not cause a significant change in the results.
In this technique, the difference between the retrained model (without the desired data) and the original model is minimized and a close graph distribution is obtained from both of them.
Suppose someone wants to remove their personal data from the model. If the privacy method is implemented correctly, when we remove that data from the model, the model will still show the same behavior as before; as if it had never learned the said data. Thus, there is no need for any special “forgetting” because the model itself is designed to hardly show the effect of that specific data.
One common way to implement DP is to add noise to the data: when we want to train the model, we add some noise to the data to reduce the effect of each specific data.
In a simple example, suppose that when the model is learning something from a sentence, some irrelevant and extra words are also introduced into the sentence. If we later try to remove that sentence, the model won’t feel much of a change, since the noise has reduced the overall impact of the data.
Technically, in this method, we first limit the magnitude of the gradients to reduce the impact of each data. This way, the model cannot suddenly learn too much from a particular data and will have a certain impact on the data. Then, we add some noise to the data to hide the exact effect of each data and even if a data is removed, its effect will not be visible in the final result of the model.
The DP metric is known by two numbers, epsilon (ε) and delta (δ). These two numbers help us understand how strong the privacy of the model is:
Epsilon indicates the amount of allowed changes. The smaller this number is, the less sensitive the model is to data changes and the more privacy it has.
Delta is a kind of probabilistic guarantee that expresses the probability of data privacy violations; that is, it tells us how likely it is that the DP will not be able to do its job correctly. So the smaller the delta, the less likely the model will behave differently because of a particular piece of data.
In sum, smaller ε and δ means the model has stronger privacy and minimizes the effect of specific data.
In the next sections, we will explain why increasing noise leads to a decrease in model performance, but for now, consider that using noise is like putting a mask on everyone’s face to prevent us from finding a specific person in a crowd. Our model may not be able to identify the person in question, but it will have trouble recognizing other data as well.
Experimental Unlearning with a Known Sample Space
In this method, the machine forgets by making small changes to the model in “incremental” steps. Experimental techniques are more trial-and-error, and researchers try to fine-tune the parameters so that the model exhibits desired behavior in the face of undesirable data.
The main point is that we can use this method only when we know the sample space.
In simple terms, we take a few calculated steps to change the behavior of the original model as if it had been trained from scratch with new data. The model is retrained in a limited way with specific settings to change its behavior to forget some data.
For example, in the 2023 NeurIPS competition, the goal was to use an unlearning algorithm to produce a model that no longer had access to certain data (e.g., facial images) and that behaved like a reference model trained only on the remaining data.
Participants received three main inputs:
A set of images that the original model had been trained on
A starting model that had not yet been forgotten
Images that had to be removed from the model
There were also hidden models that were trained only on the “retained” data. Participants had to write an algorithm that produced 512 different new, unlearned models with similar performance to the hidden models.
In the end, it turned out that the winners used a combination of several methods:
They applied an ascending gradient to the data that should be forgotten (as if telling the model to move away from this data and forget it).
They applied a descending gradient to the data that should be remembered (as if telling the model to learn this data better and remember it).
They gave the forgettable data random labels so that the model would be a little confused and unable to remember it accurately.
They added noise to the model’s memory to make it a little more forgetful.
They reset some of the weights and removed some of the weights.
They reinitialized the first and last layers of the model and trained them with the memorized images.
The reason why empirical methods are popular is that they are simpler and faster to implement and at the same time have a good effect on the model. In addition, the results of the work are easily seen. On the contrary, theoretical methods that use complex calculations are slow and difficult to implement in practice and require a lot of resources.
However, one of the main challenges of the empirical method is that we do not know how a model will behave with new data after forgetting in the ideal case; for example, whether it should classify the deleted images randomly and without confidence or not.
This uncertainty in the behavior of the model can lead to differences in the output of the model due to different conditions and scenarios and makes it difficult to accurately predict its effects. As a result, proving the effectiveness of the new model and its similarity to the original model is questioned, because the model can produce various results and outputs after deleting data.
Empirical Unlearning with Unknown Sample Space
This empirical method is used when the data to be forgotten is not precisely specified and only exists in the form of concepts or general knowledge in the model.
For example, suppose we want a model to forget the concept “Biden is the President of the United States.” But the actual meaning of this sentence is present in the data in various formats, such as articles, public conversations, videos, blog posts, or news texts. So just deleting a few specific examples will not achieve the goal.
Terms such as “model editing,” “concept editing,” “model surgery,” and “knowledge delearning” are commonly used to refer to this machine forgetting technique.
But when the request to forget is so vague, we need to focus on issues such as the scope of the editing and how the information is related.