Now You See Me, Now You Don’t : Adversarial Patching Brings Serious Trust Issues to Machine Learning

There’s a very special T-shirt in William Gibson’s novel Zero History. Its design is crafted in such a way as to render the person wearing it invisible to surveillance algorithms. Here’s what that T-shirt looks like:

Pep, in black cyclist’s pants, wore the largest, ugliest T-shirt she’d ever seen, in a thin, cheap-looking cotton the color of ostomy devices, that same imaginary Caucasian flesh-tone. There were huge features screened across it in dull black halftone, asymmetrical eyes at breast height, a grim mouth at crotch-level. Later she’d be unable to say exactly what had been so ugly about it, except that it was somehow beyond punk, beyond art, and fundamentally, somehow, an affront.

When Gibson’s novel was published in 2010, the idea of tricking computer vision algorithms with an ugly T-shirt seemed like, well, science fiction.

Four years later, a research paper titled Evasion attacks against machine learning at test time started creating some waves. Two years after that came Deep-fool: a simple and accurate method to fool deep neural networks.

The papers demonstrated how diabolically easy it was to corrupt even the most sophisticated machine learning models. The approach came to be known as adversarial patching, which meant that the input to the models was somehow doctored to cause a faulty output.

The first attempts built on manipulating pixels in digital images in such a way as to be indiscernible to the human eye, while predictively tricking deep learning algorithms into thinking, for example, it was looking at a rifle rather than a turtle (as demonstrated here).

People started getting nervous.

Still however the threat seemed somewhat theoretical, since adversarial patching could only be achieved by contaminating a certain model’s training data (cold comfort for contemporary machine learning practitioners, since the models have since grown so data hungry as to making it impossible to train them only on data that you control completely yourself).

That was until the 2017 research paper Robust physical world attacks on deep learning models, which showed that placing a small sticker–a physical adversarial patch–on traffic signs, made them invisible to computer vision algorithms in autonomous vehicles.

Then later that year some really annoying Dutch white hat hackers published Fooling automated surveillance cameras: adversarial patches to attack person detection. This time the target was a type of object with a lot more intra-class variety than traffic signs, namely people.

What the Dutch paper showed was how simply holding up a 40x40cm cardboard printout of a physical adversarial patch, would significantly compromise the ability of state of the art machine vision algorithms, to identify the person holding the patch as a human being.

A brief seven years after Gibson’s far fetched futurism, the science fiction had become reality.

Where does this leave us? The aforementioned article where turtles were made to look like rifles, was authored by mathematician/programmer Pau Labarta Bajo. According to him it’s going to be really hard to protect realworld machine learning applications from adversarial patch-based attack vectors. And by realworld he mean that the model is deployed in an environment where any user can send inputs and expect an output. Which means just about all models that we use nowadays.

So if the evidence is mounting that it’s was getting harder and harder to protect machine learning models, does that mean we’re screwed?

That’s pretty much the way it looked to me, but I wanted to prove myself wrong so I turned to one of the cleverest people I know in the field. His answer did not provide comfort:

Sadly, adversarial attacks are a fundamental vulnerability of complex AI models. Most of the examples you bring up are cases of so called evasion attacks, where clean target instances are modified at test-time to avoid detection by a classifier.

As if that’s not bad enough, it’s probably even harder to defend a model from adversarial poisoning attacks. These happen at training-time and aim to manipulate the performance of a system by inserting carefully constructed poison instances into the training data. It’s a bit like the movie Inception where you can plant an idea into someone’s brain.

The only way to avoid this type of attacks, is to always train your own model. But even if you do, you cannot exclude the possibility that the training data obtained from others are manipulated to plant some evil pattern. You would really need to train your own model, with only data that you trust. But the massive scale of both data and models makes this practically impossible, so in reality you will likely use data from others, or train a model based on an existing one.

Besides poisoning attack, there are several other types of adversarial attacks that operate on models trained on “clean” data. While the deployed models are fixed, new attacking method can emerge at any time.