[TextFooler] Is BERT really robust? — A paper summary.

4 min readSep 25, 2020

Around the end of 2018, the field of Artificial Intelligence had been revolutionized overnight when Google Brain open-sourced BERT, which it believed was a “state-of-the-art” pretraining mechanism for NLP. Achieving a nearly perfect accuracy of 93.2% when tested on a questionnaire based on a bunch of Wikipedia articles, BERT bragged being the most robust model that ever existed (at least for NLP). Until recently, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the University of Hong Kong, and Singapore’s Agency for Science, Technology, and Research birthed TextFooler a trailblazing but simple baseline for adversarial text generation. This article shall summarize the paper that demonstrates this framework with a focus on the pith of the system.

TextFooler aims to identify a counterfeit sample from a real one. (Image source — Zenva)

Abstract

TextFooler, working in a black-box setting (without the knowledge of target model or it’s architecture) successfully attacked multiple target models by applying it to 2 natural language tasks i.e., text classification and textual entailment.

Introduction

Although serious doubts are cast in regards to the reliability of ML algorithms by crafting examples that are classified correctly by a human but fools any of the 3 target models, the robustness of these models can be drastically enhanced by including such examples in the training dataset.

Method and Algorithm

The algorithm first identifies the most important words (Word Importance Ranking) that influences the target model’s prediction and then selects the synonyms (Word Transformer) that fit contextually (both grammatically and semantically).

Experiments

A. Tasks
The effectiveness of the adversarial attack is studied on two important NLP tasks, text classification and textual entailment.

Text Classification
• AG’s News (AG): Sentence-level classification with regard to four news topics: World, Sports, Business, and Science and Tech.
• Fake News Detection (Fake): Document-level classification on whether a news article is fake or not.
• MR: Sentence-level sentiment classification on positive and negative movie reviews
• IMDB: Document-level sentiment classification on positive and negative movie reviews.
• Yelp Polarity (Yelp): Document-level sentiment classification on positive and negative reviews
Textual Entailment
• SNLI: A dataset of 570K sentence pairs derived from image captions.
• MultiNLI: A multi-genre entailment dataset with coverage of transcribed speech, popular fiction, and government reports.

B. Attacking Target Models
For each data the following three state-of-the-art models are trained on the training set
1. WordCNN
2. WordLSTM
3. BERT
4. InferSent
5. ESIM

Results of Automatic Evaluation

Metrics

Original accuracy — the accuracy of the target models on the original test samples.
After-attack accuracy — The accuracy of the target models against the adversarial samples crafted from the test samples.
The greater the gap between original and after attack accuracy, more successful is the attack.

% Perturbed words — the ratio of the number of perturbed words to the text length.
Semantic similarity — contextual similarity between the original and adversarial texts.
These 2 factors evaluate the similarity between the original and adversarial texts.

Query number — the number of queries the attack system made to the target model.
This metric can reveal the efficiency of the attack model.

Results

Image source — TextFooler paper [ “m” stands matched, “mm” for mismatched ]

Observations

This algorithm achieves a high success rate when attacking with a limited number of modifications on either task.
Irrespective of the length and accuracy of the target model TextFooler reduces the original accuracy from the high values to below 15% (except Fake dataset).
Models with higher original accuracy are usually, more difficult to be attacked.
The after-attack accuracy of the Fake dataset is much higher than all other classification task datasets for all three target models.

Results of Human Evaluation

Task: Humans were asked to judge the grammaticality of a shuffled mixture of original and adversarial texts.
Observation: The grammaticality of the adversarial text were close to the original text on both datasets.
Task: The human raters were asked to assign labels to a shuffled set of original and adversarial samples.
Observation: The general agreement between the labels of the original sentence and the adversarial sentence was found to be significantly high if not perfect, with 92% on MR and 85% on SNLI.
Task: The semantic similarity of the original and adversarial sentences is evaluated by asking humans to judge whether the adversarial sentence is similar (1), ambiguous (0.5), or dissimilar (0)to the source sentence.
Observation: The perceived differences were proven to be very less on achieving similarity scores of 0.91 on MR and 0.86 on SNLI.

Error Analysis

Our adversarial samples are susceptible to three types of errors: word sense ambiguity, grammatical error, and task-sensitive content shift.
…
For example, the sentence “A man with headphones is biking” and “A man with headphones is motorcycle” differ by the word “biking”, which can be both a noun and a verb, as well as a reasonably similar word to “motorcycle”.

Future Research

Filtering out grammatical errors.
Proper implementation of task-sensitive context shift.