— Deep Learning, Keras, NLP, Text Classification, Python — 4 min read
Share
TL;DR Learn how to fine-tune the BERT model for text classification. Train and evaluate it on a small dataset for detecting seven intents. The results might surprise you!
Recognizing intent (IR) from text is very useful these days. Usually, you get a short text (sentence or two) and have to classify it into one (or multiple) categories.
Multiple product support systems (help centers) use IR to reduce the need for a large number of employees that copy-and-paste boring responses to frequently asked questions. Chatbots, automated email responders, answer recommenders (from a knowledge base with questions and answers) strive to not let you take the time of a real person.
This guide will show you how to use a pre-trained NLP model that might solve the (technical) support problem that many business owners have. I mean, BERT is freaky good! It is really easy to use, too!
Run the complete notebook in your browser
The complete project on GitHub
The data contains various user queries categorized into seven intents. It is hosted on GitHub and is first presented in this paper.
Here are the intents:
I’ve done a bit of preprocessing and converted the JSON files into easy to use/load CSVs. Let’s download them:
1!gdown --id 1OlcvGWReJMuyYQuOZm149vHWwPtlboR6 --output train.csv2!gdown --id 1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w --output valid.csv3!gdown --id 1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF --output test.csv
We’ll load the data into data frames and expand the training data by merging the training and validation intents:
1train = pd.read_csv("train.csv")2valid = pd.read_csv("valid.csv")3test = pd.read_csv("test.csv")45train = train.append(valid).reset_index(drop=True)
We have 13,784
training examples and two columns - text
and intent
. Let’s have a look at the number of texts per intent:
The amount of texts per intent is quite balanced, so we’ll not be needing any imbalanced modeling techniques.
The BERT (Bidirectional Encoder Representations from Transformers) model, introduced in the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper, made possible achieving State-of-the-art results in a variety of NLP tasks, for the regular ML practitioner. And you can do it without having a large dataset! But how is this possible?
BERT is a pre-trained Transformer Encoder stack. It is trained on Wikipedia and the Book Corpus dataset. It has two versions - Base (12 encoders) and Large (24 encoders).
BERT is built on top of multiple clever ideas by the NLP community. Some examples are ELMo, The Transformer, and the OpenAI Transformer.
ELMo introduced contextual word embeddings (one word can have a different meaning based on the words around it). The Transformer uses attention mechanisms to understand the context in which the word is being used. That context is then encoded into a vector representation. In practice, it does a better job with long-term dependencies.
BERT is a bidirectional model (looks both forward and backward). And the best of all, BERT can be easily used as a feature extractor or fine-tuned with small amounts of data. How good is it at recognizing intent from text?
Luckily, the authors of the BERT paper open-sourced their work along with multiple pre-trained models. The original implementation is in TensorFlow, but there are very good PyTorch implementations too!
Let’s start by downloading one of the simpler pre-trained models and unzip it:
1!wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip2!unzip uncased_L-12_H-768_A-12.zip
This will unzip a checkpoint, config, and vocabulary, along with other files.
Unfortunately, the original implementation is not compatible with TensorFlow 2. The bert-for-tf2 package solves this issue.
We need to convert the raw texts into vectors that we can feed into our model. We’ll go through 3 steps:
Let’s start by creating the BERT tokenizer:
1tokenizer = FullTokenizer(2 vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt")3)
Let’s take it for a spin:
1tokenizer.tokenize("I can't wait to visit Bulgaria again!")
1['i', 'can', "'", 't', 'wait', 'to', 'visit', 'bulgaria', 'again', '!']
The tokens are in lowercase and the punctuation is available. Next, we’ll convert the tokens to numbers. The tokenizer can do this too:
1tokens = tokenizer.tokenize("I can't wait to visit Bulgaria again!")2tokenizer.convert_tokens_to_ids(tokens)
1[1045, 2064, 1005, 1056, 3524, 2000, 3942, 8063, 2153, 999]
We’ll do the padding part ourselves. You can also use the Keras padding utils for that part.
We’ll package the preprocessing into a class that is heavily based on the one from this notebook:
1class IntentDetectionData:2 DATA_COLUMN = "text"3 LABEL_COLUMN = "intent"45 def __init__(6 self,7 train,8 test,9 tokenizer: FullTokenizer,10 classes,11 max_seq_len=19212 ):13 self.tokenizer = tokenizer14 self.max_seq_len = 015 self.classes = classes1617 ((self.train_x, self.train_y), (self.test_x, self.test_y)) =\18 map(self._prepare, [train, test])1920 print("max seq_len", self.max_seq_len)21 self.max_seq_len = min(self.max_seq_len, max_seq_len)22 self.train_x, self.test_x = map(23 self._pad,24 [self.train_x, self.test_x]25 )2627 def _prepare(self, df):28 x, y = [], []2930 for _, row in tqdm(df.iterrows()):31 text, label =\32 row[IntentDetectionData.DATA_COLUMN], \33 row[IntentDetectionData.LABEL_COLUMN]34 tokens = self.tokenizer.tokenize(text)35 tokens = ["[CLS]"] + tokens + ["[SEP]"]36 token_ids = self.tokenizer.convert_tokens_to_ids(tokens)37 self.max_seq_len = max(self.max_seq_len, len(token_ids))38 x.append(token_ids)39 y.append(self.classes.index(label))4041 return np.array(x), np.array(y)4243 def _pad(self, ids):44 x = []45 for input_ids in ids:46 input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]47 input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))48 x.append(np.array(input_ids))49 return np.array(x)
We figure out the padding length by taking the minimum between the longest text and the max sequence length parameter. We also surround the tokens for each text with two special tokens: start with [CLS]
and end with [SEP]
.
Let’s make BERT usable for text classification! We’ll load the model and attach a couple of layers on it:
1def create_model(max_seq_len, bert_ckpt_file):23 with tf.io.gfile.GFile(bert_config_file, "r") as reader:4 bc = StockBertConfig.from_json_string(reader.read())5 bert_params = map_stock_config_to_params(bc)6 bert_params.adapter_size = None7 bert = BertModelLayer.from_params(bert_params, name="bert")89 input_ids = keras.layers.Input(10 shape=(max_seq_len, ),11 dtype='int32',12 name="input_ids"13 )14 bert_output = bert(input_ids)1516 print("bert shape", bert_output.shape)1718 cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)19 cls_out = keras.layers.Dropout(0.5)(cls_out)20 logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)21 logits = keras.layers.Dropout(0.5)(logits)22 logits = keras.layers.Dense(23 units=len(classes),24 activation="softmax"25 )(logits)2627 model = keras.Model(inputs=input_ids, outputs=logits)28 model.build(input_shape=(None, max_seq_len))2930 load_stock_weights(bert, bert_ckpt_file)3132 return model
We’re fine-tuning the pre-trained BERT model using our inputs (text and intent). We also flatten the output and add Dropout with two Fully-Connected layers. The last layer has a softmax activation function. The number of outputs is equal to the number of intents we have - seven.
You can now use BERT to recognize intents!
It is time to put everything together. We’ll start by creating the data object:
1classes = train.intent.unique().tolist()23data = IntentDetectionData(4 train,5 test,6 tokenizer,7 classes,8 max_seq_len=1289)
We can now create the model using the maximum sequence length:
1model = create_model(data.max_seq_len, bert_ckpt_file)
Looking at the model summary:
1model.summar()
You’ll notice that even this “slim” BERT has almost 110 million parameters. Indeed, your model is HUGE (that’s what she said).
Fine-tuning models like BERT is both art and doing tons of failed experiments. Fortunately, the authors made some recommendations:
1model.compile(2 optimizer=keras.optimizers.Adam(1e-5),3 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),4 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]5)
We’ll use Adam with a slightly different learning rate (cause we’re badasses) and use sparse categorical crossentropy, so we don’t have to one-hot encode our labels.
Let’s fit the model:
1log_dir = "log/intent_detection/" +\2 datetime.datetime.now().strftime("%Y%m%d-%H%M%s")3tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir)45model.fit(6 x=data.train_x,7 y=data.train_y,8 validation_split=0.1,9 batch_size=16,10 shuffle=True,11 epochs=5,12 callbacks=[tensorboard_callback]13)
We store the training logs, so you can explore the training process in Tensorboard. Let’s have a look:
I got to be honest with you. I was impressed with the results. Training using only 12.5k samples we got:
1_, train_acc = model.evaluate(data.train_x, data.train_y)2_, test_acc = model.evaluate(data.test_x, data.test_y)34print("train acc", train_acc)5print("test acc", test_acc)
1train acc 0.99151192test acc 0.9771429
Impressive, right? Let’s have a look at the confusion matrix:
Finally, let’s use the model to detect intent from some custom sentences:
1sentences = [2 "Play our song now",3 "Rate this book as awful"4]56pred_tokens = map(tokenizer.tokenize, sentences)7pred_tokens = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)8pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))910pred_token_ids = map(11 lambda tids: tids +[0]*(data.max_seq_len-len(tids)),12 pred_token_ids13)14pred_token_ids = np.array(list(pred_token_ids))1516predictions = model.predict(pred_token_ids).argmax(axis=-1)1718for text, label in zip(sentences, predictions):19 print("text:", text, "\nintent:", classes[label])20 print()
1text: Play our song now2intent: PlayMusic34text: Rate this book as awful5intent: RateBook
Man, that’s (clearly) gangsta! Ok, the examples might not be as diverse as real queries might be. But hey, go ahead and try it on your own!
You now know how to fine-tune a BERT model for text classification. You probably already know that you can use it for a variety of other tasks, too! You just have to fiddle with the layers. EASY!
Run the complete notebook in your browser
The complete project on GitHub
Doing AI/ML feels a lot like having superpowers, right? Thanks to the wonderful NLP community, you can have superpowers, too! What will you use them for?
Share
You'll never get spam from me