Skip to content


Intent Recognition with BERT using Keras and TensorFlow 2

Deep Learning, Keras, NLP, Text Classification, Python4 min read


TL;DR Learn how to fine-tune the BERT model for text classification. Train and evaluate it on a small dataset for detecting seven intents. The results might surprise you!

Recognizing intent (IR) from text is very useful these days. Usually, you get a short text (sentence or two) and have to classify it into one (or multiple) categories.

Multiple product support systems (help centers) use IR to reduce the need for a large number of employees that copy-and-paste boring responses to frequently asked questions. Chatbots, automated email responders, answer recommenders (from a knowledge base with questions and answers) strive to not let you take the time of a real person.

This guide will show you how to use a pre-trained NLP model that might solve the (technical) support problem that many business owners have. I mean, BERT is freaky good! It is really easy to use, too!

Run the complete notebook in your browser

The complete project on GitHub


The data contains various user queries categorized into seven intents. It is hosted on GitHub and is first presented in this paper.

Here are the intents:

  • SearchCreativeWork (e.g. Find me the I, Robot television show)
  • GetWeather (e.g. Is it windy in Boston, MA right now?)
  • BookRestaurant (e.g. I want to book a highly rated restaurant for me and my boyfriend tomorrow night)
  • PlayMusic (e.g. Play the last track from Beyoncé off Spotify)
  • AddToPlaylist (e.g. Add Diamonds to my roadtrip playlist)
  • RateBook (e.g. Give 6 stars to Of Mice and Men)
  • SearchScreeningEvent (e.g. Check the showtimes for Wonder Woman in Paris)

I’ve done a bit of preprocessing and converted the JSON files into easy to use/load CSVs. Let’s download them:

1!gdown --id 1OlcvGWReJMuyYQuOZm149vHWwPtlboR6 --output train.csv
2!gdown --id 1Oi5cRlTybuIF2Fl5Bfsr-KkqrXrdt77w --output valid.csv
3!gdown --id 1ep9H6-HvhB4utJRLVcLzieWNUSG3P_uF --output test.csv

We’ll load the data into data frames and expand the training data by merging the training and validation intents:

1train = pd.read_csv("train.csv")
2valid = pd.read_csv("valid.csv")
3test = pd.read_csv("test.csv")
5train = train.append(valid).reset_index(drop=True)

We have 13,784 training examples and two columns - text and intent. Let’s have a look at the number of texts per intent:

intent distribution

The amount of texts per intent is quite balanced, so we’ll not be needing any imbalanced modeling techniques.


The BERT (Bidirectional Encoder Representations from Transformers) model, introduced in the BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding paper, made possible achieving State-of-the-art results in a variety of NLP tasks, for the regular ML practitioner. And you can do it without having a large dataset! But how is this possible?

BERT is a pre-trained Transformer Encoder stack. It is trained on Wikipedia and the Book Corpus dataset. It has two versions - Base (12 encoders) and Large (24 encoders).

BERT is built on top of multiple clever ideas by the NLP community. Some examples are ELMo, The Transformer, and the OpenAI Transformer.

ELMo introduced contextual word embeddings (one word can have a different meaning based on the words around it). The Transformer uses attention mechanisms to understand the context in which the word is being used. That context is then encoded into a vector representation. In practice, it does a better job with long-term dependencies.

BERT is a bidirectional model (looks both forward and backward). And the best of all, BERT can be easily used as a feature extractor or fine-tuned with small amounts of data. How good is it at recognizing intent from text?

Intent Recognition with BERT

Luckily, the authors of the BERT paper open-sourced their work along with multiple pre-trained models. The original implementation is in TensorFlow, but there are very good PyTorch implementations too!

Let’s start by downloading one of the simpler pre-trained models and unzip it:


This will unzip a checkpoint, config, and vocabulary, along with other files.

Unfortunately, the original implementation is not compatible with TensorFlow 2. The bert-for-tf2 package solves this issue.


We need to convert the raw texts into vectors that we can feed into our model. We’ll go through 3 steps:

  • Tokenize the text
  • Convert the sequence of tokens into numbers
  • Pad the sequences so each one has the same length

Let’s start by creating the BERT tokenizer:

1tokenizer = FullTokenizer(
2 vocab_file=os.path.join(bert_ckpt_dir, "vocab.txt")

Let’s take it for a spin:

1tokenizer.tokenize("I can't wait to visit Bulgaria again!")
1['i', 'can', "'", 't', 'wait', 'to', 'visit', 'bulgaria', 'again', '!']

The tokens are in lowercase and the punctuation is available. Next, we’ll convert the tokens to numbers. The tokenizer can do this too:

1tokens = tokenizer.tokenize("I can't wait to visit Bulgaria again!")
1[1045, 2064, 1005, 1056, 3524, 2000, 3942, 8063, 2153, 999]

We’ll do the padding part ourselves. You can also use the Keras padding utils for that part.

We’ll package the preprocessing into a class that is heavily based on the one from this notebook:

1class IntentDetectionData:
2 DATA_COLUMN = "text"
3 LABEL_COLUMN = "intent"
5 def __init__(
6 self,
7 train,
8 test,
9 tokenizer: FullTokenizer,
10 classes,
11 max_seq_len=192
12 ):
13 self.tokenizer = tokenizer
14 self.max_seq_len = 0
15 self.classes = classes
17 ((self.train_x, self.train_y), (self.test_x, self.test_y)) =\
18 map(self._prepare, [train, test])
20 print("max seq_len", self.max_seq_len)
21 self.max_seq_len = min(self.max_seq_len, max_seq_len)
22 self.train_x, self.test_x = map(
23 self._pad,
24 [self.train_x, self.test_x]
25 )
27 def _prepare(self, df):
28 x, y = [], []
30 for _, row in tqdm(df.iterrows()):
31 text, label =\
32 row[IntentDetectionData.DATA_COLUMN], \
33 row[IntentDetectionData.LABEL_COLUMN]
34 tokens = self.tokenizer.tokenize(text)
35 tokens = ["[CLS]"] + tokens + ["[SEP]"]
36 token_ids = self.tokenizer.convert_tokens_to_ids(tokens)
37 self.max_seq_len = max(self.max_seq_len, len(token_ids))
38 x.append(token_ids)
39 y.append(self.classes.index(label))
41 return np.array(x), np.array(y)
43 def _pad(self, ids):
44 x = []
45 for input_ids in ids:
46 input_ids = input_ids[:min(len(input_ids), self.max_seq_len - 2)]
47 input_ids = input_ids + [0] * (self.max_seq_len - len(input_ids))
48 x.append(np.array(input_ids))
49 return np.array(x)

We figure out the padding length by taking the minimum between the longest text and the max sequence length parameter. We also surround the tokens for each text with two special tokens: start with [CLS] and end with [SEP].


Let’s make BERT usable for text classification! We’ll load the model and attach a couple of layers on it:

1def create_model(max_seq_len, bert_ckpt_file):
3 with, "r") as reader:
4 bc = StockBertConfig.from_json_string(
5 bert_params = map_stock_config_to_params(bc)
6 bert_params.adapter_size = None
7 bert = BertModelLayer.from_params(bert_params, name="bert")
9 input_ids = keras.layers.Input(
10 shape=(max_seq_len, ),
11 dtype='int32',
12 name="input_ids"
13 )
14 bert_output = bert(input_ids)
16 print("bert shape", bert_output.shape)
18 cls_out = keras.layers.Lambda(lambda seq: seq[:, 0, :])(bert_output)
19 cls_out = keras.layers.Dropout(0.5)(cls_out)
20 logits = keras.layers.Dense(units=768, activation="tanh")(cls_out)
21 logits = keras.layers.Dropout(0.5)(logits)
22 logits = keras.layers.Dense(
23 units=len(classes),
24 activation="softmax"
25 )(logits)
27 model = keras.Model(inputs=input_ids, outputs=logits)
28, max_seq_len))
30 load_stock_weights(bert, bert_ckpt_file)
32 return model

We’re fine-tuning the pre-trained BERT model using our inputs (text and intent). We also flatten the output and add Dropout with two Fully-Connected layers. The last layer has a softmax activation function. The number of outputs is equal to the number of intents we have - seven.

You can now use BERT to recognize intents!


It is time to put everything together. We’ll start by creating the data object:

1classes = train.intent.unique().tolist()
3data = IntentDetectionData(
4 train,
5 test,
6 tokenizer,
7 classes,
8 max_seq_len=128

We can now create the model using the maximum sequence length:

1model = create_model(data.max_seq_len, bert_ckpt_file)

Looking at the model summary:


You’ll notice that even this “slim” BERT has almost 110 million parameters. Indeed, your model is HUGE (that’s what she said).

Fine-tuning models like BERT is both art and doing tons of failed experiments. Fortunately, the authors made some recommendations:

  • Batch size: 16, 32
  • Learning rate (Adam): 5e-5, 3e-5, 2e-5
  • Number of epochs: 2, 3, 4
2 optimizer=keras.optimizers.Adam(1e-5),
3 loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
4 metrics=[keras.metrics.SparseCategoricalAccuracy(name="acc")]

We’ll use Adam with a slightly different learning rate (cause we’re badasses) and use sparse categorical crossentropy, so we don’t have to one-hot encode our labels.

Let’s fit the model:

1log_dir = "log/intent_detection/" +\
3tensorboard_callback = keras.callbacks.TensorBoard(log_dir=log_dir)
6 x=data.train_x,
7 y=data.train_y,
8 validation_split=0.1,
9 batch_size=16,
10 shuffle=True,
11 epochs=5,
12 callbacks=[tensorboard_callback]

We store the training logs, so you can explore the training process in Tensorboard. Let’s have a look:

train loss train accuracy


I got to be honest with you. I was impressed with the results. Training using only 12.5k samples we got:

1_, train_acc = model.evaluate(data.train_x, data.train_y)
2_, test_acc = model.evaluate(data.test_x, data.test_y)
4print("train acc", train_acc)
5print("test acc", test_acc)
1train acc 0.9915119
2test acc 0.9771429

Impressive, right? Let’s have a look at the confusion matrix:

confusion matrix

Finally, let’s use the model to detect intent from some custom sentences:

1sentences = [
2 "Play our song now",
3 "Rate this book as awful"
6pred_tokens = map(tokenizer.tokenize, sentences)
7pred_tokens = map(lambda tok: ["[CLS]"] + tok + ["[SEP]"], pred_tokens)
8pred_token_ids = list(map(tokenizer.convert_tokens_to_ids, pred_tokens))
10pred_token_ids = map(
11 lambda tids: tids +[0]*(data.max_seq_len-len(tids)),
12 pred_token_ids
14pred_token_ids = np.array(list(pred_token_ids))
16predictions = model.predict(pred_token_ids).argmax(axis=-1)
18for text, label in zip(sentences, predictions):
19 print("text:", text, "\nintent:", classes[label])
20 print()
1text: Play our song now
2intent: PlayMusic
4text: Rate this book as awful
5intent: RateBook

Man, that’s (clearly) gangsta! Ok, the examples might not be as diverse as real queries might be. But hey, go ahead and try it on your own!


You now know how to fine-tune a BERT model for text classification. You probably already know that you can use it for a variety of other tasks, too! You just have to fiddle with the layers. EASY!

Run the complete notebook in your browser

The complete project on GitHub

Doing AI/ML feels a lot like having superpowers, right? Thanks to the wonderful NLP community, you can have superpowers, too! What will you use them for?



Want to be a Machine Learning expert?

Join the weekly newsletter on Data Science, Deep Learning and Machine Learning in your inbox, curated by me! Chosen by 10,000+ Machine Learning practitioners. (There might be some exclusive content, too!)

You'll never get spam from me