Intro to Text Classification with Keras (Part 1)

pre-processing, embeddings and more

Keras provides a simple and flexible API to build and experiment with neural networks. I used it in both python and R, but I decided to write this post in R since there are less examples and tutorials. This series of posts will focus on text classification using keras.

The introductory post will show a minimal example to explain:

  • text pre-processing in keras.

  • how and why to use embeddings.

  • how to build a keras model.

It the next posts, we will use a real dataset from the Toxic Comment Classification Challenge on Kaggle which solves a multi-label classification problem.

Loading Libraries

To use keras in R, you can simply load keras library, and it will deal with python in the background.

## load libraries
library(dplyr)
library(keras)

Minimal dataset

Here we have three (edited) titles from StackOverflow questions, with corresponding random labels.

## training data
sov_questions <- c("RMarkdown with parameters",
                   "How to Render Rmarkdown embedded in Rmarkdown",
                   "YAML header argument in knitr")

## training data labels (randomly assigned for demo purposes)
sov_labels <- c(1, 1, 0)

Text Pre-rocessing

If you already familiar with NLP and worked with text before, you probably know that we usually need to process text in different ways like tokenization, stemming, etc. However, in deep learning applications there are differences in pre-processing and sometimes we even need to preserve certain structure (keeping captilization, referring to it with a certain token, discarding stemming, etc.).

To feed text to any algorithm, we need a way to represent it in numbers. So before starting with the large dataset we have, let’s look at a minimal example to display the results at each step with few words and understand what we will get when we scale this to a dataset with more vocabulary.

Keras tokenizer

We can use keras function text_tokenizer() to initialize a tokenizer. This has the default settings or the ones you specify like:

  • the characters to remove, called the filters and the default are “!"#$%&()*+,-./:;<=>?@[\]^_`{|}~"

  • whether to convert all the characters to lower case or not.

  • and other attributes.

We will keep the default values for simplicity.

Then when you pass the list of questions to fit_text_tokenizer(), it will tokenize the text and fill other lists with the results.

## initialize tokenizer and fit
tokenizer <- text_tokenizer() %>% 
  fit_text_tokenizer(sov_questions)

For instance:

  • tokenizer$word_index holds the value corresponding to each word in the given text. There’s also index_word which holds the data the other way around.

  • tokenizer$word_counts gives the count of each word in the given list of questions.

  • plus other useful values.

## inspect word_index 
tokenizer$word_index 
## $rmarkdown
## [1] 1
## 
## $`in`
## [1] 2
## 
## $with
## [1] 3
## 
## $parameters
## [1] 4
## 
## $how
## [1] 5
## 
## $to
## [1] 6
## 
## $render
## [1] 7
## 
## $embedded
## [1] 8
## 
## $yaml
## [1] 9
## 
## $header
## [1] 10
## 
## $argument
## [1] 11
## 
## $knitr
## [1] 12

From text to numbers

Now with all this info, we can consider the 12 words as 12 features. So we need a way to give a value to each feature per question. One of the easiest ways is to say whether each word/feature exists or not in each question.

In keras, you can use the function texts_to_matrix() to achieve this and the result will be as follows.

## binary mode
texts_to_matrix(tokenizer, sov_questions, mode = "binary")
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    0    1    0    1    1    0    0    0    0     0     0     0     0
## [2,]    0    1    1    0    0    1    1    1    1     0     0     0     0
## [3,]    0    0    1    0    0    0    0    0    0     1     1     1     1

Notice that we specified the mode = "binary", but there are other ways modes like "count", "tfidf and freq.

For example, the second question How to Render Rmarkdown embedded in Rmarkdown, had the word “Rmarkdown” twice. So if you use mode = "count, its count (2) appears in the 2nd position.

## count mode
texts_to_matrix(tokenizer, sov_questions, mode = "count")
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]    0    1    0    1    1    0    0    0    0     0     0     0     0
## [2,]    0    2    1    0    0    1    1    1    1     0     0     0     0
## [3,]    0    0    1    0    0    0    0    0    0     1     1     1     1

So all these are valid ways to represent text, but imagine you had a dataset with thousands of unique words, how wide (or as we call it sparse) would be the previous matrix?. It would be HUGE in terms of fitting into memory or computation. So we go for a different type of representation which is embeddings.

Embeddings

The idea of embeddings in a nutshell is to have a compact representation of words (and sometimes categorical variables) that is learned from the context while training the network. I will not get into more details here but it is good to know that you can:

  • train your network to learn this representation of words (better to have large dataset).
  • use existing pre-trained embeddings (e.g. GloVe, Word2vec).

So let’s see how this works with our minimal example.

First, we can convert each question we have to a sequence of numbers based on the tokenizer$word_index.

## create sequence
sov_seq <- texts_to_sequences(tokenizer, sov_questions)
sov_seq
## [[1]]
## [1] 1 3 4
## 
## [[2]]
## [1] 5 6 7 1 8 2 1
## 
## [[3]]
## [1]  9 10 11  2 12

Since we have variable-length vectors in the resulting list, we will add zeros at the end to unify the length. This is called padding.

As we have short sequences, so we can take the length of the longest question to be the maximum length. But in some cases, with very long sequences, we can truncate the text and take part of it as a sort of tradeoff.

## pad sequence
sov_seq_pad <- pad_sequences(sov_seq, maxlen = 7, padding = "post")
sov_seq_pad
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,]    1    3    4    0    0    0    0
## [2,]    5    6    7    1    8    2    1
## [3,]    9   10   11    2   12    0    0

Demo model

Now we will build a model for demo purposes to see what we can expect when we convert from sequences to embedding vectors.

How to initialize a keras model?

Here we will initialize keras model using keras_model_sequential() as we are working with sequential models. Note that there is another type, which is custom, but it is out of our scope here.

## initialize model
model <- keras_model_sequential()

How to add an embedding layer?

To add an embedding layer, we need to specify:

  • input_dim which is the number of words/features in our sequences.
  • output_dim which is the number of dimensions to use to represent each word. Note that in this example, we are going down from 12 words to 4 dimensions, but in real cases you can see the value when you embed 20000 words in 100 dimensions, which is a significant dimentionality reduction.
## add embedding layer
model %>% 
  layer_embedding(input_dim = 7, output_dim = 4) 

How does this embedding layer look like?

At any stage, you can see the architecture of the model using summary(model). For instance, here you can see that the embedding layer has 28 parameters which we expect given the input and output dimensions (7x4).

## inspect model
summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_1 (Embedding)          (None, None, 4)               28          
## ===========================================================================
## Total params: 28
## Trainable params: 28
## Non-trainable params: 0
## ___________________________________________________________________________

Till now the model hasn’t been trained, so it hasn’t learned anything and actually this is a minimal example for demo purposes which we wouldn’t train in the first place. But we can see what we’ll get from the embedding layer by looking at the initialized weights.

You can see that each of the seven words got represented in 4 dimensions as we specified.

model$get_weights()
## [[1]]
##               [,1]        [,2]         [,3]          [,4]
## [1,]  0.0289678611 -0.03147485  0.006110154 -0.0253688451
## [2,] -0.0407578349  0.01074579  0.027456593  0.0244452991
## [3,] -0.0408074632  0.02040509 -0.006920170 -0.0105738156
## [4,]  0.0479122140 -0.04127477  0.023985136 -0.0298692472
## [5,] -0.0329879746 -0.01166550 -0.049527586  0.0006767139
## [6,] -0.0098348856  0.04472115 -0.001571715  0.0409932248
## [7,]  0.0005310178 -0.02521396 -0.047884941 -0.0136772506

Now we converted from sequences (1D) to a matrix (2D), how can we go back to (1D)?

One way to do this is pooling, which will take the average or maximum over one dimension and return a vector which length equals the embedding size .

For instance, if we add the layer_global_average_pooling_1d() and look again at the model, we can see its length as expected (4).

## add pooling layer 
model %>% 
  layer_global_average_pooling_1d()

## inspect model
summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_1 (Embedding)          (None, None, 4)               28          
## ___________________________________________________________________________
## global_average_pooling1d_1 (Glob (None, 4)                     0           
## ===========================================================================
## Total params: 28
## Trainable params: 28
## Non-trainable params: 0
## ___________________________________________________________________________

How to add more layers?

Now we are be ready to add more layers, and the simplest is a fully connected (or a dense layer). We will finally add another layer with one unit to give the output value since we have a single label classification problem here. We will go over more details in the next posts.

## add dense layer
model %>% 
  layer_dense(units = 16, activation = "relu") %>% 
  layer_dense(units = 1, activation = "sigmoid") ## single label classification

Now we can see the final architecture of the model.

## inspect model
summary(model)
## ___________________________________________________________________________
## Layer (type)                     Output Shape                  Param #     
## ===========================================================================
## embedding_1 (Embedding)          (None, None, 4)               28          
## ___________________________________________________________________________
## global_average_pooling1d_1 (Glob (None, 4)                     0           
## ___________________________________________________________________________
## dense_1 (Dense)                  (None, 16)                    80          
## ___________________________________________________________________________
## dense_2 (Dense)                  (None, 1)                     17          
## ===========================================================================
## Total params: 125
## Trainable params: 125
## Non-trainable params: 0
## ___________________________________________________________________________

The next steps would be compiling and training the model, but it would be better to explain it with a real dataset, which we will do in the next post. After all the main objective of this post is to understand how different functions work and view the output with at each step. This would make you more comfortable using them and debugging your code with a larger dataset.

comments powered by Disqus