Words are not enough! Short text classification using words as well

Words are not enough! Short text

classiﬁcation using words as well as

entities

Masterarbeit

von

Qingyuan Bie

Studiengang: Informatik

Matrikelnummer: 1795486

Institut für Angewandte Informatik und Formale

Beschreibungsverfahren (AIFB)

KIT-Fakultät für Wirtschaftswissenschaften

Prüfer: Prof. Dr. Ralf Reussner

Zweiter Prüfer: Prof. Dr. Harald Sack

Betreuer: M.Sc. Rima Türker

Eingereicht am: 29. August 2019

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft www.kit.edu

Zusammenfassung

Textklassiﬁzierung, auch als Textkategorisierung bezeichnet, ist eine klassische Aufgabe

in der natürlichen Sprachverarbeitung. Ziel ist es, Textdokumenten eine oder mehrere vor-

deﬁnierte Klassen oder Kategorien zuzuordnen. Die Textklassiﬁzierung hat eine Vielzahl

von Anwendungsszenarien.

Herkömmliche Probleme bei der Klassiﬁzierung von Binär- oder Mehrklassentexten wur-

den in der Maschinelles Lernen Forschung intensiv untersucht. Wenn jedoch Techniken des

Maschinelles Lernenes auf kurze Texte angewendet werden, haben die meisten Ansätze

der Standardtextklassiﬁzierung die Probleme wie Datensparsamkeit und unzureichende

Textlänge. Darüber hinaus sind kurze Texte aufgrund fehlender Kontextinformationen

sehr mehrdeutig. Infolgedessen können einfache Textklassiﬁzierungsansätze, die nur auf

Wörtern basieren, die kritischen Merkmale von Kurztexten nicht richtig darstellen.

In dieser Arbeit wird ein neuartiger Ansatz zur Klassiﬁzierung von Kurztexten auf der Ba-

sis neuronaler Netze vorgeschlagen. Wir bereichern die Textdarstellung, indem wir Wörter

zusammen mit Entitäten verwenden, die durch den Inhalt des gegebenen Dokuments re-

präsentiert werden. Tagme wird zuerst verwendet, um benannte Entitäten aus Dokumen-

ten zu extrahieren. Wir verwenden dann das Wikipedia2Vec Modell, um Entitätsvektoren

zu erhalten. Unser Modell wird nicht nur auf vortrainierte Wortvektoren, sondern auch

auf vortrainierte Entitätsvektoren trainiert. Wir vergleichen die Leistung des Modells mit

reinen Wörtern und die Leistung des Modells mit Wörtern und Entitäten. Der Einﬂuss

verschiedener Arten von Wortvektoren auf die Klassiﬁzierungsgenauigkeit wird ebenfalls

untersucht.

Abstract

Text classiﬁcation, also known as text categorization, is a classical task in natural lan-

guage processing. It aims to assign one or more predeﬁned classes or categories to text

documents. Text classiﬁcation has a wide variety of application scenarios.

Traditional binary or multiclass text classiﬁcation problems have been intensively studied

in machine learning research. However, when machine learning techniques applied to short

texts, most of the standard text classiﬁcation approaches have the problems such as data

sparsity and insuﬃcient text length. Moreover, due to the lack of contextual information,

short texts are highly ambiguous. As a result, simple text classiﬁcation approaches based

on words only, can not represent the critical features of short texts properly.

In this work, a novel neural network based approach of short text classiﬁcation is proposed.

We enrich text representation by utilizing words together with entities represented by the

content of the given document. Tagme is ﬁrst utilized to extract named entities from

documents. We then utilize Wikipedia2Vec model to get entity vectors. Our model is

trained not only on top of pre-trained word vectors, but also on top of pre-trained entity

vectors. We compare the performance of the model using pure words and the model using

words as well as entities. The impact of diﬀerent kinds of word vectors on the classiﬁcation

accuracy is also studied.

INHALTSVERZEICHNIS v

Inhaltsverzeichnis

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3

2.1 Text Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 Text Representation (Feature Engineering) . . . . . . . . . . . . . . . . . . 4

2.3 Text Classiﬁcation (Classiﬁers) . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Related Work 11

3.1 Text Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Vector Representation of Words . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Convolutional Neural Network Feature Extraction . . . . . . . . . . 13

3.1.3 Pre-training & Multi-task Learning . . . . . . . . . . . . . . . . . . 15

3.1.4 Context Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.5 Memory Storage Mechanism . . . . . . . . . . . . . . . . . . . . . . 20

3.1.6 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Short Text Classiﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.3 Text Classiﬁcation using Named Entities . . . . . . . . . . . . . . . . . . . 21

4 Wikipedia2Vec and Doc2Vec 23

4.1 Word2vec and Skip-gram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Wikipedia2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Wikipedia Link Graph Model . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 Anchor Context Model . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Doc2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Convolutional Neural Network for Short Text Classiﬁcation 29

5.1 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

INHALTSVERZEICHNIS vi

5.2 Overall Architecture of EntCNN . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Recurrent Neural Network for Short Text Classiﬁcation 35

6.1 Vanilla Recurrent Neural Network . . . . . . . . . . . . . . . . . . . . . . . 35

6.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.3 Recurrent Convolutional Neural Network . . . . . . . . . . . . . . . . . . . 42

7 Experiments 45

7.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

7.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.4 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.5 Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

7.7 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

8 Conclusion 67

8.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

A Feature Statistics 69

B Confusion Matrices 71

ABKÜRZUNGSVERZEICHNIS vii

Abkürzungsverzeichnis

BERT Deep Bidirectional Encoder Representations from Transformers.

BoW Bag-of-Words.

CBOW Continuous Bag of Words.

CharCNN Character-level Convolutional Neural Network.

CNN Convolutional Neural Network.

DBOW Distributed Bag of Words.

DCNN Dynamic Convolutional Neural Network.

DMN Dynamic Memory Network.

EL Entity Linking.

ELMo Embeddings from Language Models.

EntCNN Convolutional Neural Network for Text Classiﬁcation with Entities.

FLOPs Floating Point Operations.

FN False Negative.

FP False Positive.

GloVe Global Vectors for Word Representation.

GPT Generative Pre-Training model.

GPU Graphics Processing Unit.

GRU Gated Recurrent Unit.

HAN Hierarchical Attention Network.

ICA Independent Component Analysis.

KB Knowledge Base.

K-NN K-Nearest Neighbors.

LDA Latent Dirichlet Allocation.

ABKÜRZUNGSVERZEICHNIS viii

LSA Latent Semantic Analysis.

LSTM Long Short-Term Memory.

NED Named Entity Disambiguation.

NEL Named Entity Linking.

NER Named Entity Recognition.

NLP Natural Language Processing.

NMT Neural Machine Translation.

OOV Out-of-vocabulary.

PCA Principal Component Analysis.

RCNN Recurrent Convolutional Neural Network.

ReLU Rectiﬁed Linear Unit.

RNN Recurrent Neural Network.

SGD Stochastic Gradient Descent.

sLDA Spatial Latent Dirichlet Allocation.

SVM Support-Vector Machine.

TextCNN Convolutional Neural Network for Text Classiﬁcation.

TextRNN Recurrent Neural Network for Text Classiﬁcation.

TF-IDF Term Frequency–Inverse Document Frequency.

TN True Negative.

TP True Positive.

TPU Tensor Processing Unit.

VDCNN Very Deep Convolutional Neural Network.

VSM Vector Space Model.

WSD Word Sense Disambiguation.

ABBILDUNGSVERZEICHNIS ix

Abbildungsverzeichnis

2.1 Text classiﬁcation process . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Architecture of FastText for text classiﬁcation . . . . . . . . . . . . . . . . 12

3.2 Architecture of convolutional neural network for text classiﬁcation . . . . . 13

3.3 Architecture of generative pre-training model for text classiﬁcation . . . . . 18

3.4 Diﬀerences in pre-training model architectures . . . . . . . . . . . . . . . . 19

4.1 Word based skip-gram model . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.2 Wikipedia link graph model . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 Anchor context model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Convolutional neural network with words and entities for short text classi-

ﬁcation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6.1 Unfolded vanilla recurrent neural network architecture . . . . . . . . . . . 35

6.2 Vanishing Gradient problem in recurrent neural network . . . . . . . . . . 36

6.3 Basic recurrent neural network architecture . . . . . . . . . . . . . . . . . . 39

6.4 Cell state of long short-term memory . . . . . . . . . . . . . . . . . . . . . 39

6.5 Input gate of long short-term memory . . . . . . . . . . . . . . . . . . . . . 40

6.6 Forget gate of long short-term memory . . . . . . . . . . . . . . . . . . . . 41

6.7 Output gate of long short-term memory . . . . . . . . . . . . . . . . . . . 42

6.8 Recurrent convolutional neural network for text classiﬁcation . . . . . . . . 43

7.1 Baseline model (TextCNN) normalized confusion matrix on TREC . . . . . 64

7.2 Our model (EntCNN) normalized confusion matrix on TREC . . . . . . . 64

B.1 Baseline model (TextCNN) normalized confusion matrix on TREC . . . . . 71

B.2 Our model (EntCNN) normalized confusion matrix on TREC . . . . . . . 71

B.3 Baseline model (TextCNN) normalized confusion matrix on Twitter . . . . 72

B.4 Our model (EntCNN) normalized confusion matrix on Twitter . . . . . . . 72

TABELLENVERZEICHNIS x

Tabellenverzeichnis

1 Chinese to English translation with diﬀerent word segmentation methods . 5

2 Chinese to English translation with diﬀerent word segmentation methods . 5

3 Activation Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4 Summary statistics of datasets . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 TREC class description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Example sentences of TREC dataset . . . . . . . . . . . . . . . . . . . . . 46

7 Example sentences of Twitter dataset . . . . . . . . . . . . . . . . . . . . . 46

8 Example sentences of AG News dataset . . . . . . . . . . . . . . . . . . . . 46

9 Example sentences of Movie Review dataset . . . . . . . . . . . . . . . . . 47

10 Document term matrix with stop words retained . . . . . . . . . . . . . . . 48

11 Document term matrix with stop words removed . . . . . . . . . . . . . . . 48

12 List of some widely used emoticons . . . . . . . . . . . . . . . . . . . . . . 51

13 List of regular expressions used to match emoticons . . . . . . . . . . . . . 52

14 Confusion matrix of binary classiﬁcation . . . . . . . . . . . . . . . . . . . 53

15 Feature statistics on TREC . . . . . . . . . . . . . . . . . . . . . . . . . . 61

16 Experiment hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . 62

17 Classiﬁcation accuracy of diﬀerent models on several diﬀerent datasets . . . 63

18 Comparison between diﬀerent preprocessing methods on Twitter . . . . . . 65

19 Comparison between diﬀerent models . . . . . . . . . . . . . . . . . . . . . 66

20 Evaluation of EntCNN on AG News . . . . . . . . . . . . . . . . . . . . . . 66

21 Evaluation of TextCNN on AG News . . . . . . . . . . . . . . . . . . . . . 66

22 Evaluation of EntCNN on TREC . . . . . . . . . . . . . . . . . . . . . . . 66

23 Evaluation of TextCNN on TREC . . . . . . . . . . . . . . . . . . . . . . . 66

24 Feature statistics on Movie Review . . . . . . . . . . . . . . . . . . . . . . 69

25 Feature statistics on AG News . . . . . . . . . . . . . . . . . . . . . . . . . 69

26 Feature statistics on Twitter . . . . . . . . . . . . . . . . . . . . . . . . . . 70

1 INTRODUCTION 1

1 Introduction

1.1 Motivation

The advent of the World Wide Web (WWW) and the digitization of all areas of our lives

have led to an explosive growth of globally available data in recent decades. Among these

resources, text data has the largest quantity. How to automatically classify, organize and

manage the vast data has become a research topic with important purposes. Natural lan-

guage processing (NLP) refers to the analysis and processing of human natural language

by computers. It’s an interdisciplinary subject between computer science and linguistics.

Text classiﬁcation, also known as text categorization, is a classical task in NLP. It aims to

assign one or more predeﬁned classes or categories to text documents. Text classiﬁcation

has a wide variety of application scenarios, for instance:

News ﬁltering and organization. News websites contain a large number of articles. Ba-

sed on the content of the articles, they need to be automatically classiﬁed by subjects. For

example, news can be divided into politics, economics, military, sports, entertainment, etc.

Sentiment analysis and opinion mining. On the e-commerce websites, after clients

have completed transactions, the products are positively commented or negatively com-

mented. The merchants need to collect customer feedback to obtain an advice of what is

working well about the product or service and what should be done to make experience

better.

Email classiﬁcation and spam ﬁltering. It is often desirable to classify email in order

to determine either the subject or to determine junk email from numerous emails in an

automated way. This can signiﬁcantly improve the experience of mailbox users.

Document organization and retrieval. The media receives a large number of submis-

sions every day. We rely on text categorization technology to automatically review articles

and mark erotic, violent, junk advertisement and other illegal content.

Traditional binary or multiclass text classiﬁcation problems have been intensively studied

in machine learning research[6][29]. However, when machine learning techniques are app-

lied to short texts, most of the standard text classiﬁcation approaches have the problems

such as data sparsity and insuﬃcient text length. Moreover, due to the lack of contextual

information, short texts are highly ambiguous. As a result, simple text classiﬁcation ap-

1 INTRODUCTION 2

proaches based on words only, can not represent the critical features of short texts properly.

In this work, to overcome the mentioned shortness and sparsity problem of short texts,

we enrich text representation by utilizing words together with entities represented by the

content of the given document. Additionally, we compare the performance of the model

using pure words and the model using words as well as entities.

1.2 Contribution

In this work a novel neural network based approach of short text classiﬁcation is propo-

sed. We utilize and modify the convolutional neural network for sentence classiﬁcation

proposed by Kim (2014)[28]. Our model is not only trained on top of pre-trained word

vectors, but also on top of pre-trained entity vectors. Tagme implemented by Ferragina

et al. (2010) [20] is ﬁrst used to extract named entities from documents. We then utilize

Wikipedia2Vec model developed by Yamada et al. (2018)[57] to get entity vectors. The

impact of diﬀerent kinds of word vectors on the classiﬁcation accuracy is also studied in

this work. Comprehensive experiments have been carried out to explore the classiﬁcation

performance of diﬀerent neural network architectures.

1.3 Thesis Structure

The next chapter provides some background knowledge so we are able to gain a basic

understanding of the process of text classiﬁcation. Chapter 3 introduces previous work

of short text classiﬁcation and text classiﬁcation with entities which is similar to our

work. Especially, various deep learning approaches for text classiﬁcation, their advan-

tages and disadvantages are discussed. Chapter 4 introduces Wikipedia2Vec model and

its submodels because Wikipedia2Vec is utilized to get entity vectors in our work. Our

main contribution of this work is presented in chapter 5, convolutional neural network

for text classiﬁcation with words and entities. Chapter 5 explains the basics of recurrent

neural network and long short-term memories. Our investigated experiments, evaluation

and analysis of the utilized models can be found in chapter 7. Finally we conclude with

chapter 8, which summarizes important aspects of this work and brieﬂy discusses some

possible future work.

2 BACKGROUND 3

2 Background

2.1 Text Classiﬁcation

Text classiﬁcation application

Text classiﬁcation can be used in a wide variety of real world applications.

Basically speaking there are four diﬀerent types of classiﬁcation tasks. Binary classiﬁcati-

on is the task of classifying texts of a given set into two classes. For example, when given

two labels we just need to ﬁgure out is the email spam or not spam, or is the customer

review positive or negative. In multi-class classiﬁcation where we have a set of classes that

a particular document can have, we need to identify which class it is. For example, is the

news sports, scientiﬁc or politics. Multilabel classiﬁcation is an extension of the previous

type where a given text can have multiple labels so it could not just be part of one class

but part of multiple classes. Finally when we try to group a bunch of documents, we do

not know their labels, so we need to learn the structure of the given documents, and this

is called clustering.

Text classiﬁcation type

There are three diﬀerent types of methods that we can use to classify documents. Manual

classiﬁcation by hand, we try to ﬁgure out which category it should be. This method has

its own merits in the sense that it is quite accurate and it would be consistent in terms of

its classiﬁcation results assuming it is done by experts in that domain. The downside is

that it is pretty slow and expensive. Rule-based classiﬁcation is very accurate if the rules

are suitable to our tasks. The shortcoming is that we need to manually build and maintain

a rule-based system, if we want to add a new criteria we need to come up with a new rule

for that. The main idea of statistical classiﬁcation methods is that we automatically learn

how to classify the documents. It can scale well, can be accurate if we set it well and it

is automatic because it is done by computers. The disadvantage is that it requires lots of

training data so we need the documents which we want to classify as well as the labels

and classes which are not associated with the documents. The weakness is obviously that

it requires a lot of data, sometimes a huge amount of data, which is normally hard to get

and thus can be a serious problem.

Text classiﬁcation deﬁnition

Statistical classiﬁcation is that we assume some texts are represented by t and some

corresponding classes c. We want to learn the probability of t being of class c, that is:

P (c|t) (2.1)

2 BACKGROUND 4

We can think of text classiﬁcation tasks as a two stage process. Text representation is

we process text into some (ﬁxed) representation given documents. It is about how we

can learn t. Text classiﬁcation is that we want to classify this document given previous

representation. It’s about how to learn P (c|t).

Short text classiﬁcation

Short texts usually refer to texts that are relatively short in length and generally do not

exceed 160 characters, such as tweets, chat information, news topics, opinion comments,

question texts, mobile phone text messages, and document summaries.

2.2 Text Representation (Feature Engineering)

Feature engineering is often not only the most labor-intensive and time-consuming in

machine learning, but also is extremely important. Feature engineering is diﬀerent from

classiﬁers and does not have wide versatility. It often needs to be combined with the un-

derstanding of a speciﬁc task. The NLP domain to which the text classiﬁcation problem

belongs naturally also has its own way of dealing with features. Most work of the tradi-

tional classiﬁcation tasks is also carried out here.

As shown in Figure 2.1, text feature engineering can be divided into three parts: text

preprocessing, feature extraction and feature selection.

Figure 2.1: Text Classiﬁcation Process

2 BACKGROUND 5

Text preprocessing

Most text data in their raw format aren not well structured and contain lots of unne-

cessary information such as misspelled words, stop-words, punctuations and emojis, etc.

Especially, word segmentation is considered to be a quite important ﬁrst step for Chine-

se NLP tasks, because unlike English and any other Indo-European languages, Chinese

words can be composed of multiple characters with no spaces appearing between them.

Text preprocessing involves a variety of techniques to convert raw text data into cleaned

and standardized sequences, which can further improve the performance of classiﬁers. De-

tailed preprocessing methods will be discussed in our experiments. Table 1 and 2 are two

Chinese to English translations with diﬀerent word segmentation methods.

中外科学名著

chinese and foreign scientiﬁc masterpiece 4

中外科学名著

mid surgery masterpiece 8

Table 1: Chinese to English translation of 中外科学名著 with diﬀerent word segmentation

methods

北京大学生前来应聘

Peking undergraduates come to apply for jobs 4

北京大学生前来应聘

Peking university while living come to apply for jobs 8

Table 2: Chinese to English translation of 北京大学生前来应聘 with diﬀerent word

segmentation methods

Feature extraction and text representation

The purpose of text representation is to convert preprocessed texts into a form which

computer can process. It’s the most important part which determines the quality of text

classiﬁcation. Traditional feature extraction techniques include syntactic word represen-

tations such as n-gram, weighted words such as bag-of-words (BoW), and semantic word

representations (word embeddings) such as word2vec, GloVe, etc. Note that vector space

model (VSM)[49] refers to the data structure for each document and BoW refers to what

kind of information we can extract from a document. They are diﬀerent aspects of cha-

racterizing texts.

BoW is a simple but powerful model to represent text documents. What we need is

just to convert text documents into vectors. Only the uni-gram words are extracted, no

2 BACKGROUND 6

syntax, no semantics and no word position information. Each document is converted into

a vector which represents the frequency of all distinct words that present in the document

vector space.

Here are two simple text documents:

(1) I don’t like this type of movie, but I like this one.

(2) I like this kind of movie, but I don’t like this one.

The uni-grams of these two text documents are:

I, don’t, like, this, type, of, movie, but, I, like, this, one

I, like, this, kind, of, movie, but, I, don’t, like, this, one

Each bag-of-words can be represented as:

BoW1 = {I:2, don’t:1, like:2, this:2, type:1, of:1, movie:1, but:1, one:1}

BoW2 = {I:2, like:2, this:2, kind:1, of:1, movie:1, but:1, don’t:1, one:1}

The document vector space is:

{I, don’t, like, this, type, of, movie, but, one, kind}

The corresponding document vectors are:

[2, 1, 2, 2, 1, 1, 1, 1, 1, 0]

[2, 1, 2, 2, 0, 1, 1, 1, 1, 1]

The biggest shortcoming of this representation is that it ignores text contexts. Every

word is independent of each other and thus can not represent the semantic information

of texts. We can ﬁnd that these two document vectors are extremely similar in the above

example, which demonstrates that the two sentences have very likely the same meaning.

However they mean the exact opposite. In general, the lexicon in real world applications

is at least one million in size, so BoW model has two serious problems: high dimension

and high sparsity. BoW is the basis of vector space model, so in vector space model we

can reduce the dimension by using feature selection methods and increase the density by

calculating feature weighting factors.

Some algorithms are used to reduce dimensionality, such as principal component ana-

2 BACKGROUND 7

lysis (PCA), independent component analysis (ICA), linear discriminant analysis (LDA),

and latent semantic analysis (LSA), etc. The textual representations using these machine

learning methods are generally considered to be deep representations of the documents.

In addition, distributed semantic word representations are seen as the foundation of deep

learning methods.

Feature selection

When doing the task of text categorization, it is often necessary to extract features from

the documents. Normally we train the model using part of the features which are useful

for learning, rather than using all the features which will eventually lead to curse of di-

mensionality[8] problems.

The feature selection process includes the selection of extracted features and the cal-

culation of feature weights.

The basic idea of feature selection is to rank the original features independently according

to an evaluation metric, we then select some features with the highest score, and ﬁlter

out the remaining features. Commonly used supervised feature selection methods include

Relief, Fisher score and Information Gain[35] based methods, etc.[16] Note that feature

selection methods should be distinguished from feature extraction techniques. Feature

extraction is that we create new features from the original datasets, whereas feature se-

lection returns a subset of the initial features. Feature weight calculation mainly refers

to the classic statistical method, term frequency–inverse document frequency (TF-IDF)

method and its extensions.

2.3 Text Classiﬁcation (Classiﬁers)

Discriminative methods which directly estimate a decision boundary, such as logistic re-

gression and support vector machine (SVM), generative models which build a generative

statistical model such as naïve bayes, and instance based classiﬁers which use observations

directly, such as k-nearest neighbor (K-NN), are more traditional but still commonly used

machine learning classiﬁcation approaches. Tree-based classiﬁcation algorithms, such as

decision tree and random forests are fast and accurate for text classiﬁcation.

Deep learning has a signiﬁcant inﬂuence in the ﬁelds of image processing, speech recogni-

tion, and NLP. The inﬂuence of deep learning on NLP and especially on text classiﬁcation

is mainly reﬂected in the following aspects.

2 BACKGROUND 8

End-end training

In the past, when doing statistical natural language processing, experts need to deﬁne

various features and need a lot of domain knowledge. Sometimes it is not easy to ﬁnd

a good feature. But with end-to-end training, as long as there are pairs of inputs and

outputs (input-output), what we need to do is just labeling the output corresponding to

the input in order to form a training data set. The learning system can be then esta-

blished through automatic training using neural networks, without artiﬁcial settings and

carefully designed features. This has changed the development of many NLP technologies

and greatly reduced the technical diﬃculties of NLP. This means that as long as we have

enough computational resources (GPU, TPU) and labeled data, we can basically imple-

ment the NLP model and get really good results, which promotes the popularity of NLP

technologies.

Semantic representation (word embeddings)

One of the commonly used semantic representations is context-free embedding, which

means that the representation of a word is ﬁxed (represented by a multi-dimensional vec-

tor) regardless of its context. The other representation depends on the context of diﬀerent

sentences, the meaning of the same word might be diﬀerent, so its embeddings are also

accordingly diﬀerent. Now using models like BERT and GPT, we can train and obtain

the dynamic embedding of a word based on its context.

Pre-trained model

We know that neural networks need to be trained with data, they extract information from

data and convert them into corresponding weights. These weights can be then extracted

and transfered to other neural networks. Because we’ve transfered these learned features,

when we try to solve a similar problem, we don’t need to train a new neural network

model from scratch, instead we can start with models that have been trained to solve

similar problems. By using pre-trained models which are previously trained with massive

amounts of data, we can apply the corresponding structures and weights directly to the

problems we are facing. This process is called Transfer Learning. Since the pre-trained

models have been well trained, we don’t need to modify too many weights, what we need

to do is just modifying some parameters. And this is called Fine Tuning. As a result, the

pre-trained model can signiﬁcantly reduce the training time.

Sentence encoding methods (CNN/RNN/LSTM/GRU/Transformer)

For a sentence of variable length, it can be encoded with RNN, LSTM, GRU or Transfor-

mer. Although these several encoding methods for sentences are feasible, recently we tend

to use Transformer to encode sentences. We can apply the encoded sentences in various

2 BACKGROUND 9

NLP applications such as machine translation, question answering, language modeling,

sentiment analysis, text classiﬁcation and so on.

Attention

Encoder-Decoder model[52], or sequence to sequence (seq2seq) model, was ﬁrst introdu-

ced in 2014 by Google. In an Encoder-Decoder neural network, a set of RNNs / LSTMs /

GRUs encode an input sequence into a ﬁxed-length internal representation, another set of

RNNs / LSTMs / GRUs read this internal representation and decodes it into an output se-

quence. The length of the input and output sequences might diﬀer. This Encoder-Decoder

architecture has achieved excellent results on a wide range of NLP tasks, such as question

answering and machine translation. Nevertheless, this approach suﬀers from the problem

that neural networks are forced to compress all the necessary information of an input

sequence into a ﬁxed-length internal vector, which makes it diﬃcult for neural networks

to deal with longer sequences, especially those in the test set which are longer than the

sequences in the training set.[7] First, the internal semantic vector cannot fully represent

the information of the entire input sequence. Second, the pre-existent information will be

overwritten by the newly input messages.

The attention mechanism is a way to free the Encoder-Decoder structure from ﬁxed-length

internal representations. It maintains the encoder’s intermediate outputs from each step

of the input sequences, and then trains the model to learn how to selectively focus on the

inputs and relate them to the items in the output sequences. In other words, each item

in the output sequence depends on the selected item in the input sequence.[1]

Although this mechanism increases the model’s computational complexity, it is able to

result in a better performance, even state-of-the-art. Additionally, the model can also

show how to pay attention to input sequences when predicting output sequences. Un-

like the situation in computer vision, where several approaches for understanding and

visualizing CNNs have been developed, it’s normally pretty hard to visualize and under-

stand what neural networks in NLP have learned.[5] Thus this mechanism helps us to

understand and analyze what the model is focusing on and how much it focuses on a

particular input-output pair. Further attention visualization examples can be found in

machine translation[7], text summarization[48], speech recognition[13] and image descrip-

tions[56], etc.

3 RELATED WORK 11

3 Related Work

A central study in text classiﬁcation is feature representation, which is commonly based

on the bag-of-words (BoW) model. Besides, feature selection methods, such as TF-IDF

model, LSA, sLDA, are applied to select more discriminative features. The major classi-

ﬁers in the traditional machine learning era are mainly based on naïve bayes, maximum

entropy, k-nearest neighbor (K-NN), decision tree and support-vector machine (SVM).

Traditional machine learning algorithms for text classiﬁcation are well summarized in

[6][29].

3.1 Text Classiﬁcation

Text categorization escapes the process of artiﬁcial feature design, text similarity com-

parison and distance deﬁnitions in the era of deep learning, but still can not escape the

choice of text preprocessing, feature extraction, model selection, parameter optimization,

coding considerations (one-hot, n-gram, etc.) and time cost trade-oﬀ.

In summary, there are six main types of deep learning models:

(1) Vector representation of words: word2vec, GloVe, FastText, BERT, GPT, ELMo,

etc.

(2) Convolutional neural network feature extraction: TextCNN, CharCNN, DCNN, VD-

CNN, etc.

(3) Pre-training & multi-task learning: BERT, GPT, ELMo, etc.

(4) Context mechanism: TextRNN, BiRNN, RCNN, etc.

(5) Memory storage mechanism: EntNet, DMN, etc.

(6) Attention mechanism: HAN, etc.

In the following paragraph we will get into a little details of the above models and get a

global picture of the application and development of deep learning in text classiﬁcation

in recent years.

3.1.1 Vector Representation of Words

FastText[25], which is not much innovative in academics but has very outstanding perfor-

mance, was proposed by Mikolov’s team in Facebook in 2016. FastText (shallow network)

3 RELATED WORK 12

can often achieve the accuracy comparable to deep network in text classiﬁcation, but is

many orders of magnitude faster than deep network in training time. The core idea of

FastText is to superimpose the word and n-gram vectors of the whole document to obtain

a document vector, then we use this document vector to do softmax multiclassiﬁcati-

on. There are two techniques involved: character-level n-gram features and hierarchical

softmax classiﬁcation. Hierarchical softmax is ﬁrst introduced in word2vec which we will

discuss in the next chapter, so here we only focus on character-level n-gram.

Word2vec treats each word in the corpus as atomic. For each word it generates a vec-

tor, which ignores the words’ internal morphological features. For example, “country” and

“countries”, these two words have many characters in common, in other words, they have

similar internal forms. However, in traditional word2vec model, such words lose this in-

ternal information when converted to diﬀerent IDs. To overcome this problem, FastText

uses character-level n-grams to represent a word. For word “country”, assuming that the

value of n is 3, then its trigram are as follows:

<co, cou, oun, unt, ntr, try, ry>

where < indicates a preﬁx and > indicates a suﬃx. So we can use these trigrams to

represent word “country”. Furthermore, we can stack and add up the vector of these seven

trigrams to represent word “country”.

This brings two beneﬁts: First, it will generate better word vectors for low frequen-

cy words, because their n-grams can be shared with other words. Second, for out-of-

vocabulary (OOV) words, we can still produce their word vectors by superimposing their

character-level n-gram vectors. Whereas in word2vec OOV words are either zero or ran-

domly initialized.

Figure 3.1: Model architecture of FastText for a sentence with N ngram features

, ..., x

. The features are embedded and averaged to form the hidden variable.

3 RELATED WORK 13

As we can see in Figure 3.1, like Continuous Bag of Words (CBOW), FastText has three

layers: input layer, hidden layer and output layer (hierarchical softmax). They have so-

mething in common. Their input are words represented by multiple vectors; they have a

speciﬁc target value; the hidden layers are averages of multiple superimposed word vec-

tors. The diﬀerence is that the input of CBOW is the context of the target word whereas

the input of FastText is multiple words and their character-level n-gram features, which

are used to represent a single document; the input words of CBOW are one-hot vectors

whereas the input of FastText are embedded vectors; the output of CBOW is the target

word where the output of FastText is the class label corresponding to the document.

In a word, FastText is a pretty simple but violent classiﬁcation model. For paragraph

reasons, global vectors for word representation (GloVe) will not be discussed here.

3.1.2 Convolutional Neural Network Feature Extraction

TextCNN[28], which has become a classical deep learning model in text categorization,

was proposed by Kim in 2014.

Figure 3.2: CNN architecture for sentence classiﬁcation (adopted from Zhang 2014 [63])

3 RELATED WORK 14

As shown in Figure 3.2, the input layer is a 7 × 5 matrix. 5 is the word vector dimension

and the sentence length is 7. The second layer uses 3 sets of convolution kernels with

width of 2, 3 and 4, each size has 2 kernels. Each of the convolution kernels slides over

the whole sentence to generate a feature map. In the ﬁgure, the kernel does not use pad-

ding during the sliding process. The convolution kernel with width of 4 slides over the

sentence of length 7, we can get a feature map of length 4. 1-max pooling is performed

over each map. Thus a univariate feature vector is generated from all the 6 maps, and

these 6 features are concatenated to form one ﬁnal feature vector. The ﬁnal output layer

receives this feature vector as input and uses it to classify the sentence.

Over the last several years deep learning methods especially convolutional neural networks

(CNN) have been shown to outperform previous state-of-the-art machine learning techni-

ques in computer vision. But when CNN is ﬁrst used in NLP, we’re faced with diﬀerent

problems.

The max pooling operation in CNN has advantages: it has feature’s position and rotation

invariant characteristics, because no matter where the feature appears in a sentence, it

can be extracted regardless of its position. Besides, max pooling can reduce the number of

model parameters, which is beneﬁcial to reduce the model over-ﬁtting problem. However

in NLP this advantage isn’t necessarily a good thing, because in many NLP applications

the location information of the features is very important, for example, negative words

appear in diﬀerent positions, resulting in completely diﬀerent sentence meanings:

I don’t like this type of movie, but I like this one.

I like this kind of movie, but I don’t like this one.

So for TextCNN the completely loss of feature’s location information is its Achilles’ heel.

In summary, in TextCNN convolution is used to extract key information similar to n-

grams. We are able to capture features between multiple consecutive words and share

weights and biases when calculating the same type of features.

The overall performance of TextCNN is good, but there are problems. The local receptive

ﬁeld size (ﬁlter size) is ﬁxed, it is diﬃcult to determine the ﬁlter size. The ability to mo-

del longer sequence is limited. Most importantly, the feature’s location information is lost.

Based on CNN, some models are proposed recently. Especially, due to the above pro-

blems, several works have been done.

Kalchbrenner and Blunsom et al. (2014)[27] introduced dynamic convolutional neural

3 RELATED WORK 15

network (DCNN). It uses dynamic k-max pooling, where the result of pooling isn’t a

maximum value, but the k-maximum values, which is a subsequence of the original in-

put. This brings several beneﬁts. The dynamic k-max pooling better preserves multiple

important information in a sentence. The number of extracted features varies according

to the length of the sentence. Besides, it preserves the relative positions between words

and the word position information. It also takes into account the semantic information

between words which are far away in a sentence.

Zhang et al. (2015)[63] conducted a variety of very good comparative experiments on

CNN models mentioned in [28], and provide us with some CNN architecture and hyper-

parameter setting advice.

Zhang and LeCun (2015)[62] proposed character-level convolutional neural network (Char-

CNN) which is also an eﬀective model. The most important point of their experimental

conclusion is that this model can classify text without using words, which strongly sug-

gests that the human language can also be considered as a signal that is no diﬀerent from

any other type of signals.

Conneau and Lecun et al. (2016)[14] ﬁrst used very deep convolutional neural network

(VDCNN) in NLP. Prior to their work, normally very shallow CNNs are used in text

categorization. In their work they utilize 29 convolution layers to improve the accuracy of

text classiﬁcation. They get some important discoveries by exploring deep architectural

approaches: VDCNN performs better on big datasets; deeper network depth can improve

the model performance; etc.

Zhang and LeCun (2017)[61] did an interesting experiment which shows that FastText

model has better results in character-level encoding for Chinese, Japanese, and Korean

texts (CJK language), and better results in word-level encoding for English texts.

3.1.3 Pre-training & Multi-task Learning

Pre-training is a conventional method in computer vision since deep learning has become

popular. We already know that for hierarchical CNN structure, neurons of diﬀerent layers

can learn diﬀerent types of image features. For example, if we have a face recognition

task, after having trained the network, we visualize the features learned by each layer of

neurons, we will see that the lowest hidden layer of neurons learn has learned features

such as colors and lines, the second layer facial organ, and the third layer facial contour.

The lower the layer, the more basic the features. The upper the layer, the more related to

the task the extracted features. Because of this, the pre-trained network parameters, espe-

3 RELATED WORK 16

cially the features extracted from the bottom layers, are more independent of the speciﬁc

task, and more general. This is why we normally initialize new models with pre-trained

parameters of bottom layers.

What about NLP? The general choice for pre-training in NLP is to use language mo-

dels. The neural probabilistic language model introduced by Bengio et al. (2003)[9] can

not only predict the next word given the last few words, but also obtain a by-product, a

matrix, which is actually the word embedding matrix. Ten years later, the word2vec mo-

del proposed by Mikolov et al. (2013)[40], which simply aims to obtain word embeddings,

has become more and more popular.

So what is pre-training in NLP? Given a NLP task, such as text classiﬁcation or question

answering, each word in a sentence is one-hot encoded as input, then multiplied by the

learned word embedding matrix, so we get the words’ corresponding embeddings. This

seems to be a looking-up operation, but in fact, the word embedding matrix is actually

the network parameter matrix which maps the one-hot layer to the embedding layer. So in

other words, the network from one-hot layer to embedding layer is equivalently initialized

with this pre-trained embedding matrix.

However the most serious problem with previous word embedding models is they can

not deal with polysemy. We know that polysemy is a phenomenon that often occurs in

human languages. What’s the negative impact of polysemous words on word embeddings?

For example:

He jumped in and swam to the opposite bank.

My salary is paid directly into my bank.

As shown above, “bank” has two meanings, but when we encode this word, the word

embedding model can’t distinguish between these two meanings. Despite the fact that

the contextual words of these two sentences are diﬀerent, when the model being trained,

it will always predict the same word “bank”. The same word “bank” occupies the same

word embedding space (the same line in the matrix) of the model, which causes two

diﬀerent context information encoded into the same parameter space. As a result, word

embeddings can’t distinguish the diﬀerent semantics of polysemous words.

Researchers have made an astonishing and huge progress in NLP in 2018. Large-scale

pre-trained language models like OpenAI GPT and BERT perform quite well on various

NLP tasks using generic model architectures. The idea is similar to pre-training in com-

puter vision that we have discussed previously. But even better than that, the newly

3 RELATED WORK 17

proposed approaches do not require labeled data for pre-training. They make use of unla-

beled data, which is obviously inexhaustible in real world, allowing us to extract a large

amount of linguistic knowledge and encode them into the models. Integrating linguistic

knowledge as much as possible will naturally enhance the generalization ability of the mo-

dels, especially when we have limited amount of data. And introducing priori linguistic

knowledge has always been one of the main purposes of NLP, especially in deep learning

scenarios.

Deep contextualized word representations (ELMo) proposed by Peters et al. (2018)[44]

dynamically adjust word embedding based on current context. The core idea of ELMo

is that, we ﬁrst learn word embeddings after we have trained language model, at this

time the polysemous words still can not be distinguished. But when we actually use word

embeddings, the words already have a speciﬁc context. Then we can adjust the word em-

bedding according to the semantics of the word. The adjusted word embedding can better

express the speciﬁc meaning in its context, which cleverly solve the problem of polysemy.

ELMo has a typical two-phase process. The ﬁrst step is the use of language models for

pre-training, called feature-based pre-training. The second step is to extract the word

embedding of the corresponding word in each layer of the pre-trained model. Each word

in a sentence can get three embeddings: in the bottom layer we get word embedding; the

embedding in the ﬁrst layer of long short-term memory (LSTM) captures word’s syntac-

tic information; the embedding in the second layer of LSTM captures word’s semantic

information. We then combine the extracted embeddings together as a new feature in the

downstream NLP tasks. ELMo can improve its performance in 6 diﬀerent NLP tasks.

Generative pre-training model (OpenAI GPT) introduced by Radford et al. (2018)[45] al-

so has a two-phase training process, see Figure 3.3. In the ﬁrst phase we use the language

model for pre-training and in the second phase we ﬁne-tune the model to solve downstream

tasks. Despite of the similarity between GPT and ELMo, GPT has two major diﬀerences

from ELMo. The ﬁrst diﬀerence is that in GPT’s second step, we need to ﬁne-tune the

same base model so that we can use it in the following downstream tasks. We also need

substantial task-speciﬁc architecture modiﬁcations. The second diﬀerence between GPT

and ELMo is that GPT uses Transformer as its feature extractor. Transformer, introduced

by Vaswani et al. (2017)[54], which is ﬁrst used in neural machine translation (NMT), is

a novel neural network architecture based on a self-attention mechanism. Why do we use

Transformer here? Because Transformer is currently the best feature extractor. Transfor-

mer outperforms recurrent neural network (RNN) because RNN can not be parallelized.

Transformer also outperforms CNN because it can capture long-distance features. Ad-

ditionally, in the pre-training step of GPT, although we still use the language model as

3 RELATED WORK 18

Figure 3.3: (left) Transformer architecture and training objectives used in GPT. (right)

Input transformations for ﬁne-tuning on diﬀerent tasks. We convert all structured in-

puts into token sequences to be processed by the pre-trained model, followed by a line-

ar+softmax layer. (adopted from [45])

target task, we adopt a single-direction language model. That is, one limitation with GPT

is that it is only trained to predict the future left-to-right context, without right-to-left

context. The result of GPT is very amazing. It achieves the best results in 9 out of 12

NLP tasks.

Now we ﬁnally come to the new language representation model, deep bidirectional en-

coder representations from transformers (BERT), proposed by Devlin et al. (2018)[18],

which obtains new state-of-the-art results in 11 NLP tasks including text classiﬁcation.

BERT has the same two-phase process as GPT. It is designed to pre-train deep bidirec-

tional representations by jointly conditioning on both left and right context in all layers.

The pre-trained BERT representations can be ﬁne-tuned with just one additional output

layer. They also proposed masked language model and next-sentence prediction method

in order to pre-train the model in both forward and backward directions. BERT input

representations are the sum of the token embeddings, the segmentation embeddings and

the position embeddings.

There is a close relationship between the above 3 models, see Figure 3.4. If we change

GPT model in the pre-training phase by using bidirectional language model, then we will

get BERT. If we replace ELMo’s feature extractor LSTM with Transformer, we will also

get BERT.

In summary, inductive transfer learning consists of two steps: pre-training, in which the

3 RELATED WORK 19

Figure 3.4: Diﬀerences in pre-training model architectures. BERT uses a bidirectional Trans-

former. OpenAI GPT uses a left-to-right Transformer. ELMo uses the concatenation of inde-

pendently trained left-to-right and right-to-left LSTM to generate features for downstream

tasks. Among three, only BERT representations are jointly conditioned on both left and

right context in all layers. (adopted from [18])

model learns a general-purpose representation of inputs by using a speciﬁc feature extrac-

tor (RNN, LSTM, Transformer), and adaptation, in which the representation is transferred

to a new task.[43] As we discussed previously, there are two main paradigms for adap-

tation: feature ensemble (ELMo) and ﬁne-tuning (GPT, BERT). For ELMo, we extract

contextual representations of the words from all layers by feeding the input sentence into

pre-trained bidirectional LSTM network. During adaptation, we learn a linear weighted

combination of word embeddings in all layers which is then used as input to a task-speciﬁc

architecture in downstream tasks. In contrast, for GPT and BERT, after obtaining the

pre-trained model, during ﬁne-tuning (adaption), we still adopt the same model, use part

of the training data of the downstream task, train on the same model to correct the

network parameters obtained during the pre-training stage. We only need a task-speciﬁc

output layer. So which paradigm is better? Peter et al. (2019)[43] did some research to

empirically compare feature ensemble with ﬁne-tuning approaches across diverse datasets.

They come to a conclusion that the relative performance of two approaches depends on

the similarity of the pre-training and target tasks. So for BERT, we would better ﬁne-tune

the model.

3.1.4 Context Mechanism

We know that the nature of convolution is to extract features. Deep learning is about

feature representation learning. But documents or sentences are sequences, that is, the

previous input is related to the subsequent input. So thinking about to use RNN to deal

with texts seems to be reasonable. Recurrent neural network for text classiﬁcation (Tex-

tRNN) proposed by Liu et al. (2016)[37], whose architecture is similar to TextCNN, is the

ﬁrst RNN model to classify documents. RNN and its variants can capture variable-length

and bidirectional n-gram information. But it has disadvantages. As far as I’m concerned,

3 RELATED WORK 20

this model is not recommended because the training speed of RNN is very slow. RNN is

also a biased model, in which later words are more dominant than earlier words.

Recurrent convolutional neural network (RCNN) which was introduced by Lai et al.

(2016)[33] is a combination of CNN and RNN. They employ a bidirectional RNN to

capture the contexts and learn the word representations. They also employ max-pooling

to automatically determine which feature plays a more important role in text classiﬁcati-

on. This model runs faster than RNN and gets a good result in three of the four commonly

used datasets.

3.1.5 Memory Storage Mechanism

Typical Encoder-Decoder model’s performance is limited by the size of its memory, espe-

cially when dealing with longer sequences. This limitation can be solved by a strategy

called attention mechanism which was proposed by Bahdanau et al. (2014)[7]. We can

also reﬁne the attention mechanism by adding a separate, readable and writable memo-

ry component. Dynamic memory networks (DMN) which was proposed by Kumar et al.

(2016)[32] has four modules: input module, question module, episodic memory module

and answer module. Here the input is a list of sentences, the question module is equiva-

lent to the gated units which uses the attention mechanism to selectively store the input

information in the episodic memory module. Then the output of episodic memory is ans-

wered by the answer module. The DMN obtains state-of-the-art results on several types

of tasks and datasets including text classiﬁcation.

3.1.6 Attention Mechanism

Hierarchical attention network (HAN) proposed by Yang et al. (2016)[59] is another model

in which attention mechanism is introduced. The basic idea of HAN is that we ﬁrst con-

sider the hierarchical structure of documents: a sentence is a group of words, a document

consists of several sentences. So we can divide documents hierarchy into three layers, that

is: words, sentences and texts. We use two bidirectional LSTMs to model word-sentence

and sentence-document. Second, diﬀerent words and sentences contain diﬀerent informa-

tion thus they can not be treated equally. The attention mechanism is introduced in each

model to capture longer dependencies. From this we derive the importance of diﬀerent

words and sentences when constructing sentences and texts accordingly.

The HAN model structure is very consistent with the human understanding process:

word → sentence → document. The most important thing is that the attention layers

can be well visualized when providing us with better classiﬁcation accuracy. At the same

3 RELATED WORK 21

time, we can ﬁnd that the attention part has a great inﬂuence on the expression ability

of the model.

3.2 Short Text Classiﬁcation

Traditional machine learning methods for short text classiﬁcation are mainly based on

widely used machine learning models, such as SVM, etc. Next we focus on deep learnings

methods for short text classiﬁcation which has dominated in research in recent years.

Wang et al. (2017)[55] introduced knowledge from a knowledge base and combine words

with associated concepts to enrich text features. They also use character level information

to enhance word embeddings. They ﬁnally build a joint CNN model to classify short texts.

Our work is similar to their work, the diﬀerence is that we enrich word embeddings with

entities while they utilize concepts. We also adopt the same datasets they used in the

experiments.

Zeng et al. (2018)[60] proposed topic memory networks which jointly explore latent topics

and classify documents in an end-to-end system.

3.3 Text Classiﬁcation using Named Entities

There is not much research work about text classiﬁcation using named entities. Kral

(2014)[30] compared ﬁve diﬀerent approaches to integrate named entities using naïve

bayes classiﬁer on Czech corpus in his work. The experimental results show that entity

features do not signiﬁcantly improve the classiﬁcation performance over the baseline word-

based features. The improvement is maximum 0.42%.

4 WIKIPEDIA2VEC AND DOC2VEC 23

4 Wikipedia2Vec and Doc2Vec

One of our main concerns is to get good entity vectors. Our entity vectors are generated

by using Wikipedia2Vec and Doc2Vec. Wikipedia2Vec, which is implemented by Yamada

et al. (2018)[57], is an open source tool used for obtaining vector representations of words

and entities from large corpus Wikipedia. This tool enables us to learn both the vectors of

words and entities simultaneously and thus places similar words and entities close to each

other in a high dimensional continuous vector space. Compared with existing embedding

tools such as Gensim[46], FastText[25] and RDF2Vec[47], this tool has been proved to be

the state-of-the-art NLP model in the named entity disambiguation (NED) task.

4.1 Word2vec and Skip-gram

The word2vec model, which was proposed by Mikolov et al. (2013)[40], has caught rese-

archers a numerous amount of attention in recent years. The core idea of word2vec is to

characterize one word through the neighbors of this word with prediction between every

word and its context words. The word vector representations produced by the model have

been proven to carry semantic features and can be subsequently used in many natural

language processing applications and further research.

Word2vec actually contains two algorithms, that is, skip-gram, which predicts context

words given target; continuous bag of words (CBOW), which predicts target word from

bag-of-words context. The model also contains two moderately eﬃcient training methods,

hierarchical softmax and negative sampling[39].

Figure 4.1: Word based skip-gram model

We now focus on skip-gram algorithm. Given a sentence of T words w

, w

, ..., w

, for

each word t = 1 . . . T , we predict surrounding words in a window of “radius” m of every

word. Here m is the size of the context window, w

denotes the center word and w

t+j

is its

4 WIKIPEDIA2VEC AND DOC2VEC 24

context word. The objective function is to maximize the probability of any context word

given the current center word. We want to maximize the following objective function:

(θ) =

t=1

−m≤j≤m

j6=0

P (w

t+j

; θ) (4.1)

Machine learning likes minimizing things so we add minus. T represents the normalization

per word, so the objective function does not depend on the length of the sentence.

J(θ) = −

t=1

−m≤j≤m

j6=0

log P (w

t+j

) (4.2)

So we minimize the negative log probability where θ represents all variables we will opti-

mize.

For the conditional probability P (w

t+j

), the simplest formulation is:

P (o|c) =

exp(u

)

w=1

exp(u

)

(4.3)

where o is the outside (or output) word index, c is the center word index, v

and u

are

“center” and “outside” vectors of indices c and o, V is a set containing all the words in the

vocabulary. Note we have two vectors for each word because it makes math much easier.

This softmax function uses word c to obtain probability of word o. We use softmax func-

tion because it’s sort of a standard way to turn numbers into a probability distribution.

We now compute derivatives of equation 4.3 to work out minimum.

∂

∂v

log

exp(u

)

w=1

exp(u

)

∂

∂v

log exp(u

) −

∂

∂v

log

w=1

exp(u

) (4.4)

For the left part of the formula, we have

∂

∂v

log exp(u

) =

∂

∂v

) = u

(4.5)

For the right part of the formula, by using chain rule we have

4 WIKIPEDIA2VEC AND DOC2VEC 25

∂

∂v

log

w=1

exp(u

) =

w=1

exp(u

)

∂

∂v

x=1

exp(u

)

w=1

exp(u

)

x=1

∂

∂v

exp(u

)

w=1

exp(u

)

x=1

exp(u

)

∂

∂v

)

w=1

exp(u

)

x=1

exp(u

(4.6)

Combining formula 4.5 with 4.6 we get the ﬁnal derivative of conditional probability

P (w

t+j

), that is

∂

∂v

log(P (o|c)) = u

−

w=1

exp(u

)

x=1

exp(u

= u

−

x=1

exp(u

)

w=1

exp(u

)

· u

= u

−

x=1

P (x|c) · u

(4.7)

Besides, we can observe that the minuend in formula 4.7 is an expectation, which is an

average over all the context vectors weighted by their probability. The above formula is

just the partial derivatives for the center word vector parameters. Similarly, we also need

to compute the partial derivatives for the output word vector parameters. Finally, we have

derivatives with respect to all the parameters and we can minimize the cost(loss) function

using gradient descent algorithm.

4.2 Wikipedia2Vec

Wikipedia2Vec iterates over the entire Wikipedia pages. It extends the conventional word

vector model (i.e., skip-gram model) by using two submodels: Wikipedia link graph model

and anchor context model. It then jointly optimizes these three models.

4.2.1 Wikipedia Link Graph Model

Wikipedia link graph model[58] is a model which learns entity vectors by predicting neigh-

boring entities in Wikipedia’s link graph. Wikipedia’s link graph is an undirected graph

4 WIKIPEDIA2VEC AND DOC2VEC 26

whose nodes represent entities and edges represent links between entities in Wikipedia.

Entity linking (EL) is the task of recognizing (named entity recognition NER) and lin-

king (named entity disambiguation NED) a named-entity mention to an instance in a

knowledge base (KB), typically Wikipedia-derived resources like Wikipedia, DBpedia and

YAGO.[12]

NED in NLP is a task to determine the identity of entities which are mentioned in do-

cuments. For example, given the sentence “Washington is the capital city of the United

States”, here we aim to determine that Washington refers to “Washington, D.C.” and not

to “George Washington”, the ﬁrst President of the United States. EL requires a KB that

contains entities. Wikipedia2Vec is obviously based on Wikipedia, in which each page is

regarded as a named entity. For example the page of Washington, D.C. in Wikipedia has

a link to the page of United States, so there is an edge between this pair of entities. There

are also some other cases, for example, Berlin has a link to Germany, Germany has a link

to Berlin, both of the entities are linked to each other.

Figure 4.2: Wikipedia link graph model

Given a set of entities e

, e

, ..., e

, for each entity e

, we compute the probability of every

entity in the set that is linked to this entity e

. Here E denotes the set of all the entities

in KB and S

is the set of entities which has a link to entity e. We want to maximize

the probability of any entity that has a link to entity e given this entity e. So we need to

minimize this following objective function:

= −

∈E

∈S

6=e

log P (e

) (4.8)

For the conditional probability P (e

), we compute it using the following softmax func-

tion:

4 WIKIPEDIA2VEC AND DOC2VEC 27

P (e

) =

exp(u

)

e∈E

exp(u

)

(4.9)

4.2.2 Anchor Context Model

So far we have learned entity vectors using entities and word vectors using words. We

obtain these vectors by training two diﬀerent models, so the words and entities are map-

ped into diﬀerent vector spaces. We hope to map them into the same vector space so we

introduce anchor context model[58] which learns entity vectors and word vectors simul-

taneously by predicting the context words given an entity.

In anchor context model, we want to predict the context words of an entity pointed

to with target anchor.[58] Here A denotes the set of anchors in KB. Each anchor in A

contains a referent entity e

and a set of its context words S. The previous c and next

c words of a referent entity are seen as its context words. The objective function is as

follows:

= −

,S)∈A

∈S

log P (w

) (4.10)

Figure 4.3: Anchor context model

For the conditional probability P (w

), we compute it using the following softmax func-

tion:

P (w

) =

exp(u

)

w∈W

exp(u

)

(4.11)

4 WIKIPEDIA2VEC AND DOC2VEC 28

4.3 Doc2Vec

Doc2Vec is proposed by Quoc Le and Tomas Mikolov in 2014.[34] Doc2Vec, also called

Paragraph Vector, is an unsupervised algorithm, with which we are able to get ﬁxed-

length feature representations from variable-length documents. Doc2Vec can be used on

text classiﬁcation and sentiment analysis tasks.

Similar to word2vec, Doc2Vec also has two training algorithms, that is, distributed bag

of words (DBOW) which is similar to skip-gram in word2vec, and distributed memory

(DM) which is similar to continuous bag of words (CBOW) in word2vec. Especially, the

input of DBOW is the paragraph vector. The model is trained to predict the words in

a small window. In contrast to DBOW, in DM, the paragraph vector and local context

word vectors are concatenated or averaged to predict the next word in a context.

Quoc Le also did experiments trained on Wikipedia which shows that the Paragraph

Vector algorithms perform signiﬁcantly better than other methods and embedding quality

is enhanced.[15] So in our work, we train the Doc2Vec model on Wikipedia dump and use

Doc2Vec to obtain better entity embeddings.

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 29

5 Convolutional Neural Network for Short Text Clas-

siﬁcation

In this section, we modify traditional convolutional neural network for text classiﬁcation

(TextCNN) and propose a new model called convolutional neural network with entities for

text classiﬁcation (EntCNN), using words as well as entities to extract more contextual

information from texts. We ﬁrst introduce the way to obtain entities and entity vectors

from texts. Then we describe our model to show how to learn features from words and

entities.

5.1 Entity Linking

The ﬁrst step of our work is to detect named entities from texts. A named entity is, rough-

ly speaking, anything that can be referred to with a proper name: a person, a location or

an organization.[26] As we have mentioned in chapter 4, entity linking (EL) is the task

of recognizing and disambiguating named entities to a knowledge base (KB). It is some-

times also known as named entity recognition and disambiguation. Why can we beneﬁt

from EL? By linking named entity mentions appearing in texts with their corresponding

entities in a KB, we can make use of KB which contains rich information about entities to

better understand documents. EL is essential to many NLP tasks, including information

extraction, information retrieval, content analysis, question answering and knowledge ba-

se population.[50] Shen and Jiawei Han et al. (2014) wrote an outstanding survey about

EL, see [50].

In our work, we utilize TagMe to extract entities and link them to Wikipedia entries.

TagMe is an end-to-end system which has two major stages. First, we process a text to

extract named entities (i.e. named entity recognition). Then we disambiguate the extrac-

ted entities to the correct entry in a given KB (i.e. named entity disambiguation). After

having obtained entity entries in a KB, we use Wikipedia2Vec to get entity vectors. As we

have discussed in the previous chapter, Wikipedia2Vec is a disambiguation-only approach.

In contrast to end-to-end approach, this model takes standard named entities directly as

input and disambiguates them to the correct entries in a given KB. In our work, we just

need to query previously obtained entity entries on Wikipedia2Vec, then we can get entity

vectors as output.

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 30

Figure 5.1: Convolutional neural network with words and entities for short text classiﬁ-

cation

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 31

5.2 Overall Architecture of EntCNN

As we have discussed in chapter 3, CNNs have achieved impressive results on text classi-

ﬁcation tasks. In our work, we focus on single layer CNN due to its outstanding empirical

performance and use it as our baseline model.

We begin with a preprocessed sentence as Figure 5.1 shows: Berlin is my favorite city

in Germany. The whole network contains input layer (embedding layer), convolution and

max-pooling layer, dropout layer and output layer. The details are described in the follo-

wing paragraph.

Embedding Layer

We ﬁrst convert the preprocessed sentence to an embedding matrix. This word embed-

ding matrix maps vocabulary word indices into low-dimensional vector representations.

Each row of the matrix is a word vector representation. It’s essentially a look-up table.

Note that each sentence is zero-padded to the maximum sentence length of a document.

We initialize word vectors with publicly available pre-trained word2vec or GloVe models.

Vocabulary words which do not present in the set of pre-trained models are randomly in-

itialized. Similarly, Our entity embedding matrix is initialized using Wikipedia2Vec. Note

that entity embedding matrix should have the same length as word embedding matrix due

to Tensorﬂow implementation reasons. But it turns out that this doesn’t aﬀect the ﬁnal

result because padded-zeros are ﬁnally ignored after max-pooling step. Note that not all

entities have entity vectors. We also adopt the same zero-padding strategy for entities that

don’t have entity embeddings, i.e. they are zero-padded in the entity embedding matrix.

Assume the maximum sentence length of a given document is l. The dimensionality of

word and entity vectors are denoted as d. So we can obtain the embedding matrix W

by concatenating the embedding matrix of words and entities together: W = W

⊕ W

and W

denote the embedding matrix of words and entities respectively. ⊕ stands for

concatenation operator. Thus given a sentence, its embedding matrix can be represented

as:

W = v

⊕ v

⊕ · · · ⊕ v

⊕ v

⊕ · · · ⊕ v

(5.1)

Convolution and Max-Pooling Layer

Unlike convolution operations in computer vision where ﬁlter width can be one or any

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 32

other integers, ﬁlter width should be equal to the dimensionality of word and entity vectors

(i.e. d) in text classiﬁcation. It doesn’t make any sense to operate on part of word vectors.

For example, for word example, we can’t recognize it when just seeing xa or ampl. Thus

we just need to vary ﬁlter height f. Given a ﬁlter (convolution kernel) which has size

f × d, suppose we have a concatenation of words and entities v

, v

i+1

, . . . , v

i+j

, denoted

as [v

: v

i+f−1

]. A feature map is then generated from this window of words and entities

by:

= g(w · [v

: v

i+f−1

] + b) (5.2)

where g is a non-linear activation function and b is a bias term. Various nonlinear activati-

on functions are applied to introduce non-linearity into deep neural networks. Historically

commonly used functions include sigmoid, hyperbolic tangent (tanh) and rectiﬁed linear

unit (ReLU). In our work, we choose ReLU because it has characteristics such as sim-

plicity, non-saturating nonlinearity which helps to avoid the vanishing gradient problem,

and it has been observed to be able to accelerate the convergence of stochastic gradi-

ent descent (SGD).[31] For every possible window of words and entities in a sentence

1:f

, x

2:f+1

, . . . , x

n−f+1:n

}, we use this ﬁlter to produce feature maps:

c = [c

, c

, . . . , c

n−f+1

] (5.3)

A max-over-time pooling function is then applied to this feature map to create a ﬁxed

length vector. Max-pooling is a strategy where we take the maximum value, reducing its

dimensionality while capturing the most important features. Note that we use ﬁlters of

diﬀerent region sizes (heights). We also use multiple ﬁlters to learn diﬀerent features for

the same region size (height). Because each ﬁlter produces vectors of diﬀerent shapes, we

need to iterate through them, create a layer for each of them, adopt the same pooling

scheme, and then merge the results into one big feature vector.

Dropout Layer

This big feature vector is then passed to a dropout layer. Dropout[51] is perhaps the

most popular regularization method in CNN. The idea is quite simple. A dropout layer

stochastically disables a fraction of its neurons during training by setting them to ze-

ro, which prevents neurons from co-adapting and forces them to learn individually useful

features, thus signiﬁcantly reduces overﬁtting. The output of dropout layer is denoted as h.

Softmax Layer

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 33

The feature vector from dropout layer is ﬁnally passed to a fully connected layer. We can

generate label predictions by doing a matrix multiplication and picking the class with the

highest score. We also apply a softmax function to convert raw scores into a probability

distribution over labels.

p = softmax(w · h + b)

ˆp = arg max(p)

(5.4)

5.3 Training

Loss Function

Our goal of training is to minimize the loss function. The cross-entropy error is the stan-

dard loss function for categorization problems. Compared with cross-entropy error, the

mean squared error (MSE) and classiﬁcation error are not good loss functions for classi-

ﬁcation problem.

In binary classiﬁcation, where the number of classes m equals 2, cross-entropy error can

be calculated as:

L = −(y log(p) + (1 − y) log(1 − p)) (5.5)

where p indicates the probability of the prediction is positive and y is the label of the

observed sample, 1 is positive and 0 is negative. If m > 2 (i.e. multi-class classiﬁcation),

we calculate a separate loss for each class per observation and sum the result:

L = −

i=1

log(p

) (5.6)

where m is the number of classes, y ∈ R

is the one-hot represented true label, p ∈ R

is the predicted probability for each class that observation is of class i after softmax cal-

culation.

Regularization

L2 regularization is perhaps the most common regularization method to prevent over-

ﬁtting. It can be implemented by penalizing the squared magnitude of all parameters

directly in the objective function. That is, for every weight w in the network, we add

the term

λw

to the objective, where λ is the regularization strength. Now the new loss

function becomes:

5 CONVOLUTIONAL NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 34

(w) = L(w) +

λw

+ · · · (5.7)

So we have the ﬁnal loss function:

(w) = −

i=1

log(p

) + λkwk

(5.8)

In [63], it’s recommended to set the dropout rate to a small value (0 ∼ 0.5) and a rela-

tively large λ value (commonly used 3) for L2 regularization. Practically, we increase the

number of feature maps to see if or not it helps to improve the performance. If not, we

need a larger dropout rate.

Cross Validation

To avoid overﬁtting, a common practice is to use cross-validation approach. We randomly

split the training set into n distinct subsets with approximately equal size, called folds.

We then train and evaluate the model n− 1 times, pick a diﬀerent fold for evaluation each

time and train on the other n − 1 folds, yielding 1 score at last. The ﬁnal result of cross-

validation is summarized with the mean of the n scores. It has been shown empirically to

have a good balance of bias and variance when we perform n-fold cross validation using

n = 5 or n = 10.[24] Because most datasets are small in our experiments, we set n to 10.

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 35

6 Recurrent Neural Network for Short Text Classiﬁca-

tion

We know that fully connected neural networks and CNNs can only process individual

inputs. The previous input and the following input are completely unrelated. However, in

some tasks we need to be able to better process the sequences, where the previous input

and the subsequent input are related. For example, if we want to understand a sentence,

it’s not enough to understand each word separately. We need to understand a group of

words or the entire sequence of words. So we need another very important neural network

in deep learning: recurrent neural network (RNN).

6.1 Vanilla Recurrent Neural Network

Figure 6.1: Recurrent neural network and the unfolding in time of the computation invol-

ved in its forward computation

Figure 6.1 shows a vanilla RNN being unfolded into a full network. By unfolding we mean

that we write out the network for the complete sequence. The parameters in RNN are as

follows: g and f are activation functions. w

is the input at time step t. h

is the hidden

layer state at time step t. We can think of the hidden state h

as the memory of the

network. p

is the output prediction at time step t. W

is the weight matrix from input

layer to hidden layer. Here the ﬁrst subscript h indicates that we want to work out the

hidden layer. The second subscript implies that matrx W

multiplied by w

. Accordingly

W maps hidden layer to output layer. W

is the weight matrix between hidden layers.

Note that a RNN shares the same set of parameters (W

, W

and W ) across all time

steps, thus greatly reducing the total number of parameters we need to learn.[3] At time

= g(W

t−1

+ W

+ c) (6.1)

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 36

= f(W h

+ b) (6.2)

The above formulas deﬁne the forward propagation in RNN. We deﬁne (W

t−1

)

as V [h

t−1

, w

] to simplify the notations. We take the matrix W

and W

, put them side

by side and stack them horizontally as follows: [W

. W

] = V . Besides, [h

t−1

, w

]

indicates that we stack these two vectors vertically like

t−1

. To make it clearly, we say

h is 100 dimensional and w is 10,000 dimensional. Then W

will be (100, 100) and W

will be (100, 10000). When we stack these two matrices together, V will be a (100, 10100)

matrix. Correspondingly

t−1

will be a (10100, 100) vector. So [W

. W

] multiplied

t−1

we will get (W

t−1

+ W

). We rewrite the formula 6.1 and 6.2 as follows

for simplicity:

= g(V [h

t−1

, w

] + c) (6.3)

= f(W h

+ b) (6.4)

Exploding and vanishing gradients problems. We are interested in how the gradient

propagates back in time in RNN:

∂cost

∂h

∂cost

∂p

∂h

n∈{N,...,2}

∂h

n−1

∂cost

∂p

∂h

n∈{N,...,2}

∂h

∂z

∂h

n−1

(6.5)

Figure 6.2: Problem of exploding and vanishing gradients in RNN (adopted from [2])

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 37

We deﬁne (W

t−1

+ W

+ c) in Formula 6.1 as z

. Then we get following derivatives:

∂h

∂z

= diag(g

)) (6.6)

∂z

∂h

n−1

= W

(6.7)

We rewrite the partial derivatives in Formula 6.5:

∂h

n−1

∂h

∂z

∂h

n−1

= diag(g

)) W

(6.8)

∂cost

∂h

∂cost

∂p

∂h

n∈{N,...,2}

∂h

n−1

∂cost

∂p

∂h

n∈{N,...,2}

diag(g

)) W

(6.9)

The core of the recurrent product is the repeated multiplication of W

. If the largest

eigenvalue of W

equals 1, then gradient will propagate and it will be stable; if it is

greater than 1, the product will grow exponentially as we multiply more and more times,

the gradients will explode; if it is less than 1, the product will shrink exponentially, the

gradients will vanish. These are problems in RNN: we will not be able to train the model

if it explodes because the gradients will be inﬁnite; if it vanishes, we will not be able to

learn long range dependencies because we will get zero gradients after a few steps back

in time.

Solutions to exploding and vanishing gradients problems. In general, gradient ex-

plosion is easier to handle. Our program will receive a NaN error when gradient explodes.

We can also set a threshold, we can clip gradient values which exceed the preferred range

when gradient exceeds this threshold.

Gradient vanishing problem is more diﬃcult to handle. We can use other types of RNNs,

such as LSTMs and gated recurrent unit (GRU), which is a most popular way and will

be discussed in the ﬂowing paragraph.

We can also use new activation functions rectiﬁed linear unit (ReLU) instead of Sig-

moid and Tanh. See Table3.

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 38

Sigmoid Tanh ReLU

g(z) =

1+e

−z

g(z) =

−e

−z

g(z) = max(0, z)

Table 3: Commonly used activation functions

6.2 Long Short-Term Memory

When the time interval is large, it’s very diﬃcult for simple RNNs to model long-distance

dependencies, known as long-term dependencies problem.[22][10] Long short-term memo-

ry (LSTM)[23] which was proposed by Hochreiter and Schmidhuber in 1997 has cleverly

solved this problem. LSTM networks – usually just called LSTMs, are a special kind of

RNN, capable of learning long-term dependencies.[4]

In the following paragraph we’ll step by step build a basic LSTMs model. Here we have

two time steps and we’re interested in how the hidden layer in the previous time step gets

updated into the next time step. The update equation is as follows:

= g(V [w

n−1

; h

n−1

] + c) (6.10)

Here h indicates hidden layers, w stands for input words and p is the output prediction.

As Figure 6.3 presents, when we go from h

n−1

to h

we are going to pass the hidden

layer h

n−1

and the input w

n−1

together through a linear transformation and a tanh non

linearity. This is a simple RNN architecture.

= tanh(V [w

n−1

; h

n−1

] + c) (6.11)

In the hidden layer of the original RNN there is only one hidden state, which is very

sensitive to short-term inputs. The next step is that we introduce an extra layer called

cell state c and we can think of it as our memory, we use it to store our long-term state.

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 39

Figure 6.3: Basic RNN architecture (adopted from [2])

We then promote the hidden layer h to upper layer as Figure 6.4 shows. We update the

cell state c

by taking the previous cell state c

n−1

and adding in a contribution from our

hidden layer h

n−1

with our input w

n−1

which looks just like previous simple RNN. The

diﬀerence is in this network we basically update cell states. And what really matters is

these updates are additive. In the simple recurrent network as we go from one hidden

layer to the next we are actually multiplying whereas in this network we do lots of sums.

When we diﬀerentiate the sum rather than the product, the gradient can ﬂow nicely back

through the sum. The key thing in LSTMs is we replace multiplication with sum.

= c

n−1

+ tanh(V [w

n−1

; h

n−1

] + b

) (6.12)

= tanh(V [w

n−1

; h

n−1

] + c) (6.13)

Figure 6.4: The cell state of LSTM (adopted from [2])

The next thing we need to do is called gating. The idea of gating in neural networks is

just computing a non linearity which is bounded between 0 and 1 and we can think of it

like a switch: if it is 1 we turn on the connection, if it is 0 we turn it oﬀ. The key thing

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 40

is that it is a continous function so we can diﬀerentiate it and do back propagation. The

gate is actually a fully connected layer whose input is a vector and output is a vector

between 0 and 1. Suppose W is the weight matrix of the gate and b is the bias term. The

gate can be expressed as follows:

g(x) = σ(W x + b) (6.14)

As Figure 6.5 presents, here we introduce the input gate i. We gate the contribution

from the input and previous hidden layer. We then take a new linear transformation of

the input w

n−1

and the previous hidden layer h

n−1

, apply a sigmoid function to it and

get a vector of gates. When observing formula 6.15, we can think of it in this way. If the

gate is 1 then we will get a big contribution from our current input. If the gate is 0 then

we will get no contribution from current input, cell state c

is equal to cell state c

n−1

, so

there is no update and model forgets nothing. If a gate is closed for many time steps, we

will obtain long time dependencies. So input gate gives us the ability to ignore inputs.

= c

n−1

+ i

◦ tanh(V [w

n−1

; h

n−1

] + b

) (6.15)

= σ(W

n−1

; h

t−1

] + b

) (6.16)

Figure 6.5: The input gate of LSTM (adopted from [2])

We have the ability either to include or ignore of the input but we do not have the ability

to forget things what are in our cell state. Now we add a forget gate f as in Figure 6.6.

As formula 6.19 shows, it is a sigmoid function of a new linear transformation of the input

n−1

and hidden layer h

n−1

. This forget gate is then multiplied by the previous cell state

n−1

to get a new cell state 6.17. Our new memory or new cell state is a mixture of the

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 41

previous cell state c

n−1

weighted by the forget gate f

and our new update from our input

and the previous hidden layer weighted by the input gate i

. If all the forget gates are set

to 0, we will forget all the memories. If all the input gates are set to 0, we will ignore all

the inputs. So the model can choose between do I really propagate information forward

perfectly and ignore inputs or do we forget what we have seen and introduce something

new.

= f

◦ c

n−1

+ i

◦ tanh(V [w

n−1

; h

n−1

] + b

) (6.17)

= σ(W

n−1

; h

t−1

] + b

) (6.18)

= σ(W

n−1

; h

t−1

] + b

) (6.19)

Figure 6.6: The forget gate of LSTM (adopted from [2])

Finally in LSTMs we always add a output gate o on the output from the cell state c

to the hidden layer h

, as in Figure 6.7.

= o

◦ tanh(W

+ b

) (6.20)

= f

◦ c

n−1

+ i

◦ tanh(V [w

n−1

; h

n−1

] + b

) (6.21)

= σ(W

n−1

; h

t−1

] + b

) (6.22)

= σ(W

n−1

; h

t−1

] + b

) (6.23)

= σ(W

n−1

; h

t−1

] + b

) (6.24)

In summary, the hidden state h in RNN stores some historical information, which can be

regarded as a kind of memory. In a simple RNN, the hidden state is rewritten at each time

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 42

Figure 6.7: The output gate of LSTM (adopted from [2])

step, so it can be seen as a short-term memory. In contrast, long-term memory in neural

networks can be seen as a network parameter, which is stored in the weight matrices. It

contains the information learned from training data, and its update cycle is much longer

than short-term memory.

In LSTMs, cell state c can capture some critical information and has the ability to keep

it for a certain time. The life cycle of the stored information in cell state c is longer

than short-term memory, but much shorter than long-term memory, so it is called long

short-term memory (LSTM).

6.3 Recurrent Convolutional Neural Network

In our work, we also experiment with recurrent convolutional neural network for text

classiﬁcation (RCNN) proposed by Lai (2015). To present a word meaning better and

more precisely, we combine a word with its context. As shown in Figure 6.8, we use a

bidirectional recurrent neural network to capture contexts. As shown in equation 6.25

and 6.26, c

) denotes the left context of word w

and c

) denotes the right context

of word w

. e is a word embedding. W

(l)

, W

(r)

are matrices used to transfer contexts into

the next layer. W

(sl)

, W

(sr)

are matrices used to combine the context of the current word

with its left/right contexts.

) = f(W

(l)

i−1

) + W

(sl)

e(w

i−1

)) (6.25)

) = f(W

(r)

i+1

) + W

(sr)

e(w

i+1

)) (6.26)

Thus, the representation of word w

can be deﬁned as a concatenation of word embedding

6 RECURRENT NEURAL NETWORK FOR SHORT TEXT CLASSIFICATION 43

e(w

), its left context vector c

) and its right context vector c

= [c

); e(w

); c

)] (6.27)

Figure 6.8: Recurrent convolutional neural network for text classiﬁcation (adopted from

[33])

After we calculate the representation of each word w

, we can obtain the whole text

representation using equation 6.28.

(2)

= tanh(W

(2)

+ b

(2)

) (6.28)

So how do we extract features from texts? We simply apply a max-pooling layer after all

the word representations are calculated. Note that the max function is an element-wise

function. That is, the k-th element of y

(3)

is the maximum in the k-th elements of y

(2)

Therefore, we can convert sentences of diﬀerent lengths into a ﬁxed length vector. Mean-

while, we are also able to capture the most important textual information of documents.

(3)

max

i=1

(2)

(6.29)

7 EXPERIMENTS 45

7 Experiments

7.1 Data

We conduct our experiments on four widely used datasets: TREC, Twitter, AG News and

Movie Review. A summary of the datasets are listed in Table 4.

Datasets c l m train/test |V | |V

pre

TREC 6 10 33 5452/500 8602 7223

Twitter 3 19 35 9684/3547 20755 12140

AG News 4 7 23 120000/7600 38916 25310

Movie Review 2 21 59 10662/CV 19897 16394

Table 4: Summary statistics of datasets. c: number of target classes, l: average sentence

length, m: maximum sentence length, train/test: training/test set size, |V |: vocabulary

size, |V

pre

|: number of words present in the set of pre-trained word embeddings, CV: there

is no standard train/test split, thus 10-fold cross validation is used.

TREC. TREC

is a question answering dataset. This dataset is adopted from [Li et al.

2002][36]. It includes six diﬀerent kinds of questions. TREC class descriptions are shown

in Table5. TREC details are listed in Table 6.

Abbreviation abbreviation or expression abbreviated

Entity food, animal, technique, language, plant, word, sport, etc.

Description deﬁnition or description of something, etc.

Human title or description of a person, etc.

Location city, state, country, etc.

Numeric count, date, size, weight, speed, temperature, distance, etc.

Table 5: Class descriptions of TREC dataset

Twitter. Twitter

is a dataset for a sentiment analysis competition. We need to predict

three labels: positive, negative and neutral. Because tweets are noisy, they are specially

preprocessed in our experiments, details can be seen in preprocessing part in this chapter.

Twitter details are listed in Table 7.

http://cogcomp.cs.illinois.edu/Data/QA/QC/

https://www.cs.york.ac.uk/semeval-2013/task2/

7 EXPERIMENTS 46

Abbreviation What is the full form of .com?

Entity What’s the second most used vowel in English?

Description What is an annotated bibliography?

Human Who is the founder of Scientology?

Location What is the highest waterfall in the United States?

Numeric How many people in the world speak French?

Table 6: Example sentences of TREC dataset

Positive

Good morning Becky! Thursday is going to be Fantastic!

aylor Swift is coming with Ed Sheeran June 29th, most perfect news I’ve

heard all night.

Neutral

The heat game may cost a lot, plus I would rather see Austin Rivers play.

ovember is an odd release date if true, but if it becomes big enough

maybe she could sing it at Grammys.

Negative

hy is it so hard to ﬁnd the @TVGuideMagazine these days? Went to 3

stores for the Castle cover issue. NONE. Will search again tomorrow...

Just found a piece of candy that may have been injected with something.

Table 7: Example sentences of Twitter dataset

AG News. AG News is a collection of news articles. This dataset is adopted from [Zhang

et al., 2015][62]. The original data consists of class index (1 to 4), news title and news

description. We only use news title in our short text classiﬁcation experiments. AG News

details can be seen in Table 8.

Sci/Tech IBM to hire even more new workers.

Sports Capacity crowds at beach volleyball rock the joint.

World US ﬁghter squadron to be deployed in South Korea next month.

Business Staples proﬁt up, to enter china market.

Table 8: Example sentences of AG News dataset

Movie Review. Movie Review

is a dataset used in sentiment analysis experiments. This

dataset is adopted from [Pang and Lee, 2005][42]. It contains sentences labeled with re-

spect to their overall sentiment polarity (positive or negative). Movie Review does not

have a standard test set. So for this dataset we randomly select 10% of the training data

http://www.cs.cornell.edu/people/pabo/movie-review-data/

7 EXPERIMENTS 47

as the dev set. Movie Review details can be seen in Table 9.

Positive

If this movie were a book, it would be a page-turner, you can’t wait to see

what happens next.

he ﬁlm is faithful to what one presumes are the book’s twin premises - -

that we become who we are on the backs of our parents, but we have no

idea who they were at our age; and that time is a ﬂeeting and precious

commodity no matter how old you are.

Negative

Too slow for a younger crowd, too shallow for an older one.

he problem is not that it’s all derivative, because plenty of funny movies

recycle old tropes. The problem is that van wilder does little that is

actually funny with the material.

Table 9: Example sentences of Movie Review dataset

7.2 Preprocessing

Text preprocessing is always the ﬁrst step of a good NLP system, which provides the sys-

tem with clean, easy to handle forms. The impact on the performance of commonly used

text preprocessing methods with traditional machine learning classiﬁers has been investi-

gated in a previous work[53], which shows that appropriate combinations of preprocessing

methods depending on the domains and languages might result in a signiﬁcant improve-

ment of the accuracy, whereas inappropriate combinations might decrease the accuracy as

well. A similar study with deep learning classiﬁers has also been conducted recently[11].

Therefore text preprocessing methods should be used and especially carefully selected.

In order to understand why text preprocessing really matters, we now consider the pre-

viously used sentences:

(1) I do not like this type of movie, but I like this one.

(2) I like this kind of movie, but I do not like this one.

We can clearly see that these two snippets represent completely opposite meanings. The

question is in what ways text preprocessing might aﬀect our sense of how diﬀerent they

are. To begin with, we calculate the cosine similarity of these two sentences, which is 0.90.

The relevant document frequency matrix are shown in Table 10 when considered only

lowercase words.

7 EXPERIMENTS 48

i do not like this type of movie but one kind

(1) 2 1 1 2 2 1 1 1 1 1 0

(2) 2 1 1 2 2 0 1 1 1 1 1

Table 10: Document term matrix with stop words retained

We then remove stop-words according to the natural language tool kit’s (NLTK) list of

English stop-words. With those removed, as shown in Table 11, the two snippets now look

less similar than before. In fact, the cosine distance between them decreases from 0.90

to 0.64. Thus, when asked how similar the two snippets are, our conclusions diﬀer. And

this does matter because text similarity calculation is the core and foundation of unsu-

pervised learning techniques, such as clustering. Our observation of the above experiment

makes it clear that using diﬀerent text preprocessing methods will lead to diﬀerent results.

like type movie one kind

(1) 2 1 1 1 0

(2) 2 0 1 1 1

Table 11: Document term matrix with stop words removed

Next, the widely used text preprocessing techniques including lowercasing, tokenization,

stop-words removal, stemming and lemmatizing are introduced.

Lowercasing

There are two case conversion approaches, lowercasing and uppercasing. Here we only

focus on lowercasing, where all letters in all words of the document is converted comple-

tely to lowercase. The reason for doing this is that whether or not the ﬁrst letter of a

word is uppercase typically does not aﬀect its meaning.[17] For example, “Computer” and

“computer” all refer to the same type of machine which is able to be instructed to carry

out sequences of arithmetic or logical operations automatically. So it is unnecessary to

regard them as two diﬀerent words in text analysis. Furthermore, it will lead to decline

in the accuracy without lowercasing under such circumstances. However, there is not a

general rule which demonstrates that lowercasing can always improve the performance.

Especially when our corpus is in German, it would be more appropriate to retain the

capitalization form of words since it is an important characteristic of German language.

In English only proper nouns such as the names of people, cities, countries, nationalities,

and languages, etc. are capitalized. From the following example we can notice that all

7 EXPERIMENTS 49

nouns are capitalized in German, wherever they appear in a sentence. This is a nearly

unique feature in a contemporary language.

EN: A computer is a machine that can be instructed to carry out sequences

of arithmetic or logical operations automatically.

DE: Ein Computer ist eine Maschine, die angewiesen werden kann, Sequenzen

von arithmetischen oder logischen Operationen automatisch auszuführen.

Tokenization

Tokenization methods consist of sentence tokenization and word tokenization. Sentence

tokenization is the process of splitting or segmenting text into individual sentences. Nor-

mally sentence tokenization is followed by word tokenization. Word tokenization is the

process of splitting or segmenting sentence into a list of words. The resulting word tokens

can be then used in further processing. Tokenization is important in NLP, especially in

the operations where each word needs to be individually dealed with such as stemming

and lemmatizing. In our work, we use NLTK library to tokenize words. Here is a word

tokenization example.

Original text:

He jumped in and swam to the opposite bank.

Tokenized text:

[‘He’, ‘jumped’, ‘in’, ‘and’, ‘swam’, ‘to’, ‘the’, ‘opposite’, ‘bank’]

Removing stop-words

Sometimes some extremely commonly used words are unlikely to convey us much useful

information, such as “the”, “the”, “and”, etc. These words are called stop-words. They can

be removed from texts and excluded from the vocabulary we built. Removing stop-words

has its beneﬁt: we wouldn’t want the stop-words to take up too much space in our databa-

se or memory when processing large scale corpus. However it also has downsides, it might

aﬀect the accuracy. There is no standard stop-words list. NLTK has a list of stop-words in

16 diﬀerent languages. We can also deﬁne our own stop-words list according to our needs

in diﬀerent domains and languages. In our experiments, stop-words are not removed.

Stemming and lemmatizing

Stemming and lemmatizing aims to reduce inﬂectional forms and sometimes derivational-

ly related forms of a word to a common base form.[38] Stemming refers to the process of

reducing a word to its most basic form. For instance, with removal of derivational aﬃxes of

the following words, the verb “to jump” may appear as: “jumped”, “jumping” and “jumps”,

7 EXPERIMENTS 50

we can obtain their common base form: jump.

Lemmatizing or lemmatization refers to the process of conducting morphological ana-

lysis to correctly return the base form for each word, which here we call it lemma. For

instance, the verb ‘to swim’ may appear as: “swam”, “swum”, “swimming” and “swims”,

with lemmatizing we will obtain “swim” or “swimming” depending on the context, where

“swimming” can be either the base form of a noun “swimming-pool” or of a verb “I swam in

the river”. Whereas with stemming we might obtain “swim”. In fact, stemmer is typically

easier to implement and runs faster than lemmatizer, because stemmer simply operates on

a single word without any knowledge of the context but lemmatizer requires a dictionary

look-up operation.

By observing the raw tweets we can ﬁnd that it is a quite noisy corpus. The language used

in tweets is very informal, with creative punctuation and spelling, misspellings, slang, new

words and genre-speciﬁc terminology and abbreviations. Tweets also have lots of special

characters such as emoticons, URLs, user-mentions, etc. as shown below. So we need some

additional special preprocessing methods for tweets. In our work, we adopt the methods

from a Kaggle twitter sentiment analysis competition

#NowPlaying Michael Jackson - She’s Out of My Life (Live At Wembley July

16. http://t.co/vxRYuADn L.O.V.E it! #MjTunes

URL

A uniform resource locator (URL), or web address in spoken English, is a reference to a

web resource which speciﬁes its location. In our daily lives we would love to share content

such as URLs with friends or the public via Twitter. However URL in tweets doesn’t pro-

vide us with any valuable information in text classiﬁcation tasks. One feasible approach is

removing all the URLs. Here we replace all the URLs in tweets with word URL. Note that

all the links (URLs) posted in tweets will automatically be processed and shortened. The

related regular expression used to match URLs is ((www\.[\S]+)|(https?://[\S]+)).

User mention

Lots of tweets have an associated symbol @. For example, we would like to mention our

friends with @username to share some funny posts or interesting ideas with them, or we

just want to let someone else know that we are talking about them. User mention does

not provide us with any useful information either. We replace all the user mentions in

tweets with word USER_MENTION. The regular expression used to match user mention is

https://github.com/abdulfatir/twitter-sentiment-analysis

7 EXPERIMENTS 51

@[\S]+.

Hashtag

A hashtag which is written with a # symbol, is used to index keywords or topics in tweets,

allowing us to easily follow or mention topics we are interested in. We simply replace all

the hashtags with the words with hash symbol #. For instance, #KIT is replaced with KIT.

The regular expression used to match hashtags is #(\S+).

Retweet

A retweet is a re-posting of a tweet. Sometimes we type RT at the beginning of a tweet to

indicate that we are re-posting someone else’s tweet. We remove RT from the tweets. The

regular expression used to match retweets is \brt\b.

Emoticon

Emoticon is a set of characters used in social media to express emotion or active atmos-

phere, etc. They are textual portrayals of our moods or facial expressions. There is a list

of commonly used emoticons on Wikipedia

. Exhaustively list and match all of the diﬀe-

rent emoticons that could possibly be used in tweets is obviously hard and impossible. In

our work, we just match some widely used emoticons. We replace the matched emoticons

with either EMO_POS, EMO_NEG or EMO_NEU depending on whether or not it’s positive. The

regular expressions used to match emoticons are listed in Table 12 and Table 13.

Type Emoticons

(1) Smiley :), : ), :-), (:, ( :, (-:

(2) Laughing :D, : D, :-D, xD, x-D, XD, X-D, 8D, 8-D, =D, =3

(3) Sad :-(, : (, :(, ):, ) :, )-:

(4) Surprise :-O, :O, :-o, :o

(5) Kiss <3, :*

(6) Wink ;-), ;), ;-D, ;D, (;, (-;

(7) Crying :,(, :’(, :"(

Table 12: List of some widely used emoticons

Besides, we convert 2 or more letter repetitions to two letters. For instance, someone use

tweets like “I loooooove it soooo much!” to emphasize the most signiﬁcant words and

express their strong feelings. Here we convert this sentence to “I love it so much!”.

https://en.wikipedia.org/wiki/List_of_emoticons

7 EXPERIMENTS 52

Type Regex Replacement

(1) Smiley (:\s?\)|:-\)|\(\s?:|\(-:) EMO_POS

(2) Laughing (:\s?D|:-D|x-?D|X-?D|8-?D|=D|=3) EMO_POS

(3) Sad (:\s?\(|:-\(|\)\s?:|\)-:) EMO_NEG

(4) Surprise (:-?O|:-?o) EMO_NEU

(5) Kiss (<3|:\*) EMO_POS

(6) Wink (;-?\)|;-?D|\(-?;) EMO_POS

(7) Crying (:,\(|:\’\(|:"\() EMO_NEG

Table 13: List of regular expressions used to match emoticons

To have a better understanding of the preprocessing part, we select two original and pre-

processed sampled sentences listed below:

Before: @jackseymour11 I may have an Android phone by the time I get back

to school! :)

After: USERMENTION i may have an android phone by the time i get back

to school EMOPOS

Before: #NowPlaying Michael Jackson - She’s Out of My Life (Live At Wembley

July 16. http://t.co/vxRYuADn L.O.V.E it! #MjTunes

After: nowplaying michael jackson she is out of my life live at wembley

july 16 URL love it mjtunes

7.3 Evaluation Metrics

Binary classiﬁcation

For a speciﬁc task, suppose we have a variety of machine learning or deep learning models

on hand. The question is: which model should we choose from all these available models

and apply it in the task? Actually all the machine learning algorithms consist of combi-

nations of just three components: representation, evaluation and optimization.[19] So it’s

important to choose proper evaluation metrics that match our tasks. We compute our

selected evaluation metric for multiple diﬀerent models. Then we select the model with

best value of evaluation metric.

An evaluation function, or objective function, or scoring function, is used to distinguish

good classiﬁers from bad ones. Error rate and accuracy, they are two most commonly used

evaluation metrics in binary and multiclass classiﬁcation tasks. Error rate is the ratio of

7 EXPERIMENTS 53

the number of samples with wrong predictions to the total number of samples. Accuracy

is the ratio of the number of samples with correct predictions to the total number of

samples.

Accuracy =

#correct predictions

#total instances

(7.1)

Error rate =

#wrong predictions

#total instances

(7.2)

However, these two metrics sometimes can not provide us with any accurate information

when the dataset has imbalanced classes, especially when greatly unbalanced. For instan-

ce, a dataset has 1000 randomly selected items in total, of which 990 are of class A and

only 10 are of class B. We build a classiﬁer to predict relevant items, and see that its

accuracy on a test set is 99.0% which is amazingly good. But for comparison, suppose

we have a dummy classiﬁer that does not look at the features at all, and always just

blindly predict the most frequent class (here class A). This dummy classiﬁer’s accuracy

would also be 99.0%. It completely ignores the input data thus can not be used in practice.

For binary classiﬁcation tasks, according to the combination of samples’ real categories

and our predicted categories, samples can be grouped into four classes: true positive (TP),

false positive (FP), true negative (TN) and false negative (FN). The confusion matrix is

shown in Table 14, where each row of the matrix represents the instances in an actual

class, while each column represents the instances in a predicted class.

True Positive TP FN

True Negative FP TN

Predicted Positive Predicted Negative

Table 14: Confusion matrix of binary classiﬁcation

Precision and recall are then deﬁned as:

precision =

T P

T P + F P

(7.3)

recall =

T P

T P + F N

(7.4)

Generally speaking, it is common that when the value of precision is higher, recall is often

lower. When the value of precision is lower, recall is often higher. For example, suppose

we are in a supermarket buying strawberries. We can get good strawberries as many as

7 EXPERIMENTS 54

possible by increasing the number of picked strawberries. If all the strawberries are se-

lected, then all the good strawberries must be selected, so precision would be lower. If

we want the proportion of good ones in the selected strawberries as high as possible, we

can only pick strawberries with most conﬁdence, but this will inevitably leads to miss-

ing a lot of good strawberries, which makes recall lower. So there is always a trade-oﬀ

between precision and recall. In the recommendation system, in order to disturb users as

little as possible, it is more desirable that the recommended content is of users’ interest,

so precision is more important. In the fugitive information retrieval system, it is more

desirable to miss the fugitives as little as possible, so recall is crucial. Text classiﬁcation

is a precision-oriented machine learning task, where we concern about how many positive

predicted documents are really positive.

A measure which combines precision with recall is the traditional F -measure or balanced

F -score. It is the harmonic mean of precision and recall.

F 1 =

2 × precision × recall

precision + recall

(7.5)

It is a special case of the general F (β) measure. F (β)-measure is the weighted harmonic

mean of precision and recall.

F (β) =

(1 + β

) × precision × recall

× precision + recall

(7.6)

Where β > 0 measures the relative importance of recall to precision. When β = 1, it

degenerates into the standard F 1 score, when β > 1, recall has a greater impact, when

β < 1, precision has a greater inﬂuence.

Multiclass classiﬁcation

The situation is diﬀerent in multiclass classiﬁcation tasks. Each combination of two cate-

gories has a corresponding confusion matrix. The evaluation metric of multiclass classiﬁ-

cation on N categories can be divided into N evaluation metrics of binary classiﬁcation

on 2 categories.

The ﬁnal evaluation metrics are averages across classes. There are two possible ways

that the metrics are averaged across all the classes. A straightforward approach is to

ﬁrst calculate the precision and recall score on each confusion matrix. We record them as

, R

), (P

, R

), . . . , (P

, R

), we then get the F 1

macro

score by working out the average.

7 EXPERIMENTS 55

macro

i=1

(7.7)

macro

i=1

(7.8)

F 1

macro

2 × P

macro

× R

macro

+ R

macro

(7.9)

We can also work out the averages of T P , F P , T N, F N on each confusion matrix, record

them as T P , F P , T N, F N. Based on these averages, we can then calculate P

micro

, R

micro

and F1

micro

T P

T P + F P

(7.10)

micro

T P

T P + F N

(7.11)

F 1

micro

2 × P

micro

× R

micro

+ R

micro

(7.12)

As a comparison, each class has equal weight in macro scores whereas in micro scores each

instance has equal weight and largest classes have most inﬂuence. We compute metrics

within each class, average resulting metrics across classes to get macro scores. We ag-

gregrate outcomes across all classes, compute metrics with aggregate outcomes to obtain

micro scores. So how should we make a choice between macro and micro scores? If the

classes have about the same number of instances, macro and micro scores will be about

the same. If some classes are much larger or have more instances than others, and we want

to weight metric towards the largest ones, we use micro scores, if towards the smallest

ones, use macro scores. In our work, we compute confusion matrices and obtain the above

evaluation values with machine learning library for python called scikit-learn.

7.4 Computational Complexity

Here we consider the computational complexity of convolutional neural network for text

classﬁcation.

Time complexity

The time complexity, also called computational complexity, describes the amount of ti-

me it takes to run an algorithm. Time complexity is normally calculated by counting the

7 EXPERIMENTS 56

number of elementary operations performed by the algorithms. In scientiﬁc computations,

the computational complexity of deep learning architectures such as TensorFlow or Py-

Torch is calculated by estimating the number of ﬂoating point operations (FLOPs) in the

forward pass. Here elementary operation refers to FLOPs, which includes multiplication

and accumulation. We ignore negligible costs such as non-linearities, dropout, and nor-

malization layers. How do we estimate FLOPs? The dot product of vectors and matrices

(matrix multiplication) is one of the most important operations in deep learning. Thus

matrix multiplication and matrix transpose are basic operations. Matrix multiplication

involves the accumulation of products or put it in another way, multiplier–accumulator

(MAC) operations. The number of OPs/s (the number of operations per second) can be

estimated by counting how many MACs can be completed per second. Each multiplication

and accumulation is considered to be 1 OP, so 1 MAC is actually 2 OPs.

To compute the number of FLOPs, we assume convolution is implemented as a sliding

window. For a standard convolutional kernel in a single layer we have:

Time ∼ O



2 · K

· M

· C

out



(7.13)

where M is the height and width of the output Feature Map, K is the Kernel width

(assumed to be symmetric), C

is the number of channels of the input feature maps or

the number of output channels in the previous convolutional layer. C

out

is the number of

output channels or the number of output feature maps in this convolution layer.

To illustrate the above equation, we can analysis it in this way. We modify formula 7.4:

Time ∼ O



2 · C

· K



· M

· C

out

(7.14)

We ﬁrst calculate the number of operations of one pixel in an output feature map. Then it

is multiplied by (M

· C

out

) to be generalized to the entire output feature maps. The part

in the parentheses can be divided into two steps as shown in equation 7.15, where the ﬁrst

item denotes we count the number of multiplications and the second denotes the number

of accumulations. Because bias is not considered, we add 1 and ﬁnally get (2 · C

· K

)

as shown in formula 7.14.

(2 · C

· K

− 1) = (C

· K

) + (C

· K

− 1) (7.15)

7 EXPERIMENTS 57

In addition, the size of output feature map M is determined by the size of input feature

map X, kernel width K, padding and stride, which can be estimated as follows:

M = (X − K + 2 ∗ Padding)/Stride + 1 (7.16)

The total time complexity of all convolutional layers is:

Time ∼ O

2 ·

l=1

· K

· C

l−1

· C

(7.17)

where l is the index of a convolutional layer, and D is the model depth or the number

of convolutional layers. M

is the spatial size of the output feature map. K

is the spatial

size (length) of the ﬁlter. C

is the number of ﬁlters (width) or the output channels in the

l-th layer. C

l−1

denotes the number of input channels of the l-th layer or the number of

output ﬁlters in the (l − 1)-th layer.

The theoretical time complexity in above equations are simply the estimations of neural

networks, rather than the actual running time. Because the actual running time are sen-

sitive to implementations and hardwares.[21]

Space complexity

Space complexity is a measurement of the amount of working storage an algorithm needs.

It means how much memory, in the worst case, is needed at any point in the algorithm. In

practical deep learning, often the size of neural networks is bound by available memory.

Sometimes there could possibly be a situation that can evolve where the memory can

not hold so many model weights and intermediate variables that the program crashes.

Therefore it is necessary to know how to calculate the size of the graphics card’s memory

occupied by the model that we design together with the intermediate variables. Then we

will be comfortable with the memory of graphics card and the problems and bugs that it

arises.

The space complexity in deep learning includes two parts: total parameters and output

feature maps in each layers. Total parameters are the total weight parameters of the all

layers where there are model parameters (the ﬁrst part of the formula below). Feature

maps are the output feature maps calculated in each layer of the model (the second part

of the formula below).

7 EXPERIMENTS 58

Space ∼ O

l=1

· C

l−1

· C

l=1

· C

(7.18)

As we can see in formula 7.18, total parameters is only related to the spatial size of the

kernel K, the number of channels C and the number of convolutional layers D, regardless

of the input size (weight sharing). In fact, practical memory usage also depends on the

implementations and platforms.

7.5 Entity Linking

We use TagMe

to extract and obtain useful named entities. TagMe[20] is a powerful sys-

tem which has the ability to properly identify meaningful substrings (also called spots) in

a plain-text and link each of them to a closely related Wikipedia

page fast and eﬃciently.

Besides TagMe, there are also some other entity linking tools such as Dexter

, Babelfy

etc. We choose to use TagMe in our work because it is reported to be one of the best entity

linking tool in the scientiﬁc community. Most importantly, compared with existing tools,

TagMe is able to annotate short texts such as tweets, search-engine results, news titles, etc.

An annotation is a pair (spot, entity), where spot (also mention) is a substring of the

input text, and entity is a reference to the Wikipedia page that represents the meaning

of the spot in that context. In a proper case, not only can the mention (spot) be linked

to a named entity which contains the right mention in respect of named entity linking

(NEL), but also solves the important and pervasive semantic problems across languages,

ambiguity and polysemy, in the sense of word sense disambiguation (WSD).

We use RESTful API to obtain annotations provided by TagMe. We ﬁrst send a text to

the server for example. The response of the server includes all the annotations found in

that text. TagMe associates an attribute with each annotation, called ρ. This ρ value,

which estimates the goodness of the annotation with respect to other entities existing in

the input text, should be used as a ﬁlter to discard annotations that are below a given

threshold. A reasonable threshold is reported to be between 0.1 and 0.3. Note that ρ does

not indicate the relevance of the entity in the input text, but is rather a conﬁdence score

assigned by TagMe to that annotation.

https://tagme.d4science.org/tagme/

https://en.wikipedia.org/wiki/Main_Page

http://dexter.isti.cnr.it/

http://babelfy.org/

7 EXPERIMENTS 59

For instance, consider the following sentence:

On this day 24 years ago Maradona scored his infamous Hand of God goal

against England in the quarter-final of the 1986.

For this short text, the extracted annotations in JSON format are given below:

1 {

2 "test": "5",

3 "annotations": [

4 {

5 "spot": "On this day",

6 "link_probability": 0.0014018691144883633,

7 "rho": 0.0576556958258152,

8 "title": "BBC News Online"

9 },

10 {

11 "spot": "day",

12 "link_probability": 0.0030609103851020336,

13 "rho": 0.02183143049478531,

14 "title": "Martin Day"

15 },

16 {

17 "spot": "24 years",

18 "link_probability": 0.0019907099194824696,

19 "rho": 0.0009953549597412348,

20 "title": "1986 FIBA World Championship"

21 },

22 {

23 "spot": "Maradona",

24 "link_probability": 0.15395480394363403,

25 "rho": 0.2147175818681717,

26 "title": "Diego Maradona"

27 },

28 {

29 "spot": "infamous",

30 "link_probability": 0.022769613191485405,

31 "rho": 0.011384806595742702,

32 "title": "Infamous (Motionless in White album)"

33 },

34 {

35 "spot": "Hand of God goal",

36 "link_probability": 0.8918918967247009,

37 "rho": 0.582613468170166,

38 "title": "Argentina v England (1986 FIFA World Cup)"

7 EXPERIMENTS 60

39 },

40 {

41 "spot": "England",

42 "link_probability": 0.2987603545188904,

43 "rho": 0.27085503935813904,

44 "title": "England national football team"

45 },

46 {

47 "spot": "quarter-final",

48 "link_probability": 0.014687774702906609,

49 "rho": 0.1550503820180893,

50 "title": "2010 FIFA World Cup knockout stage"

51 }

52 ]

53 }

The ρ value of the recognized spot “On this day” which links to Wikipedia page “BBC

News Online” is 0.06. The ρ value of spot “Maradona” which links to entity “Diego Ma-

radona” is 0.21. The latter annotation looks more reasonable in this sentence and thus

should have and actually does have a higher ρ value. In fact, “Maradona”, the well-known

football player “Diego Maradona”, is a partial name which have to be linked to the ap-

propriate entity in a KB, such as Wikipedia or DBpedia.

In our work, we set ρ to 0.1. As a result, in this example, “Diego Maradona”, “Argen-

tina v England (1986 FIFA World Cup)”, “England national football team” and “2010

FIFA World Cup knockout stage” are considered to be proper named entities and will be

used in the next experimental steps.

In our experiments, we ﬁnd that not all the sentences have extracted named entities.

Not all the named entities have correlated entity embeddings in Wikipedia2Vec. To de-

monstrate that named entities can be a useful feature, some statistics are listed in Table 15.

We ﬁsrt use TagMe to get named entities. From Table 15 we can know that about 92%

sentences in TREC have entities. We then query these entities in Wikipedia2Vec to obtain

entity embeddings. About 98% extracted named entities have corresponding entity em-

beddings in Wikipedia2Vec. Moreover, 92% sentences have at least one entity embedding,

which means that only 8% sentences do not have features. These statistics make it clear

that extracted named entities can be regarded as a new feature which will be utilized in

our work. Feature statitics on the other datasets can be viewed in Appendix. For those

entities which do not have entity embeddings, they are zero initialized.

7 EXPERIMENTS 61

TREC Training Set Test Set

#total sentences 5452 500

#sentences with entity 5031 460

#sentences without entity 421 40

percentage of sentences with entity 92% 92%

#total entities 12291 874

#average entities per sentence 2.44 1.90

#total entities with entity embedding in Wikipedia2Vec 12020 856

percentage of entities with entity embedding 98% 98%

#sentences which have at least 1 entity embedding 5000 460

percentage of sentences which have at least 1 entity embedding 92% 92%

Table 15: Feature statistics on TREC

7.6 Experimental Setup

Word Embeddings

The Word2Vec

model takes a document corpus as input and produces the word vectors

as output. It is pre-trained on a very large unannotated corpus so we are able to achieve

better generalization when given limited amount of training data. The pre-trained vectors

of word2vec model are trained on part of Google News dataset which contains about 100

billion words. The model contains 300 dimensional vectors for 3 million words and phra-

ses. Words which do not present in the set of pre-trained vectors are randomly initialized

from a uniform distribution [−0.25, 0.25].

The Wikipedia2Vec models are trained on an English Wikipedia dump. Available pre-

trained models which have 100, 300 and 500 dimension words and entities can be found

on the Wikipedia2Vec oﬃcial page

. We can also train our own model using the latest

English Wikipedia dump

We can train the Doc2Vec model of the English Wikipedia dump. Detailed training in-

troductions can be followed on this page

. In our work, the model size is set to 300 and

epochs 5. We can improve the model accuracy with with more training epochs, such as

10 or 20 epochs, but it would take much more training time.

https://code.google.com/archive/p/word2vec/

https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

https://dumps.wikimedia.org/enwiki/

https://markroxor.github.io/gensim/static/notebooks/doc2vec-wikipedia.html

7 EXPERIMENTS 62

Hyperparameters

Parameters Values

ﬁlter sizes lower:[3, 4, 5], upper:[3, 4, 5, 6]

number of ﬁlters lower:16, upper:256

dropout rate 0.5

batch size 64

l2 regularization 3.0

embedding dimension lower:100, upper:300

learning rate 0.005

Table 16: Experiment hyperparameters

Experimental Environment

In our work, all the experiments are accomplished on a workstation containing 3.60GHz

Intel Xeon 4 Cores CPUs, 1 single NVIDIA GeForce GTX 1060 GPU with use of CUDA

10.0.130, cuDNN 7.3.0 and TensorFlow 1.3.

7.7 Results and Analysis

Our experimental results are listed in Table 17. All the experiments are repeated three to

ﬁve times under the optimal hyperparameters. We then take an average of the results as

ﬁnal result. As we can see, our TextCNN implementation gets almost the same or better

performance as our baseline. Our model EntCNN has higher accuracy with maximum

1.4% improvement than our baseline model TextCNN on all datasets. This conveys the

information that entities do help to improve the performance of text classiﬁcation.

The accuracy on Twitter 68.1% is much lower than three other datasets. It can be at-

tributed to the following reasons. First, Twitter, which contains less than ten thousand

sentences, is not a large corpus, especially for deep learning. Second, Twitter is a set of

tweets which has three labels. Besides, it contains longer, more complicated and noisy

sentences. When compared with TREC and AG News, it is obviously more diﬃcult to

analyze and label the sentences. Third, Twitter is a dataset used for sentiment analysis

task but we do not conduct any similar analysis in our work. As reported in the original

sentiment analysis competition paper [41], the best result of SemEval-2013 Task 2 subtask

B is 69%, so our result appears to be reasonable.

7 EXPERIMENTS 63

Model Twitter TREC Movie Review AG News

Baseline(Word2Vec) 57.24 89.33 81.52 86.11

Word(Word2Vec) 68.1 89.4 83.0 87.4

Word(Word2Vec) + Entity(Wikipedia2Vec) 68.2 90.4 83.0 87.9

Word(Wikipedia2Vec) + Entity(Wikipedia2Vec) 68.8 89.8 82.8 87.9

∆ + 0.7 + 1.0 - + 0.5

Word(Doc2Vec) 69.3 91.6 81.5 87.6

Word(Word2Vec) + Entity(Doc2Vec) 68.0 89.0 82.0 88.1

Word(Doc2Vec) + Entity(Doc2Vec) 69.3 93.0 81.3 87.4

∆ - + 1.4 +0.5 +0.5

Entity(Wikipedia2vec) 46.4 61.4 51.8 79.3

Word(BERT) 69.3 90.2 79.3 87.4

Table 17: Classiﬁcation results of diﬀerent models on several diﬀerent datasets. Baseline:

we take the results reported in [55] as our baseline. Word: TextCNN(i.e. text classiﬁcation

using only words). Word + Entity: EntCNN(i.e. text classiﬁcation using words and

entities). Entity: text classiﬁcation using only entities. All the words and entities are

initialized from 300 dimensional pre-trained models Word2Vec, Wikipedia2Vec, Doc2Vec

and 768 dimensional BERT respectively.

To verify the hypothesis that we can beneﬁt from entities, we conduct an experiment whe-

re we use only entities to classify documents. The experimental results in Table 17 conﬁrm

that by using only entities, we are already able to achieve an accuracy that is better than

randomly guessing (50% for binary classiﬁcation and 33% for three-class classiﬁcation).

From the confusion matrices shown in Figure 7.1 and Figure 7.2, we can ﬁnd that our

model EntCNN improves the classiﬁcation accuracy on label HUM from 91% to 95%, ENTY

from 73% to 77% respectively, and has overall higher accuracy than TextCNN. That is,

named entities do help us to classify sentences on label HUM and ENTY. Another observation

about the classiﬁcation results is that labels ABBR and ENTY have much lower accuracy

than other labels. One possible explanation is that label ABBR and ENTY are lack of enough

training data. Confusion matrices on the other datasets can be seen in Appendix.

We also compare the classiﬁcation performance when we use diﬀerent pre-trained word

vector models. The entity vectors are initialized with Wikipedia2Vec or Doc2Vec. We

do not use pre-trained Word2Vec model to initialize entity vectors because it is trained

with Google News articles rather than Wikipedia dump, so pre-trained vectors do not be

available for many Wikipedia entities. The word vectors of diﬀerent models are initialized

with Word2Vec, Wikipedia2Vec and Doc2Vec models respectively. From the results shown

in Table 17, it is clear that the choice of pre-trained word vector models has an impact

7 EXPERIMENTS 64

Figure 7.1: Baseline model (TextCNN) normalized confusion matrix on TREC

Figure 7.2: Our model (EntCNN) normalized confusion matrix on TREC

7 EXPERIMENTS 65

on performance, however diﬀerent models perform better for diﬀerent datasets. Doc2Vec

works better for Twitter and TREC. Word2Vec and Wikipedia2Vec work better for Movie

Review. All of them have good performance on AG News. So there is no single optimal

choice for all datasets. Practically, our results suggest experimenting with diﬀerent pre-

trained word vectors for new datasets.

We also conduct an experiment where we initialize words from state-of-the-art model

BERT. The word vetors in BERT have 768 dimensions. We obtain word vectors from an

open toolkit bert-as-service

. The results demonstrate that we can not further improve

the classiﬁcation performance by just using word vectors from BERT.

To demonstrate that our preprocessing methods on Twitter do help us to improve the

classiﬁcation accuracy, we conduct an experiment with and without tweets speciﬁc pre-

processing methods. By comparing the results shown in Table 18, we can clearly ﬁnd that

our preprocessing methods improve the classiﬁcation accuracy by average 2% ∼ 3%. 2%

is not a small number generally, especially when considering the fact that our model has

an improvement of average 1%. This suggests that preprocessing is always the ﬁrst step

to consider and extremely important in NLP. When we conduct our experiments, we ﬁnd

that there is a performance diﬀerence when applying diﬀerent preprocessing techniques

on the same task.

Dataset Without With

Word(Word2Vec) 66.1 68.1

Word(Word2Vec) + Entity(Wikipedia2Vec) 66.1 68.2

Word(Wikipedia2Vec) + Entity(Wikipedia2Vec) 65.5 68.8

Table 18: Comparison between diﬀerent preprocessing methods on Twitter. Without:

classiﬁcation without tweets speciﬁc preprocessing. With: classiﬁcation with tweets spe-

ciﬁc preprocessing.

We also compare the performance of diﬀerent text classiﬁcation models, see Table 19.

As we can see, TextCNN is already able to achieve pretty good performance, the models

proposed later do not have much improvement. From evaluation scores on AG News and

TREC listed in Tables at the end of this chapter, we can see on what labels the classiﬁ-

cation accuracy is improved.

https://github.com/hanxiao/bert-as-service

7 EXPERIMENTS 66

To summarize our experimental results, our model EntCNN can improve the classiﬁcati-

on performance on four datasets, no matter they are traditional multi-class classiﬁcation

tasks or sentiment analysis tasks where we want to determine whether a sentence is po-

sitive or negative. Experimental results demonstrate that entities provide us with more

contextual information and thus can help to classify documents.

Model Twitter TREC Movie Review AG News

TextCNN 68.1 89.4 83.0 87.4

RCNN 67.6 90.4 82.3 88.3

Bi-LSTM 63.6 88.4 83.7 88.2

Table 19: Comparison between diﬀerent models. TextCNN: Convolutional Neural Net-

work for sentence classiﬁcation (Kim, 2014). RCNN: Recurrent Covolutional Neural Net-

work for text classiﬁcation (Lai, 2015). All words are initialized from 100 dimensional

pre-trained model GloVe respectively.

Dataset Precision Recall F1-Score

World 90.5 87.4 88.9

Sports 90.3 94.5 92.3

Business 84.9 85.2 85.0

Sci/Tech 86.5 85.1 85.8

Table 20: Evaluation of EntCNN on AG

News

Dataset Precision Recall F1-Score

World 91.5 85.7 88.5

Sports 89.0 95.2 92.0

Business 82.4 86.4 84.4

Sci/Tech 88.8 82.9 85.3

Table 21: Evaluation of TextCNN on AG

News

Dataset Precision Recall F1-Score

ABBR 100 77.8 87.5

ENTY 90.0 76.6 82.8

DESC 83.4 98.6 90.4

HUM 92.5 95.4 94.0

LOC 92.4 90.1 91.3

NUM 98.1 90.3 94.0

Table 22: Evaluation of EntCNN on TREC

Dataset Precision Recall F1-Score

ABBR 100 77.8 87.5

ENTY 89.6 73.4 80.7

DESC 80.5 98.6 88.6

HUM 90.8 90.8 90.8

LOC 93.7 91.4 92.5

NUM 99.0 90.3 94.4

Table 23: Evaluation of TextCNN on TREC

8 CONCLUSION 67

8 Conclusion

8.1 Summary

Text classiﬁcation has always been considered to be extremely important in NLP rese-

arch and industry community. However, short text classiﬁcation has not been a popular

research topic. Thus, in our work, we explore the possibility to utilize named entities in

neural network architectures to classify documents. Our key contribution is that we are

the ﬁrst to demonstrate that named entities help us to improve the performance of short

text classiﬁcation.

In our work, a novel neural network based approach of short text classiﬁcation is proposed.

We utilize and modify the convolutional neural network for text classiﬁcation (TextCNN)

proposed by Kim (2014)[28]. Our model is not only trained on top of pre-trained word

vectors, but also on top of pre-trained entity vectors. Tagme is utilized to extract named

entities from documents. We utilize Wikipedia2Vec and Doc2Vec to get entity vectors.

Our feature statistics on four popular used NLP datasets demonstrate that the extracted

named entities can be a feature for the following work. We compare the performance of

our proposed model EntCNN with the baseline model TextCNN. By comparing our ex-

perimental results of diﬀerent models on diﬀerent datasets, we can conclude that entity

vectors are helpful on mutil-class classiﬁcation tasks. The impact of diﬀerent kinds of

pre-trained word vector representations on the classiﬁcation accuracy is also studied in

this work. The relative performance achieved using Word2Vec, Wikipedia2Vec, Doc2Vec

and BERT depends on the dataset. Our results suggests that we experiment with dif-

ferent pre-trained word vectors for new NLP tasks. Additionally, our applied intensive

preprocessing methods on tweets have been proven to show great improvement on the

classiﬁcation performance. Therefore, our experiment suggests that corpus speciﬁc pre-

processing is always considered to be a crucial step in NLP.

8.2 Future Work

Our work is based on TextCNN proposed by Kim in 2014. This model achieves good

classiﬁcation performance across a wide range of text classiﬁcation tasks and has beco-

me a standard baseline for text classiﬁcation architectures since then. However fantastic

progress has been achieved in NLP since last year in 2018. Large scale pre-trained lan-

guage representation models like BERT, GPT and ELMo using the enormous amount of

unannotated text on the web are proposed. These models can then be ﬁne-tuned on small

8 CONCLUSION 68

amount of data NLP tasks and have been proven to be state-of-the-art models. BERT

is a milestone in NLP and will have a lasting impact on subsequent NLP research and

industrial applications. So in the following work, our eﬀorts should be focused on these

new SOTA models. Generative pre-trained language models together with task-speciﬁc

ﬁne-tuning should be considered to be a general NLP framework.

Due to time and compute resource reasons we did not explore ﬁne-tuning BERT for text

classiﬁcation. Especially, according to our literature review, BERT for short text classi-

ﬁcation has not been studied yet. How to learn better pre-trained representations will

continue to be a research popular point for some time. Pre-training on what size (word,

sub-word, character etc.), what kind of language model (LSTMs, Transformer, etc.) to

train, and how to adapt pre-trained models to speciﬁc tasks are interesting topics that

still need to be explored.

When BERT involved in our work, we could explore the possibility to combine BERT

input embeddings with entity embeddings. That is, BERT input representations are the

sum of the token embeddings, the segmentation embeddings, the position embeddings and

the entity embeddings.

A FEATURE STATISTICS 69

A Feature Statistics

Table 24: Feature statistics on Movie Review

Movie Review Positive Negative

#total sentences 5331 5331

#sentences with entity 4517 4379

#sentences without entity 814 952

percentage of sentences with entity 85% 82%

#total entities 14171 13014

#average entities per sentence 3.14 2.97

#total entities with entity embedding in Wikipedia2Vec 13827 12677

percentage of entities with entity embedding 98% 97%

#sentences which have at least 1 entity embedding 4455 4328

percentage of sentences which have at least 1 entity embedding 84% 81%

Table 25: Feature statistics on AG News

AG News Training Set Test Set

#total sentences 120000 7600

#sentences with entity 110719 6989

#sentences without entity 9281 611

percentage of sentences with entity 92% 92%

#total entities 323517 20626

#average entities per sentence 2.92 2.95

#total entities with entity embedding in Wikipedia2Vec 316591 20204

percentage of entities with entity embedding 98% 98%

#sentences which have at least 1 entity embedding 110001 6942

percentage of sentences which have at least 1 entity embedding 92% 91%

A FEATURE STATISTICS 70

Twitter Training Set Test Set

#total sentences 9684 3547

#sentences with entity 9219 3382

#sentences without entity 465 165

percentage of sentences with entity 95% 95%

#total entities 30551 10916

#average entities per sentence 3.31 3.23

#total entities with entity embedding in Wikipedia2Vec 28983 10278

percentage of entities with entity embedding 95% 94%

#sentences which have at least 1 entity embedding 9148 3345

percentage of sentences which have at least 1 entity embedding 94% 94%

Table 26: Feature statistics on Twitter

B CONFUSION MATRICES 71

B Confusion Matrices

Figure B.1: Baseline model (TextCNN) normalized confusion matrix on AG News

Figure B.2: Our model (EntCNN) normalized confusion matrix on AG News

B CONFUSION MATRICES 72

Figure B.3: Baseline model (TextCNN) normalized confusion matrix on Twitter

Figure B.4: Our model (EntCNN) normalized confusion matrix on Twitter

LITERATUR 73

Literatur

[1] Attention in long short-term memory recurrent neural

networks. https://machinelearningmastery.com/

attention-long-short-term-memory-recurrent-neural-networks/. Acces-

sed: 2017-06-30.

[2] Deep learning for natural language processing. https://github.com/

oxford-cs-deepnlp-2017/lectures.

[3] Recurrent neural networks tutorial, part 1 – intro-

duction to rnns. http://www.wildml.com/2015/09/

recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/.

Accessed: 2015-09-17.

[4] Understanding lstm networks. https://colah.github.io/posts/

2015-08-Understanding-LSTMs/.

[5] Visualizing what convnets learn. http://cs231n.github.io/understanding-cnn/.

[6] Aggarwal, C. C., and Zhai, C. A survey of text classiﬁcation algorithms. In

Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. Springer, 2012, pp. 163–222.

[7] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly

learning to align and translate, 2014. cite arxiv:1409.0473Comment: Accepted at

ICLR 2015 as oral presentation.

[8] Bellman, R., and Collection, K. M. R. Adaptive Control Processes: A Guided

Tour. Princeton Legacy Library. Princeton University Press, 1961.

[9] Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic

language model. J. Mach. Learn. Res. 3 (Mar. 2003), 1137–1155.

[10] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies

with gradient descent is diﬃcult. Trans. Neur. Netw. 5, 2 (Mar. 1994), 157–166.

[11] Camacho-Collados, J., and Pilehvar, M. T. On the role of text preproces-

sing in neural network architectures: An evaluation study on text categorization and

sentiment analysis. CoRR abs/1707.01780 (2017).

[12] Chang, A., Spitkovsky, V. I., Manning, C. D., and Agirre, E. A comparison

of named-entity disambiguation and word sense disambiguation. In Proceedings of

the Tenth International Conference on Language Resources and Evaluation (LREC

2016) (Paris, France, may 2016), N. C. C. Chair), K. Choukri, T. Declerck, S. Goggi,

LITERATUR 74

M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, and S. Pipe-

ridis, Eds., European Language Resources Association (ELRA).

[13] Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y.

Attention-based models for speech recognition. In Advances in Neural Information

Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and

R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 577–585.

[14] Conneau, A., Schwenk, H., Barrault, L., and LeCun, Y. Very deep convo-

lutional networks for natural language processing. CoRR abs/1606.01781 (2016).

[15] Dai, A. M., Olah, C., and Le, Q. V. Document embedding with paragraph

vectors. In NIPS Deep Learning Workshop (2015).

[16] Dash, M., and Liu, H. Feature selection for classiﬁcation. Intell. Data Anal. 1, 3

(May 1997), 131–156.

[17] Denny, M. J., and Spirling, A. Text preprocessing for unsupervised learning:

Why it matters, when it misleads, and what to do about it. Political Analysis 26, 2

(2018), 168–189.

[18] Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of

deep bidirectional transformers for language understanding. CoRR abs/1810.04805

(2018).

[19] Domingos, P. A few useful things to know about machine learning. Commun. ACM

55, 10 (Oct. 2012), 78–87.

[20] Ferragina, P., and Scaiella, U. Tagme: On-the-ﬂy annotation of short text

fragments (by wikipedia entities). In Proceedings of the 19th ACM International Con-

ference on Information and Knowledge Management (New York, NY, USA, 2010),

CIKM ’10, ACM, pp. 1625–1628.

[21] He, K., and Sun, J. Convolutional neural networks at constrained time cost. In

2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June

2015), pp. 5353–5360.

[22] Hochreiter, S. Untersuchungen zu dynamischen neuronalen Netzen. Diploma the-

sis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München,

1991.

[23] Hochreiter, S., and Schmidhuber, J. Long short-term memory. Neural Comput.

9, 8 (Nov. 1997), 1735–1780.

LITERATUR 75

[24] James, G., Witten, D., Hastie, T., and Tibshirani, R. An Introduction to

Statistical Learning: With Applications in R. Springer Publishing Company, Incor-

porated, 2014.

[25] Joulin, A., Grave, E., Bojanowski, P., and Mikolov, T. Bag of tricks for

eﬃcient text classiﬁcation. CoRR abs/1607.01759 (2016).

[26] Jurafsky, D., and Martin, J. H. Speech and Language Processing (2Nd Edition).

Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2009.

[27] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. A convolutional

neural network for modelling sentences. CoRR abs/1404.2188 (2014).

[28] Kim, Y. Convolutional neural networks for sentence classiﬁcation. CoRR ab-

s/1408.5882 (2014).

[29] Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L. E.,

and Brown, D. E. Text classiﬁcation algorithms: A survey. CoRR abs/1904.08067

(2019).

[30] Král, P. Named entities as new features for czech document classiﬁcation. In

Proceedings of the 15th International Conference on Computational Linguistics and

Intelligent Text Processing - Volume 8404 (Berlin, Heidelberg, 2014), CICLing 2014,

Springer-Verlag, pp. 417–427.

[31] Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classiﬁcation with

deep convolutional neural networks. In Advances in Neural Information Processing

Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds.

Curran Associates, Inc., 2012, pp. 1097–1105.

[32] Kumar, A., Irsoy, O., Ondruska, P., Iyyer, M., Bradbury, J., Gulrajani,

I., Zhong, V., Paulus, R., and Socher, R. Ask me anything: Dynamic memory

networks for natural language processing. In Proceedings of The 33rd International

Conference on Machine Learning (New York, New York, USA, 20–22 Jun 2016),

M. F. Balcan and K. Q. Weinberger, Eds., vol. 48 of Proceedings of Machine Learning

Research, PMLR, pp. 1378–1387.

[33] Lai, S., Xu, L., Liu, K., and Zhao, J. Recurrent convolutional neural networks for

text classiﬁcation. In Proceedings of the Twenty-Ninth AAAI Conference on Artiﬁcial

Intelligence (2015), AAAI’15, AAAI Press, pp. 2267–2273.

[34] Le, Q., and Mikolov, T. Distributed representations of sentences and documents.

In Proceedings of the 31st International Conference on International Conference on

Machine Learning - Volume 32 (2014), ICML’14, JMLR.org, pp. II–1188–II–1196.

LITERATUR 76

[35] Lee, C., and Lee, G. G. Information gain and divergence-based feature selection

for machine learning-based text categorization. Inf. Process. Manage. 42, 1 (Jan.

2006), 155–165.

[36] Li, X., and Roth, D. Learning question classiﬁers. In Proceedings of the 19th

International Conference on Computational Linguistics - Volume 1 (Stroudsburg,

PA, USA, 2002), COLING ’02, Association for Computational Linguistics, pp. 1–7.

[37] Liu, P., Qiu, X., and Huang, X. Recurrent neural network for text classiﬁcation

with multi-task learning. CoRR abs/1605.05101 (2016).

[38] Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information

Retrieval. Cambridge University Press, New York, NY, USA, 2008.

[39] Mikolov, T., Chen, K., Corrado, G., and Dean, J. Eﬃcient estimation of

word representations in vector space. CoRR abs/1301.3781 (2013).

[40] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distri-

buted representations of words and phrases and their compositionality. In Advances

in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Wel-

ling, Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2013,

pp. 3111–3119.

[41] Nakov, P., Rosenthal, S., Kozareva, Z., Stoyanov, V., Ritter, A., and

Wilson, T. SemEval-2013 task 2: Sentiment analysis in twitter. In Second Joint

Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings

of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) (At-

lanta, Georgia, USA, June 2013), Association for Computational Linguistics, pp. 312–

320.

[42] Pang, B., and Lee, L. Seeing stars: Exploiting class relationships for sentiment ca-

tegorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting

on Association for Computational Linguistics (Stroudsburg, PA, USA, 2005), ACL

’05, Association for Computational Linguistics, pp. 115–124.

[43] Peters, M., Ruder, S., and Smith, N. A. To tune or not to tune? adapting

pretrained representations to diverse tasks. CoRR abs/1903.05987 (2019).

[44] Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee,

K., and Zettlemoyer, L. Deep contextualized word representations. CoRR ab-

s/1802.05365 (2018).

[45] Radford, A. Improving language understanding by generative pre-training.

LITERATUR 77

[46] Řehůřek, R., and Sojka, P. Software Framework for Topic Modelling with Large

Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP

Frameworks (Valletta, Malta, May 2010), ELRA, pp. 45–50. http://is.muni.cz/

publication/884893/en.

[47] Ristoski, P., and Paulheim, H. Rdf2vec: Rdf graph embeddings for data mining.

In The Semantic Web - ISWC 2016 : 15th International Semantic Web Conference,

Kobe, Japan, October 17-21, 2016, Proceedings, Part I (Cham, 2016), vol. 9981,

Springer International Publishing, pp. 498–514.

[48] Rush, A. M., Chopra, S., and Weston, J. A neural attention model for abstrac-

tive sentence summarization. In Proceedings of the 2015 Conference on Empirical

Methods in Natural Language Processing (Lisbon, Portugal, Sept. 2015), Association

for Computational Linguistics, pp. 379–389.

[49] Salton, G., Wong, A., and Yang, C. S. A vector space model for automatic

indexing. Commun. ACM 18, 11 (Nov. 1975), 613–620.

[50] Shen, W., Wang, J., and Han, J. Entity linking with a knowledge base: Issues,

techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering

27, 2 (Feb 2015), 443–460.

[51] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salak-

hutdinov, R. Dropout: A simple way to prevent neural networks from overﬁtting.

Journal of Machine Learning Research 15 (2014), 1929–1958.

[52] Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning

with neural networks. In Advances in Neural Information Processing Systems 27,

Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds.

Curran Associates, Inc., 2014, pp. 3104–3112.

[53] Uysal, A. K., and Gunal, S. The impact of preprocessing on text classiﬁcation.

Inf. Process. Manage. 50, 1 (Jan. 2014), 104–112.

[54] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,

A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. CoRR ab-

s/1706.03762 (2017).

[55] Wang, J., Wang, Z., Zhang, D., and Yan, J. Combining knowledge with deep

convolutional neural networks for short text classiﬁcation. In Proceedings of the 26th

International Joint Conference on Artiﬁcial Intelligence (2017), IJCAI’17, AAAI

Press, pp. 2915–2921.

LITERATUR 78

[56] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Ze-

mel, R., and Bengio, Y. Show, attend and tell: Neural image caption generation

with visual attention. In Proceedings of the 32nd International Conference on Ma-

chine Learning (Lille, France, 07–09 Jul 2015), F. Bach and D. Blei, Eds., vol. 37 of

Proceedings of Machine Learning Research, PMLR, pp. 2048–2057.

[57] Yamada, I., Asai, A., Shindo, H., Takeda, H., and Takefuji, Y. Wiki-

pedia2vec: An optimized tool for learning embeddings of words and entities from

wikipedia. CoRR abs/1812.06280 (2018).

[58] Yamada, I., Shindo, H., Takeda, H., and Takefuji, Y. Joint learning of

the embedding of words and entities for named entity disambiguation. CoRR ab-

s/1601.01343 (2016).

[59] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchical

attention networks for document classiﬁcation. In Proceedings of the 2016 Conference

of the North American Chapter of the Association for Computational Linguistics:

Human Language Technologies (San Diego, California, June 2016), Association for

Computational Linguistics, pp. 1480–1489.

[60] Zeng, J., Li, J., Song, Y., Gao, C., Lyu, M. R., and King, I. Topic memory

networks for short text classiﬁcation. In Proceedings of the 2018 Conference on Empi-

rical Methods in Natural Language Processing (Brussels, Belgium, 2018), Association

for Computational Linguistics, pp. 3120–3131.

[61] Zhang, X., and LeCun, Y. Which encoding is the best for text classiﬁcation in

chinese, english, japanese and korean? CoRR abs/1708.02657 (2017).

[62] Zhang, X., Zhao, J. J., and LeCun, Y. Character-level convolutional networks

for text classiﬁcation. CoRR abs/1509.01626 (2015).

[63] Zhang, Y., and Wallace, B. C. A sensitivity analysis of (and practitioners’ guide

to) convolutional neural networks for sentence classiﬁcation. CoRR abs/1510.03820

(2015).

Erklärung

Ich versichere wahrheitsgemäß, die Arbeit selbstständig verfasst, alle benutzten Hilfsmittel

vollständig und genau angegeben und alles kenntlich gemacht zu haben, was aus Arbeiten

anderer unverändert oder mit Abänderungen entnommen wurde sowie die Satzung des KIT

zur Sicherung guter wissenschaftlicher Praxis in der jeweils gültigen Fassung beachtet zu

haben.

Karlsruhe, 29. August 2019 QINGYUAN BIE