Introduction

In the last blog we gave a brief introduction to Graph Convolution Networks and examined the model, \textsc{TextGCN}, which builds a graph from word occurrences in documents and utilizes GCNs to classify documents (the nodes in the constructed graph). Likewise, with COVID-19 affecting people all over the people and China being the origin of the pandemic, there has been an increase in the spread of hateful language towards East Asians. The following example shows a tweet by the US president which conflates coronavirus with the China Virus. Thus, to apply our understanding of GCNs for text classification, we will implement the model in PyTorch and apply it to classifying twitter data. Specifically we will use the dataset constructed by Vidgen et al. in Detecting East Asian Prejudice on Social Media. For all the code, check out my github repo!

Dataset Details

The dataset is constructed using tweets from January 1st to March 17th 2020. The authors first build a database of roughly 150,000 unique tweets from that period. They then employ annotators to extract common hashtags which express negativity towards East Asia. They then sample 10,000 tweets that used one of the anti-East Asia hashtags and also 10,000 tweets at random. This way, the oversampling of anti-East Asia tweets ensures whatever model we use can capture the features of these tweets. After, more annotators individually classify each tweet into the following categories:

  • criticism of an East Asian entity ex. “the CCP hid information relevant to corona virus”
  • hostility of an East Asian entity (more severe than criticism) ex. “Those oriental devils don’t care about human life”
  • counter speech (those that condemn abuse against an East Asian identity) ex. “It isn’t right to blame China!”
  • discussion (tweets that discuss prejudice but do not condemn) ex. “It’s not racist to call it the Wuhan virus”

After annotation, the distributions of the categories are shown below. The authors combine counter speech and discussion of East Asian prejudice for classification due to the infrequency of counter speech tweets and its overlap with discussion of EA prejudice. Still, we see that most tweets are categorized as neutral.

Moreover, the authors replace hashtags into general hashtags as specific hashtags can give the model wrong signals and would not generalize well to unseen tweets. In particular, hashtags are substituted by the following #EASTASIA (ex. “wuhuan”), #VIRUS (ex. #covid19), #EASTASIAVIRUS (ex. #chinavirus), #OTHERCOUNTRYVIRUS (ex. #italycovid), and #HASHTAG for hashtags not are not specific to the previously mentioned hashtags.

Dataset Preprocessing

In the given dataset, tweets are given in their raw form, including emojis, punctuation etc. Below is an example

😷before you wear n95 masks, you should look into getting a fit test. because unlike surgical masks, one size does not fit all for n95 masks. having best fit n95 for your face will ensure a good face seal for protection. https://t.co/xm2maqsp8w #HASHTAG HASHTAG_EASTASIA+VIRUS https://t.co/iiszmr3wgc”

To get words that are useful for our text graph we would like to keep the generic hashtags and words but remove the links and emojis and punctuation. Using some regex pattern find and replace, we parse our tweet strings. Above, is two example lines of python code using regex to clean the string. The first line removes all URL links. The second line removes all non alpha numeric characters. After this preprocessing, the example tweet becomes:

before you wear n95 masks you should look into getting a fit test because unlike surgical masks one size does not fit all for n95 masks having best fit n95 for your face will ensure a good face seal for protection hashtag hashtag_eastasia_virus

We also build a vocabulary of words to use in our text graph. All words which occur more than 5 times in our corpus of tweets are included. The code for cleaning the dataset is available in prep_data.py.

Building The Text Graph

Given our cleaned documents and vocabulary, we build the text graph using the formulation as described in  Graph Convolutional Networks for Text Classification and explained in the last blog. A quick recap is shown below.

PMI(i,j) is the point wise mutual information between two words. #W(i, j) is the number of sliding windows that contain both word i and word j and #W(i) is the number of sliding windows containing word i. #W is the total number of sliding windows.

We use a window size of 10 as tweets are relatively short in length (average of 16 words per tweet). In total, for this twitter dataset we have 20,000 document nodes, 6057 word nodes and a total of 793,081 edges. Details of the code are in build_graph.py. After construction, we obtain an adjacency matrix representing our text graph.

We can also use gephi to visualize our graph.

Green nodes are words and red nodes are documents. The sizes of nodes correspond to their degree. From our visualization that the network is highly dominated by word-word and document word edges as seen by the many green edges. Moreover, nodes with the most neighbors correspond to the hashtags. This makes sense, as we are looking at Twitter data after all.

However this may make our data noisy. Let’s see what our graph looks like once we remove these hashtag nodes.

Now we can clearly see the main words in our dataset such as “china”, “chinese”, “wuhan” etc.

Implementing in PyTorch

Implementation of the model is fairly straightforward as we can use the PyTorch Geometric library for the implementation of graph neural networks. The PyTorch Geometric GCN module requires a feature matrix \textbf{X} \in \mathbb{R}^{ |V| \times D} in which \textbf{X}_i is the node feature for the i^{th} node where D is the dimension of the features. In this case, we are using one hot initialization for the node features so D = V = 26057. For such a large matrix, to avoid running out of memory it is important to use a sparse matrix representation. The following is a method in the TextDataset class which generates the one hot initial embedding as a torch sparse matrix. inds are the row and column indices which correspond to each value in values.

However, PyTorch Geometric’s GCN module does not actually take a sparse feature matrix as input. Therefore, we need to change the weight and feature matrix multiplication in the GCN layer to allow sparse torch matrices (\textbf{X}_t \textbf{W}_{t} in the GCN equation)

(1)   \begin{equation*} \textbf{X}_{t+1} \in \mathbb{R}^{|V| \times D_{out}} = f(\textbf{D}^{-1/2} \hat{\textbf{A}} \textbf{D}^{-1/2} \textbf{X}_t \textbf{W}_{t}) \end{equation*}

Specifically, we change the change the following lines in the forward method of the GCN module to allow sparse matrix multiplication.

For the rest of the details about the model, feel free to take a look at model_text_gnn.py

Experiments and Results

Classification

Other than the window size of 10 (which performed better than a window size of 20) we maintain the same hyperparameters as those of the original \textsc{TextGCN}. We train a two layer GCN with an intermediate dimension of 200 and a ReLU activation function. We use a softmax layer over the final layer for prediction and cross entropy as the loss function. Moreover, we use a learning rate of 0.02 and train for up to 200 epochs with early stopping if the validation loss does not decrease for 10 epochs. However, because our dataset is somewhat imbalanced and real world applications of such text classification models would prefer higher precision on the non neutral categories, we also utilize class weights in our cross entropy loss function to emphasize tweets in the less frequent categories.

We see the results of the model as well as the effect of using class weights below.

TextGCNAccuracyF1PrecisionRecall
w/o class weights0.7540.7230.7380.754
w class weights0.7230.7330.7670.716
confusion matrix for model without class weights
confusion matrix for model with class weights

We can see (from the diagonal of the confusion matrix) that in the model with class weights, the model sacrifices correct predictions in the neutral class for better performance in the other classes. However, we can also see that there are a lot of errors between the different classes. For example 169 neutral tweets are classified as hostile tweets. Some examples of misclassified tweets (from the most erroneous boxes in the confusion matrix) are shown below

misclassified neutral as hostile: @cnbci textbook way of censorship from HASHTAG_EASTASIA the entire world will be the victim. #HASHTAG

misclassified neutral as hostile: rt @yeesaryan: taiwan reacted almost immediately to HASHTAG_EASTASIA+VIRUS now taiwanese are safely guarded by a responsible government. how lucky ar…

misclassified criticism as hostile: “@hawleymo the hong kong gov is putting hk people’s lives behind political considerations, ridiculous! HASHTAG_EASTASIA HASHTAG_EASTASIA+VIRUS”

miclassified hostile as criticism: “you have got to be kidding me. the coronavirus outbreak came from communist china, not the usa or israel. 🧫🇨🇳 #HASHTAG HASHTAG_EASTASIA+VIRUS HASHTAG_VIRUS HASHTAG_VIRUS #HASHTAG #HASHTAG #HASHTAG HASHTAG_EASTASIA+VIRUS HASHTAG_EASTASIA”

Ablation Studies

In addition to the above results we also perform ablation studies to investigate the effect of the edge weights we derived when building the graph and the effect using different numbers of GCN layers. We want to see what part of the model influences performance the most and is truly learning good tweet representations. All models in the ablation studies use class weights in the loss function.

ModelAccuracy\Delta_{accuracy}F1\Delta_{F1}
Full Model0.7230.733
1-layer GCN0.6230.1000.6540.079
3-layer GCN0.4570.2660.5300.203
No edge weights0.6740.0490.6920.041

Discussion of Results

Classification

From the examples shown we can see that for the misclassification of neutral tweets as criticism and hostility, there is inherent negativity expressed in the tweets just not towards an East Asian entity. Furthermore, the creators of the dataset also mentioned known annotator errors. In my opinion, these examples also show the difficulty the model has in identifying the nuances difference between criticism and hostility. For example, in the fourth example tweet, it contains words such as “insane”, “racist”, and “communist” which the model might have learned to be associated with hostile tweets, hence misclassification. Ultimately, it seems these fine-grained classifications are too hard for the model as seen by the low recall and precision for the non-neutral classes.

Ablation Studies

In regards to the ablation studies, we can see that decreasing the number of GCN layers to one leads to a modest drop in performance. Likewise, when we increase the number of GCN layers to three, there is a more significant decrease. Let’s look at the toy example below to investigate possible reasons.

Toy example with three document nodes and three word nodes. The thickness of edges represent edge weights. Blue edges represent word-word edges as calculated by PMI. Red edges represent document-word edges calculated by TF-IDF.

With one layer, each document node is only able to receive information from its immediate neighbors. By construction, the immediate neighbors of tweets are the words directly in the document. Tweet one only receives the the one hot representation vector for “China”, and tweet two receives the one hot vectors for “China” and “disease”. Thus, the document nodes cannot see information from other words outside the document (for example other negative words that may indicate hostility towards an Asian entity). In this case, tweet one does not receive information about “outbreak” and “disease” which may help the model. A document also cannot see other documents which may have strong relations with words in itself (tweet one cannot see the strong relationship between tweet two and “China”). Thus, with one layer, important information may be left out.

With two GCN layers, document nodes can receive information about other words in the corpus that have a high relationship with words in the current document (tweet one can see how “disease” and “outbreak” relates to “China” as that information is passed received during the second GCN layer). Likewise, a document node can also see other documents with fairly similar words as those in itself.

On the other hand, with three GCN layers, document nodes can now receive information about other other nodes via two different paths as loops can form. For example, tweet one receives information from the “disease” node via the “china” node after two GCN layers. However, it receives information from the “disease” node again after three layers via “china” and tweet two node. The implicit mixing of different edge types (document-word and word-word) and different paths one we have more than two layers appear to add too much noise to the document representations for the model to learn well.

Additionally, we also see that removing edge weights (each edge has weight=1) also decreases performance slightly. This indicates that the frequency of word co-occurrence (obtained by PMI) and TF-IDF are useful guides for the model on top of the existence of such relationships.

Comparison with other Natural Language Processing Models

Compared to the reported performance of BERT based models in Detecting East Asian Prejudice on Social Media, \textsc{TextGCN} performs decently worse. It also performs slightly worse than the LSTM based model.

This may be because \textsc{TextGCN} does not directly learn the context of words in a tweet, rather it is hand built (using the sliding windows and text graph). Moreover, the GCN model does not use any pretraining that BERT based models uses. It also uses less parameters than the BERT models. Our implementation of \textsc{TextGCN} for this dataset uses roughly 5M parameters while the smallest BERT model (ALBERT-base) uses 12M parameters. Keep in mind, however, the number of parameters for GCN scales with the dimensions of the feature matrix. The weight matrix is of dimension D_{in} \times D_{out}. If we were to use one hot embedding as we had previously, then D_{in} would grow with the number of nodes in the network which grows with the size of our corpus.

performance of different NLP models on the twitter dataset as reported by the dataset creators

Nevertheless, despite \textsc{TextGCN} not being the state of the art, It might be useful for smaller datasets and when computing resources are hard to come by.

Final Thoughts

We have shown an application of \textsc{TextGCN} on an interesting real world dataset. We also explored certain cases in which the model makes errors and compared it with reported results of other state of the art baselines. Additionally, using this new dataset we were able to better understand and reason which parts of the model were important to performance. Overall, this had been a fun experience for me applying what I had learned in natural language processing and graph deep learning. Feel free to download the code and datasets and try out the model for yourself!

Close Menu