Community Prediction with NLP

Text Classification Algorithms Using Reddit Data

AUTHORS
Affiliation

Victor De Lima (vad49)
Matt Moriarty (mdm341)

Georgetown University
M.S. Data Science and Analytics

Published

May 5, 2023

All code used in this report is publicly available on GitHub.

Abstract

In the study, we use Natural Language Processing (NLP) to determine whether we can accurately classify Reddit posts into their respective subreddit by analyzing the language contained in the post’s text. We conducted the research by employing Term Frequency methods and supervised learning algorithms. Our findings show that the language individuals use when participating in discussions within a community context provides sufficient information for models to make excellent predictions. We also offer an overview of the details related to model construction and explore the implications of the results.

Keywords: Natural language processing, community detection, random forests, neural networks

1 Introduction

Natural language is one of the purest and most complex forms of data in the world. Within each of the numerous written languages across the globe are even more intricacies, such as grammatical structures, idioms, and slang terms. These components combine to form a complex yet profoundly intriguing data structure known to us all.

A popular place for such natural language is Reddit. Here, registered users submit content such as links, text posts, images, and videos, which other members then vote up or down. Subreddits are smaller communities within Reddit that house content associated with a specific topic. Since each subreddit’s subject is different, analyzing whether users utilize distinctive language when participating in topic-specific discussions is intriguing. Our goal is to develop models that can take an anonymous Reddit post data as input, analyze the natural language comprising that post, and correctly assign it to the subreddit to which it belongs.

Our analysis is motivated by the beneficial applications of Natural Language Processing (NLP) in the real world. For instance, email companies can use NLP to filter out spam emails for their users. Likewise, social networking companies can use NLP to identify commonalities between sub-communities, and marketing companies can use NLP to identify members of a target audience. Our analysis provides a glimpse into the power of NLP tasks and paves the way for more impactful and relevant applications of NLP in the real world.

3 Methodologies

3.1 About the Data

Our data set, obtained from Völske et al. (2017), contains nearly 19 GB of data, involving almost four million Reddit posts. Alongside each post is a collection of information, such as the author of the post and the subreddit to which the post belongs. As a data reduction technique, we randomly sampled 40,000 posts from each of the five most popular subreddits AskReddit, leagueoflegends, relationships, timu1, and trees. This resulted in a total of 200,000 posts, providing us with a data set of size 300 MB.

Since the Reddit posts are raw and untouched, we apply a series of cleaning steps to the content of each post. These include removing punctuation and special characters, as well as performing stemming and lemmatization. Having cleaned the textual data comprising each Reddit post, we obtained a vocabulary of over 245,000 unique words used across all posts. Therefore, we kept only the 5,000 most common words as a variable selection technique. In doing so, we can efficiently remove almost 98% of unique words while maintaining nearly 95% of the total words used (see Figure 1 below).

Figure 1: This figure visualizes the elbow method for reducing vocabulary size. The dashed lines intersect at the threshold for which vocabulary words are kept or removed. Keeping a small portion (5,000) of the most frequent vocabulary words allows for retention of a large portion (95%) of the total words used.

3.2 Binary Classification

First, we explore binary classification. The logistic regression model estimates the probability that the dependent variable Y belongs to a particular category rather than modeling the direct response as in linear regression. The model uses the maximum likelihood method for model fitting, which estimates the beta coefficients that maximize the likelihood function (James et al. 2013). We use the SKLearn implementation of logistic regression (Pedregosa et al. 2012), in which we predict the probability \(\hat{p}(X_i) = \frac{1}{1 + \exp(-X_i w - w_0)}\). We include a Lasso penalty term yielding the following loss function:

\[ \begin{align}\begin{aligned} \min_{w} C \sum_{i=1}^n (-y_i \log(\hat{p}(X_i)) \\\begin{split} - (1 - y_i) \log(1 - \hat{p}(X_i))) \end{split} \\\begin{split} + \|w\|_1 \end{split}\end{aligned}\end{align} \tag{1}\]

3.3 Multi-Class Classification

3.3.1 Baseline Model

Our first multi-class classification method acts as a baseline to compare our other models. This model simply takes a random guess as to which subreddit a post belongs. As such, we can expect this model to be around 20% accurate, correctly classifying approximately one out of five posts. Note that this model does not implement a training algorithm because of its simplicity - it merely makes a random classification for each post in the testing set.

3.3.2 Tree Models

Our next pair of models consists of a Decision Tree and a Random Forest. A Decision Tree is quite like how it sounds - it is a tree that makes decisions in order to categorize data. Starting from its “root,” the Decision Tree splits data based on certain conditions to create the purest splits possible. A Random Forest is an ensemble of simple trees known as weak learners. When constructing the Random Forest, each tree only considers a subset of the variables. This procedure decorrelates the trees, making the resulting trees’ average less variable and more reliable.

Since these models are more complex, we implemented a training algorithm to perform hyperparameter tuning. Splitting the data into training, validation, and testing sets, we used 5-fold cross-validation across a hyperparameter grid search to determine the optimal hyperparameters for each model. Using categorical cross-entropy loss as the key performance metric, we optimized our hyperparameter selection by identifying the configuration such that the validation loss is minimal with little to no overfitting (James et al. 2013). We provide an example of this hyperparameter selection process used across all the models in Figure 2.

(a) Decision Tree Training and Validation Loss

(b) Random Forest Training and Validation Loss

Figure 2: This figure shows the training and validation losses of (a) Decision Tree models and (b) Random Forest models during a 5-fold cross-validation hyperparameter search. As the tree models increase in depth, they begin to overfit to the training data, especially in the case of the Decision Tree.

3.3.3 Neural Network

In a basic Neural Network, the model feeds a vector of variables, referred to as the “input layer,” to a second layer that performs non-linear transformations of the data. This process may be repeated for additional layers until reaching the last output layer, providing the predicted values. The procedure optimizes iteratively using gradient descent and backpropagation algorithms (James et al. 2013).

Since this model is very complex, we again implement a training algorithm in order to perform hyperparameter tuning. We implement a deep feed-forward Neural Network using PyTorch and use a grid search to identify the optimal hyperparameters for the model. As with our tree models, we use a validation set and 5-fold cross validation, along with categorical cross-entropy loss, to select these hyperparameters.

4 Experiments and Results

4.1 Binary Classification

Before diving into multi-class classification, we began with binary classification. We ran the logistic regression model using a regularization lambda of 0.001 and five-fold Cross-Validation (CV) on the AskReddit vs timu and leagueoflegends vs trees categories. Figure 3 shows the model results. From this model, we can see that the classification performance of posts belonging to AskReddit is lower (70.2%) than timu (84.2%) or leagueoflegends (97.4%). We expect these results since AskReddit is a more general category, while leagueoflegends is more niche, and the community might have a more topic-specific vocabulary that gets picked up by the model.

(a) AskReddit vs. timu

(b) leagueoflegends vs. trees

Figure 3: This figure shows the results of a Logistic Regression model performing binary classification on two pairs of subreddits: (a) AskReddit vs. timu and (b) leagueoflegends vs. trees. Each cell in the confusion matrices expresses the model’s classifications as a proportion of classifications in that row. For instance, of all Reddit posts belonging to the AskReddit subreddit, the Logistic Regression model classifies 70.2% of them correctly as belonging to the AskReddit subreddit and 29.8% of them incorrectly as belonging to the timu subreddit.

Although this is a relatively good performance, many real-world scenarios are not binary, for which more complex models are necessary.

4.2 Multi-Class Classification

We train each of our multi-class classification models with optimally-chosen hyperparameters on a large training set. We then assess the implementation by evaluating performance metrics on a held-out test set. We calculate the overall model accuracy on the classification task, weighted precision and recall measures, and the area under the Receiver Operating Characteristic (ROC) curve. The following sections detail the implementation of each model.

4.2.1 Baseline Model

Our baseline model provides us with a reference point to which we can compare our more complex models in order to evaluate their performance. In Figure 4, we can see a confusion matrix representing the performance of the baseline model on a held-out test set containing 40,000 Reddit posts. It is evident that this model performs poorly at the multi-class classification task, correctly classifying Reddit posts only approximately 20% of the time. The consistency of color across the heat map, rather than a striking diagonal line, is an indication that this model does not classify Reddit posts very well.

Figure 4: Figure 4: This figure shows the results of a baseline model performing multi-class classification on five subreddits. Each cell in the confusion matrix expresses the model’s classifications as a proportion of classifications in that row. For instance, of all Reddit posts belonging to the timu subreddit, the baseline model classifies 20.09% of them correctly, 18.98% as belonging to the leagueoflegends subreddit, and so on.

4.2.2 Tree Models

Our tree models, providing a level of complexity not present in our baseline model, express good performance on the held-out test set.

We implemented our Decision Tree using 5-Fold CV and a maximum tree depth of 8. We find that the model achieves over 60% accuracy when classifying Reddit posts in the test set. However, indicated by the presence of color in the first column, this model appears to be classifying a lot of Reddit posts in the test set as belonging to the AskReddit subreddit. As a result, the impressive-looking 79.35% classification accuracy on posts belonging to this subreddit is diminished by the fact that this model is very frequently attempting to classify posts as such. Results are shown in Figure 5 (a),

We implemented our Random Forest model in a similar manner, using 5-Fold CV and a maximum tree depth of 10. We find that the model achieves over 70% classification accuracy when classifying Reddit posts in the test set. This model expresses a weaker tendency to commonly classify posts as belonging to the AskReddit subreddit, like the Decision Tree does. These results are indicative of a stronger model fit. Results are shown in Figure 5 (b).

(a) Decision Tree Results

(b) Random Forest Results

Figure 5: This figure shows the results of (a) a Decision Tree model and (b) a Random Forest model performing multi-class classification on five subreddits. As the tree models increase in depth, they begin to overfit to the training data, especially in the case of the Decision Tree. Each cell in the confusion matrices expresses the model’s classifications as a proportion of classifications in that row. For instance, of all Reddit posts belonging to the leagueoflegends subreddit, the Decision Tree model classifies 70.31% correctly, 28.88% as belonging to the AskReddit subreddit, and so on.

The Random Forest, being an ensemble model, also provides us with the most important features of a post across the weak learners in the ensemble, which can be seen in Figure 6. Here, we find that words like ‘relationship’, ‘smoke’, and ‘game’ are important to the model as it aims to classify a post as belonging to a particular subreddit. Understanding the focus of each of the five subreddits in our analysis, we can interpret these words as the strongest indicators that a post belongs to the relationships, trees, or leagueoflegends subreddit, respectively.

Overall, our two tree models show significant improvement over our baseline model, with some indications of overfitting to certain subreddits in particular, and also provide us with context about which specific words are considered more heavily in each model.

Figure 6: This figure shows the ten words identified as most important in the Random Forest model, arranged in order of decreasing importance. Words such as ‘relationship’, ‘smoke’, and ‘game’ are most indicative of posts belonging to a specific subreddit.

4.2.3 Neural Network

We implemented our Neural Network using a multi-layer perceptron with two hidden layers. The first layer has 128 neurons, and the second layer has 64. The model utilizes the ReLU activation function and the Adam optimizer. We regularized the model using dropout with a rate of 0.5 and Lasso with a regularization strength of 0.0001.

It is evident by both Table 1 and Figure 7 that the Neural Network model outperforms all other models across all metrics. Achieving a classification accuracy of almost 85%, this model proves that it can effectively analyze natural language within a Reddit post to assign it to the correct subreddit. This performance is consistent in both the training and testing sets due to regularization. In Figure 7 specifically, this effectiveness is highlighted by the sharp diagonal line in the confusion matrix, indicating strong classification accuracy, precision, and recall measures for the model.

Figure 7: This figure shows the results of a Neural Network model performing multi-class classification on five subreddits. Each cell in the confusion matrix expresses the model’s classifications as a proportion of classifications in that row. For instance, of all Reddit posts belonging to the relationships subreddit, the Neural Network model classifies 90.16% of them correctly, 2.48% as belonging to the timu subreddit, and so on.

Although the Neural Network outperforms the Decision Tree and Random Forest, their vast improvement over the baseline indicates commendable success.

Table 1: These tables show performance metrics across all models, indicating that the Neural Network is the top performer, followed by the Random Forest. The performance metrics include Accuracy, Weighted Precision, Weighted Recall, and ROC AUC.
Model Type Accuracy Weighted Precision Weighted Recall ROC AUC
Random Classifier 0.1981 0.1981 0.1981 0.4987
Decision Tree 0.6378 0.7334 0.6378 0.8533
Random Forest 0.7644 0.7697 0.7644 0.9333
Neural Network 0.8457 0.8465 0.8435 0.9712

5 Conclusions

This study shows that the text-based language people use in discussions as part of a community contains enough information to predict the association effectively. Using Term Frequency methods and supervised learning algorithms, our model utilizes information on both words used and their usage frequency. We find the best-performing model to be the feed-forward Neural Network described in Section 4, followed by the Random Forest, both significantly outperforming the random classifier. This research may prove valuable for companies that identify and appeal to individuals sharing common interests, such as a marketing or social media company seeking to target relevant ads to a specific demographic.

Future research may involve Recurrent Neural Networks, which have also been shown effective for text-based classification. Additionally, bi-grams rather than single words may be considered as tokens to remove some of the ambiguity that may be present (for example, the term ‘relationship’ may be more informative with accompanying words rather than on its own). Lastly, unsupervised learning would also provide intriguing extensions to the model to show whether clusters exist that may not directly correspond to particular subreddits.

References

Huang, Jin, and C. X. Ling. 2005. “Using AUC and Accuracy in Evaluating Learning Algorithms.” IEEE Transactions on Knowledge and Data Engineering 17 (3): 299–310. https://doi.org/10.1109/TKDE.2005.50.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 103. Springer Texts in Statistics. New York, NY: Springer New York. https://doi.org/10.1007/978-1-4614-7138-7.
Kowsari, Jafari Meimandi, Heidarysafa, Mendu, Barnes, and Brown. 2019. “Text Classification Algorithms: A Survey.” Information 10 (4): 150. https://doi.org/10.3390/info10040150.
LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. 2015. “Deep Learning.” Nature 521 (7553): 436–44. https://doi.org/10.1038/nature14539.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2012. “Scikit-Learn: Machine Learning in Python.” https://doi.org/10.48550/ARXIV.1201.0490.
Shah, Kanish, Henil Patel, Devanshi Sanghvi, and Manan Shah. 2020. “A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification.” Augmented Human Research 5 (1): 12. https://doi.org/10.1007/s41133-020-00032-0.
Singh, Ksh. Nareshkumar, S. Dickeeta Devi, H. Mamata Devi, and Anjana Kakoti Mahanta. 2022. “A Novel Approach for Dimension Reduction Using Word Embedding: An Enhanced Text Classification Approach.” International Journal of Information Management Data Insights 2 (1): 100061. https://doi.org/10.1016/j.jjimei.2022.100061.
Vijayarani, S, and R Janani. 2016. “Text Mining: Open Source Tokenization ToolsAn Analysis.” Advanced Computational Intelligence: An International Journal (ACII) 3 (1): 37–47. https://doi.org/10.5121/acii.2016.3104.
Völske, Michael, Martin Potthast, Shahbaz Syed, and Benno Stein. 2017. TL;DR: Mining Reddit to Learn Automatic Summarization.” In Proceedings of the Workshop on New Frontiers in Summarization, 59–63. Copenhagen, Denmark: Association for Computational Linguistics. https://doi.org/10.18653/v1/W17-4508.
Xu, Baoxun, Xiufeng Guo, Yunming Ye, and Jiefeng Cheng. 2012. “An Improved Random Forest Classifier for Text Categorization.” Journal of Computers 7 (12): 2913–20. https://doi.org/10.4304/jcp.7.12.2913-2920.

Footnotes

  1. The name of this subreddit was adapted from the original tifu.↩︎