Rock The Upvote:
Predicting the Success of a Reddit Post
Neehar Peri, Wilson Orlando
Introduction
Ever heard of Reddit? Maybe you already have; it's the 6th most visited website in the U.S. If this is your first time hearing about it, Reddit is a "social news aggregation forum", which is the official way of saying it gathers content from all over the internet to display in one place. Users submit content to Reddit like links, text posts, and images, and that content gets liked ("upvoted") or disliked ("downvoted") by other users. Posts are grouped into user-created boards called "subreddits" focused on one topic- like cute animals, world news, or Game of Thrones- and the most upvoted posts across all subreddits are shown on the site's front page. You can read more about it here.
With 330 million users and over 20 thousand daily submissions, it’s hard to predict which posts will get popular enough to hit the front page. If your predictions were good enough to craft viral Reddit posts, though, you could get your ideas out to those millions of users in a matter of hours. Many users don’t realize that Reddit can be- and is already being- manipulated to spread more than cute cats. Large financial service companies use Reddit to boost their online image, and just this month the Reddit team claimed leaked U.S.-U.K. trade documents posted on their site were part of a large-scale misinformation campaign originating from Russia. On top of getting your own upvotes, understanding what makes a Reddit post popular is critical for your informed online browsing.
Given the huge amount of data Reddit generates, it’s also an interesting problem for data science: based on existing Reddit posts, can we identify the most important features of a successful post and predict a post’s success before it’s submitted? In this tutorial, our goal is to download and tidy data from Reddit, perform some exploratory analysis to identify important features, then use machine learning models to predict a post’s upvote rating based on those features. For readers unfamiliar with Reddit, we hope our analysis will get you interested in the site and in some popular posts. For more experienced readers, we hope to give you some insight into how your favorite “social news aggregation forum” works, and show you how to get your own posts to the front page of the internet.
Getting Started with the Data
We use Python 3 and some imported libraries, like torch, pandas, matplotlib, scikit-learn, and more.
from textblob import TextBlob
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import Lasso, LogisticRegression
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from tqdm import tqdm
from pprint import pprint
import numpy as np
import praw
import pdb
import operator
import warnings
warnings.filterwarnings("ignore")
Data Collection
To scrape data from Reddit, we'll be using PRAW (the Python Reddit API Wrapper). For this analysis, we chose to pull data into two datasets:
- Top Posts: The 1000 most upvoted posts in the last year
- Controversial Posts: The 1000 most controversial posts in the last year (i.e. closest to 50% upvote ratio)
# Establish connection using Reddit's API
clientID = 'VpVEXRwC5-nrbA'
clientSecret = 'XMql3JHOqhq2NatetFMFPkhBaCE'
userAgent = 'Reddit WebScraping'
reddit = praw.Reddit(client_id=clientID, client_secret=clientSecret, user_agent=userAgent)
# Retrieve the top (most upvoted) 1k posts in the last year and add them to a Pandas dataframe
topPosts = []
subReddit = reddit.subreddit('all')
for post in tqdm(subReddit.top(time_filter = 'year', limit=1000), total=1000):
topPosts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created_utc, post.author, post.is_self, post.over_18, post.spoiler, post.upvote_ratio])
topPosts = pd.DataFrame(topPosts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created', 'author', 'is_self', 'over_18', 'spoiler', 'upvote_ratio'])
topPosts.to_csv("top1KPosts.csv", index=False)
# Retrieve the most controversial (most downvoted) 1k posts in the last year and add them to a Pandas dataframe
controvertialPosts = []
subReddit = reddit.subreddit('all')
for post in tqdm(subReddit.controversial(time_filter = 'year', limit=1000), total=1000):
controvertialPosts.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, post.created_utc, post.author, post.is_self, post.over_18, post.spoiler, post.upvote_ratio])
controvertialPosts = pd.DataFrame(controvertialPosts,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created', 'author', 'is_self', 'over_18', 'spoiler', 'upvote_ratio'])
controvertialPosts.to_csv("controvertial1KPosts.csv", index=False)
top1KPosts = pd.read_csv("top1kPosts.csv")
controversial1KPosts = pd.read_csv("controversial1kPosts.csv")
To avoid calling the Reddit API thousands of times, we have saved the results of 1000 "Top" posts and 1000 "Controversial" posts to a Pandas dataframe. For all future analysis, we will be reading and modifying these data frames.
Next, let's take a look at some of the subreddits that have broken the top 1000 and controversial 1000 posts:
#Top 10 List of subreddits that have broken top 1k
topSubReddit = top1KPosts.subreddit.unique()
topSubReddit[0:10]
#Top 10 subreddits that have broken most controversial 1k
controversialSubReddits = controversial1KPosts.subreddit.unique()
controversialSubReddits[0:10]
Our sampling of subreddits in the top 1000 are pretty predictable; r/pics, r/aww, and r/AskReddit are some of the biggest subreddits on the site ("r/" denotes a subreddit's name), with AskReddit hosting over 25 million subscribed users. This would indicate that bigger subreddits are more likely to break the top 1000, so a post's subreddit should be a very important feature for its success.
The controversial sample of subreddits shows a few different trends. Of the 74 subreddits that have broken the controversial 1000, many (like r/unpopularopinion and r/TrueOffMyChest), are communities for people to vent intentionally controversial ideas to other people. There's also a sizeable chunk of political subreddits, like r/Libertarian and r/Conservative, which also host a lot of controversial opinions (as is the nature of partisan politics). Surprisingly, though, the vast majority of controversial subreddits have to do with popular entertainment, like r/DestinyTheGame, r/leagueoflegends, and r/FortNiteBR. Part of this is sampling bias; there's inherently just a lot of subreddits dedicated to popular entertainment on Reddit, so it makes sense that they'd make up a significant percentage of the controversial dataset. Interestingly, though, many of them reflect legitimate controversies in the past year! r/gameofthrones is present, after many fans were disappointed with the show's final season, and r/Blizzard makes an appearance after the gaming company punished prominent players for supporting the Hong Kong riots.
For our analysis, then, we'd recommend posting to large generic subreddits and avoiding specific potentially-controversial ones, hypothesizing that a post's subreddit is an important feature in its success.
Data Preprocessing
Next, we'll have to clean up the data from PRAW to get it into a more usable format. We'll be combining the top and controversial posts into a single Pandas dataframe, adding a field to differentiate between them, and dropping some fields that could not be controlled by a user at the time of post submission (like url and reddit's internal post id).
for tableRow in top1KPosts.iterrows():
top1KPosts.at[tableRow[0], "post_type"] = 1 #Set topPost as post_type 1
for tableRow in controversial1KPosts.iterrows():
controversial1KPosts.at[tableRow[0], "post_type"] = 0 #Set controversialPost as post_type 0
dataSet = pd.concat([top1KPosts, controversial1KPosts])
dataSet = dataSet.reset_index()
dataSet = dataSet.drop(["index", "id", "url", "created", "author"], axis=1) #Drop Features that cannot be controlled by the user
dataSet.head()
From here, we'll conduct all analysis on this master dataset. Although metrics such as number of comments, post type, and upvote ratio cannot be controlled at the time of submission, we'll be keeping them as potential target metrics for later classification and regression. We still need to perform some data cleaning to make these features useable in machine learning models:
uniqueSubReddits = {"subReddit" : []} #Get unique subReddits.
for tableRow in dataSet.iterrows(): #Iterate through all rows in data set.
title = tableRow[1]["title"]
subReddit = tableRow[1]["subreddit"]
body = tableRow[1]["body"]
originalContent = tableRow[1]["is_self"]
nsfw = tableRow[1]["over_18"]
spoiler = tableRow[1]["spoiler"]
titleBlob = TextBlob(title)
lenTitle = len(title)
titleSentiment = titleBlob.sentiment.polarity #Sentiment score from [-1, 1] -1 -> Negative, 1-> Positive
titleSubjectivity = titleBlob.sentiment.subjectivity #Opinion score from [0, 1] 0 -> Factual, 1 -> Opinion
titleQuestion = 1 if "?" in title else 0 #Is the title a question?
if subReddit not in uniqueSubReddits["subReddit"]:
uniqueSubReddits["subReddit"].append(subReddit)
lenBody = 0
try:
if np.isnan(body): #Setting empty body elements to empty strings to homogenize the data type within the column
body = ""
lenBody = 0
except: # Body is not NAN (Throws Exception)
lenBody = len(body)
#Set cleaned values and additional features
dataSet.at[tableRow[0], "len_title"] = lenTitle
dataSet.at[tableRow[0], "title_question"] = titleQuestion
dataSet.at[tableRow[0], "title_sentiment"] = titleSentiment
dataSet.at[tableRow[0], "title_subjectivity"] = titleSubjectivity
dataSet.at[tableRow[0], "body"] = body
dataSet.at[tableRow[0], "len_body"] = lenBody
#Convert True -> 1 and False -> 0
dataSet.at[tableRow[0], "is_oc"] = 1 if originalContent else 0
dataSet.at[tableRow[0], "is_nsfw"] = 1 if nsfw else 0
dataSet.at[tableRow[0], "is_spoiler"] = 1 if spoiler else 0
dataSet = dataSet.drop(["is_self", "over_18", "spoiler",], axis=1)
subRedditLookUp = pd.DataFrame(uniqueSubReddits) #Create table to iterate through all sub-reddits
subRedditOneHotEncoding = pd.get_dummies(subRedditLookUp) #One hot encoding of categorical feature for input into model
dataSet.head()
The Reddit API makes data cleaning fairly straightforward. There are very few missing values, and all elements in the table are uniformly formatted, making it easier to specify rules to clean them. In addition to the features we are given through the API, we decided to calcualte a few other metrics that might be useful in determining the quality of a post. Specifically, we also look at the sentiment of the tilte, the objectivity of the title, length of the title, and length of the body. For the sake of simplicity, we do not consider the raw text data in a post to avoid dealing with links, subreddit specific acronyms and internet slang in general. Using these hand-crafted features, we will attempt to both regress the upvote ratio, and classify whether a post is considered a top post or a controversial post. Note that we choose to regress the upvote ratio rather than the total score because the score contains a very large range of values, making it difficult to accurately regress the values.
Now that our cleaning process is complete, it may help to define exactly what each of our features represents:
- title: The title of the post
- score: The raw number of upvotes that this post received
- subreddit: The subreddit this post was made on
- num_comments: The number of comments made on this post
- body: The text written in the post's body, if any
- upvote_ratio: The percentage of upvotes to total votes (upvotes + downvotes) on the post.
- post_type: Whether this post came from our top (1) or controversial (0) dataset.
- len_title: The number of characters in the post's title
- title_question: Whether or not the post's title contains the '?' character.
- title_sentiment and title_subjectivity: Sentiment and subjectivity ratings of the title, as given by TextBlob (https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis).
- len_body: Number of characters in the post's body text
- is_oc: 1 if the post is an original text post, 0 otherwise (i.e. it links to an external website with no body text)
- is_nsfw: 1 if the post is "nsfw" (inappropriate for some audiences), 0 otherwise
- is_spoiler: 1 if the post is tagged as a spoiler, 0 otherwise
Exploratory Data Analysis
Now that our data has been assembled, we'd like to take a closer look at the kinds of subreddits in each category. If you're making a post, what kind of subreddit should you post on to ensure its success?
To answer this question, let's take a look at the subreddits that most frequently broke the top 1000 posts in the last year. The following plot shows the subreddits which had at least 10 posts in the top 1000 in the last year:
# Plot subreddits that most frequently broke top 1000
subs = {}
for tableRow in dataSet.iterrows():
# Tally how many times each subreddit appears in dataset
if (tableRow[1]["post_type"] == 1): # Filter only top posts
subreddit = tableRow[1]["subreddit"]
if subreddit in subs:
subs[subreddit] = subs[subreddit] + 1
else:
subs[subreddit] = 1
top_subs = subs
# Filter out subs that broke top 1000 less than 10 times
consistent_subs = {}
for sub in subs:
if subs[sub] >= 10:
consistent_subs[sub] = subs[sub]
# Create bar chart
plt.bar(range(len(consistent_subs)), list(consistent_subs.values()), align='center')
plt.xticks(range(len(consistent_subs)), list(consistent_subs.keys()), rotation='vertical')
plt.title("Most Frequent Subreddits in Top 1000")
plt.ylabel("Number of Posts")
plt.xlabel("Subreddit Name")
plt.show()
As expected, all of these are large, popular subreddits. All of them (except r/Showerthoughts) primarily post links to external sites, like images, gifs, or videos. More notably, all of them have broad, generic themes, implying our previous hypothesis that more specific subreddits were more controversial had some merit. Next, then, let's plot the subreddits that broke the most controversial 1000 at least 10 times:
# Plot subreddits that most frequently broke controversial 1000
subs = {}
for tableRow in dataSet.iterrows():
# Tally how many times each subreddit appears in dataset
if (tableRow[1]["post_type"] == 0): # Filter only controversial posts
subreddit = tableRow[1]["subreddit"]
if subreddit in subs:
subs[subreddit] = subs[subreddit] + 1
else:
subs[subreddit] = 1
cont_subs = subs
# Filter out subs that broke controversial 1000 less than 10 times
consistent_subs = {}
for sub in subs:
if subs[sub] >= 10:
consistent_subs[sub] = subs[sub]
# Create bar chart
plt.bar(range(len(consistent_subs)), list(consistent_subs.values()), align='center')
plt.xticks(range(len(consistent_subs)), list(consistent_subs.keys()), rotation='vertical')
plt.title("Most Frequent Subreddits in Controversial 1000")
plt.ylabel("Number of Posts")
plt.xlabel("Subreddit Name")
plt.show()
As expected, there's a heavy presence of political subreddits like r/Sino, r/politics, r/Conservative, and r/Libertarian, as well as a popular entertainment subreddit (r/leagueoflegends). Unfortunately, it looks like there's some overlap with our top 1000 subreddits; r/IAmA and r/videos are popular subreddits, so subscriber count may be a poor predictor. As a follow-up, let's try this: what percentage of subreddits appear in either the top 1000 or the controversial 1000, but not both?
# Count how many subreddits appear in either top or controversial, not both
unique_subs = {}
subs_count = len(top_subs)
for key in top_subs:
if key not in cont_subs:
unique_subs[key] = top_subs[key]
for key in cont_subs:
if key not in top_subs:
subs_count = subs_count + 1
unique_subs[key] = cont_subs[key]
len(unique_subs)/subs_count
Aha! This is good; 92% of our subreddits are unique to either the top or controversial lists, so only 8% of subreddits appear on both. Even though subscriber count isn't a good metric, then, it looks like subreddit still is! We'll expect our machine learning models to count subreddit as an important feature, and they should perform decently well.
Hot or Not?
From here, we'd like to put our hypotheses to the test. Are our features good enough to predict a post's success at the time of submission?
Linear Regression
First, we're going to shoot the moon; let's try predicting the exact upvote ratio that a post will achieve based on its initial conditions. This is a regression problem, so let's try fitting our data to a linear regression model with "upvote_ratio" as our labels.inputRegressionData = []
outputRegressionData = []
#Creating examples with all features and the corresponding label
for tableRow in dataSet.iterrows():
subReddit = tableRow[1]["subreddit"]
SR_oneHotEncoding = subRedditOneHotEncoding["subReddit_" + subReddit].to_list()
originalContent = tableRow[1]["is_oc"]
nsfw = tableRow[1]["is_nsfw"]
spoiler = tableRow[1]["is_spoiler"]
lenTitle = tableRow[1]["len_title"]
titleQuestion = tableRow[1]["title_question"]
titleSentiment = tableRow[1]["title_sentiment"]
titleSubjectivity = tableRow[1]["title_subjectivity"]
lenBody = tableRow[1]["len_body"]
ratio = tableRow[1]["upvote_ratio"]
inputRegressionData.append(SR_oneHotEncoding + [titleQuestion, lenTitle, titleSentiment, titleSubjectivity, lenBody, originalContent, nsfw, spoiler])
outputRegressionData.append([ratio])
We use a Lasso model to build a linear model of the data since the traditional least squares method does not work well on sparse data.
trainInput, testInput, trainOutput, testOutput = train_test_split(inputRegressionData, outputRegressionData, test_size=0.1) #Split data
ratioRegression = Lasso().fit(np.array(trainInput), np.array(trainOutput)) # Works better with sparse data
predictedOutput = ratioRegression.predict(np.array(testInput)) #Test the fitted model against unseen data
modelPerformance = []
testOutput = [i[0] for i in testOutput]
predictedOutput = predictedOutput.tolist()
for i, output in enumerate(zip(testOutput, predictedOutput)):
target, prediction = output
modelPerformance.append({"Example" : i, "True Ratio" : target , "Predicted Ratio" : prediction, "Residual" : target - prediction})
modelPerformance = pd.DataFrame(modelPerformance)
MSE = metrics.mean_squared_error(testOutput, predictedOutput) #How close are the predictions to the actual values?
averageResidual = np.mean(modelPerformance["Residual"]) #Gives an idea if the model is consistently under predicting or overpredicting, indicating a poor fit
modelPerformance.head()
print("Mean Squared Error: " + str(MSE))
print("Average Residual: " + str(averageResidual))
Directly regressing the ratio is extremely difficult since almost all ratios on reddit are between 0.5 and 1. In general, the linear model is able to minimize the loss function by predicting around 0.75 for each example. So, it looks like our data isn't fit well by a standard linear regression.
Given that this regression model performed badly, we'd like to pursue two different solutions:
- What if we made the problem easier by turning it into a classification problem?
- What if a more complex regression model, like a neural network, would perform better?
Logistic Regression
Let's tackle the classification problem first. If we tried to categorize a post as either "top" or "controversial", instead of trying to regress its exact upvote ratio, we may see more success.inputClassificationData = []
outputClassificationData = []
for tableRow in dataSet.iterrows():
subReddit = tableRow[1]["subreddit"]
SR_oneHotEncoding = subRedditOneHotEncoding["subReddit_" + subReddit].to_list()
originalContent = tableRow[1]["is_oc"]
nsfw = tableRow[1]["is_nsfw"]
spoiler = tableRow[1]["is_spoiler"]
titleQuestion = tableRow[1]["title_question"]
lenTitle = tableRow[1]["len_title"]
titleSentiment = tableRow[1]["title_sentiment"]
titleSubjectivity = tableRow[1]["title_subjectivity"]
lenBody = tableRow[1]["len_body"]
post_type = tableRow[1]["post_type"]
inputClassificationData.append(SR_oneHotEncoding + [titleQuestion, lenTitle, titleSentiment, titleSubjectivity, lenBody, originalContent, nsfw, spoiler])
outputClassificationData.append([post_type])
trainInput, testInput, trainOutput, testOutput = train_test_split(inputClassificationData, outputClassificationData, test_size=0.1)
ratioRegression = LogisticRegression().fit(np.array(trainInput), np.array(trainOutput)) #Binary Classification task
predictedOutput = ratioRegression.predict(np.array(testInput)) # Test model on unseen data
modelPerformance = []
testOutput = [i[0] for i in testOutput]
predictedOutput = predictedOutput.tolist()
for i, output in enumerate(zip(testOutput, predictedOutput)):
target, prediction = output
modelPerformance.append({"Example" : i, "True Class" : target , "Predicted Class" : prediction})
modelPerformance = pd.DataFrame(modelPerformance)
modelPerformance.head()
Accuracy = metrics.accuracy_score(testOutput, predictedOutput)
print("Accuracy: " + str(Accuracy))
Our logistic regression model performs quite well on the limited data set. Intuitively, it makes sense that a model will perform better at classifying items into two buckets, rather than regressing a continuous variable.
Neural Network Regression
Now that we have a working logistic model, let's revisit our regression model from earlier. Can we improve it? Since linear regression seems to lack the required horsepower, let's go a little overboard and try to use a fully connected neural network. PyTorch (https://pytorch.org/) is great for building neural networks and training deep learning models.trainInput, testInput, trainOutput, testOutput = train_test_split(inputRegressionData, outputRegressionData, test_size=0.1)
testOutput = [i[0] for i in testOutput]
class PredictRedditPost(nn.Module):
def __init__(self):
super(PredictRedditPost, self).__init__()
self.linearClassifier = nn.Sequential(nn.Linear(353, 64), nn.Dropout(0.5), nn.ReLU(),
nn.Linear(64, 16), nn.ReLU(),
nn.Linear(16, 1))
#Linear fully connected neural network with Dropout regularization to prevent overfitting and ReLU activations for added non-linearity
def forward(self, featureVector):
return self.linearClassifier(featureVector)
LR = 1e-3 #Learning Rate
WEIGHTDECAY = 0.0005 #L2 Penalty which forces smaller weights and simpler models
EPOCH = 10 #Number of iterations through entire dataset
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") #Train on GPU, because training on CPUs is for normies
model = PredictRedditPost()
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=LR, weight_decay=WEIGHTDECAY) #Gradient descent, of the Adam variety. Less fussy than SGD
MSE = nn.MSELoss() #Loss function minimizes mean squared error
TestingMSE = []
TrainingLoss = []
for STEP in range(1, EPOCH + 1):
epochLoss = 0
model.train() #Turn on Dropout
for batchCount, data in enumerate(zip(trainInput, trainOutput)):
handCraftedFeatures, ratio = data
handCraftedFeatures = torch.tensor(handCraftedFeatures) #Format input to play nice with PyTorch
ratio = torch.tensor(ratio)
handCraftedFeatures = handCraftedFeatures.to(device) #Send data to GPU
ratio = ratio.to(device)
predictedRatio = model(handCraftedFeatures) #Generate predictions
optimizer.zero_grad()
Loss = MSE(predictedRatio, ratio)
epochLoss = epochLoss + Loss.item()
Loss.backward() #Backpropagate loss through all layers of NN
optimizer.step()
print("Epoch " + str(STEP) +" Loss: " + str(epochLoss / batchCount))
TrainingLoss.append(epochLoss / batchCount)
model.eval() #Turn off Dropout
testInput = torch.tensor(testInput)
testInput = testInput.to(device)
predictedRatio = model(testInput) #Forward pass with testing samples
predictedRatio = [int(i[0]) for i in predictedRatio.tolist()]
MSError = metrics.mean_squared_error(testOutput, predictedRatio) #Get MSE of testing samples
print("Epoch " + str(STEP) + " MSE: " + str(MSError) + "\n")
TestingMSE.append(MSError)
sns.lineplot(x=np.array([x for x in range(EPOCH)]), y=np.array(TrainingLoss)) #Plot training loss per epoch
plt.title("Loss Per Epoch")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.show()
plt.clf()
sns.lineplot(x=np.array([x for x in range(EPOCH)]), y=np.array(TestingMSE)) #Plot MSE on testing set per epoch.
plt.title("MSE Per Epoch")
plt.xlabel("Epoch #")
plt.ylabel("MSE")
plt.show()
plt.clf()
The plots above show that the model overfit to the training data. The training loss decreased to 0, while the MSE on the test set plateaued. Unsurprisingly, the neural network model predicted 0 for all test samples. Due to the small sample size, the network is unable to precisely regress the correct output value. Throwing more data at this model will likely help it generalize better since the upvote ratio is a continuous variable.
modelPerformance = []
for i, output in enumerate(zip(testOutput, predictedRatio)):
target, prediction = output
modelPerformance.append({"Example" : i, "True Ratio" : target , "Predicted Ratio" : prediction})
modelPerformance = pd.DataFrame(modelPerformance)
modelPerformance.head()
trainInput, testInput, trainOutput, testOutput = train_test_split(inputClassificationData, outputClassificationData, test_size=0.1)
testOutput = [i[0] for i in testOutput]
class PredictRedditPost(nn.Module):
def __init__(self):
super(PredictRedditPost, self).__init__()
self.linearClassifier = nn.Sequential(nn.Linear(353, 64), nn.Dropout(0.5), nn.ReLU(),
nn.Linear(64, 16), nn.ReLU(),
nn.Linear(16, 1))
self.sigmoid = nn.Sigmoid()
#Same neural network architecture as above, now new and improved with a sigmoid activation function
def forward(self, featureVector):
return self.sigmoid(self.linearClassifier(featureVector))
LR = 1e-3 #Learning Rate
WEIGHTDECAY=0.0005 #L2 Penalty which forces smaller weights and simpler models.
EPOCH = 10 #Number of iterations through the entire dataset
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = PredictRedditPost()
model.to(device)
optimizer = optim.Adam(model.parameters(), lr=LR, weight_decay=WEIGHTDECAY)
BCEL = nn.BCELoss() #Binary Cross Entropy Loss used for two-class classifiation tasks
#Same training loop as regression case
TestingAccuracy = []
TrainingLoss = []
for STEP in range(1, EPOCH + 1):
epochLoss = 0
model.train()
for batchCount, data in enumerate(zip(trainInput, trainOutput)):
handCraftedFeatures, postType = data
handCraftedFeatures = torch.tensor(handCraftedFeatures) #Formatting features to play nice with PyTorch
postType = torch.tensor(postType)
handCraftedFeatures = handCraftedFeatures.to(device) #Send to GPU
postType = postType.to(device)
predictedPostType = model(handCraftedFeatures)
optimizer.zero_grad()
Loss = BCEL(predictedPostType, postType)
epochLoss = epochLoss + Loss.item()
Loss.backward()
optimizer.step()
print("Epoch " + str(STEP) + " Loss: " + str(epochLoss / batchCount))
TrainingLoss.append(epochLoss / batchCount)
model.eval()
testInput = torch.tensor(testInput)
testInput = testInput.to(device)
predictedPostType = torch.round(model(testInput))
predictedPostType = [int(i[0]) for i in predictedPostType.tolist()]
Accuracy = metrics.accuracy_score(testOutput, predictedPostType)
print("Epoch "+ str(STEP) + " Accuracy: " + str(Accuracy) + "\n")
TestingAccuracy.append(Accuracy)
sns.lineplot(x=np.array([x for x in range(EPOCH)]), y=np.array(TrainingLoss)) #Plot training loss per epoch
plt.title("Loss Per Epoch")
plt.xlabel("Epoch #")
plt.ylabel("Loss")
plt.show()
plt.clf()
sns.lineplot(x=np.array([x for x in range(EPOCH)]), y=np.array(TestingAccuracy)) #Plot classification accuracy per epoch
plt.title("Accuracy Per Epoch")
plt.xlabel("Epoch #")
plt.ylabel("Accuracy")
plt.show()
plt.clf()
Our classification neural network is able to outperform our simple logistic regression model by a few percent. This state-of-the-art result should be published in next year's NeurIPS conference! Despite the apparent performance increase, neural networks are not interpretable. For the sake of explainability, it makes more sense to analyze the linear models.
The key takeaway from these experiments is that although many state-of-the-art methods use neural networks and deep learning, the size of the dataset plays an important role in whether to stick with traditional machine learning based approaches, or try deep learning models. Let's try to understand our features to see what really goes into making a great post. We'll be generating a Pearson Correlation Matrix to visualize how closely different features are correlated. For more information, read here. Here are the main takeaways:
- Two variables have a significant correlation if their coefficient is at least 0.5
- A variable is "important" for our model if it has a significant correlation with our label (in our case, upvote ratio, which is "UR" in the last column
As an important note, the limitations of the correlation function mean that we can only use numerical variables for this heatmap. That means we'll be omitting a categorical feature, subreddit, which we already know to be significant.
numericalFeatureSelection = []
# Get numerical and boolean features, and abbreviate their labels
for tableRow in dataSet.iterrows():
lenTitle = tableRow[1]["len_title"]
titleSentiment = tableRow[1]["title_sentiment"]
titleSubjectivity = tableRow[1]["title_subjectivity"]
titleQuestion =tableRow[1]["title_question"]
originalContent = tableRow[1]["is_oc"]
nsfw = tableRow[1]["is_nsfw"]
spoiler = tableRow[1]["is_spoiler"]
lenBody = tableRow[1]["len_body"]
post_type = tableRow[1]["post_type"]
upvote_ratio = tableRow[1]["upvote_ratio"]
numericalFeatureSelection.append({"TQ": titleQuestion, "LT" : lenTitle, "TSEN" : titleSentiment, "TSUB" : titleSubjectivity,
"OC" : originalContent, "NSFW" : nsfw, "SP" : spoiler,
"LB" : lenBody, "PT" : post_type, "UR" : upvote_ratio})
featureSelection = pd.DataFrame(numericalFeatureSelection)
plt.figure(figsize=(16,10))
pearsonCorrelation = featureSelection.corr() # Pearson's correlation coeff.
sns.heatmap(pearsonCorrelation, annot=True, cmap=plt.cm.Reds) # Using Seaborn to generate a heatmap
plt.show()
#Look Up Table for Above Plot
pprint({"TQ": "Title Contains Question","LT" : "Length of Tilte", "TSEN" : "Title Sentiment", "TSUB" : "Title Subjectivity",
"OC" : "Original Content", "NSFW" : "Not Safe for Work", "SP" : "Spoiler",
"LB" : "Length of Body", "PT" : "Post Type", "UR" : "Upvote Ratio"})
Throughout this tutorial, we have been trying to predict if a post will join the ranks of other top posts, or be condemned as a controversial post. The correlation matrix above shows that most numerical features of our dataset didn't actually affect the results of our linear models. Interestingly, both the length post body and the use of original content negatively correlated with both the post type (PT) and upvote ratio (UR). In order to have a hot post, make sure to always repost material and never write well-thought-out content!
Using this information, we can test our logistic regression model using only len_body and is_oc to predict post_type to see if we can get the same level of performance:
inputClassificationData = []
outputClassificationData = []
for tableRow in dataSet.iterrows():
originalContent = tableRow[1]["is_oc"]
lenBody = tableRow[1]["len_body"]
post_type = tableRow[1]["post_type"]
inputClassificationData.append([lenBody, originalContent])
outputClassificationData.append([post_type])
trainInput, testInput, trainOutput, testOutput = train_test_split(inputClassificationData, outputClassificationData, test_size=0.1)
ratioRegression = LogisticRegression().fit(np.array(trainInput), np.array(trainOutput))
predictedOutput = ratioRegression.predict(np.array(testInput))
modelPerformance = []
testOutput = [i[0] for i in testOutput]
predictedOutput = predictedOutput.tolist()
Accuracy = metrics.accuracy_score(testOutput, predictedOutput)
print("Accuracy: " + str(Accuracy))
Although our logistic regression predictor is able to acheive "better than random" accuracy, the performance is still much lower than when we included other features. A noteable feature that we are missing is the subreddit itself.
inputClassificationData = []
outputClassificationData = []
for tableRow in dataSet.iterrows():
subReddit = tableRow[1]["subreddit"]
SR_oneHotEncoding = subRedditOneHotEncoding["subReddit_" + subReddit].to_list()
originalContent = tableRow[1]["is_oc"]
lenBody = tableRow[1]["len_body"]
post_type = tableRow[1]["post_type"]
inputClassificationData.append(SR_oneHotEncoding + [lenBody, originalContent])
outputClassificationData.append([post_type])
trainInput, testInput, trainOutput, testOutput = train_test_split(inputClassificationData, outputClassificationData, test_size=0.1)
ratioRegression = LogisticRegression().fit(np.array(trainInput), np.array(trainOutput))
predictedOutput = ratioRegression.predict(np.array(testInput))
modelPerformance = []
testOutput = [i[0] for i in testOutput]
predictedOutput = predictedOutput.tolist()
Accuracy = metrics.accuracy_score(testOutput, predictedOutput)
print("Accuracy: " + str(Accuracy))
With the subreddit included, our accuracy jumps from 73% to 91%! As suspected, the subreddit that you post in significantly impacts the likelihood of going viral on Reddit.
Conclusion
To review, we scraped thousands of Reddit's best and worst posts, processed their features, performed some exploratory subreddit analysis, and attempted to predict a new post's performance with four different machine learning models. In the end, we have a working logistic regression model that can predict whether a post is more likely to be "top" or "controversial" with over 90% accuracy! If you'd like to make your own viral Reddit post, here are some of our suggestions:
- Be persistent. Reddit has over 20 thousand daily submissions across all subreddits; it may take a few posts before people catch on to your content.
- Post on a large, generic-topic subreddit. The ones that most frequently broke the top 1000 posts were r/pics, r/aww, r/funny, r/gifs, and r/gaming.
- Post a link to an image or a video instead of original content. Posts with text had a significant negative correlation with upvote_ratio.