Textual Analysis Of Reddit to Predict Future Stock Returns

Code to Reproduce Analysis

1. Introduction 

The goal of our project is to understand the impact of social media posts on the future prices of individual stocks. We examined the impact of posts in the social media platform reddit.com within investment focused subreddits/forums using textual analysis techniques.

In today’s society people are engaging in their social activities online using services like Twitter, Reddit, and other similar social media platforms. It has been shown that investing is largely a social activity and that conversations may lead to investment behavior (Shiller et al.). We focused on reddit.com as it has experienced a large amount of growth over the years and there exists several investment focused subreddits on the platform. 50.3% of the value of U.S stock is owned by households and Reddit content is generated solely by individual household users (Rosenthal, Austin). To our knowledge, there has not been a study performed that uses reddit.com as a source of information to predict stock movement. The rapid growth of Reddit’s user base is shown below in Figure 1.

Fig. 1

2. Survey

There have been many attempts to utilize social media content to predict stock returns. Our study is closely related to those utilizing sources such as Twitter and traditional news outlets.

Others have found statistically significant results in predicting future stock returns using their social media data sources of choice (Chen,et al; Bollen et al; Wijaya, et al; Zhang. et al; Makrehchi, et al; Godbole, et al; Nikfarjam, et al; Gidofalvi et al; Nguyen et al).

One paper stated that sentiment analysis could be done today by a lexicon approach of tweets against the Indonesian language (Wijaya, et al.) Another approach to sentiment analysis was using NLP on news headlines (Velay, et al). Yet another approach used the sentiment from seeking alpha blog posts and the corresponding comments (Chen, et al.). Twitter analysis has also shown to be able to predict price movement of other assets such as cryptocurrencies (Abraham, et al.). We found that stock prices fluctuate largely on various events. These events can be announcements about the company or information shared. Often times these events will be talked about online which further supports that textual analysis can be effective (Kaushik, et al.).

A unique aspect of reddit posts is that they are typically short snippets of text of a users thoughts or opinions, not unlike twitter. However Reddit also gives an opportunity to write substantially more if the user desires. Unique approaches to analyzing this type of text is required (Kiritchenko, et al). We expanded upon this to use frequency and sentiment.

3. Method of Analysis

This section will explain the data gathering and variables of interest. Our study used data collection from reddit.com/r/wallstreetbets as well as stock market data collection from various financial APIs. The period we sampled is January 2018 through September 2018. In addition we implemented a web application that leverages our findings on past data for real time insights.

3.1 Data

The subreddit wallstreetbets is a place where reddit users discuss day trading, stocks, options, futures, and any other market related discussion. “WallStreetBets is lively, engaged and growing. It was in the top 1% of Reddit’s more than 824,000 subreddits in new-subscriber growth over the past 90 days, according to RedditMetrics.com, and its more than 2 million monthly page views represent more traffic than all of the other stock-related subreddits combined.” (French, Sally, and Shawn Langlois) An average of 236 topic posts are made to this subreddit daily, and each topic post may contain user comments. On average each topic post will gather 21 comments and sometimes reaching into the thousands depending on the popularity of each post. Each post comment contains an average of 342 words.

As we used reddit comments to try and predict the direction of stock prices, we also obtained daily stock data from the same period. This data was obtained using quantmod.

Stock data obtained from the quantmod was collected for all stock tickers that are mentioned in wallstreetbets posts. The end of day value was used for the stock price. We created additional features using this data. For example we created percent change features that show the percent change 1 day in the future, 2 days, to 90 days in the future. We used the percent change features as dependent variables in our modeling.

3.2 Entity Extraction

One of the challenges of extracting useful information from our comment/post dataset was determining what stock ticker or company comment/post is talking about. To do this we identified all posts which mention a specific company or stock ticker in the title. We gathered all stock post titles then cleaned and tokenized each. We then joined a stock ticker list to find all stock tickers and tagged each post with the tickers found. We made the decision to only analyze posts that contain a single stock ticker in the title as it is beyond the scope of our analysis to try and determine which ticker the corresponding posts are referring to in the discussion. Another challenge we faced when tagging posts was that many stock tickers are also commonly used english words. Examples of this include “play”, “post”, “beat”, “win”.

There are 531 unique stock tickers mentioned in posts during the time frame we examined. We also found that wallstreetbets frequently talked about a small subset of stocks, typically tech related stocks. Because wallstreetbets tends to talk most about these stocks we decided to focus our analysis on them as they give us sufficient volume to derive insights. This also resolves the issue of stock tickers also being common english words.

Stock tickers that we focused on are as follows: mu, amd, tsla, snap, nvda, fb, amzn, baba, aapl, ge, msft, and nflx. We chose these because they show up consistently throughout the time period we are analyzing and represent 36% of the post content. Also the tickers share a common industry, tech, with the exception of ge.

3.3 Measuring Sentiment

The subreddits on reddit.com are a collection of posts with many comments. Each comment is of varying quality. Sometimes a post will gather very relevant and thoughtful stock discussion while others will be conversations of non stock relevant banter. We decided to use two sentiment lexicons that would be most appropriate for this style of commentary.

We have considered the following lexicons: Vader (Gilbert, CJ Hutto Eric) and Loughran (Loughran, Tim, and Bill McDonald). The Vader sentiment lexicon was tuned specifically for microblog social media posts and performs exceptionally well at this task. Our post data closely resembles the style Vader was designed for. We also wanted to incorporate a more domain specific lexicon, called Loughran. This lexicon is tuned specifically for financial text.

The Vader lexicon was run against each comment, the output of which was the following: positive, negative, and compound. Compound being the normalized weighted composite score. Similarly, the Loughran lexicon produced an output of six distinct financial scores: constraining, litigious, positive, negative, superfluous, and uncertainty.

Once the sentiment scores were obtained we summed the total sentiment type for each comment to the post level. For example a post with 100 comments, each with an individual score for the 9 distinct lexicon metrics was summed to one row of data consisting of the post title, stock ticker, and sentiment metrics.

Fig. 2

3.4 Analysis

We wanted to understand the relationship between the comment sentiment and future stock movement. To do this we will be regressing future returns on the sentiment and stock ticker-post volume. The model will be roughly as follows:

Where is the return of stock ticker for the next day . is the sum of a sentiment measure from comments across all posts with company on day . We add an indicator variable, which will denote if there were any posts and comments discussing company on day . Finally we add which is the volume of ticker on day .

3.5 Experiments and Evaluation

We evaluated the effectiveness of our models in the following ways. We examined the assumptions of the regression model, which include a linear relationship between the predictor and response variables, little or no multicollinearity in predictors, the residual error between predicted and observed responses follows a normal distribution, and homoscedasticity. The impact of the social sentiment variables were measured with methods such as variable selection, p-values, confidence intervals, and estimated values for the coefficients. The regression model was evaluated using statistical techniques such as finding the adjusted R-squared value and the root mean squared error. Our assumption was the following: if the model satisfies the assumptions of multiple linear regression, has acceptable adjusted R-squared and root mean squared error values, and the coefficients related to the social sentiment are found to be statistically significant, then the impact of the social sentiment variables can be measured by the value of the related estimated regression coefficients.

Multiple linear regression models were explored using the stock price as a response variable. The regression coefficients, R-squared values, and assumptions of multiple linear regression were evaluated for each model. The models were constructed with the stock tickers we as binary dummy variables. The predicting variables included sentiment and volume from the Vader analysis, sentiment from the Loughran analysis, and historical stock price data. Initially a model was created using all variables in the dataset. The results from the initial model showed none of the sentiment predictors were statistically significant within a 95% confidence interval, however several of the historical stock price predictors had very small p-values. Stepwise model selection using AIC with backward selection was used for variable selection from the full model.

The initial variable selection chose mostly the historical stock data but also included the sentiment variables of positive sentiment and uncertainty from the Loughran analysis. The reduced model showed the positive sentiment variable as statistically significant, but the R-squared value was suspiciously high and seemed to be dominated by the historical stock predictors instead of the sentiment variables. The model also violated the regression assumptions of linearity, constant variance, normality, and no multicollinearity. The historical stock predicting variables were then removed and another model was compiled using the positive sentiment and uncertainty variables only, with results indicating the sentiment variables might have some predictive power.

Finally, a model was created using all predicting variables from the Loughran and Vader sentiment analysis. Stepwise model selection using AIC with backward selection was used again for variable selection. The variable selection returned several variables from both the Vader and Loughran sentiment analysis. A new model was compiled using the selected sentiment variables. The positive sentiment from the Loughran analysis and the compound sentiment from the returned very low p-values in. The positive sentiment variable from the Vader analysis was statistically significant within a 90% confidence interval. The negative sentiment variables for both Loughran and Vader showed p-values within an 85% confidence interval. Again, the R-squared values for this model were suspiciously high and the linear regression assumptions were not met. These evaluations indicate the sentiment variables may have some promise in predicting stock prices, but the results of the models might be misleading or inaccurate.

In addition to using stock price as a response variable we also explored the use of percent returns. We created features for next day percent return through percent return 90 days in the future. As part of our analysis we wanted to determine how current post sentiment might affect prices in differing future time periods. A series of experiments were constructed for each percent return feature. We again utilized AIC with backward selection as well as all features from the Loughrain and Vader sentiment analysis. A problem we noted when regressing on percent returns greater than 3 days in the future, is the general upward trend the stock market experienced during our period of analysis. We believe the models picked up on this and incorrectly associated the independent variables to this effect. As a result we decided upon a model using percent return 2 days in the future from current day.

4. Interactive Implementation

Using Reddit textual analysis to understand the impact on stock prices is an innovation we want to make available as a web UI. This is where we will combine our innovations of textual analysis, and interactive tools.

4.1 Database Architecture (MongoDB database)

Part of designing the data included collecting historic and real time reddit posts from subreddits such as “/r/wallstreetbets”. A raw format of the posts and comments was stored on the MongoDB collections. The extract, transform and load (ETL) phase enabled us to collect meaningful data to achieve better predictions. The data was enclosed into 3 MongoDB collections.

  1. submissions_collections : contains all the posts made under “wallstreetbets” subreddit since 1/1/18 till date. The title and selftext keys are “text” indexed in mongodb, so that the stock ticker symbols can be searched anywhere in those fields.
  2. Comments_collection: contains the comments made under each post. The body key of the comments is indexed by “text”.
  3. Sentiments_collection: This collection aggregates the stock price information, with sentiment features and final prediction to buy/sell the stock.

4.2 Real-time Daily Data extraction (Reddit and stock pricing)

To enable real-time stock prediction analysis based on reddit post sentiment, we performed the following.

4.2.1 Reddit Data

Praw is a python library which provides ways to set up streams to extract posts made under a subreddit realtime. Using this, we setup a python script to continuously extract posts made under “wallstreetbets” and load it into our MongoDB submissions_collection.

4.2.2 Stock Market Data

For the proposed models we required “end of day” stock closing prices for the list of stocks we were using for this project. This data was extracted from “http://alphavantage.co”.

4.3 Apply prediction on the extracted Real-Time data

Based on the research and experiments we performed, we used the ideal sentiment analysis technique to the calculate the sentiment score as a nightly job. This was applied on all the posts and comments made under ‘wallstreetbets’ on that day. Also, we used the sentiment score and the stock price as inputs to our linear regression model to calculate the predicted percentage change in 2 days.

4.4 Web UI

To effectively visualize impact of sentiment data over the stock price, we created a series of visualizations which helps understand the realistic impact of social media posts on a stock. To create this, we used reactjs, recharts (D3 charts implemented in react framework) and python web framework.

UI is accessible at http://liztd.com

4.4.1 Web Server

A python based web server is hosted to create an API for stock sentiment and posts data available in MongoDB. This API takes the ticker symbol as input and generates the output necessary for painting the graphs and posts on the UI.

4.4.2 User Interface

The UI majorly highlights the following sections:

Visualizations

1. Historical Stock Price – Price vs Date

2. Sentiment Score – Compound Sentiment Score vs Date

3. Number of Posts – Post Count vs Date

4. Prediction vs Actual – Predicted Percentage Change for 2 Days in Future vs Actual

Sections
1. A drop down menu to select the stock ticker of interest
2. Today’s pricing information
3. Historical stock price in a tabular format
4. Posts made on the ticker which are color coded based on the sentiment they reflect

6. Conclusions and Discussion

We hoped to extract a usable signal from the social media posts on WallStreetBets. However our models did not produce statistically significant results that would be appropriate to make investing decisions. We ultimately suspect that the posts made to the WallStreetBets forum were not of sufficient quantity or quality.

We believe that there is still some potential signal to be found within Reddit social media posts, but it will require examining other investment related sub forums in an attempt to increase content and perhaps find a sub forum with higher quality discussion. In addition it may be worthwhile to explore a combination of sentiment metrics from other financial social media outlets in addition to our findings. We hope to expand upon our web application in the future and discover a more robust model that is both statistically significant and economically viable.

References

Abraham, Jethin, et al. “Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis.” SMU Data Science Review 1.3 (2018): 1.

Bollen, Johan, Huina Mao, and Xiaojun Zeng. “Twitter mood predicts the stock market.” Journal of computational science 2.1 (2011): 1-8.

Chen, Hailiang, et al. “Wisdom of crowds: The value of stock opinions transmitted through social media.” The Review of Financial Studies 27.5 (2014): 1367-1403.

Fig 1. F. Richter, “Infographic: The Explosive Growth of Reddit’s Community”, Statista Infographics, 2017. [Online]. Available: https://www.statista.com/chart/11882/number-of-subreddits-on-reddit/. [Accessed: 13- Oct- 2018].

Gidofalvi, Gyozo, and Charles Elkan. “Using news articles to predict stock price movements.” Department of Computer Science and Engineering, University of California, San Diego(2001).

Godbole, Namrata, Manja Srinivasaiah, and Steven Skiena. “Large-Scale Sentiment Analysis for News and Blogs.” Icwsm7.21 (2007): 219-222.

Makrehchi, Masoud, Sameena Shah, and Wenhui Liao. “Stock prediction using event-based sentiment analysis.” Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 01. IEEE Computer Society, 2013.

Nikfarjam, Azadeh, Ehsan Emadzadeh, and Saravanan Muthaiyah. “Text mining approaches for stock market prediction.” Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on. Vol. 4. IEEE, 2010.

Nguyen, Thien Hai, Kiyoaki Shirai, and Julien Velcin. “Sentiment analysis on social media for stock movement prediction.” Expert Systems with Applications 42.24 (2015): 9603-9611.

Kaushik, Vikrant Kumar, et al. “Sentiment Analysis of Event Driven Stock Market Price Prediction.” Journal of Network Communications and Emerging Technologies (JNCET) www. jncet. org 8.4 (2018).

Kiritchenko, Svetlana, Xiaodan Zhu, and Saif M. Mohammad. “Sentiment analysis of short informal texts.” Journal of Artificial Intelligence Research 50 (2014): 723-762.

Kucher, Kostiantyn, Carita Paradis, and Andreas Kerren. “The state of the art in sentiment visualization.” Computer Graphics Forum. Vol. 37. No. 1. 2018.

Rosenthal, Steve, and Lydia Austin. “The dwindling taxable share of US corporate stock.” (2016).

Shiller, Robert J., Stanley Fischer, and Benjamin M. Friedman. “Stock prices and social dynamics.” Brookings papers on economic activity 1984.2 (1984): 457-510.

Velay, Marc, and Fabrice Daniel. “Using NLP on news headlines to predict index trends.” arXiv preprint arXiv:1806.09533 (2018).

Wijaya, Viktor, et al. “Automatic mood classification of Indonesian tweets using linguistic approach.” Information Technology and Electrical Engineering (ICITEE), 2013 International Conference on. IEEE, 2013.

Zhang, Xue, Hauke Fuehres, and Peter A. Gloor. “Predicting stock market indicators through twitter “I hope it is not as bad as I fear”.” Procedia-Social and Behavioral Sciences 26 (2011): 55-62.

French, Sally, and Shawn Langlois. “There’s a Loud Corner of Reddit Where Millennials Look to Get Rich or Die Tryin’.” MarketWatch,5_Apr.2016,www.marketwatch.com/story/the-millennials-looking-to-get-rich-or-die-tryin-off-one-of-wall-streets-riskiest-oil-plays-2016-03-30.

Nielsen, Finn Årup. “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.” arXiv preprint arXiv:1103.2903 (2011).

Loughran, Tim, and Bill McDonald. “When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks.” The Journal of Finance 66.1 (2011): 35-65.

Gilbert, CJ Hutto Eric. “Vader: A parsimonious rule-based model for sentiment analysis of social media text.” Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf. 2014.

Analyzing HQ Trivia Data

HQ Trivia Analysis

HQ Trivia is an app and trivia game that was released in August 2017 for both Apple and Android devices. It started to become very popular near the end of 2017 in part due to people hearing about large cash prizes the game was offering. Sometimes prizes up to 25k USD was given away for answering 12 questions correctly.

Starting in October of 2017 a friend of mine started to manually record in a spreadsheet the game questions and answers as well as some of the stats behind who was winning and how much people were winning. It was shortly after this that he asked my help in ramping up data collection efforts. We both love working with data, and thought this would be a fun way to do some real world analysis on an interesting phone app. We wanted to answer questions like how many people on average answer each question correctly? What categories were the most difficult for people to answer? Do people stick around to watch the game after losing? What are people talking about most in the chat?

We set out to build a database to store every aspect of the game to answer these questions. At the time, we were the only ones that had this kind of data for the game. When we started it was still a small app without any of the viral popularity. This changed very quickly and in November of 2017, we were contacted by the Washington Post as they were going to write a story on the new trivia app and needed our data and analysis to make their story. This is when we knew this was something bigger than we thought initially.

I am going to share the story of how we collected the data and present some of the findings from the data that has previously been unseen. To my knowledge this is the only analysis of this kind of actual HQ trivia data that you might find on the internet. If you know of other analysis using raw HQ trivia data let me know in the comments!

Washington Post Story

Washington Post Article

Time.com and Money

Once our data was featured in the Washington post we had all sorts of inquiries to use our data. We partnered with Money to provide data and analysis for 3 additional stories.

Justin and I met as we were both in the Georgia Tech Masters of Analytics program. Once our data was being featured in the news, Georgia Tech wanted to write about our work as well and even used our story in student recruitment efforts! Our work seemed to have paid off.

Now that you have an idea of what we were able to accomplish I am going to start getting into the details of how we collected our data and some analysis you might find interesting.

Data Collection

Manual Collection:

As I stated earlier, my friend Justin started manually recording all of the HQ trivia game data directly into a spreadsheet. Quite often he would look up and find the games as YouTube videos that someone recorded previously so he could pause and write things down. Needless to say this was very time consuming and not a lot of fun.

When Justin first mentioned this idea to me I told him he should look into Amazon’s mechanical turk to enlist the help of other people to capture this data. Justin created a HTML form and instructed the workers to capture the required data from the youtube videos he supplied. This system worked great until we saw the bill. It was simply not sustainable to keep it going for a hobby project. This is really where I started to get involved. Justin knew that I was already working as a data scientist and was very experienced in this type of work so he asked me for help in automating this data collection.

What our data looked like initially

Machine Learning and Optical Character Recognition:

The first idea that came to mind was using optical character recognition to look at the screenshots of the game and parse out the data we were interested in. I leveraged the tesseract library to make a working prototype.

My first prototype worked roughly as follows:

  • Download YouTube videos
  • Convert videos into a series of images
  • Classification algorithm to find the images that we want to parse text from
  • Optical character recognition of the remaining images

The script worked, but was plagued with inaccuracies that simply were not acceptable for what we were wanting to achieve. It was around this time that I noticed people started to cheat on the game and would build programs that would google answers in real time. Clearly someone had figured out how to parse the questions very quickly so I started looking at other people’s code to figure out how. This leads into our final technique.

Undocumented API

The scripts that I found were hooking into the api of the application which would send the questions and answers to the phone app. Using the cheating scripts as inspiration I quickly wrote a script which would instead take the data generated from HQ Trivia and insert them into a database. I setup a server to run this script and capture the afternoon and evening games that were broadcasted.

HQ Insiders Database

We compiled a database of game data from October of 2017 through August 2018. It contains roughly 12,000 unique trivia questions and answers, over 300,000 chat messages, player payouts, and the broadcast metrics of 364 unique games.

Analysis

I am going to detail some of the analysis we did to answer the questions that motivated our efforts in the beginning and then show how the game went through a surge in popularity and eventually going into its decline.

A typical HQ trivia game lasted around 15 to 18 minutes and consisted of 12-15 trivia questions. Each question had three answers to choose from. HQ trivia was a game show and attempted to entertain it’s users while playing. There are two main play times, an afternoon game at 3:00 PM EST and an evening game at 9:00 PM EST. One of the first questions we wanted to ask was how many people watch the game who are not actively playing. I will call these viewers.

Viewership by Game Minute

A viewer is someone who is not actively playing. They might have been eliminated or never attempted to answer the first question. Each line represents a unique game.


These charts show that most people stick around for the first 5 to 10 minutes of the game, and then viewership starts to drop off quickly. The evening games that show up as outliers were the games with either a special guest host or a very large payout so people wanted to stick around and watch who wins.

App Downloads and Key Events

We were able to obtain app download estimates for the HQ trivia app on both Apple and Android app stores. We plotted this data against some key events to see how it affected app downloads. We annotate large cash prizes and the ready player one sponsored game announcement. It is clear that the first large prize that HQ trivia offered helped keep downloads high and the next big catalyst was the ready player one sponsorship announcement.

Distribution of Winners Per Game

When someone wins a game of HQ trivia the prize money is split between all the winners. Therefore the highest payouts occur in the games with the fewest amount of winners. Unfortunately it was rare to be in a game with very few winners. We present a density plot to show the distribution of winners in all the games we recorded. unfortunately a limitation of the API we used to collect data cut off at 750 winners in a game so we were not able to accurately record all games with more than this amount of winners.

Did the Game Become Easier to Win?

The short answer to this is yes. While I do not think the difficulty of the questions changed much, cheating became very prevalent as the game increased in popularity. There were live chat groups on telegram where people would crowdsource answers and the fact that more people were figuring out how to obtain the questions from their api to quickly lookup the answers in google. The chart below shows that the average % of players with correct answers increased over the time period we collected data.

The missing section of data was during the time we were transitioning our data collection methods.

Rise and Decline of HQ Trivia

I believe that running a trivia app on a phone comes with problems that just cannot be overcome. The main issue is offering a monetary incentive to win a trivia game, naturally players will start to maximize their efforts to win. This can be as simple as quickly googling questions while playing, crowdsourcing answers in a living room filled with friends or a more complex chat room filled with thousands of players, and finally the players who created very sophisticated cheating scripts that could look up answers. It is fun to win 10 dollars when playing a game, and absolutely exhilarating winning 25,000! But when players start to win 10 cents it starts to lose its appeal fast. This is exactly what started to happen.

If you track the blue average line you can see a steady decrease in payouts. After May it was becoming very common to win just a few cents or a couple of dollars. Of course there were special events and games that would boost that amount, but the winnings just weren’t as good as they used to be.

If you were to win in November of 2017, the average winning amount was $63.38. Contrast that with winning in July of 2018 where the average winning amount was only $5.72.

You can see that the maximum amount of players per game over time was steadily decreasing after peaking sometime around March or April of 2018. We did not collect data past September but from what I understand the game continues to decline based on other news and social media posts I have read about the game.

What Were People Saying in the Chat?

HQ Trivia offered players to chat while in game. There were so many players at any given time that the messages were coming in quick and fast with little time to read. On top of this if you were playing the chats were distracting and a lot of people chose to hide it. I wanted to take a look at the chats we collected to see exactly what was being said. Mostly it was people talking about their birthday, Scott the host, some glitch in the game they experienced, or something they loved. In fact these were the 4 most common words used in the chat.

There were also some memes talked about frequently in chat.

The red dab line you can see above is when elmo hosted the show. Players asked for elmo to dab. Shortly after elmo delivered.

I also generated some word clouds of certain games. Below is a word cloud from the special game with The Voice.

Code to Reproduce

You can find the R code that I used to produce these graphs and analysis here.