Textual Analysis Of Reddit to Predict Future Stock Returns

Code to Reproduce Analysis

1. Introduction 

The goal of our project is to understand the impact of social media posts on the future prices of individual stocks. We examined the impact of posts in the social media platform reddit.com within investment focused subreddits/forums using textual analysis techniques.

In today’s society people are engaging in their social activities online using services like Twitter, Reddit, and other similar social media platforms. It has been shown that investing is largely a social activity and that conversations may lead to investment behavior (Shiller et al.). We focused on reddit.com as it has experienced a large amount of growth over the years and there exists several investment focused subreddits on the platform. 50.3% of the value of U.S stock is owned by households and Reddit content is generated solely by individual household users (Rosenthal, Austin). To our knowledge, there has not been a study performed that uses reddit.com as a source of information to predict stock movement. The rapid growth of Reddit’s user base is shown below in Figure 1.

Fig. 1

2. Survey

There have been many attempts to utilize social media content to predict stock returns. Our study is closely related to those utilizing sources such as Twitter and traditional news outlets.

Others have found statistically significant results in predicting future stock returns using their social media data sources of choice (Chen,et al; Bollen et al; Wijaya, et al; Zhang. et al; Makrehchi, et al; Godbole, et al; Nikfarjam, et al; Gidofalvi et al; Nguyen et al).

One paper stated that sentiment analysis could be done today by a lexicon approach of tweets against the Indonesian language (Wijaya, et al.) Another approach to sentiment analysis was using NLP on news headlines (Velay, et al). Yet another approach used the sentiment from seeking alpha blog posts and the corresponding comments (Chen, et al.). Twitter analysis has also shown to be able to predict price movement of other assets such as cryptocurrencies (Abraham, et al.). We found that stock prices fluctuate largely on various events. These events can be announcements about the company or information shared. Often times these events will be talked about online which further supports that textual analysis can be effective (Kaushik, et al.).

A unique aspect of reddit posts is that they are typically short snippets of text of a users thoughts or opinions, not unlike twitter. However Reddit also gives an opportunity to write substantially more if the user desires. Unique approaches to analyzing this type of text is required (Kiritchenko, et al). We expanded upon this to use frequency and sentiment.

3. Method of Analysis

This section will explain the data gathering and variables of interest. Our study used data collection from reddit.com/r/wallstreetbets as well as stock market data collection from various financial APIs. The period we sampled is January 2018 through September 2018. In addition we implemented a web application that leverages our findings on past data for real time insights.

3.1 Data

The subreddit wallstreetbets is a place where reddit users discuss day trading, stocks, options, futures, and any other market related discussion. “WallStreetBets is lively, engaged and growing. It was in the top 1% of Reddit’s more than 824,000 subreddits in new-subscriber growth over the past 90 days, according to RedditMetrics.com, and its more than 2 million monthly page views represent more traffic than all of the other stock-related subreddits combined.” (French, Sally, and Shawn Langlois) An average of 236 topic posts are made to this subreddit daily, and each topic post may contain user comments. On average each topic post will gather 21 comments and sometimes reaching into the thousands depending on the popularity of each post. Each post comment contains an average of 342 words.

As we used reddit comments to try and predict the direction of stock prices, we also obtained daily stock data from the same period. This data was obtained using quantmod.

Stock data obtained from the quantmod was collected for all stock tickers that are mentioned in wallstreetbets posts. The end of day value was used for the stock price. We created additional features using this data. For example we created percent change features that show the percent change 1 day in the future, 2 days, to 90 days in the future. We used the percent change features as dependent variables in our modeling.

3.2 Entity Extraction

One of the challenges of extracting useful information from our comment/post dataset was determining what stock ticker or company comment/post is talking about. To do this we identified all posts which mention a specific company or stock ticker in the title. We gathered all stock post titles then cleaned and tokenized each. We then joined a stock ticker list to find all stock tickers and tagged each post with the tickers found. We made the decision to only analyze posts that contain a single stock ticker in the title as it is beyond the scope of our analysis to try and determine which ticker the corresponding posts are referring to in the discussion. Another challenge we faced when tagging posts was that many stock tickers are also commonly used english words. Examples of this include “play”, “post”, “beat”, “win”.

There are 531 unique stock tickers mentioned in posts during the time frame we examined. We also found that wallstreetbets frequently talked about a small subset of stocks, typically tech related stocks. Because wallstreetbets tends to talk most about these stocks we decided to focus our analysis on them as they give us sufficient volume to derive insights. This also resolves the issue of stock tickers also being common english words.

Stock tickers that we focused on are as follows: mu, amd, tsla, snap, nvda, fb, amzn, baba, aapl, ge, msft, and nflx. We chose these because they show up consistently throughout the time period we are analyzing and represent 36% of the post content. Also the tickers share a common industry, tech, with the exception of ge.

3.3 Measuring Sentiment

The subreddits on reddit.com are a collection of posts with many comments. Each comment is of varying quality. Sometimes a post will gather very relevant and thoughtful stock discussion while others will be conversations of non stock relevant banter. We decided to use two sentiment lexicons that would be most appropriate for this style of commentary.

We have considered the following lexicons: Vader (Gilbert, CJ Hutto Eric) and Loughran (Loughran, Tim, and Bill McDonald). The Vader sentiment lexicon was tuned specifically for microblog social media posts and performs exceptionally well at this task. Our post data closely resembles the style Vader was designed for. We also wanted to incorporate a more domain specific lexicon, called Loughran. This lexicon is tuned specifically for financial text.

The Vader lexicon was run against each comment, the output of which was the following: positive, negative, and compound. Compound being the normalized weighted composite score. Similarly, the Loughran lexicon produced an output of six distinct financial scores: constraining, litigious, positive, negative, superfluous, and uncertainty.

Once the sentiment scores were obtained we summed the total sentiment type for each comment to the post level. For example a post with 100 comments, each with an individual score for the 9 distinct lexicon metrics was summed to one row of data consisting of the post title, stock ticker, and sentiment metrics.

Fig. 2

3.4 Analysis

We wanted to understand the relationship between the comment sentiment and future stock movement. To do this we will be regressing future returns on the sentiment and stock ticker-post volume. The model will be roughly as follows:

Where is the return of stock ticker for the next day . is the sum of a sentiment measure from comments across all posts with company on day . We add an indicator variable, which will denote if there were any posts and comments discussing company on day . Finally we add which is the volume of ticker on day .

3.5 Experiments and Evaluation

We evaluated the effectiveness of our models in the following ways. We examined the assumptions of the regression model, which include a linear relationship between the predictor and response variables, little or no multicollinearity in predictors, the residual error between predicted and observed responses follows a normal distribution, and homoscedasticity. The impact of the social sentiment variables were measured with methods such as variable selection, p-values, confidence intervals, and estimated values for the coefficients. The regression model was evaluated using statistical techniques such as finding the adjusted R-squared value and the root mean squared error. Our assumption was the following: if the model satisfies the assumptions of multiple linear regression, has acceptable adjusted R-squared and root mean squared error values, and the coefficients related to the social sentiment are found to be statistically significant, then the impact of the social sentiment variables can be measured by the value of the related estimated regression coefficients.

Multiple linear regression models were explored using the stock price as a response variable. The regression coefficients, R-squared values, and assumptions of multiple linear regression were evaluated for each model. The models were constructed with the stock tickers we as binary dummy variables. The predicting variables included sentiment and volume from the Vader analysis, sentiment from the Loughran analysis, and historical stock price data. Initially a model was created using all variables in the dataset. The results from the initial model showed none of the sentiment predictors were statistically significant within a 95% confidence interval, however several of the historical stock price predictors had very small p-values. Stepwise model selection using AIC with backward selection was used for variable selection from the full model.

The initial variable selection chose mostly the historical stock data but also included the sentiment variables of positive sentiment and uncertainty from the Loughran analysis. The reduced model showed the positive sentiment variable as statistically significant, but the R-squared value was suspiciously high and seemed to be dominated by the historical stock predictors instead of the sentiment variables. The model also violated the regression assumptions of linearity, constant variance, normality, and no multicollinearity. The historical stock predicting variables were then removed and another model was compiled using the positive sentiment and uncertainty variables only, with results indicating the sentiment variables might have some predictive power.

Finally, a model was created using all predicting variables from the Loughran and Vader sentiment analysis. Stepwise model selection using AIC with backward selection was used again for variable selection. The variable selection returned several variables from both the Vader and Loughran sentiment analysis. A new model was compiled using the selected sentiment variables. The positive sentiment from the Loughran analysis and the compound sentiment from the returned very low p-values in. The positive sentiment variable from the Vader analysis was statistically significant within a 90% confidence interval. The negative sentiment variables for both Loughran and Vader showed p-values within an 85% confidence interval. Again, the R-squared values for this model were suspiciously high and the linear regression assumptions were not met. These evaluations indicate the sentiment variables may have some promise in predicting stock prices, but the results of the models might be misleading or inaccurate.

In addition to using stock price as a response variable we also explored the use of percent returns. We created features for next day percent return through percent return 90 days in the future. As part of our analysis we wanted to determine how current post sentiment might affect prices in differing future time periods. A series of experiments were constructed for each percent return feature. We again utilized AIC with backward selection as well as all features from the Loughrain and Vader sentiment analysis. A problem we noted when regressing on percent returns greater than 3 days in the future, is the general upward trend the stock market experienced during our period of analysis. We believe the models picked up on this and incorrectly associated the independent variables to this effect. As a result we decided upon a model using percent return 2 days in the future from current day.

4. Interactive Implementation

Using Reddit textual analysis to understand the impact on stock prices is an innovation we want to make available as a web UI. This is where we will combine our innovations of textual analysis, and interactive tools.

4.1 Database Architecture (MongoDB database)

Part of designing the data included collecting historic and real time reddit posts from subreddits such as “/r/wallstreetbets”. A raw format of the posts and comments was stored on the MongoDB collections. The extract, transform and load (ETL) phase enabled us to collect meaningful data to achieve better predictions. The data was enclosed into 3 MongoDB collections.

  1. submissions_collections : contains all the posts made under “wallstreetbets” subreddit since 1/1/18 till date. The title and selftext keys are “text” indexed in mongodb, so that the stock ticker symbols can be searched anywhere in those fields.
  2. Comments_collection: contains the comments made under each post. The body key of the comments is indexed by “text”.
  3. Sentiments_collection: This collection aggregates the stock price information, with sentiment features and final prediction to buy/sell the stock.

4.2 Real-time Daily Data extraction (Reddit and stock pricing)

To enable real-time stock prediction analysis based on reddit post sentiment, we performed the following.

4.2.1 Reddit Data

Praw is a python library which provides ways to set up streams to extract posts made under a subreddit realtime. Using this, we setup a python script to continuously extract posts made under “wallstreetbets” and load it into our MongoDB submissions_collection.

4.2.2 Stock Market Data

For the proposed models we required “end of day” stock closing prices for the list of stocks we were using for this project. This data was extracted from “http://alphavantage.co”.

4.3 Apply prediction on the extracted Real-Time data

Based on the research and experiments we performed, we used the ideal sentiment analysis technique to the calculate the sentiment score as a nightly job. This was applied on all the posts and comments made under ‘wallstreetbets’ on that day. Also, we used the sentiment score and the stock price as inputs to our linear regression model to calculate the predicted percentage change in 2 days.

4.4 Web UI

To effectively visualize impact of sentiment data over the stock price, we created a series of visualizations which helps understand the realistic impact of social media posts on a stock. To create this, we used reactjs, recharts (D3 charts implemented in react framework) and python web framework.

UI is accessible at http://liztd.com

4.4.1 Web Server

A python based web server is hosted to create an API for stock sentiment and posts data available in MongoDB. This API takes the ticker symbol as input and generates the output necessary for painting the graphs and posts on the UI.

4.4.2 User Interface

The UI majorly highlights the following sections:


1. Historical Stock Price – Price vs Date

2. Sentiment Score – Compound Sentiment Score vs Date

3. Number of Posts – Post Count vs Date

4. Prediction vs Actual – Predicted Percentage Change for 2 Days in Future vs Actual

1. A drop down menu to select the stock ticker of interest
2. Today’s pricing information
3. Historical stock price in a tabular format
4. Posts made on the ticker which are color coded based on the sentiment they reflect

6. Conclusions and Discussion

We hoped to extract a usable signal from the social media posts on WallStreetBets. However our models did not produce statistically significant results that would be appropriate to make investing decisions. We ultimately suspect that the posts made to the WallStreetBets forum were not of sufficient quantity or quality.

We believe that there is still some potential signal to be found within Reddit social media posts, but it will require examining other investment related sub forums in an attempt to increase content and perhaps find a sub forum with higher quality discussion. In addition it may be worthwhile to explore a combination of sentiment metrics from other financial social media outlets in addition to our findings. We hope to expand upon our web application in the future and discover a more robust model that is both statistically significant and economically viable.


Abraham, Jethin, et al. “Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis.” SMU Data Science Review 1.3 (2018): 1.

Bollen, Johan, Huina Mao, and Xiaojun Zeng. “Twitter mood predicts the stock market.” Journal of computational science 2.1 (2011): 1-8.

Chen, Hailiang, et al. “Wisdom of crowds: The value of stock opinions transmitted through social media.” The Review of Financial Studies 27.5 (2014): 1367-1403.

Fig 1. F. Richter, “Infographic: The Explosive Growth of Reddit’s Community”, Statista Infographics, 2017. [Online]. Available: https://www.statista.com/chart/11882/number-of-subreddits-on-reddit/. [Accessed: 13- Oct- 2018].

Gidofalvi, Gyozo, and Charles Elkan. “Using news articles to predict stock price movements.” Department of Computer Science and Engineering, University of California, San Diego(2001).

Godbole, Namrata, Manja Srinivasaiah, and Steven Skiena. “Large-Scale Sentiment Analysis for News and Blogs.” Icwsm7.21 (2007): 219-222.

Makrehchi, Masoud, Sameena Shah, and Wenhui Liao. “Stock prediction using event-based sentiment analysis.” Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT)-Volume 01. IEEE Computer Society, 2013.

Nikfarjam, Azadeh, Ehsan Emadzadeh, and Saravanan Muthaiyah. “Text mining approaches for stock market prediction.” Computer and Automation Engineering (ICCAE), 2010 The 2nd International Conference on. Vol. 4. IEEE, 2010.

Nguyen, Thien Hai, Kiyoaki Shirai, and Julien Velcin. “Sentiment analysis on social media for stock movement prediction.” Expert Systems with Applications 42.24 (2015): 9603-9611.

Kaushik, Vikrant Kumar, et al. “Sentiment Analysis of Event Driven Stock Market Price Prediction.” Journal of Network Communications and Emerging Technologies (JNCET) www. jncet. org 8.4 (2018).

Kiritchenko, Svetlana, Xiaodan Zhu, and Saif M. Mohammad. “Sentiment analysis of short informal texts.” Journal of Artificial Intelligence Research 50 (2014): 723-762.

Kucher, Kostiantyn, Carita Paradis, and Andreas Kerren. “The state of the art in sentiment visualization.” Computer Graphics Forum. Vol. 37. No. 1. 2018.

Rosenthal, Steve, and Lydia Austin. “The dwindling taxable share of US corporate stock.” (2016).

Shiller, Robert J., Stanley Fischer, and Benjamin M. Friedman. “Stock prices and social dynamics.” Brookings papers on economic activity 1984.2 (1984): 457-510.

Velay, Marc, and Fabrice Daniel. “Using NLP on news headlines to predict index trends.” arXiv preprint arXiv:1806.09533 (2018).

Wijaya, Viktor, et al. “Automatic mood classification of Indonesian tweets using linguistic approach.” Information Technology and Electrical Engineering (ICITEE), 2013 International Conference on. IEEE, 2013.

Zhang, Xue, Hauke Fuehres, and Peter A. Gloor. “Predicting stock market indicators through twitter “I hope it is not as bad as I fear”.” Procedia-Social and Behavioral Sciences 26 (2011): 55-62.

French, Sally, and Shawn Langlois. “There’s a Loud Corner of Reddit Where Millennials Look to Get Rich or Die Tryin’.” MarketWatch,5_Apr.2016,www.marketwatch.com/story/the-millennials-looking-to-get-rich-or-die-tryin-off-one-of-wall-streets-riskiest-oil-plays-2016-03-30.

Nielsen, Finn Årup. “A new ANEW: Evaluation of a word list for sentiment analysis in microblogs.” arXiv preprint arXiv:1103.2903 (2011).

Loughran, Tim, and Bill McDonald. “When is a liability not a liability? Textual analysis, dictionaries, and 10‐Ks.” The Journal of Finance 66.1 (2011): 35-65.

Gilbert, CJ Hutto Eric. “Vader: A parsimonious rule-based model for sentiment analysis of social media text.” Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http://comp. social. gatech. edu/papers/icwsm14. vader. hutto. pdf. 2014.