❖     Sentiment Analysis on Movie-Reviews

(data science course, two-person team)

  • Natural Language Processing (LDA, Lemma, Word2Vec); Machine learning models (Ensemble) implementation
  • Data analysis and visualization;

This is a project for the Spring 2016 Data Science class at Olin College. We've chosen to work with a dataset of movie reviews from Rotten Tomatoes. Our goal is to predict the sentiment of these reviews using various natural language processing tools. A competition based on this dataset was previously hosted on the data science site, Kaggle.

Competition link could be found here

Code repo link could be found here

Final report could be found here

Model

We extracted features from movie reviews in a variety of ways for this competition. These include:

  • Term Frequency vectorization
  • Term Frequency Inverse Document Frequency vectorization
  • Latent Dirichlet Allocation
  • Parts of Speech Tagging
  • Google's Word2Vec tool
  • TextBlob sentiment analysis

We have constructed separate models for predicting movie review sentiments using each of these methods. The model representations that we have used most are support vector machines and decision trees. We determined these were most effective by conducting cross validation across several representations. Along with a cross validation score, we were able to get another performance evaluation from Kaggle. Although the Sentiment Analysis on Move Reviews competition has ended, Kaggle has made the test prediction scoring page still available.

We have also built an ensemble model by combining each of these individual models. The ensemble uses the predictions of the lower level models as input features. By combining the predictions into an ensemble model we are able to achieve higher scores than any individual model.

93526F21-85D5-4F14-8A3D-592A8E6694E6.jpg