Introduction to Data Science: Assignment 1 (Twitter Sentiment Analysis in Python)

This is a fun assignment, as we get to deal with real data: twitter’s live stream, that is real-time tweets on is done by accessing the twitter Application Programming Interface(API) using python. This assignment requires some python programming knowledge.

Get Twitter live stream data

First, we have to get the data from twitter to our computer, then we will analyze it. In the files provided, there is a script/ file, running this script will get data from twitter and store it in a file on your computer.

Before running the script, we need to get API Key, API secret, Access token, and Access token secret before accessing twitter api. This can be done by signing in here, click “Create New App”, then click the “API keys tab”, and “Create My Access Token”. After obtaining the key and tokens, change these information in your We are ready to get the live stream data, run in your command prompt:

This will output twitter live stream data to output.txt file, click ctrl + c to stop the live stream when you feel that you have enough data.

Analyze the Twitter live stream data

If you open your output.txt file, you will see many lines of data, each line corresponds to data of one tweet. Each line is a string in JSON format, a standard way to store information. You can see information like created_at, id, text, etc… To understand what these information means, refer to the documentation. Following is a string in JSON format representing one tweet’s information:

Before accessing json string, we need to parse it first, using the json.loads function, remember to import json library. After parsing, we can access the information using this line of code:

parsed is a string of json that is parsed. In the code, “text” key is accessed, the encode(‘utf-8) is to properly print international characters. Since text consists of many terms/ words, we use split() function to split the terms and store it in the form of dictionary.

Calculating the sentiment score for each tweet

We are supplied with a afinn-111.txt file, it consists of lines of data, each line consists of a term and sentiment score separated by tab. This is a pre made by hand file that match a term to its positive/ negative score (-5 to 5, 5 being the most positive and -5 being the most negative). The concept here is, since we can access the text of each tweet, and the sentiment score, we can match each term of a tweet with a sentiment score, add up the scores, then we get a sentiment score for the tweet, assuming terms not found in afinn-111.txt to have 0 sentiment score.

Guess the sentiment score of new terms

Some of the terms in tweets may not be found in afinn-111.txt file. To get the sentiment score of non-afinn-111 term, we divide the sum of the sentiment score of tweets that contain the term by the number of tweets that contain the term. For example, among 10 tweets, a non-afinn111 term “soccer” appears 3 times, and the tweets “soccer” appears in have sentiment scores of 7, 3, and -1, using the formula, \frac{7+3-1}{3} = 3, the sentiment score of the term “soccer” is 3. I got this technique from here (slide 12).

Compute term Frequency

This is easy, we divide the number of times a term occur in all tweets, and divide by the number of terms in all tweets. For example, the term “I” appears 30 times in 10 tweets, and there are 100 terms in 10 tweets, the frequency of the term “I” is \frac{30}{100} = 0.3.

Which state is happiest?

The parsed json string also has information like “location”. Location is nested in user object and can be accessed by:

I split the location because the location may contain more than one word. To standardize the problem, we concern ourselves with only United States. We scan the location for two letter abbreviation, and check if it is a state using a python dictionary of state abbreviations. Then, add up the sentiment score of tweets of the states and print the state with highest total sentiment score to command prompt.

Top ten hash tags

This is simple, loop through each word of each tweet, and find words that start with “#”. Create a dictionary to store the hash tags. Then, print out ten hash tags that has the most number of occurrences.

Overall, the concept of the assignment is quite simple, the harder part is the execution of the concept, i.e. programming. I learnt how to use API and read documentation, also refreshes my memory of python. Lastly, raw data may be messy, such as, the text may be in different language, or location is not supplied. As first step of doing data science, we need to clean the data. This can be done by filtering the tweets that we want to analyze and discard those that is unanalyzable.

Above filters and keeps data that is in ‘en’, english and have non empty location.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.