First process the OpinionFinder_Lexicon.tff, calculate the sentiment strength and polarity. Sentiment score of single word can be:
Neural = ignore ; Weak Positive = 1 ; Weak Negative = -1 ; Strong Positive = 3 ; Strong Negative = -3
# prepare Sentiment dictionary
senti = open('OpinionFinder_Lexicon.tff','r').read().splitlines()
sentiment={}
for line in senti:
tokens = line.split(' ')
if tokens[5] == 'priorpolarity=neutral':
continue
term = tokens[2].replace('word1=','')
if tokens[0] == 'type=weaksubj':
score = 1
elif tokens[0] == 'type=strongsubj':
score = 3
if tokens[5] == 'priorpolarity=negative':
polarity = -1
elif tokens[5] == 'priorpolarity=positive':
polarity = 1
sentiment[term] = polarity * score
Define a function to calculate score of each word. A net amount of sentiment score will be count.
Optional: may print out sentiment score of each word.
def calculate_sentiment(d):
tokens = re.findall(r'\w+', d.lower())
sentiment_score = 0
for token in tokens:
if token in sentiment:
sentiment_score = sentiment_score + sentiment[token]
#print(token, sentiment[token])
return sentiment_score
Open up news.json (continue from News Article Analysis 1.0).
Recall that all_articles structure as below:
all_articles = [[date, title, content, link],[date, ..., ..., ...],....,....]
# Open JSON file
import json
with open('news.json') as f:
all_articles = json.load(f)
First sort all articles by date. Then loop through all_articles and analyse content of each article.
Gather the date and score and save it as a tuple in date_score.
Append date_score into date_score_table list.
Optional: Print out first and last 5 title, score, date and url for checking.
all_articles.sort() # sort it by date
date_score_table = []
import re
for i in range(len(all_articles)):
d = all_articles[i][2] # content
sentiment_score = calculate_sentiment(d) # calculate sentiment of content
if i < 5 or i > (len(all_articles) -6): # only print first and last 5 for review
print(str(i + 1) + ') ' + all_articles[i][1]) # print title
print(' Sentiment Score = ' + str(sentiment_score) + ' -------- ' + all_articles[i][0]) # print score and date
print(all_articles[i][3] + '\n') # print url for checking
date_score = (all_articles[i][0], sentiment_score)
date_score_table.append(date_score)
Use Pandas DataFrame to load date_score_table. Assign column name as date and score.
Save a softcopy in hard drive in csv format.
Optional: print out first and last 5 date & score result
import pandas as pd
import numpy as np
from numpy.random import randn
df = pd.DataFrame(date_score_table, columns=('date', 'score'))
df.to_csv('date_score.csv')
print(df.head(5))
print(df.tail(5))
Visualize Sentiment Score in time series plot chart. From the chart, we can analyse keyword that we search for. See how often it exists in a Positive articles compare to Negative articles.
Example: The keyword that we search here is Petronas. Although sentiment score is Positive overall, but we can see numbers of Negative increasing recently.
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import rcParams
rcParams['figure.figsize'] = 16,8
df = pd.read_csv('date_score.csv', parse_dates=True, index_col=1)
plt.xlabel('Date')
plt.ylabel('Sentiment Score')
plt.title('News Sentiment Score')
df['score'].plot(style=".")
plt.axhline(y=0, color='b', linestyle='-')
plt.show()