News Article Analysis 2.0

Natural Language Processing (NLP)

In this process of Text Analysis, we Count Word Frequency and visualize it in Word Cloud.

Open News Article data saved earlier

First, we open up news.json (continue from News Article Analysis 1.0).

Recall that all_articles structure as below:

all_articles = [[date, title, content, link],[date, ..., ..., ...],....,....]

In [2]:
# Open JSON file
import json
with open('news.json') as f:
    all_articles = json.load(f)

Regular Expressiong (re) - use for search and matching.

import Counter - to count word frequency

stopwords - use for remove stopword in content of articles

text = '' - store every word in text, as single string

tokens - word tokenization (exclude stopwords & numbers)

len(tokens) - show total number of words

Counter(tokens).most_common(100) - show 100 most common words

In [4]:
import re
from collections import Counter
stopwords = open('stopwords.txt','r').read().splitlines()

text = ''
for i in range(len(all_articles)): 
    try:
        text += all_articles[i][2]      # all content of articles become a single string
    except TypeError:
        continue
tokens = re.findall(r'\w+', text)  # Regular Expresssion will return a list of tuples
tokens = [t.lower() for t in tokens if t.lower() not in stopwords] # remove stopwords

# to remove all numbers (eg. 2018, 1, 10, 100)
for w in tokens:
    try: intW = int(w)
    except ValueError: continue
    if type(intW) == int:
        while w in tokens: tokens.remove(w)


print('\n' + 'Total number of words = ' + str(len(tokens)) + '\n')
print(Counter(tokens).most_common(100))
Total number of words = 54943

[('petronas', 1579), ('oil', 844), ('gas', 529), ('government', 368), ('malaysia', 346), ('project', 297), ('lng', 286), ('sarawak', 260), ('prices', 248), ('industry', 222), ('crude', 216), ('wan', 165), ('lumpur', 161), ('added', 152), ('production', 147), ('development', 143), ('tax', 140), ('energy', 140), ('nasional', 138), ('petroleum', 134), ('lower', 133), ('kuala', 127), ('companies', 127), ('national', 124), ('rose', 123), ('growth', 121), ('day', 120), ('chief', 119), ('fell', 117), ('aramco', 115), ('rm1', 115), ('datuk', 114), ('petroliam', 112), ('president', 112), ('contract', 112), ('revenue', 111), ('petrochemical', 111), ('research', 110), ('federal', 109), ('cost', 109), ('refinery', 109), ('minister', 106), ('executive', 106), ('natural', 103), ('activities', 99), ('zulkiflee', 99), ('supply', 99), ('global', 98), ('investment', 98), ('rapid', 96), ('statement', 94), ('increase', 94), ('rm5', 94), ('earnings', 93), ('barrel', 92), ('bank', 91), ('country', 90), ('due', 89), ('upstream', 89), ('profit', 89), ('projects', 88), ('dr', 88), ('klci', 87), ('budget', 86), ('continue', 86), ('canada', 86), ('demand', 86), ('time', 86), ('services', 84), ('exploration', 84), ('term', 83), ('sabah', 83), ('capital', 82), ('lost', 81), ('international', 80), ('decision', 79), ('compared', 78), ('resources', 78), ('world', 78), ('fuel', 77), ('malaysian', 75), ('integrated', 75), ('complex', 75), ('officer', 74), ('local', 74), ('products', 74), ('saudi', 73), ('pengerang', 73), ('stations', 73), ('rm2', 73), ('cash', 72), ('june', 72), ('sector', 72), ('including', 72), ('rm4', 72), ('dagangan', 72), ('dividend', 70), ('billion', 69), ('asia', 69), ('top', 68)]

Optional step: Manually remove non-meaningful words

filter_word - store all word intended to be remove by manually filling in.

Count again total number of words, and show 100 most common.

In [5]:
# Exclude some word you find not meaningful

print(r'Key in word you wish to exlucde from list')
filter_word = input()
filter_word = filter_word.split()

for word in filter_word:
    while word in tokens: tokens.remove(word)

print('\n' + 'Total number of words = ' + str(len(tokens)) + '\n')
print(Counter(tokens).most_common(100))
Key in word you wish to exlucde from list
billion rm4 including rm2 june compared decision term dr due rm5 statement rm1 prices

Total number of words = 53617

[('petronas', 1579), ('oil', 844), ('gas', 529), ('government', 368), ('malaysia', 346), ('project', 297), ('lng', 286), ('sarawak', 260), ('industry', 222), ('crude', 216), ('wan', 165), ('lumpur', 161), ('added', 152), ('production', 147), ('development', 143), ('tax', 140), ('energy', 140), ('nasional', 138), ('petroleum', 134), ('lower', 133), ('kuala', 127), ('companies', 127), ('national', 124), ('rose', 123), ('growth', 121), ('day', 120), ('chief', 119), ('fell', 117), ('aramco', 115), ('datuk', 114), ('petroliam', 112), ('president', 112), ('contract', 112), ('revenue', 111), ('petrochemical', 111), ('research', 110), ('federal', 109), ('cost', 109), ('refinery', 109), ('minister', 106), ('executive', 106), ('natural', 103), ('activities', 99), ('zulkiflee', 99), ('supply', 99), ('global', 98), ('investment', 98), ('rapid', 96), ('increase', 94), ('earnings', 93), ('barrel', 92), ('bank', 91), ('country', 90), ('upstream', 89), ('profit', 89), ('projects', 88), ('klci', 87), ('budget', 86), ('continue', 86), ('canada', 86), ('demand', 86), ('time', 86), ('services', 84), ('exploration', 84), ('sabah', 83), ('capital', 82), ('lost', 81), ('international', 80), ('resources', 78), ('world', 78), ('fuel', 77), ('malaysian', 75), ('integrated', 75), ('complex', 75), ('officer', 74), ('local', 74), ('products', 74), ('saudi', 73), ('pengerang', 73), ('stations', 73), ('cash', 72), ('sector', 72), ('dagangan', 72), ('dividend', 70), ('asia', 69), ('top', 68), ('downstream', 68), ('public', 67), ('programme', 67), ('dollar', 67), ('set', 66), ('month', 66), ('reported', 66), ('carigali', 66), ('rm3', 66), ('partners', 66), ('china', 65), ('total', 64), ('subsidiary', 64), ('financial', 63)]

Optional step: Remove more words if needed

Refer to result above, exclude more non-meaningful word.

In [6]:
# Refer to result above exclude more word you find not meaningful

print(r'Key in word you wish to exlucde from list')
filter_word = input()
filter_word = filter_word.split()

for word in filter_word:
    while word in tokens: tokens.remove(word)

print('\n' + 'Total number of words = ' + str(len(tokens)) + '\n')
print(Counter(tokens).most_common(100))
Key in word you wish to exlucde from list
rm3 month set total

Total number of words = 53355

[('petronas', 1579), ('oil', 844), ('gas', 529), ('government', 368), ('malaysia', 346), ('project', 297), ('lng', 286), ('sarawak', 260), ('industry', 222), ('crude', 216), ('wan', 165), ('lumpur', 161), ('added', 152), ('production', 147), ('development', 143), ('tax', 140), ('energy', 140), ('nasional', 138), ('petroleum', 134), ('lower', 133), ('kuala', 127), ('companies', 127), ('national', 124), ('rose', 123), ('growth', 121), ('day', 120), ('chief', 119), ('fell', 117), ('aramco', 115), ('datuk', 114), ('petroliam', 112), ('president', 112), ('contract', 112), ('revenue', 111), ('petrochemical', 111), ('research', 110), ('federal', 109), ('cost', 109), ('refinery', 109), ('minister', 106), ('executive', 106), ('natural', 103), ('activities', 99), ('zulkiflee', 99), ('supply', 99), ('global', 98), ('investment', 98), ('rapid', 96), ('increase', 94), ('earnings', 93), ('barrel', 92), ('bank', 91), ('country', 90), ('upstream', 89), ('profit', 89), ('projects', 88), ('klci', 87), ('budget', 86), ('continue', 86), ('canada', 86), ('demand', 86), ('time', 86), ('services', 84), ('exploration', 84), ('sabah', 83), ('capital', 82), ('lost', 81), ('international', 80), ('resources', 78), ('world', 78), ('fuel', 77), ('malaysian', 75), ('integrated', 75), ('complex', 75), ('officer', 74), ('local', 74), ('products', 74), ('saudi', 73), ('pengerang', 73), ('stations', 73), ('cash', 72), ('sector', 72), ('dagangan', 72), ('dividend', 70), ('asia', 69), ('top', 68), ('downstream', 68), ('public', 67), ('programme', 67), ('dollar', 67), ('reported', 66), ('carigali', 66), ('partners', 66), ('china', 65), ('subsidiary', 64), ('financial', 63), ('operations', 63), ('stake', 62), ('offshore', 62), ('costs', 62)]

Word Cloud

Use Word Cloud and Matplotlib to visualize frequency of words.

In [8]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

all_words = ' '.join(tokens)

wordcloud = WordCloud(background_color="white", width=1600, height=800).generate(all_words)
# Open a plot of the generated image.

plt.figure( figsize=(16,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()