Naive Bayes
This section builds on the last 2 tutorials to choose an algorithm, separate the data into training and testing sets – and set it running.
The algorithm in this example is the Naive Bayes classifier.
But first the data needs to be split into training and test sets for some supervised machine learning. In essence we show the machine data, and telling it “this data is positive,” or “this data is negative.” Then, after the training is done, we show the machine some new data and ask the computer what the computer thinks the category of the new data is.
import nltk
import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)
all_words = []
for w in movie_reviews.words():
all_words.append(w.lower()) # normalise everything to lower case and append
all_words = nltk.FreqDist(all_words) # converts to a nltk frequency distribution
word_features = list (all_words.keys())[:3000] # from the frequency list we're taking just the words(keys) and only the top 3000
def find_fetures(document):
words = set(document) # this gives a list of the unique words - removes duplicates
features = {} # declare an empty dictionary
for w in word_features:
features[w] = (w in words) # this checks each word in the top 3000 to see if it is present in the passed text 'document' so gives a true/false against the 3000
return features
# print((find_fetures(movie_reviews.words('neg/cv000_29416.txt'))))
featuresets = [(find_fetures(rev), category) for (rev, category) in documents]
training_set = featuresets[:1900] # splits the featuresets into two seperate groups 1 to train and the other to test
testing_set = featuresets[1900:]
## Naive Bayse Algorythm
classifier = nltk.NaiveBayesClassifier.train(training_set) # training the NaiveBayesClassifier on training data
print("Naive Bayes Algo accuracy:", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15) # tells most popular words on both sides and if +ve or -ve
Output:-
galiquis@raspberrypi: $ python3 ./nltk_tutorial13.py
Naive Bayes Algo accuracy: 80.0
Most Informative Features
annual = True pos : neg = 9.6 : 1.0
sucks = True neg : pos = 9.1 : 1.0
bothered = True neg : pos = 9.1 : 1.0
frances = True pos : neg = 8.9 : 1.0
idiotic = True neg : pos = 8.8 : 1.0
unimaginative = True neg : pos = 8.4 : 1.0
silverstone = True neg : pos = 7.7 : 1.0
shoddy = True neg : pos = 7.1 : 1.0
suvari = True neg : pos = 7.1 : 1.0
mena = True neg : pos = 7.1 : 1.0
sexist = True neg : pos = 7.1 : 1.0
regard = True pos : neg = 6.9 : 1.0
schumacher = True neg : pos = 6.7 : 1.0
uninspired = True neg : pos = 6.6 : 1.0
kidding = True neg : pos = 6.4 : 1.0