End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer
Application of Machine learning and Natural language Processing | Maximilien Kpizingui | Submission DPhi October 2021 | Deep Learning & IoT Engineer
In this pandemic time where every body has to cover his nose and turn to online shopping, your product is not what you say about it but it is what google says about it. Let me ask you this question as a consumer of an end product. When you go online to purchase a product what is the first thing you do? Do you just buy an item because you cannot interact physically with it and have a knowledge of it or you take your time to read through the comments and experiences of other end users about that same product before purchasing. If you belong to the last category, you are in the winning side in this digital age. In this optic, we are going to use a mathematical model to predict the rating of a particular product out of 5 based on some sample product information collected from end consumers on Amazon using machine learning and natural language processing with RMSE as evaluation metric.
Contents
1. What is product review
2. Importance of product review
3. Top product review platforms
4. What is multinomial Naive Bayes algorithm
5. What is CountVectorizer
6. Code implementation of CountVectorizer
7. Implementation of product rating using multinomial Naive Bayes algorithm
1- What is product review
In electronic commerce, a product review is a section on shopping websites which gives the customers the opportunity to rate and to comment on a product they have purchased in which other end consumers will read to make their decision to purchase the same item for their personal need.
2. Importance of product review
Drive sales
Build trust
Aid customer decision making
Credibility and social proof
3. Top product review platform
Baazarvoice
Yotpo
Trustpilot
PowerReviews
Reevoo
Feefo
4. What is multinomial Naive Bayes algorithm
Multinomial Naive Bayes algorithm is a subset of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of the feature. It is designed to handle corpus using word counts as its method of calculating probability given by:
P(c|x) = P(x|c) * P(c) / P(x)
Where c is the class of the possible outcomes and x is the given instance which has to be classified representing some certain features.
If you want to go deep into the mathematics, I refer Nagesh Singh Chauhan post on DPhi platform to you.
5. What is CountVectorizer
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to convert a given text in a document into a numerical vector based on the frequency of each word that occurs in the entire text. This transformation is paramount at the early stage of machine learning pipeline which will be used as feature representation of the raw text in document for machine learning tasks such as text classification and clustering because machine learning algorithms only compute numerical features irrespective of the input data feed into the model.
6. Code implementation of CountVectorizer
Considering a few sample texts from a corpus of my IoT startup as a list element:
corpus= ["maxtek helps startup", "maxtek is into computer vision and IoT", "maxtek provides Deep learning and IoT solutions"]
CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the corpus is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample as shown below
from sklearn.feature_extraction.text import CountVectorizer
corpus = ["maxtek helps startup",
"maxtek is into computer vision and IoT",
"maxtek provides Deep learning and IoT solutions"]
# Create a Vectorizer Object
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)
# Encode the corpus
vector = vectorizer.transform(corpus)
# Summarizing the Encoded word in the corpus
print("Encoded corpus is:")
print(vector.toarray())
This way of representation is known as a Sparse Matrix.
Key observations:
There are 13 unique words in the corpus forming the vocabulary, represented as columns of the table.
There are 3 sentences in the corpus each represented as rows of the table.
Every cell contains a number, that represents the count of the word in that particular text.
All words have been converted to lowercase.
The words in columns have been arranged alphabetically.
7. Implementation of product rating using multinomial Naive Bayes algorithm
- Importing all the libraries required to run this code.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize, word_tokenize
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize, pos_tag
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import contractions
import nltk
import re
In case you run through package not found error after running this code above you can install that particular library as shown below
!pip3 install name_of_the_library
- Loading train and test dataset
train_df = pd.read_csv("Train_Data.csv")
test_df = pd.read_csv('Test_Data.csv')
- Displaying 5 rows of the train_df
train_df.head()
- Displaying information about the features sets in the train dataframe
train_df.info()
- Visulalizing the distribution of average_review_rating
plt.figure(figsize=(12,8))
train_df['average_review_rating'].value_counts().sort_index().plot(kind='bar')
plt.title('Distribution of average_review_rating')
plt.xlabel('average_review_rating')
plt.ylabel('Count')
- Visualizing the distribution top 50 products reviews
products = train_df["product_name"].value_counts()
plt.figure(figsize=(12,8))
products[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 products")
- Visualizing the distribution of top 50 manufactuters reviews
brands = train_df['manufacturer'].value_counts()
plt.figure(figsize=(12,8))
brands[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 manufacturers")
- Visualizing the distrubition of the length of the reviews
review_length = train_df["customer_reviews"].dropna().map(lambda x: len(x))
plt.figure(figsize=(12,8))
review_length.loc[review_length < 1500].hist()
plt.title("Distribution of customer review Length")
plt.xlabel('Review length (Number of character)')
plt.ylabel('Count')
- Checking for NaN value in the train dataframe
train_df.isnull().sum()
- As shown above, we have many NaN values in the dataframe so we are going to remove them by using drop() method cause if we keep them there will be noise in our data.
train_df.dropna(inplace=True)
- Displaying customer_reviews in the dataframe cause we will be using it as input to our model. Recalling that the system rate various product based on customer review
train_df['customer_reviews']
- Checking if there is not NaN values in the features
train_df['customer_reviews'].isnull().sum()
That looks good. let's proceed further!!
- Defining a function cleanText() to remove special, html tagsetc. in our feature
def cleanText(raw_text, remove_stopwords=False, stemming=False, split_text=False ):
'''
Convert a raw review to a cleaned review
'''
text = BeautifulSoup(raw_text, 'lxml').get_text() #remove html
letters_only = re.sub("[^a-zA-Z]", " ", text) # remove non-character
words = letters_only.lower().split() # convert to lower case
if remove_stopwords: # remove stopword
stops = set(stopwords.words("english"))
words = [w for w in words if not w in stops]
if stemming==True: # stemming
# stemmer = PorterStemmer()
stemmer = SnowballStemmer('english')
words = [stemmer.stem(w) for w in words]
if split_text==True: # split text
return (words)
return( " ".join(words))
- Splitting the dataset into training and validation dataset to train and evaluate the performance of our model by using train_test_split() method taking the feature and target variable as arguments.
X_train, X_test, y_train, y_test = train_test_split(train_df['customer_reviews'],train_df['average_review_rating'],test_size=0.2, random_state=42)
- Cleaning the training feature set.
X_train_cleaned = []
X_test_cleaned = []
for d in X_train:
X_train_cleaned.append(cleanText(d))
print('Show a cleaned review in the training set : \n', X_train_cleaned[10])
for d in X_test:
X_test_cleaned.append(cleanText(d))
- Printing the identified Unique words along with their indices
X_train_countVect = countVect.fit(X_train_cleaned)
print("Vocabulary: ", X_train_countVect.vocabulary_)
- Applying CountVectorizer to X_train_clean
countVect = CountVectorizer()
X_train_countVect = countVect.fit_transform(X_train_cleaned)
print("Number of features : %d \n" %len(countVect.get_feature_names())) #6378
print("Show some feature names : \n", countVect.get_feature_names()[::100])
- Create a object of a class LabelEncoder(). We notice that our target variable is of data type float and the input to our model must be of integer data type.
lb=preprocessing.LabelEncoder()
- Displaying y_train before encoding
print(y_train)
- Encoding y_train
y_train_encoded=lb.fit_transform(y_train)
- Displaying the encoded y_train
print(y_train_encoded)
- Encoding y_test
y_test_encoded = lb.transform(y_test)
y_test_encoded
- Defining our model and fitting with X_train_countVect and y_train_encoded
mnb = MultinomialNB()
mnb.fit(X_train_countVect, y_train_encoded)
- Predicting on unseen validation test features
predictions = mnb.predict(countVect.transform(X_test_cleaned))
- Displaying predicted value.
print(predictions)
We notice that the predicted value is out of the scale in the problem statement simply because we encoded our target variable before fitting our model. So we need to inverse transform the predicting to get the actual prediction as shown below:
prediction =lb.inverse_transform(predictions )
- Defining a function to evaluate the model using RMSE(Root Mean Square Error) as metric score.
def modelEvaluation(y_tst,pred):
'''
Print model evaluation to predicted result
'''
print ("\nAccuracy on validation set: {:.4f}".format(np.sqrt(mean_squared_error(y_tst, pred))))
- Evaluating the model performance
modelEvaluation(lb.inverse_transform(y_test_encoded),predictions)
Accuracy on validation set: 3.1841
Low RMSE score means the model is performing well. We have reached half way to our final destination. Let us process the test dataframe and run the prediction on unseen feature sets.
- Displaying the first 5 rows of the test dataframe
test_df.head()
- Displaying information about the test dataframe
test_df.info()
- Checking if there is a null value in the test dataframe
test_df.customer_reviews.isnull()
- Cleaning the test_df['customer_reviews']
test_df_cleaned = []
for d in test_df['customer_reviews']:
test_df_cleaned .append(cleanText(d))
print('Show a cleaned review in the training set : \n', test_df_cleaned [10])
- Predicting on the unseen clean customer reviews
predictions_test_df = mnb.predict(countVect.transform(test_df_cleaned ))
predictions_test = lb.inverse_transform(predictions_test_df)
- Displaying the predicted values
print(predictions_test)
- Putting the predicted values into a dataframe
predictions_test = pd.DataFrame(predictions_test)
- Displaying the prediction
print(predictions_test)
- Converting the dataframe into csv format for submission.
predictions_test.index = pd.DataFrame(predictions_test).index
predictions_test.columns = ["prediction"]
predictions_test.to_csv("submission.csv", index = False)
Check on the project folder in jupyter notebook you should have a file name submission.csv as shown below:
Conclusion:
We were able to rate customer reviews on Amazon product with RMSE score of 0.318 using multinomial Naive Bayes algorithm and countVectorizer. We could achieve lower RMSE score with TF-IDF which performs better than countVectorizer and exploring other machine learning algorithms such as support vector machine algorithm, deep learning algorithm like RNN and LSTM.
Please let me know if you find any errors. You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org
If you are on linkedIn you can reach me here
Warm regards,
Maximilien.