End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer

In this pandemic time where every body has to cover his nose and turn to online shopping, your product is not what you say about it but it is what google says about it. Let me ask you this question as a consumer of an end product. When you go online to purchase a product what is the first thing you do? Do you just buy an item because you cannot interact physically with it and have a knowledge of it or you take your time to read through the comments and experiences of other end users about that same product before purchasing. If you belong to the last category, you are in the winning side in this digital age. In this optic, we are going to use a mathematical model to predict the rating of a particular product out of 5 based on some sample product information collected from end consumers on Amazon using machine learning and natural language processing with RMSE as evaluation metric.

Contents

1. What is product review

2. Importance of product review

3. Top product review platforms

4. What is multinomial Naive Bayes algorithm

5. What is CountVectorizer

6. Code implementation of CountVectorizer

7. Implementation of product rating using multinomial Naive Bayes algorithm

1- What is product review

In electronic commerce, a product review is a section on shopping websites which gives the customers the opportunity to rate and to comment on a product they have purchased in which other end consumers will read to make their decision to purchase the same item for their personal need.

2. Importance of product review

Drive sales
Build trust
Aid customer decision making
Credibility and social proof

3. Top product review platform

Baazarvoice
Yotpo
Trustpilot
PowerReviews
Reevoo
Feefo

4. What is multinomial Naive Bayes algorithm

Multinomial Naive Bayes algorithm is a subset of probabilistic algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of the feature. It is designed to handle corpus using word counts as its method of calculating probability given by:

P(c|x) = P(x|c) * P(c) / P(x)

Where c is the class of the possible outcomes and x is the given instance which has to be classified representing some certain features.

If you want to go deep into the mathematics, I refer Nagesh Singh Chauhan post on DPhi platform to you.

5. What is CountVectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to convert a given text in a document into a numerical vector based on the frequency of each word that occurs in the entire text. This transformation is paramount at the early stage of machine learning pipeline which will be used as feature representation of the raw text in document for machine learning tasks such as text classification and clustering because machine learning algorithms only compute numerical features irrespective of the input data feed into the model.

6. Code implementation of CountVectorizer

Considering a few sample texts from a corpus of my IoT startup as a list element:

corpus= ["maxtek helps startup", "maxtek is into computer vision and IoT", "maxtek provides Deep learning and IoT solutions"]

CountVectorizer creates a matrix in which each unique word is represented by a column of the matrix, and each text sample from the corpus is a row in the matrix. The value of each cell is nothing but the count of the word in that particular text sample as shown below

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["maxtek helps startup",
            "maxtek is into computer vision and IoT",
            "maxtek provides Deep learning and IoT solutions"]

# Create a Vectorizer Object
vectorizer = CountVectorizer()

vectorizer.fit(corpus)

# Printing the identified Unique words along with their indices
print("Vocabulary: ", vectorizer.vocabulary_)

# Encode the corpus
vector = vectorizer.transform(corpus)

# Summarizing the Encoded word in the corpus
print("Encoded corpus is:")
print(vector.toarray())

This way of representation is known as a Sparse Matrix.

Key observations:

There are 13 unique words in the corpus forming the vocabulary, represented as columns of the table.
There are 3 sentences in the corpus each represented as rows of the table.
Every cell contains a number, that represents the count of the word in that particular text.
All words have been converted to lowercase.
The words in columns have been arranged alphabetically.

7. Implementation of product rating using multinomial Naive Bayes algorithm

Importing all the libraries required to run this code.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import mean_squared_error
from sklearn.naive_bayes import MultinomialNB
from sklearn import preprocessing
from sklearn import metrics
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from nltk import sent_tokenize, word_tokenize
from bs4 import BeautifulSoup 
from nltk.corpus import stopwords 
from nltk.stem.porter import PorterStemmer
from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk import sent_tokenize, word_tokenize, pos_tag
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd 
import numpy as np
import contractions
import nltk
import re

In case you run through package not found error after running this code above you can install that particular library as shown below

!pip3 install name_of_the_library

Loading train and test dataset

train_df = pd.read_csv("Train_Data.csv")
test_df = pd.read_csv('Test_Data.csv')

Displaying 5 rows of the train_df

train_df.head()

Displaying information about the features sets in the train dataframe

train_df.info()

Visulalizing the distribution of average_review_rating

plt.figure(figsize=(12,8))

train_df['average_review_rating'].value_counts().sort_index().plot(kind='bar')
plt.title('Distribution of average_review_rating')
plt.xlabel('average_review_rating')
plt.ylabel('Count')

Visualizing the distribution top 50 products reviews

products = train_df["product_name"].value_counts()
plt.figure(figsize=(12,8))
products[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 products")

Visualizing the distribution of top 50 manufactuters reviews

brands = train_df['manufacturer'].value_counts()
plt.figure(figsize=(12,8))
brands[:50].plot(kind='bar')
plt.title("Number of Reviews for Top 50 manufacturers")

Visualizing the distrubition of the length of the reviews

review_length = train_df["customer_reviews"].dropna().map(lambda x: len(x))
plt.figure(figsize=(12,8))
review_length.loc[review_length < 1500].hist()
plt.title("Distribution of customer review Length")
plt.xlabel('Review length (Number of character)')
plt.ylabel('Count')

Checking for NaN value in the train dataframe

train_df.isnull().sum()

As shown above, we have many NaN values in the dataframe so we are going to remove them by using drop() method cause if we keep them there will be noise in our data.

train_df.dropna(inplace=True)

Displaying customer_reviews in the dataframe cause we will be using it as input to our model. Recalling that the system rate various product based on customer review

train_df['customer_reviews']

Checking if there is not NaN values in the features

train_df['customer_reviews'].isnull().sum()

That looks good. let's proceed further!!

Defining a function cleanText() to remove special, html tagsetc. in our feature

def cleanText(raw_text, remove_stopwords=False, stemming=False, split_text=False ):
    '''
    Convert a raw review to a cleaned review
    '''
    text = BeautifulSoup(raw_text, 'lxml').get_text()  #remove html
    letters_only = re.sub("[^a-zA-Z]", " ", text)  # remove non-character
    words = letters_only.lower().split() # convert to lower case 

    if remove_stopwords: # remove stopword
        stops = set(stopwords.words("english"))
        words = [w for w in words if not w in stops]

    if stemming==True: # stemming
#         stemmer = PorterStemmer()
        stemmer = SnowballStemmer('english') 
        words = [stemmer.stem(w) for w in words]

    if split_text==True:  # split text
        return (words)

    return( " ".join(words))

Splitting the dataset into training and validation dataset to train and evaluate the performance of our model by using train_test_split() method taking the feature and target variable as arguments.

X_train, X_test, y_train, y_test = train_test_split(train_df['customer_reviews'],train_df['average_review_rating'],test_size=0.2, random_state=42)

Cleaning the training feature set.

X_train_cleaned = []
X_test_cleaned = []

for d in X_train:
    X_train_cleaned.append(cleanText(d))
print('Show a cleaned review in the training set : \n',  X_train_cleaned[10])

for d in X_test:
    X_test_cleaned.append(cleanText(d))

Printing the identified Unique words along with their indices


X_train_countVect = countVect.fit(X_train_cleaned)

print("Vocabulary: ", X_train_countVect.vocabulary_)

Applying CountVectorizer to X_train_clean

countVect = CountVectorizer() 
X_train_countVect = countVect.fit_transform(X_train_cleaned)
print("Number of features : %d \n" %len(countVect.get_feature_names())) #6378 
print("Show some feature names : \n", countVect.get_feature_names()[::100])

Create a object of a class LabelEncoder(). We notice that our target variable is of data type float and the input to our model must be of integer data type.

lb=preprocessing.LabelEncoder()

Displaying y_train before encoding

print(y_train)

Encoding y_train

y_train_encoded=lb.fit_transform(y_train)

Displaying the encoded y_train

print(y_train_encoded)

Encoding y_test

y_test_encoded = lb.transform(y_test)

y_test_encoded

Defining our model and fitting with X_train_countVect and y_train_encoded

mnb = MultinomialNB()
mnb.fit(X_train_countVect, y_train_encoded)

Predicting on unseen validation test features

predictions = mnb.predict(countVect.transform(X_test_cleaned))

Displaying predicted value.

print(predictions)

We notice that the predicted value is out of the scale in the problem statement simply because we encoded our target variable before fitting our model. So we need to inverse transform the predicting to get the actual prediction as shown below:

prediction =lb.inverse_transform(predictions )

Defining a function to evaluate the model using RMSE(Root Mean Square Error) as metric score.

def modelEvaluation(y_tst,pred):
    '''
    Print model evaluation to predicted result 
    '''
    print ("\nAccuracy on validation set: {:.4f}".format(np.sqrt(mean_squared_error(y_tst, pred))))

Evaluating the model performance

modelEvaluation(lb.inverse_transform(y_test_encoded),predictions)

Accuracy on validation set: 3.1841

Low RMSE score means the model is performing well. We have reached half way to our final destination. Let us process the test dataframe and run the prediction on unseen feature sets.

Displaying the first 5 rows of the test dataframe

test_df.head()

Displaying information about the test dataframe

test_df.info()

Checking if there is a null value in the test dataframe

test_df.customer_reviews.isnull()

Cleaning the test_df['customer_reviews']

test_df_cleaned = []

for d in test_df['customer_reviews']:
    test_df_cleaned .append(cleanText(d))
print('Show a cleaned review in the training set : \n',  test_df_cleaned [10])

Predicting on the unseen clean customer reviews

predictions_test_df = mnb.predict(countVect.transform(test_df_cleaned ))
predictions_test = lb.inverse_transform(predictions_test_df)

Displaying the predicted values

print(predictions_test)

Putting the predicted values into a dataframe

predictions_test = pd.DataFrame(predictions_test)

Displaying the prediction

print(predictions_test)

Converting the dataframe into csv format for submission.

predictions_test.index = pd.DataFrame(predictions_test).index
predictions_test.columns = ["prediction"]
predictions_test.to_csv("submission.csv", index = False)

Check on the project folder in jupyter notebook you should have a file name submission.csv as shown below:

Conclusion:

We were able to rate customer reviews on Amazon product with RMSE score of 0.318 using multinomial Naive Bayes algorithm and countVectorizer. We could achieve lower RMSE score with TF-IDF which performs better than countVectorizer and exploring other machine learning algorithms such as support vector machine algorithm, deep learning algorithm like RNN and LSTM.

Please let me know if you find any errors. You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are on linkedIn you can reach me here

Warm regards,

Maximilien.

End-To-End Amazon Product Rating Using Multinomial Naive Bayes Algorithm and CountVectorizer

Application of Machine learning and Natural language Processing | Maximilien Kpizingui | Submission DPhi October 2021 | Deep Learning & IoT Engineer

Did you find this article valuable?