Maximiien Kpizingui - Machine Learning & IoT Engineer

Welcome back to this blog post in advanced data analytics nanodegree scholarship sponsored by Udacity. The project II WeRateDogs, a twitter account which has 9.3 million followers across the world at the point this blog is published contains three datasets namely twitter-archive-enhanced.csv, image_predictions.tsv and tweet_json.txt which we have to programmatically download from three different sources. Two from website URLs and the last one requires the learner to sign up for twitter developer account to collect additional tweet through twitter API. The objective of this project is to challenge the learner to wrangle the three datasets and to combine all the three clean datasets into a twitter master archive by providing at least four insights and two visualizations. Without delay let's get into the data wrangling.

Table of Contents

Loading the libraries
Data gathering
Reading twitter-archive-enhanced dataset into a dataframe
Downloading image prediction dataframe using request
Reading image prediction into a dataframe
Query tweet_json dataset using twitter API
Reading tweet_json into a dataframe
Assessing data
Objectives
Methodology
Visual assessment of twitter_archive_df
Programmatic assessment of twitter_achive_df
Visual assessment of image_predictions_df
Programmatic assessment of image_predictions_df
Visual assessment of tweet_status
Programmatic assessment of tweet_status dataframe
Project scope
Quality issues
Tidiness issues
Cleaning data
Making the copy of the dataframes
Quality issue #8: Convert tweet_id column in image_predictions_clean from int to str
Quality Issue #6: Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns
Quality Issue #7: Removing duplicated values img_url of image_prediction_df
Quality Issue #4: Incorrect values and incorrect type for timestamp
Quality Issue #1: invalid and missing dog name in the column name
Tidiness Issue #1 : Column source contain HTML tag and hyperlinks
quality Issue #10: Removing RT @ in column text of twitter_archive_clean
Quality Issue #5 : dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp
Tidiness Issue #2: Merging columns doggo, floofer,pupper, and puppo
Tidiness Issue #4: There are two information in a single column text url link and text
Quality Issue #2 and 3#: Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.
Tidiness Issue #5: Merging twitter_archive_clean and image_predictions_clean
Quality Issue #9: Incorrect data type in column id in tweet_status_clean
Tidiness isssue #6: Creating a dataframe with columns: id, favorite_count, retweet_count
Tidiness isssue #7: Merge tweet_status_clean and twitter_achive_merged
Storing the master data into csv file
ANALYSING AND VISUALIZING THE DATA
Reading the twitter_archive_master.csv into a dataframe
Insight about the clean master dataset
Visualizing the hidden pattern in the dataset
Function to plot the average count of tweets
Visualizing the distribution of average favorite count of tweet using based on the dog category
Visualizing the distribution of average retweet count based on the dog category
Visualizing Likes vs Retweets
Visualizing the most used devices by WeRateDogs users
Visualizing the most popular name of the dogs
Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs
Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs
Visualizing the dog category with the highest score
Visualizing the dog category with maximum favorite count
Visualizing the dog category with minimun retweet count
Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017
Visualizing some of the dogs image prediction p1

1. Loading the required libraries

from timeit import default_timer as timer
from IPython.display import Image
from tweepy import OAuthHandler
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import pandas as pd
import numpy as np
%matplotlib inline
import requests
import tweepy
import json
import re

2. Data Gathering

Reading the twitter-archive-enhanced.csv

twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')

Use the Requests library to download the tweet image prediction (image_predictions.tsv)

# Use Requests library to programmatically download tsv file from a website
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
# Save tsv to file
with open('image_predictions.tsv', mode='wb') as file:
 file.write(response.content)

Reading image_predictions.tsv in dataframe

image_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t')

Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt) One of the Project requirements is to access the Twitter API to create the tweet_json.txt completing some missing/wrong values of the tweet-json file. I will use the tweepy package (a client code) to access the Twitter API.
Authentication Details: load personal API keys (replaced with placeholders)

consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
# variables for Twitter API connection
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True)

Collecting tweet data using API

tweet_ids = twitter_archive_df.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except Exception as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

Reading tweet JSON content as pandas dataframe

tweet_status_df = pd.read_json('tweet_json.txt', lines = True,encoding='utf-8')

Assessing Data

Objectives
In this section, I detect and I document at least eight (8) quality issues and two (2) tidiness issue by using visual assessment and programmatic assessement The issues could be defined into two types:

Quality issues or dirty data: missing, duplicated, or incorrect data
Tidiness issues: messy data or unstructural data.

Methodology
I use visual assessment on each dataframe to detect the issues and document it.

Visual assessment of twitter_archive_df

Displaying the first five raw of the twitter_archive_df

twitter_archive_df.sample(3)

We notice quality issue in column name cause it contains invalid name and missing name which are not accurate
We notice tidiness issue in columns source HTML tags, URL, and content in a single column.

Programmatic assessment of twitter_archive_df

Through visual assessment we found invalid name in column name of the dataframe.

Displaying 20 name of dogs

twitter_archive_df['name'][:20]

Counting the occurence of the unique name of the dogs

twitter_archive_df['name'].value_counts()[:20]

We notice 745 dogs in the dataframe have the name None and 55 dogs have the name a
Displaying the unique occurance of the denominator rating value

twitter_archive_df['rating_denominator'].value_counts()[:20]

We notice incorrect value in the rating_denominator column cause it must have a same denomination which is 10 leading to accuracy quality issue
Displaying the rating_numerator column

twitter_archive_df['rating_numerator']

Screenshot (6).png

twitter_archive_df['rating_numerator'].value_counts()[:20]

The numerator values must be between 0 to 10 which is not the case leading to accuracy quality issue

Displaying descriptive information about the twitter_archive_df

twitter_archive_df.info()

Screenshot (8).png

From the programmatic assessment we notice some quality issues
- timestamp column need to be converted to datetime

twitter_archive_df.pupper

Screenshot (9).png

twitter_archive_df.doggo

Screenshot (10).png

twitter_archive_df.floofer

Screenshot (11).png

twitter_archive_df.puppo

Screenshot (12).png

From programmatic assessment we find a tidiness issue on columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column

Checking for null values

twitter_archive_df.isnull().sum()

Screenshot (15).png

in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp columns have more null values which will bring bring about a quality issue we need to drop them

twitter_archive_df.text[:20]

Screenshot (16).png

Checking the occurence of RT in the text column as it must be removed as specified in the project requirement

display(twitter_archive_df[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_df['text'].str.contains('RT @')))

Screenshot (17).png

the number of RT in the text is: 181
Programmatically this is a quality issue cause text columns contains text and Url andv 181 retweets

Visual assessment of image_predictions_df

image_predictions_df.sample(4)

Screenshot (18).png

From visual assessment we notice a quality issue on colums p1, p2 and p3 inconsistancy in the spelling of the name using upper case letter, lower case and underscore

Programmatic assessment of image_prediction dataframe

Displaying the description of the image_predictions.tsv

image_predictions_df.info()

Screenshot (19).png

From the descriptive information of the image_predictions dataframe, tweet_id need to be converted to a String data type cause it does not affect our analysis . So we raise a quality issue on the validity of the tweet_id data type

Checking for duplicated value

sum(image_predictions_df['tweet_id'].duplicated())

Screenshot (20).png

sum(image_predictions_df.jpg_url.duplicated())

Screenshot (20).png

They are 66 duplicated jpg_url double entry bringing about a quality issue on the validity of the data

Visual assessment of tweet_status

tweet_status_df.sample(4)

Screenshot (22).png

We notice tidy issue in column in_reply_to_status_id, retweeted_status, quoted_status_id, quoted_status_id_str, quoted_status having NaN value
We notice tidiness issue in columns source, entities, extended_entities having HTML tags, URL full_text contains RT @ which need to be remove per the project requirement

Programmatic assessment of tweet_status_df

tweet_status_df.isnull().sum()

Screenshot (23).png

we have quality issue on the validity of the data entry of the columns listed below having high number of NaN values therefore we need to drop
in_reply_to_status_id
in_reply_to_status_id_str
in_reply_to_user_id
in_reply_to_user_id_str
in_reply_to_screen_name
geo
coordinates
place
contributors
retweeted_status
quoted_status_id
quoted_status_id_str
quoted_status

Project scope

Based on the tidiness concept, twitter_archive_enhanced.csv,tweet_status_df, image_predictions.tsv should be merged using the tweet_id as a mapping key

Quality issues

invalid and missing dog name in the column name of twitter_achive_df
Invalid ratings in column rating_numerator of twitter_achive_df value varies from 1776 to 0 and data type need to be converted from int to float
Invalid denominator in column rating_denominator of twitter_achive_df. It should be a fixed base 10 and data type need to be converted from int to float.
Timestamp column in twitter_achive_df needs to be converted into datetime data type
in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp have missing data
Spelling inconsistency of upper case letter in columns p1, p2 and p3 of image_predictions_df
jpg_url columns in image_predictions_df has 66 duplicated images and for the accurary of the data we need to drop them
tweet_id in image_predictions_df needs to be converted to string
column id in tweet_status_df need to be converted from in to string
column id need to be converted in tweet_id

Tidiness issues

HTML tags, URL, and content in source column
columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column
- twitter_archive_df and image_predictions_df and tweet_status_df can be marged There is two information in a single column text, ampasand and //n
3 useful columns: id, favorite_count, retweet_count in tweet_status_df
Retweet need to be removed in text column and in full_text columns

Cleaning Data

In this section, we clean all of the issues I listed out above while assessing the data.

Making copies of the original dataframes

twitter_archive_clean = twitter_archive_df.copy()
image_predictions_clean = image_predictions_df.copy()
tweet_status_clean = tweet_status_df.copy()

Quality Issue #8:

Convert tweet_id column in image_predictions_clean from int to str

Define:

Converting tweet_id column from int to str

Code

image_predictions_clean.tweet_id = image_predictions_clean.tweet_id.astype(str)

Test

type(image_predictions_clean.tweet_id[0])

Screenshot (25).png

Quality Issue #6:

Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns

Define:

Replacing underscores ('_') with spaces and capitalize first letter.

Code

image_predictions_clean.p1 = image_predictions_clean.p1.str.replace('_', " ").str.capitalize()
image_predictions_clean.p2 = image_predictions_clean.p2.str.replace('_', " ").str.capitalize()
image_predictions_clean.p3 = image_predictions_clean.p3.str.replace('_', " ").str.capitalize()

Test

image_predictions_clean[['p1','p2','p3']].sample(6)

Screenshot (26).png

Quality Issue #7

Removing duplicated values img_url of image_prediction_df

Define:

Indexing all the duplicated value in jpg_url columns ;selecting the the non duplicating values and assigning the non duplicated jpg_url to image_predictions_clean

Code

indexing = image_predictions_df.jpg_url.duplicated()
indexing = np.logical_not(indexing)
image_predictions_clean= image_predictions_clean[indexing]

Test

print("Before cleaning: {} rows.\nAfter cleaning: {} rows.".format(image_predictions_df.shape[0],image_predictions_clean.shape[0]))

Screenshot (27).png

print("{} duplicated.".format(sum(image_predictions_clean.jpg_url.duplicated())))

Screenshot (28).png

Quality Issue #4:

Incorrect values and incorrect type for timestamp

Define:

Converting timestamp column from object to datetime series

Code

twitter_archive_clean['timestamp'] = twitter_archive_clean['timestamp'].astype('datetime64[ns]')

Test

twitter_archive_clean.timestamp[0]

Screenshot (29).png

Quality Issue #1

invalid and missing dog name in the column name

Define:

Rename non-standard names to "None"

Code

invalid_names=list(twitter_archive_clean[twitter_archive_clean.name.
                                           str.contains('^[a-z].*')].
                   name.value_counts().index) + ['None']
twitter_archive_clean.loc[twitter_archive_clean.name.apply(lambda x: x in invalid_names),'name']=None

Test

(twitter_archive_clean.name=='None').sum()

Screenshot (20).png

(twitter_archive_clean.name=='a').sum()

Screenshot (20).png

(twitter_archive_clean.name.apply(lambda x: x in invalid_names)).sum()

Screenshot (20).png

Quality Issue #10 :

twitter_data_archived contains retweets RT @ in column text

Define

As per project specification, we only want original dog ratings. We need to remove retweets text column starting with RT @. In order to fix this quality issue, we are going to create vectors to index all the non-retweet and we exclude the subset from the retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

Code

indexing_retweeted_status_id =twitter_archive_clean.retweeted_status_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_id]
indexing_retweeted_status_user_id = twitter_archive_clean.retweeted_status_user_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_user_id]
indexing_retweeted_status_timestamp = twitter_archive_clean.retweeted_status_timestamp.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_timestamp ]

Test

print("Number of rows with true in retweeted_status_id:", sum(twitter_archive_clean.retweeted_status_id.isnull()))
print("Number of rows with true in retweeted_status_timestamp:", sum(twitter_archive_clean.retweeted_status_timestamp.isnull()))
print("Number of rows with true in retweeted_status_user_id:", sum(twitter_archive_clean.retweeted_status_user_id.isnull()))
print("Number of rows of twitter_archive_clean:",twitter_archive_clean.shape[0])

Screenshot (32).png

display(twitter_archive_clean[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_clean['text'].str.contains('RT @')))

Screenshot (33).png

twitter_archive_clean['text'].sample(45)

Screenshot (34).png

Tidiness Issue #1

Column source contain HTML tag and hyperlinks

Define:

Extracting the content between opening and closing tag using regular expressions. Extracting the link.
Replacing source variable in the dataset with just the source name
Creating additional table with source link that we could use as a lookup table.

Code

source_link = twitter_archive_clean.source.str.extract(r'<a href="(.+)" .+>', expand=False)
source = twitter_archive_clean.source.str.extract(r'>([A-z -]+)<', expand=False)
twitter_archive_clean.source = source

sources = pd.DataFrame({'source': source, 'source_link': source_link})
sources.drop_duplicates(inplace=True)

sources

Screenshot (35).png

Test

twitter_archive_clean.source.value_counts()

Screenshot (37).png

Quality issue

Dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp and expanded_urls

Define:

Removing the columns in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp

Code

columns_to_remove = ['in_reply_to_user_id', 'in_reply_to_status_id',
                    'retweeted_status_id', 'retweeted_status_user_id',
                    'retweeted_status_timestamp','expanded_urls']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.columns

Screenshot (38).png

Tidiness Issue #2

Merging columns doggo, floofer,pupper, and puppo

Define

Columns doggo, floofer,pupper, and puppo have the same values therefore can be merged into one feature

Code

dog_cols = twitter_archive_clean[['doggo','floofer','pupper','puppo']]
dog_cols = dog_cols.replace('None', '') 
dog_category = np.array(dog_cols['doggo']) + np.array(dog_cols['floofer']) + np.array(dog_cols['pupper']) + np.array(dog_cols['puppo'])
pd.DataFrame(dog_category, columns = ['dog_category']).dog_category.value_counts()

Screenshot (39).png

Appending this new column called dog_category to the twtitter_archive_clean

twitter_archive_clean.reset_index(drop=True, inplace=True)
twitter_archive_clean = pd.concat([twitter_archive_clean, pd.DataFrame(dog_category, columns = ['dog_category'])], axis = 1)

Dropping the individual columns after merging them

columns_to_remove = ['doggo','floofer','pupper','puppo']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.dog_category.value_counts()

Screenshot (40).png

twitter_archive_clean.columns

Screenshot (42).png

Tidiness Issue #4

There are two information in a single column text.

Define

Remove the URL in the end of the text column, ampasamd and /n

Code

twitter_archive_clean['text'] = twitter_archive_clean.text.str.replace(r'[(https://.+),(\&amp;)|(\n)]','',regex=True)

Test

twitter_archive_clean['text'].sample(45)

Screenshot (43).png

Quality Issue #2 and 3 :

Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.

Define

Convert rating_numerator and rating_denominator to float because @dog_rates uses float rating number.
Remove the extreme values (1776, 420, etc.) of rating_numerator, and;
Remove non expected value of denominator, anything different of 10.

Code

Converting the rating_numerator and rating_denominator to float.

twitter_archive_clean.rating_numerator = twitter_archive_clean.rating_numerator.astype(float)
twitter_archive_clean.rating_denominator = twitter_archive_clean.rating_denominator.astype(float)

From visual assessment of the rating_numerator done earlier on, we need to drop the numerator with invalid rating 1776, 420, 204.

invalid_ratings_1776_420_204= twitter_archive_clean[(twitter_archive_clean.rating_numerator==1776)|(twitter_archive_clean.rating_numerator==204)|(twitter_archive_clean.rating_numerator==420)].index
twitter_archive_clean.drop(invalid_ratings_1776_420_204, inplace=True)

Removing tweet_id with 0/10
Listing the tweet_id having 0 as numerator and removing these tweet_id from the dataframe

list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)

Screenshot (44).png

rm_list = list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_numerator == 0')

Screenshot (45).png

Quering denominator other than 10

list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)

Screenshot (46).png

Remove those three tweet_id's

rm_list = list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_denominator < 10')

Screenshot (48).png

twitter_archive_clean.query('rating_denominator > 10')

Screenshot (48).png

twitter_archive_clean.rating_denominator.value_counts()

Screenshot (49).png

twitter_archive_clean.tweet_id =twitter_archive_clean.tweet_id.astype(str)

Merging twitter_archive_clean and image_predictions_clean as stating in the tidiness issue

twitter_archive_merged= pd.DataFrame()
twitter_archive_merged = twitter_archive_clean.copy()
twitter_archive_merged_df  = twitter_archive_merged.merge(image_predictions_clean, on='tweet_id', how='inner')

print("Shape df_twitter_archive_clean: " + str(twitter_archive_merged.shape))
print("Shape image_predictions_clean: " + str(image_predictions_clean.shape))
print("Shape df_twitter_combined " + str(twitter_archive_merged_df.shape) + " after joining df_twitter_archive_cleaned, df_tweet_performance and df_image_predictions_cleaned.")

Screenshot (50).png

Test

twitter_archive_merged_df.info()

Screenshot (51).png

Quality Issue #11:

Incorrect data type in column id in tweet_status_clean

Define:

Convert id from int to str

Code

tweet_status_clean.id = tweet_status_clean.id.astype('str')
tweet_status_clean.rename(columns={'id':'tweet_id'}, inplace=True)

Test

type(tweet_status_clean.tweet_id[0])

Screenshot (52).png

Tidiness Issue #11:

Removing RT @ in full_text column

Define

Removing RT @ in full_text column as required in the project requirement

Code

display(tweet_status_clean[tweet_status_clean['full_text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(tweet_status_clean['full_text'].str.contains('RT @')))

Screenshot (54).png

Screenshot (55).png

Removing the 160 retweet in full_text column

tweet_status_clean['full_text']= tweet_status_clean['full_text'].apply(lambda x: re.sub(r'\bRT @\b','',x).strip())

Test

sum(tweet_status_clean.full_text.str.contains('RT @'))

Screenshot (20).png

All the rows have been removed

Tidiness isssue #6

Creating a dataframe with columns: id, favorite_count, retweet_count

Define Reassign the useful columns to the dataframe

Code

tweet_status_clean = tweet_status_clean[['tweet_id', 'favorite_count', 'retweet_count' ]]

Test

tweet_status_clean.sample(3)

Screenshot (56).png

Tidiness issue #7

Merge tweet_status_clean and twitter_achive_merged

Define

Merging tweet_status_clean and twitter_achive_merged

Code

twitter_archive_master = twitter_archive_merged_df.merge(tweet_status_clean, on='tweet_id', how='inner')

Test

twitter_archive_master.info()

Screenshot (57).png

Storing Data

Saving the clean merged dataframe into "twitter_archive_master.csv".

twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)

Analyzing and Visualizing Data

In this section, we analyze the twitter archiv master dataset and visualizing data

-Reading and assessing the twitter_archive_master.csv into a dataframe

master_df =pd.read_csv("twitter_archive_master.csv")

master_df.sample(3)

Screenshot (58).png

Insights: In this section, we are interesting to find the hidden pattern in the clean twitter archive master dataset.

. Visualizing the distribution of dog category based on the favorite tweet count by their users on twitter

. Visualizing the distribution of dog category based on the retweet count by their users on twitter

. Visualizing the most used devices by WeRateDogs users

. Visualizing the most common name of dog

. Line plot of like and retweet by twitter users on WeRateDogs account

.Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

. Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

. Visualizing the dog category with the highest rating score

. Visualizing the dog category with maximum favorite count

. Visualizing the dog category with minimum retweet count

. Visualizing Total Tweets made by WeRateDogs per month

. Visualizing some of the dog names in image prediction p1

Visualization

Define a function to plot the average count of tweets

def distributionPlot(feature,ylabel,title):
    stage_favorite_counts = master_df[['dog_category',feature]]
    stage_favorite_counts_agg = stage_favorite_counts.groupby('dog_category', as_index=False).mean()
    print(stage_favorite_counts_agg )
    f, ax = plt.subplots(1, 1, figsize=(12, 4));
    sns.barplot(data=stage_favorite_counts_agg,x='dog_category',y=feature,color='green', ax=ax);
    ax.set_ylabel(ylabel);
    ax.set_xlabel("Dog category");
    ax.set_title(title)

Plotting the distribution of average favorite count of tweet using based on the dog category

distributionPlot('favorite_count','average favorite count',"Distribution of average favorite count of tweet using based on the dog category¶")
plt.savefig('av_favcounte.png')

Screenshot (59).png

Among the dog category:

doggo has 17599.225806 tweet favorite counts the most
followed by floofer with 11223.857143 tweet favorite counts
multiclass has 15008.909091 tweet favorite counts
pupper has 6204.975369 tweet favorite counts
puppo has 19573.545455 tweet favorite counts the least

Plotting the distribution of average retweet count based on the dog category

distributionPlot('retweet_count','average retweet count',' Distribution of average retweet count based on the dog category')
plt.savefig('av_retweet_count.png')

Screenshot (60).png

Among the dog category:

doggo has 5972.709677 average retweet counts the most
followed by puppo with 5325.318182 average retweet counts
multiclass has 4548.272727 average retweet countss
floofer has 1909.453202 average retweet counts
pupper has 5325.318182 average retweet counts the least

Plotting likes vs retweets

likes = pd.Series(data=master_df['favorite_count'].values, index=master_df['timestamp'])
retweets = pd.Series(data=master_df['retweet_count'].values, index=master_df['timestamp'])

likes.plot(figsize=(16,4), label='Favorites', color='orange', legend=True)
retweets.plot(figsize=(16,4), label='Retweets', color='maroon', legend=True);
plt.title('No. of Favorites and Retweets Over Time')
plt.show()
plt.savefig('retvs.png')

Screenshot (61).png

From the plot of likes vs retwets we notice that WeRateDogs twitter account started in December,11,2015 with zero likes and retweets. In June,23,2016 over 65k users retweeted post of the account and 140k people like the post as their favorites. In February,16, 2017, the account got more popularity over 100k people like the post on the account as their favorites and 25k people retweeted the posts about WeRateDogs.In August 1,2017 few people retweets about their dogs and only 58k people like the post of dogs as their favorites.

Visualizing the most used devices by WeRateDogs users

print(twitter_archive_clean.source.value_counts())
twitter_archive_clean.source.value_counts().plot(kind='bar', figsize=(11,5), title='Most used Twitter source').set_xlabel("Number of Tweets")
plt.savefig('twitter_source')
plt.savefig('twitter_source.png')

Screenshot (62).png

From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

Displaying the most common name among the dogs

master_df['name'].value_counts()[1:13]

Screenshot (63).png

Visualizing the most popular name of the dogs

sorted_order = master_df['name'].value_counts()[1:13].head(13).index
display(sorted_order)
plt.figure(figsize = (10,4))
values = np.count_nonzero(master_df['name'])
fclrs = ['grey' if (x > min(sorted_order)) else 'red' for x in sorted_order]
sns.countplot(data = master_df, x = 'name', order = sorted_order, orient = 'h')
plt.xlabel('Name', fontsize =16)
plt.ylabel('Count', fontsize =16);
plt.title('Most popular name of the dogs ')
plt.savefig('popular_dog_name.png')

Screenshot (64).png

From the above plot the most common name among the dogs are Charlie, Oliver, Lucy, Tucker, Penny, Winston, Sadie,Toby, Daisy, Lola, Koda and Bo

Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

breeds= master_df.p2.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='red')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds prediction p2 on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20p2_dog_pred.png')

From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriever

Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

breeds= master_df.p3.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='purple')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds predction on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20d0gs_p3.png')

Screenshot (67).png

Screenshot (68).png

From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriver

Visualizing the dog category with the highest score

Creating a new column called rating in master_df

master_df['rating'] = master_df['rating_numerator']/master_df['rating_denominator']

dog_rating=master_df.groupby('dog_category')['rating'].max()
print(dog_rating)
dog_rating.plot.bar(figsize=(10,5))
plt.ylim(top=10)
plt.title("Dog category with the highest rating score")
plt.xlabel("Dog category")
plt.legend([" Rating"])
plt.savefig('rating.png')

Screenshot (69).png

From the plot above, pupper is the dog with the high rating score

Visualizing the dog category with maximun favorite count

dog_favorite_count=master_df.groupby('dog_category')['favorite_count'].max()
display(dog_favorite_count)
dog_favorite_count.plot.bar(figsize=(10,5))
plt.title("Dog category favorite count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('favorite.png')

Screenshot (70).png

From the above plot doggo is the dog with maximum twitter favorite count

Visualizing the dog category with minimun retweet count

dog_retweet_count=master_df.groupby('dog_category')['retweet_count'].min()
display(dog_retweet_count)
dog_retweet_count.plot.bar(figsize=(10,5))
plt.title("Dog category with minimun retweet count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('retweet.png')

Screenshot (71).png

From the above plot pupper has the least retweet count

Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017

twitter_archive_master['year_month_date'] = twitter_archive_master['timestamp'].dt.year.astype(str) + '-' + \
                                            twitter_archive_master['timestamp'].dt.month.astype(str).str.pad(2, fillchar='0')
twitter_archive_master['is_tweet'] = np.where(twitter_archive_master.tweet_id.notnull(), 1, 0)
twitter_archive_monthly_tweets = twitter_archive_master.groupby('year_month_date').is_tweet.sum().reset_index()
plt.xticks(rotation=45)
ax = sns.lineplot(x='year_month_date', y='is_tweet', data=twitter_archive_monthly_tweets)
ax.set_title('Total Tweets made by WeRateDogs per month between 2015 to 2017')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Tweets')

Screenshot (72).png

From the plot above, we notice in November,2015, 350 people tweeted on WeRateDogs. However, the number gradually decreased in August 2017, the plot revealed that the number of tweets had dropped to 0 tweet.

Visualizing some of the dogs image prediction p1

twitter_archive_master[['jpg_url','p1']]

Screenshot (73).png

Visualizing Miniature_pinscher

Image("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")

Programmatically download Miniature_pinscher image from the url

response = requests.get("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")
file = open("Miniature_pinscher.png", "wb")
file.write(response.content)
file.close()

Visualizing Rhodesian ridgeback

Image('https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg')

Rhodesian ridgeback.png

Programmatically download Rhodesian ridgeback image from the url

response = requests.get("https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg")
file = open("Rhodesian ridgeback.png", "wb")
file.write(response.content)
file.close()

Visualizing German Sheperd

Image('https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg')

Programmatically download german_sheperd image from the url

response = requests.get("https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg")
file = open("german_sheperd.png", "wb")
file.write(response.content)
file.close()

Conclusion
We have reached the end of this data wrangling and visualization journey. Through our analysis , we found out that:

Among the dog category we found out that:

tweet favorite counts

doggo has 17599.225806 the most
followed by floofer with 11223.857143
multiclass has 15008.909091
pupper has 6204.975369
puppo has 19573.545455

retweet count

doggo has 5972.709677 the most
followed by puppo with 5325.318182
multiclass has 4548.272727
floofer has 1909.453202
pupper has 5325.318182

Most devices used by WeRaDogs users

Twitter for iPhone: 2016
Vine - Make a Scene: 91
Twitter Web Client: 31
TweetDeck : 10

** Dog category with the highest rating score

pupper 2.7
doggo 1.4
puppo 1.4
floofer 1.3
multiclass 1.3

dog category with maximun favorite count

doggo 144774
puppo 124103
pupper 108900
multiclass 49401
floofer 28112

Finaly, People preferred to favorite a dog over a retweet and both actions had decreased over time from 2015 to 2017.

The github repository to download the datasets, the wrangling report and the jupyter notebook can be found here.

If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email for freelance maximilien@tutanota.de

If you want to hire me to work on machine learning, data science,IoT and AI related project, please reach out to me here

Warm regards,
Maximilien.

@WeRateDogs Data Wrangling Analysis And Visualization

Udacity Scholarship |Advanced Data Analytics| Submission July 11,2022| Project II |Maximilien Kpizingui | Machine Learning & IoT Engineer

@WeRateDogs Data Wrangling Analysis And Visualization

Udacity Scholarship |Advanced Data Analytics| Submission July 11,2022| Project II |Maximilien Kpizingui | Machine Learning & IoT Engineer

Did you find this article valuable?