@WeRateDogs Data Wrangling  Analysis And Visualization

@WeRateDogs Data Wrangling Analysis And Visualization

Udacity Scholarship |Advanced Data Analytics| Submission July 11,2022| Project II |Maximilien Kpizingui | Machine Learning & IoT Engineer

·

20 min read

Welcome back to this blog post in advanced data analytics nanodegree scholarship sponsored by Udacity. The project II WeRateDogs, a twitter account which has 9.3 million followers across the world at the point this blog is published contains three datasets namely twitter-archive-enhanced.csv, image_predictions.tsv and tweet_json.txt which we have to programmatically download from three different sources. Two from website URLs and the last one requires the learner to sign up for twitter developer account to collect additional tweet through twitter API. The objective of this project is to challenge the learner to wrangle the three datasets and to combine all the three clean datasets into a twitter master archive by providing at least four insights and two visualizations. Without delay let's get into the data wrangling.


Table of Contents

  1. Loading the libraries
  2. Data gathering
  3. Reading twitter-archive-enhanced dataset into a dataframe
  4. Downloading image prediction dataframe using request
  5. Reading image prediction into a dataframe
  6. Query tweet_json dataset using twitter API
  7. Reading tweet_json into a dataframe
  8. Assessing data
  9. Objectives
  10. Methodology
  11. Visual assessment of twitter_archive_df
  12. Programmatic assessment of twitter_achive_df
  13. Visual assessment of image_predictions_df
  14. Programmatic assessment of image_predictions_df
  15. Visual assessment of tweet_status
  16. Programmatic assessment of tweet_status dataframe
  17. Project scope
  18. Quality issues
  19. Tidiness issues
  20. Cleaning data
  21. Making the copy of the dataframes
  22. Quality issue #8: Convert tweet_id column in image_predictions_clean from int to str
  23. Quality Issue #6: Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns
  24. Quality Issue #7: Removing duplicated values img_url of image_prediction_df
  25. Quality Issue #4: Incorrect values and incorrect type for timestamp
  26. Quality Issue #1: invalid and missing dog name in the column name
  27. Tidiness Issue #1 : Column source contain HTML tag and hyperlinks
  28. quality Issue #10: Removing RT @ in column text of twitter_archive_clean
  29. Quality Issue #5 : dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp
  30. Tidiness Issue #2: Merging columns doggo, floofer,pupper, and puppo
  31. Tidiness Issue #4: There are two information in a single column text url link and text
  32. Quality Issue #2 and 3#: Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.
  33. Tidiness Issue #5: Merging twitter_archive_clean and image_predictions_clean
  34. Quality Issue #9: Incorrect data type in column id in tweet_status_clean
  35. Tidiness isssue #6: Creating a dataframe with columns: id, favorite_count, retweet_count
  36. Tidiness isssue #7: Merge tweet_status_clean and twitter_achive_merged
  37. Storing the master data into csv file
  38. ANALYSING AND VISUALIZING THE DATA
  39. Reading the twitter_archive_master.csv into a dataframe
  40. Insight about the clean master dataset
  41. Visualizing the hidden pattern in the dataset
  42. Function to plot the average count of tweets
  43. Visualizing the distribution of average favorite count of tweet using based on the dog category
  44. Visualizing the distribution of average retweet count based on the dog category
  45. Visualizing Likes vs Retweets
  46. Visualizing the most used devices by WeRateDogs users
  47. Visualizing the most popular name of the dogs
  48. Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs
  49. Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs
  50. Visualizing the dog category with the highest score
  51. Visualizing the dog category with maximum favorite count

  52. Visualizing the dog category with minimun retweet count
  53. Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017
  54. Visualizing some of the dogs image prediction p1

1. Loading the required libraries

from timeit import default_timer as timer
from IPython.display import Image
from tweepy import OAuthHandler
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import pandas as pd
import numpy as np
%matplotlib inline
import requests
import tweepy
import json
import re

2. Data Gathering

  • Reading the twitter-archive-enhanced.csv
twitter_archive_df = pd.read_csv('twitter-archive-enhanced.csv')
  • Use the Requests library to download the tweet image prediction (image_predictions.tsv)
# Use Requests library to programmatically download tsv file from a website
url = 'https://d17h27t6h515a5.cloudfront.net/topher/2017/August/599fd2ad_image-predictions/image-predictions.tsv'
response = requests.get(url)
# Save tsv to file
with open('image_predictions.tsv', mode='wb') as file:
 file.write(response.content)
  • Reading image_predictions.tsv in dataframe
image_predictions_df = pd.read_csv('image_predictions.tsv', sep='\t')
  • Use the Tweepy library to query additional data via the Twitter API (tweet_json.txt) One of the Project requirements is to access the Twitter API to create the tweet_json.txt completing some missing/wrong values of the tweet-json file. I will use the tweepy package (a client code) to access the Twitter API.
  • Authentication Details: load personal API keys (replaced with placeholders)
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''
# variables for Twitter API connection
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth, wait_on_rate_limit = True)
  • Collecting tweet data using API
tweet_ids = twitter_archive_df.tweet_id.values
len(tweet_ids)

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive
count = 0
fails_dict = {}
start = timer()
# Save each tweet's returned JSON as a new line in a .txt file
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        count += 1
        print(str(count) + ": " + str(tweet_id))
        try:
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            json.dump(tweet._json, outfile)
            outfile.write('\n')
        except Exception as e:
            print("Fail")
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)
  • Reading tweet JSON content as pandas dataframe
tweet_status_df = pd.read_json('tweet_json.txt', lines = True,encoding='utf-8')

Assessing Data

Objectives
In this section, I detect and I document at least eight (8) quality issues and two (2) tidiness issue by using visual assessment and programmatic assessement The issues could be defined into two types:

  • Quality issues or dirty data: missing, duplicated, or incorrect data
  • Tidiness issues: messy data or unstructural data.

Methodology
I use visual assessment on each dataframe to detect the issues and document it.

Visual assessment of twitter_archive_df

  • Displaying the first five raw of the twitter_archive_df
twitter_archive_df.sample(3)

sam1.png

  • We notice quality issue in column name cause it contains invalid name and missing name which are not accurate
  • We notice tidiness issue in columns source HTML tags, URL, and content in a single column.

Programmatic assessment of twitter_archive_df

Through visual assessment we found invalid name in column name of the dataframe.

  • Displaying 20 name of dogs
twitter_archive_df['name'][:20]

sam1.png

  • Counting the occurence of the unique name of the dogs
twitter_archive_df['name'].value_counts()[:20]

sam1.png

  • We notice 745 dogs in the dataframe have the name None and 55 dogs have the name a

  • Displaying the unique occurance of the denominator rating value

twitter_archive_df['rating_denominator'].value_counts()[:20]

sam1.png

  • We notice incorrect value in the rating_denominator column cause it must have a same denomination which is 10 leading to accuracy quality issue

  • Displaying the rating_numerator column

twitter_archive_df['rating_numerator']

Screenshot (6).png

twitter_archive_df['rating_numerator'].value_counts()[:20]

Screenshot7.png

  • The numerator values must be between 0 to 10 which is not the case leading to accuracy quality issue

Displaying descriptive information about the twitter_archive_df

twitter_archive_df.info()

Screenshot (8).png

From the programmatic assessment we notice some quality issues
- timestamp column need to be converted to datetime

twitter_archive_df.pupper

Screenshot (9).png

twitter_archive_df.doggo

Screenshot (10).png

twitter_archive_df.floofer

Screenshot (11).png

twitter_archive_df.puppo

Screenshot (12).png

  • From programmatic assessment we find a tidiness issue on columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column

Checking for null values

twitter_archive_df.isnull().sum()

Screenshot (15).png

  • in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp columns have more null values which will bring bring about a quality issue we need to drop them
twitter_archive_df.text[:20]

Screenshot (16).png

  • Checking the occurence of RT in the text column as it must be removed as specified in the project requirement
display(twitter_archive_df[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_df['text'].str.contains('RT @')))

Screenshot (17).png

  • the number of RT in the text is: 181

  • Programmatically this is a quality issue cause text columns contains text and Url andv 181 retweets

Visual assessment of image_predictions_df

image_predictions_df.sample(4)

Screenshot (18).png

  • From visual assessment we notice a quality issue on colums p1, p2 and p3 inconsistancy in the spelling of the name using upper case letter, lower case and underscore

Programmatic assessment of image_prediction dataframe

  • Displaying the description of the image_predictions.tsv
image_predictions_df.info()

Screenshot (19).png

  • From the descriptive information of the image_predictions dataframe, tweet_id need to be converted to a String data type cause it does not affect our analysis . So we raise a quality issue on the validity of the tweet_id data type

Checking for duplicated value

sum(image_predictions_df['tweet_id'].duplicated())

Screenshot (20).png

sum(image_predictions_df.jpg_url.duplicated())

Screenshot (20).png

  • They are 66 duplicated jpg_url double entry bringing about a quality issue on the validity of the data

Visual assessment of tweet_status

tweet_status_df.sample(4)

Screenshot (22).png

  • We notice tidy issue in column in_reply_to_status_id, retweeted_status, quoted_status_id, quoted_status_id_str, quoted_status having NaN value
  • We notice tidiness issue in columns source, entities, extended_entities having HTML tags, URL full_text contains RT @ which need to be remove per the project requirement

Programmatic assessment of tweet_status_df

tweet_status_df.isnull().sum()

Screenshot (23).png

  • we have quality issue on the validity of the data entry of the columns listed below having high number of NaN values therefore we need to drop
    in_reply_to_status_id
    in_reply_to_status_id_str
    in_reply_to_user_id
    in_reply_to_user_id_str
    in_reply_to_screen_name
    geo
    coordinates
    place
    contributors
    retweeted_status
    quoted_status_id
    quoted_status_id_str
    quoted_status

Project scope

Based on the tidiness concept, twitter_archive_enhanced.csv,tweet_status_df, image_predictions.tsv should be merged using the tweet_id as a mapping key

Quality issues

  • invalid and missing dog name in the column name of twitter_achive_df

  • Invalid ratings in column rating_numerator of twitter_achive_df value varies from 1776 to 0 and data type need to be converted from int to float

  • Invalid denominator in column rating_denominator of twitter_achive_df. It should be a fixed base 10 and data type need to be converted from int to float.

  • Timestamp column in twitter_achive_df needs to be converted into datetime data type

  • in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp have missing data

  • Spelling inconsistency of upper case letter in columns p1, p2 and p3 of image_predictions_df

  • jpg_url columns in image_predictions_df has 66 duplicated images and for the accurary of the data we need to drop them

  • tweet_id in image_predictions_df needs to be converted to string

  • column id in tweet_status_df need to be converted from in to string

  • column id need to be converted in tweet_id

Tidiness issues

  • HTML tags, URL, and content in source column
  • columns doggo, floofer,pupper, and puppo they all have the same name and can be merge into 1 column
    • twitter_archive_df and image_predictions_df and tweet_status_df can be marged There is two information in a single column text, ampasand and //n
  • 3 useful columns: id, favorite_count, retweet_count in tweet_status_df
  • Retweet need to be removed in text column and in full_text columns

Cleaning Data

In this section, we clean all of the issues I listed out above while assessing the data.

Making copies of the original dataframes

twitter_archive_clean = twitter_archive_df.copy()
image_predictions_clean = image_predictions_df.copy()
tweet_status_clean = tweet_status_df.copy()

Quality Issue #8:

Convert tweet_id column in image_predictions_clean from int to str

Define:

Converting tweet_id column from int to str

Code

image_predictions_clean.tweet_id = image_predictions_clean.tweet_id.astype(str)

Test

type(image_predictions_clean.tweet_id[0])

Screenshot (25).png

Quality Issue #6:

Inconsistent use of lowercase and uppercase and underscores in p1, p2,p3 columns

Define:

Replacing underscores ('_') with spaces and capitalize first letter.

Code

image_predictions_clean.p1 = image_predictions_clean.p1.str.replace('_', " ").str.capitalize()
image_predictions_clean.p2 = image_predictions_clean.p2.str.replace('_', " ").str.capitalize()
image_predictions_clean.p3 = image_predictions_clean.p3.str.replace('_', " ").str.capitalize()

Test

image_predictions_clean[['p1','p2','p3']].sample(6)

Screenshot (26).png

Quality Issue #7

Removing duplicated values img_url of image_prediction_df

Define:

Indexing all the duplicated value in jpg_url columns ;selecting the the non duplicating values and assigning the non duplicated jpg_url to image_predictions_clean

Code

indexing = image_predictions_df.jpg_url.duplicated()
indexing = np.logical_not(indexing)
image_predictions_clean= image_predictions_clean[indexing]

Test

print("Before cleaning: {} rows.\nAfter cleaning: {} rows.".format(image_predictions_df.shape[0],image_predictions_clean.shape[0]))

Screenshot (27).png

print("{} duplicated.".format(sum(image_predictions_clean.jpg_url.duplicated())))

Screenshot (28).png

Quality Issue #4:

Incorrect values and incorrect type for timestamp

Define:

Converting timestamp column from object to datetime series

Code

twitter_archive_clean['timestamp'] = twitter_archive_clean['timestamp'].astype('datetime64[ns]')

Test

twitter_archive_clean.timestamp[0]

Screenshot (29).png

Quality Issue #1

invalid and missing dog name in the column name

Define:

Rename non-standard names to "None"

Code

invalid_names=list(twitter_archive_clean[twitter_archive_clean.name.
                                           str.contains('^[a-z].*')].
                   name.value_counts().index) + ['None']
twitter_archive_clean.loc[twitter_archive_clean.name.apply(lambda x: x in invalid_names),'name']=None

Test

(twitter_archive_clean.name=='None').sum()

Screenshot (20).png

(twitter_archive_clean.name=='a').sum()

Screenshot (20).png

(twitter_archive_clean.name.apply(lambda x: x in invalid_names)).sum()

Screenshot (20).png

Quality Issue #10 :

twitter_data_archived contains retweets RT @ in column text

Define

As per project specification, we only want original dog ratings. We need to remove retweets text column starting with RT @. In order to fix this quality issue, we are going to create vectors to index all the non-retweet and we exclude the subset from the retweeted_status_id, retweeted_status_user_id, and retweeted_status_timestamp.

Code

indexing_retweeted_status_id =twitter_archive_clean.retweeted_status_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_id]
indexing_retweeted_status_user_id = twitter_archive_clean.retweeted_status_user_id.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_user_id]
indexing_retweeted_status_timestamp = twitter_archive_clean.retweeted_status_timestamp.isnull()
twitter_archive_clean = twitter_archive_clean[indexing_retweeted_status_timestamp ]

Test

print("Number of rows with true in retweeted_status_id:", sum(twitter_archive_clean.retweeted_status_id.isnull()))
print("Number of rows with true in retweeted_status_timestamp:", sum(twitter_archive_clean.retweeted_status_timestamp.isnull()))
print("Number of rows with true in retweeted_status_user_id:", sum(twitter_archive_clean.retweeted_status_user_id.isnull()))
print("Number of rows of twitter_archive_clean:",twitter_archive_clean.shape[0])

Screenshot (32).png

display(twitter_archive_clean[twitter_archive_df['text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(twitter_archive_clean['text'].str.contains('RT @')))

Screenshot (33).png

twitter_archive_clean['text'].sample(45)

Screenshot (34).png

Tidiness Issue #1

Column source contain HTML tag and hyperlinks

Define:

  • Extracting the content between opening and closing tag using regular expressions. Extracting the link.

  • Replacing source variable in the dataset with just the source name

  • Creating additional table with source link that we could use as a lookup table.

Code

source_link = twitter_archive_clean.source.str.extract(r'<a href="(.+)" .+>', expand=False)
source = twitter_archive_clean.source.str.extract(r'>([A-z -]+)<', expand=False)
twitter_archive_clean.source = source
sources = pd.DataFrame({'source': source, 'source_link': source_link})
sources.drop_duplicates(inplace=True)
sources

Screenshot (35).png

Test

twitter_archive_clean.source.value_counts()

Screenshot (37).png

Quality issue

Dropping columns in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp and expanded_urls

Define:

Removing the columns in_reply_to_status_id, in_reply_to_user_id ,retweeted_status_id,retweeted_status_user_id, retweeted_status_timestamp

Code

columns_to_remove = ['in_reply_to_user_id', 'in_reply_to_status_id',
                    'retweeted_status_id', 'retweeted_status_user_id',
                    'retweeted_status_timestamp','expanded_urls']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.columns

Screenshot (38).png

Tidiness Issue #2

Merging columns doggo, floofer,pupper, and puppo

Define

Columns doggo, floofer,pupper, and puppo have the same values therefore can be merged into one feature

Code

dog_cols = twitter_archive_clean[['doggo','floofer','pupper','puppo']]
dog_cols = dog_cols.replace('None', '') 
dog_category = np.array(dog_cols['doggo']) + np.array(dog_cols['floofer']) + np.array(dog_cols['pupper']) + np.array(dog_cols['puppo'])
pd.DataFrame(dog_category, columns = ['dog_category']).dog_category.value_counts()

Screenshot (39).png

  • Appending this new column called dog_category to the twtitter_archive_clean
twitter_archive_clean.reset_index(drop=True, inplace=True)
twitter_archive_clean = pd.concat([twitter_archive_clean, pd.DataFrame(dog_category, columns = ['dog_category'])], axis = 1)
  • Dropping the individual columns after merging them
columns_to_remove = ['doggo','floofer','pupper','puppo']
twitter_archive_clean.drop(columns_to_remove, axis=1, inplace=True)

Test

twitter_archive_clean.dog_category.value_counts()

Screenshot (40).png

twitter_archive_clean.columns

Screenshot (42).png

Tidiness Issue #4

There are two information in a single column text.

Define

Remove the URL in the end of the text column, ampasamd and /n

Code

twitter_archive_clean['text'] = twitter_archive_clean.text.str.replace(r'[(https://.+),(\&amp;)|(\n)]','',regex=True)

Test

twitter_archive_clean['text'].sample(45)

Screenshot (43).png

Quality Issue #2 and 3 :

Invalid ratings. Value varies from 1776 to 0. data type must be converted from int to float. Invalid denominator, I expected a fixed base. data Strucutre must be converted from int to float.

Define

  • Convert rating_numerator and rating_denominator to float because @dog_rates uses float rating number.
  • Remove the extreme values (1776, 420, etc.) of rating_numerator, and;
  • Remove non expected value of denominator, anything different of 10.

Code

  • Converting the rating_numerator and rating_denominator to float.
twitter_archive_clean.rating_numerator = twitter_archive_clean.rating_numerator.astype(float)
twitter_archive_clean.rating_denominator = twitter_archive_clean.rating_denominator.astype(float)
  • From visual assessment of the rating_numerator done earlier on, we need to drop the numerator with invalid rating 1776, 420, 204.
invalid_ratings_1776_420_204= twitter_archive_clean[(twitter_archive_clean.rating_numerator==1776)|(twitter_archive_clean.rating_numerator==204)|(twitter_archive_clean.rating_numerator==420)].index
twitter_archive_clean.drop(invalid_ratings_1776_420_204, inplace=True)
  • Removing tweet_id with 0/10
  • Listing the tweet_id having 0 as numerator and removing these tweet_id from the dataframe
list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)

Screenshot (44).png

rm_list = list(twitter_archive_clean.query('rating_numerator == 0').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_numerator == 0')

Screenshot (45).png

  • Quering denominator other than 10
list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)

Screenshot (46).png

  • Remove those three tweet_id's
rm_list = list(twitter_archive_clean.query('rating_denominator !=10').tweet_id)
# Creating a vector to subset twitter_archive_clean and remove the tweet_id from the rm_list.
indexing = np.logical_not(twitter_archive_clean.tweet_id.isin(rm_list))
# Updating the twitter_archive_clean data frame.
twitter_archive_clean = twitter_archive_clean[indexing]

Test

twitter_archive_clean.query('rating_denominator < 10')

Screenshot (48).png

twitter_archive_clean.query('rating_denominator > 10')

Screenshot (48).png

twitter_archive_clean.rating_denominator.value_counts()

Screenshot (49).png

twitter_archive_clean.tweet_id =twitter_archive_clean.tweet_id.astype(str)

Merging twitter_archive_clean and image_predictions_clean as stating in the tidiness issue

twitter_archive_merged= pd.DataFrame()
twitter_archive_merged = twitter_archive_clean.copy()
twitter_archive_merged_df  = twitter_archive_merged.merge(image_predictions_clean, on='tweet_id', how='inner')
print("Shape df_twitter_archive_clean: " + str(twitter_archive_merged.shape))
print("Shape image_predictions_clean: " + str(image_predictions_clean.shape))
print("Shape df_twitter_combined " + str(twitter_archive_merged_df.shape) + " after joining df_twitter_archive_cleaned, df_tweet_performance and df_image_predictions_cleaned.")

Screenshot (50).png

Test

twitter_archive_merged_df.info()

Screenshot (51).png

Quality Issue #11:

Incorrect data type in column id in tweet_status_clean

Define:

Convert id from int to str

Code

tweet_status_clean.id = tweet_status_clean.id.astype('str')
tweet_status_clean.rename(columns={'id':'tweet_id'}, inplace=True)

Test

type(tweet_status_clean.tweet_id[0])

Screenshot (52).png

Tidiness Issue #11:

Removing RT @ in full_text column

Define

Removing RT @ in full_text column as required in the project requirement

Code

display(tweet_status_clean[tweet_status_clean['full_text'].str.contains('RT @')])
print('the number of RT in the text is:', sum(tweet_status_clean['full_text'].str.contains('RT @')))

Screenshot (54).png

Screenshot (55).png

  • Removing the 160 retweet in full_text column
tweet_status_clean['full_text']= tweet_status_clean['full_text'].apply(lambda x: re.sub(r'\bRT @\b','',x).strip())

Test

sum(tweet_status_clean.full_text.str.contains('RT @'))

Screenshot (20).png

  • All the rows have been removed

Tidiness isssue #6

Creating a dataframe with columns: id, favorite_count, retweet_count

Define Reassign the useful columns to the dataframe

Code

tweet_status_clean = tweet_status_clean[['tweet_id', 'favorite_count', 'retweet_count' ]]

Test

tweet_status_clean.sample(3)

Screenshot (56).png

Tidiness issue #7

Merge tweet_status_clean and twitter_achive_merged

Define

Merging tweet_status_clean and twitter_achive_merged

Code

twitter_archive_master = twitter_archive_merged_df.merge(tweet_status_clean, on='tweet_id', how='inner')

Test

twitter_archive_master.info()

Screenshot (57).png

Storing Data

Saving the clean merged dataframe into "twitter_archive_master.csv".

twitter_archive_master.to_csv('twitter_archive_master.csv', index=False)

Analyzing and Visualizing Data

In this section, we analyze the twitter archiv master dataset and visualizing data

-Reading and assessing the twitter_archive_master.csv into a dataframe

master_df =pd.read_csv("twitter_archive_master.csv")
master_df.sample(3)

Screenshot (58).png

Insights: In this section, we are interesting to find the hidden pattern in the clean twitter archive master dataset.

. Visualizing the distribution of dog category based on the favorite tweet count by their users on twitter

. Visualizing the distribution of dog category based on the retweet count by their users on twitter

. Visualizing the most used devices by WeRateDogs users

. Visualizing the most common name of dog

. Line plot of like and retweet by twitter users on WeRateDogs account

.Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

. Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

. Visualizing the dog category with the highest rating score

. Visualizing the dog category with maximum favorite count

. Visualizing the dog category with minimum retweet count

. Visualizing Total Tweets made by WeRateDogs per month

. Visualizing some of the dog names in image prediction p1

Visualization

Define a function to plot the average count of tweets

def distributionPlot(feature,ylabel,title):
    stage_favorite_counts = master_df[['dog_category',feature]]
    stage_favorite_counts_agg = stage_favorite_counts.groupby('dog_category', as_index=False).mean()
    print(stage_favorite_counts_agg )
    f, ax = plt.subplots(1, 1, figsize=(12, 4));
    sns.barplot(data=stage_favorite_counts_agg,x='dog_category',y=feature,color='green', ax=ax);
    ax.set_ylabel(ylabel);
    ax.set_xlabel("Dog category");
    ax.set_title(title)

Plotting the distribution of average favorite count of tweet using based on the dog category

distributionPlot('favorite_count','average favorite count',"Distribution of average favorite count of tweet using based on the dog category¶")
plt.savefig('av_favcounte.png')

Screenshot (59).png

Among the dog category:

  • doggo has 17599.225806 tweet favorite counts the most
  • followed by floofer with 11223.857143 tweet favorite counts
  • multiclass has 15008.909091 tweet favorite counts
  • pupper has 6204.975369 tweet favorite counts
  • puppo has 19573.545455 tweet favorite counts the least

Plotting the distribution of average retweet count based on the dog category

distributionPlot('retweet_count','average retweet count',' Distribution of average retweet count based on the dog category')
plt.savefig('av_retweet_count.png')

Screenshot (60).png

Among the dog category:

  • doggo has 5972.709677 average retweet counts the most
  • followed by puppo with 5325.318182 average retweet counts
  • multiclass has 4548.272727 average retweet countss
  • floofer has 1909.453202 average retweet counts
  • pupper has 5325.318182 average retweet counts the least

Plotting likes vs retweets

likes = pd.Series(data=master_df['favorite_count'].values, index=master_df['timestamp'])
retweets = pd.Series(data=master_df['retweet_count'].values, index=master_df['timestamp'])
likes.plot(figsize=(16,4), label='Favorites', color='orange', legend=True)
retweets.plot(figsize=(16,4), label='Retweets', color='maroon', legend=True);
plt.title('No. of Favorites and Retweets Over Time')
plt.show()
plt.savefig('retvs.png')

Screenshot (61).png

From the plot of likes vs retwets we notice that WeRateDogs twitter account started in December,11,2015 with zero likes and retweets. In June,23,2016 over 65k users retweeted post of the account and 140k people like the post as their favorites. In February,16, 2017, the account got more popularity over 100k people like the post on the account as their favorites and 25k people retweeted the posts about WeRateDogs.In August 1,2017 few people retweets about their dogs and only 58k people like the post of dogs as their favorites.

Visualizing the most used devices by WeRateDogs users

print(twitter_archive_clean.source.value_counts())
twitter_archive_clean.source.value_counts().plot(kind='bar', figsize=(11,5), title='Most used Twitter source').set_xlabel("Number of Tweets")
plt.savefig('twitter_source')
plt.savefig('twitter_source.png')

Screenshot (62).png

  • From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

From the bar plot above, most of the WeRateDogs users used Twitter for iPhone

Displaying the most common name among the dogs

master_df['name'].value_counts()[1:13]

Screenshot (63).png

Visualizing the most popular name of the dogs

sorted_order = master_df['name'].value_counts()[1:13].head(13).index
display(sorted_order)
plt.figure(figsize = (10,4))
values = np.count_nonzero(master_df['name'])
fclrs = ['grey' if (x > min(sorted_order)) else 'red' for x in sorted_order]
sns.countplot(data = master_df, x = 'name', order = sorted_order, orient = 'h')
plt.xlabel('Name', fontsize =16)
plt.ylabel('Count', fontsize =16);
plt.title('Most popular name of the dogs ')
plt.savefig('popular_dog_name.png')

Screenshot (64).png

  • From the above plot the most common name among the dogs are Charlie, Oliver, Lucy, Tucker, Penny, Winston, Sadie,Toby, Daisy, Lola, Koda and Bo

Visualizing 20 dogs breed P2 predicted by twitter user on WeRateDogs

breeds= master_df.p2.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='red')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds prediction p2 on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20p2_dog_pred.png')

update.png

  • From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriever

Visualizing 20 dogs breed P3 predicted by twitter user on WeRateDogs

breeds= master_df.p3.value_counts().head(20)
display(breeds)
plt.barh(breeds.index , breeds,color='purple')
plt.xlabel('Count')
plt.ylabel('Dog Breed')
plt.title('Top 20 Dog Breeds predction on tweet')
plt.gca().invert_yaxis()
plt.show()
plt.savefig('20d0gs_p3.png')

Screenshot (67).png

Screenshot (68).png

  • From the above plot, the most predicted dog image by twitter users on WeRateDogs is Labrador retriver

Visualizing the dog category with the highest score

  • Creating a new column called rating in master_df
master_df['rating'] = master_df['rating_numerator']/master_df['rating_denominator']
dog_rating=master_df.groupby('dog_category')['rating'].max()
print(dog_rating)
dog_rating.plot.bar(figsize=(10,5))
plt.ylim(top=10)
plt.title("Dog category with the highest rating score")
plt.xlabel("Dog category")
plt.legend([" Rating"])
plt.savefig('rating.png')

Screenshot (69).png

  • From the plot above, pupper is the dog with the high rating score

Visualizing the dog category with maximun favorite count

dog_favorite_count=master_df.groupby('dog_category')['favorite_count'].max()
display(dog_favorite_count)
dog_favorite_count.plot.bar(figsize=(10,5))
plt.title("Dog category favorite count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('favorite.png')

Screenshot (70).png

  • From the above plot doggo is the dog with maximum twitter favorite count

Visualizing the dog category with minimun retweet count

dog_retweet_count=master_df.groupby('dog_category')['retweet_count'].min()
display(dog_retweet_count)
dog_retweet_count.plot.bar(figsize=(10,5))
plt.title("Dog category with minimun retweet count")
plt.xlabel("Dog category")
plt.legend(["favorite_count"])
plt.savefig('retweet.png')

Screenshot (71).png

  • From the above plot pupper has the least retweet count

Visualizing Total Tweets made by WeRateDogs per month between 2015 to 2017

twitter_archive_master['year_month_date'] = twitter_archive_master['timestamp'].dt.year.astype(str) + '-' + \
                                            twitter_archive_master['timestamp'].dt.month.astype(str).str.pad(2, fillchar='0')
twitter_archive_master['is_tweet'] = np.where(twitter_archive_master.tweet_id.notnull(), 1, 0)
twitter_archive_monthly_tweets = twitter_archive_master.groupby('year_month_date').is_tweet.sum().reset_index()
plt.xticks(rotation=45)
ax = sns.lineplot(x='year_month_date', y='is_tweet', data=twitter_archive_monthly_tweets)
ax.set_title('Total Tweets made by WeRateDogs per month between 2015 to 2017')
ax.set_xlabel('Date')
ax.set_ylabel('Number of Tweets')

Screenshot (72).png

  • From the plot above, we notice in November,2015, 350 people tweeted on WeRateDogs. However, the number gradually decreased in August 2017, the plot revealed that the number of tweets had dropped to 0 tweet.

Visualizing some of the dogs image prediction p1

twitter_archive_master[['jpg_url','p1']]

Screenshot (73).png

Visualizing Miniature_pinscher

Image("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")

Miniature_pinscher.png

  • Programmatically download Miniature_pinscher image from the url
response = requests.get("https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg")
file = open("Miniature_pinscher.png", "wb")
file.write(response.content)
file.close()

Visualizing Rhodesian ridgeback

Image('https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg')

Rhodesian ridgeback.png

  • Programmatically download Rhodesian ridgeback image from the url
response = requests.get("https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg")
file = open("Rhodesian ridgeback.png", "wb")
file.write(response.content)
file.close()

Visualizing German Sheperd

Image('https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg')

german_sheperd.png

  • Programmatically download german_sheperd image from the url
response = requests.get("https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg")
file = open("german_sheperd.png", "wb")
file.write(response.content)
file.close()

Conclusion
We have reached the end of this data wrangling and visualization journey. Through our analysis , we found out that:

Among the dog category we found out that:

tweet favorite counts

  • doggo has 17599.225806 the most
  • followed by floofer with 11223.857143
  • multiclass has 15008.909091
  • pupper has 6204.975369
  • puppo has 19573.545455

retweet count

  • doggo has 5972.709677 the most
  • followed by puppo with 5325.318182
  • multiclass has 4548.272727
  • floofer has 1909.453202
  • pupper has 5325.318182

Most devices used by WeRaDogs users

  • Twitter for iPhone: 2016
  • Vine - Make a Scene: 91
  • Twitter Web Client: 31
  • TweetDeck : 10

** Dog category with the highest rating score

  • pupper 2.7
  • doggo 1.4
  • puppo 1.4
  • floofer 1.3
  • multiclass 1.3

dog category with maximun favorite count

  • doggo 144774
  • puppo 124103
  • pupper 108900
  • multiclass 49401
  • floofer 28112

Finaly, People preferred to favorite a dog over a retweet and both actions had decreased over time from 2015 to 2017.

The github repository to download the datasets, the wrangling report and the jupyter notebook can be found here.


If you want to contribute or you find any errors in this article please do leave me a comment.

You can reach me out on any of the matrix decentralized servers. My element messenger ID is @maximilien:matrix.org

If you are in one of the mastodon decentralized servers, here is my ID @maximilien@qoto.org

If you are on linkedIn, you can reach me here

If you want to contact me via email for freelance

If you want to hire me to work on machine learning, data science,IoT and AI related project, please reach out to me here

Warm regards,
Maximilien.

Did you find this article valuable?

Support Maximilien by becoming a sponsor. Any amount is appreciated!