Exploratory data analysis (EDA) is a process where you figure out general features of your data. It’s a close and casual conversation between you and data, and a very fun process! You make figures of histograms, scatter plots or bar plots to see how your data looks like. This can give you a general idea about your data, help you discover interesting facts about it, and finally guide you to a right direction towards your goal. I always go back and forth between applying models/algorithms and EDA for these reasons.
Before jumping into analysis, I first want to check whether the website is still active. The histograms below shows the number of themes uploaded by all users per year from the Nov 2006 till June 2015. It seems there has been consistent upload of themes each year. The bump in 2014 and 2014 shows that for some reason, the website became more popular. Considering the fact that this is base on top 5000 themes, it’s possible that the number is much larger if we count all themes.
There are multiple measures for user preference on themes. We have the following 4 measures:
- Number of likes (“Likes”)
- Number of views (“Views”)
- Number of reviews (“Review Count”)
- Average review rating (“ReviewAvr”)
Review Counts and Ratings
There doesn’t seem to be a significant correlation between the two variables. More importantly, based on the histogram on top, you can see that a large portion of themes have average zero rating. This does not necessarily mean these themes have no reviews; it can be the case of all reviews have zero ratings.
However, this figure clearly shows that a lot of themes actually have no reviews. Plus, we can see that with very low number of reviews, review average can very significantly. This is very similar to the case of using Yelp. You want to find a good restaurant with good reviews, but you don’t want to stumble upon the high-rating ones with very few reviews. Plus, average review rating is an approximation of all review ratings, whose distribution may vary across themes. Thus, I decide to use neither number of reviews nor average review ratings.
Likes and Views
This figure shows positive correlation between number of likes and views. Then what is the better measure between the two? Since users have to press the thumb-up button to give a like to a theme, I would say like is a representation of a more active form of opinion. Thus, I decided to use number of likes as a measure of preference. Based on the histogram of number of likes above, you can see that the distribution has a very long tail. The median value of number of likes was in fact only 108.0. Also, you can see that some themes are much more popular than the others. The one on the right top has almost 14000 likes. One can consider to remove these outliers when considering preference in analysis.
Below, I made a bar plot of average number of likes across time. It is not clear when the website first launched, in 2006, for two months (Nov and Dec), it might have been the case the the website was extremely popular that many users gave likes to certain themes. Except for that, you can see that the average number of likes per year does not vary much across time, and also within the same year.