Filtering the data
Before we have a clean dataset, let’s take a look at an example JSON response we scraped from the website. This is a JSON response from the first theme.
As Python’s data-type dictionary, it seems to be a set of unordered key:value arrays. Here is the summary of the keys and their meaning:
- author: author Id and name
- comment: number of comments
- createAt: date created
- tags: text description of the theme
- description
- harmony: information about creating process of the theme (e.g., rule)
- href: url
- originalTheme: if the theme is created based on an already-existing theme, its value is the id of that parent theme
- access
- like: number of likes
- view: number of views
- rating: review count and average rating
- iccProfiles: ?
- swatches: color information in RGB, hex, and colorIndex (?)
- name: name of the theme
- id: theme id
- editedAt: latest date edited
As you can see, each theme has more than a dozen different kinds of information. Bold entries are the ones that I decide to include for preliminary analysis. There are some information that I would like to consider in the future, and I am going to talk about in a different post.
After webscraping, I saved all JSON responses in a text file (.txt). Using Pandas, I can easy convert this JSON format into a DataFrame using read_json function.
Authors
Authors refer to the creator of a color theme. Pandas apply function comes in handy in this case. You can pass an input from each row and generate a new column. Since we’re going to do this for many different columns, building a function is useful: “func” which gives an output of a value based on its input: DataFrame column and and id.
Time
For a time-series analysis, the CreateAt column should be converted to a datetime format. We can use a function “strptime” in the datatime package.
from datetime import datetime def convert_time(x): k = x[:-6] k = datetime.strptime(k, '%Y-%m-%dT%H:%M:%S') return pd.Series(k) df['Time']=df['createdAt'].apply((lambda x: convert_time(x)))Color
The color column (“swatches”) also needs processing because it has a nested data format. For a theme, “swatches” has information about all 5 colors. Each color is represented in various color formats (RGB code, hex code, etc.) I decided to extract the hex code because it’s a single string value which can be easily transformed into an RGB code later: the hex code is simply a hexadecimal code for RGB. Again, I used a function to pass inputs from “swatches” to create a new column. Here, I am creating a column for each color in a theme. So we end up having 5 new columns.
def func_extract_color(x,colorIdx): k = x[colorIdx]["hex"] # here I read key values return pd.Series(k) df['C1']=df['swatches'].apply((lambda x: func_extract_color(x,0)))Now, I can remove the original columns that are used for extraction and creating new columns. For convenience, I rearranged the columns. This gives a DataFrame that looks like this: