Preprocessing the scraped Kuler data

Filtering the data

Before we have a clean dataset, let’s take a look at an example JSON response we scraped from the website. This is a JSON response from the first theme.

As Python’s data-type dictionary, it seems to be a set of unordered key:value arrays. Here is the summary of the keys and their meaning:

  • author: author Id and name
  • comment: number of comments
  • createAt: date created
  • tags: text description of the theme
  • description
  • harmony: information about creating process of the theme (e.g., rule)
  • href: url
  • originalTheme: if the theme is created based on an already-existing theme, its value is the id of that parent theme
  • access
  • like: number of likes
  • view: number of views
  • rating: review count and average rating
  • iccProfiles: ?
  • swatches: color information in RGB, hex, and colorIndex (?)
  • name: name of the theme
  • id: theme id
  • editedAt: latest date edited

As you can see, each theme has more than a dozen different kinds of information. Bold entries are the ones that I decide to include for preliminary analysis. There are some information that I would like to consider in the future, and I am going to talk about in a different post.

After webscraping, I saved all JSON responses in a text file (.txt). Using Pandas, I can easy convert this JSON format into a DataFrame using read_json function.

Authors

Authors refer to the creator of a color theme. Pandas apply function comes in handy in this case. You can pass an input from each row and generate a new column. Since we’re going to do this for many different columns, building a function is useful: “func” which gives an output of a value based on its input: DataFrame column and and id.

Time

For a time-series analysis, the CreateAt column should be converted to a datetime format. We can use a function “strptime” in the datatime package.

from datetime import datetime
def convert_time(x):
    k = x[:-6]
    k = datetime.strptime(k, '%Y-%m-%dT%H:%M:%S')
    return pd.Series(k)
df['Time']=df['createdAt'].apply((lambda x: convert_time(x)))

Color

The color column (“swatches”) also needs processing because it has a nested data format. For a theme, “swatches” has information about all 5 colors. Each color is represented in various color formats (RGB code, hex code, etc.) I decided to extract the hex code because it’s a single string value which can be easily transformed into an RGB code later: the hex code is simply a hexadecimal code for RGB. Again, I used a function to pass inputs from “swatches” to create a new column. Here, I am creating a column for each color in a theme. So we end up having 5 new columns.

def func_extract_color(x,colorIdx):
    k = x[colorIdx]["hex"] # here I read key values
    return pd.Series(k)
df['C1']=df['swatches'].apply((lambda x: func_extract_color(x,0)))

Now, I can remove the original columns that are used for extraction and creating new columns. For convenience, I rearranged the columns. This gives a DataFrame that looks like this:

1435239262067

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s