Webscraping: XML and JSON

When you click a theme in the Kuler website, it shows the theme’s page where you can see an enlarged image of the theme, and other information. On the right side, you can see “Action” and “Info” frames. The latter has following information:

  • Author of the theme (“Created By”): nominal
  • Date created (“Created”): ordinal
  • Number of views (“Viewed”): quantitative
  • Rating: quantitative (shown in number of stars)
  • Number of likes (“Appreciated By”): quantitative
  • Tags: nominal

Color information only appears when you click “Edit Copy” in the “Action” frame. After clicking it, you’ll see a hexcode and an RGB code. A hexcode is simply a hexadecimal representation of an RGB code. For instance, when the first two digits of a hexcode is converted into a decimal number, this is the R value of an RGB code.

Now you may wonder, in order to scrape this information from the themes, whether you have to click every one of them. Luckily, you don’t have to. The “Explore” page, where you can see multiple themes at once, has all the above-mentioned information. When you right-click at this page and select “inspect element” option (Chrome), you will see a structured information of this webpage, or the hierarchical structure of the Document Object Model (DOM).

One of the nice things about the inspecting tool is that it directly shows the part of the code regarding where your mouse cursor is located, vice versa. This way, I can find that information of all themes is stored in a HTML division with class name “collection-assets.” Under this class, there are multiple “collection-assets-item” classes, which of each corresponds to each theme in the page. And here, finally you can see the hexcode of each color in a theme. The example below is from the theme “sandy stone beach ocean diver”.

<div class="collection-assets-item">
<div class="content" aria-haspopup="true">
<div class="frame ctooltip">
<div style="background: #E6E2AF"></div>
<div style="background: #A7A37E"></div>
<div style="background: #EFECCA"></div>
<div style="background: #046380"></div>
<div style="background: #002F2F"></div>
</div>
</div>
</div>

In the same class, you will also be able to find other information such as name, number of likes, and so on. The

    tag means the elements are in the form of an unordered list. As you can see below, quantitative data for number of views (“views-count”) or likes (“likes-count”) are approximated, which is not ideal.
<ul class="assets-item-meta">
	<li class="name">
        <a class="ctooltip" href="/sandy-stone-beach-ocean-diver-color-theme-15325/">sandy stone beach ocean diver
        </a></li>
	<li class="info">
<ul>
	<li class="views"></li>
	<li class="views-count">9K+</li>
	<li class="likes"></li>
	<li class="likes-count">9K+</li>
	<li class="comments"></li>
	<li class="comments-count">339</li>
</ul>
</li>
</ul>

How to scrape data from a webpage

Browser Automation

We now know where to find the data. It’s time to extract the data from the DOM tree. There are many tools out there for web scraping, but here I used Selenium. As the website simply puts, “Selenium automates browsers.” You can open and close the browser, load a website, click an element, and finally scrape data using Selenium. And the good thing is you can automate this process with your code. After installing the Selenium package, we can open a browser and load a webpage in Python.

import time
reloads = 5
pause = 0
for _ in range(reloads):
	driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
	time.sleep(pause)

XML

We know the location of the data, but calling it is a different matter. Selenium Webdriver can call an element based on its XPath. XPath or XML Path Language is useful to navigate elements in an XML document. In the inspector, you can simply right click a line and get a copy of XPath. For instance, the XPath of this DOM element

//*[@id="content"]/div/div/div[1]/div/div/div[1]

is same as

u'background: rgb(230, 226, 175) none repeat scroll 0% 0%;'

Finally, running this code gives us&nbsp;the RGB value of the 1st color of the 1st theme:

JSON

The main drawback of webscraping through XML was that I had to create a loop over all 5 colors in a theme and also across all themes. Plus, each page loads only 36 themes at one, and thus I had to scroll down to the bottom of the page to load another 36 themes, which adds to computation time. Moreover, the preference information (likes and ratings) is approximated (e.g., 9K+), and in order to get the actual data, I had to click each theme to get the accurate data.

An alternative can be to scrape web data using JSON. The main difference between XML and JSON is that JSON does not have a hierarchical data structure, which makes JSON lighter, and easy to scrape. Besides, the structure of a JSON object is very similar to that of Python’s dictionary. There are many packages one can use to read JSON objects (BeautifulSoup, Requests, Urllib2/Urllib3, json, etc.). Here, I used Requests.

I found JSON objects in the website using the Chrome Inspector as well. JSON objects can be accessed through “XHR” under “Network” tab. Make sure that you refresh the website again once you go to the XHR section. Once the page is loaded, JSON objects will appear. In each object, you can access to the preview, headers, and response of it. The information we want is stored in JSON response. Unlike XML, this JSON response has 1) information of multiple themes in one response, 2) raw data from each theme, and 3) richer information about each theme such as tags. However, you still have to scroll down the page to the bottom to load more themes.

JSON responses and errors

So, I used Selenium to load about 5000 themes in total. Now the next step is to call JSON response from each object. To do this, you need to have JSON address, which can be easily obtained from right-clicking a JSON object in the Inspector. For example, the first response’s address is https://color.adobe.com/api/v2/themes?filter=public&startIndex=0&maxNumber=36&sort=like_count&time=all.

And the next JSON object’s address is only different by its startIndex, which is basically 36*(n-1), where n is the nth JSON object. Thus, one can generate a for loop and generate the address for each JSON object. Now, one can load JSON response using Requests:

import requests
url = 'https://color.adobe.com/api/v2/themes?filter=public&amp;amp;amp;startIndex=0&amp;amp;amp;maxNumber=36&amp;amp;amp;sort=like_count&amp;amp;amp;time=all'
r = requests.get(url)
r.json()

The only problem was… that it did not work. It gave me the following output:

{u'message': u'', u'reason': u''}

Which is different from what I see from the Inspector (see the figure above). I used a different package to read the same JSON response:

import urllib2
url = 'https://color.adobe.com/api/v2/themes?filter=public&amp;amp;amp;startIndex=0&amp;amp;amp;maxNumber=36&amp;amp;amp;sort=like_count&amp;amp;amp;time=all'
request = urllib2.Request(url)
content = urllib2.urlopen(request).read()

However, this gives me “HTTPError: HTTP Error 403: Forbidden”. I did some research to fix this issue, and the closest clue I’ve got was that this could be due to the fact that the website recognizes me as a bot and prevented the scraping. One way to bypass this problem is to add headers as if my program is a human agent. I tried this method but it did not work. It is an embarrassing fact but because I could not solve this problem, I had to manually copy and paste the JSON responses. I am still working on figuring out how to automate this webscraping process. But for now, I finally have a text file with about 5,000 themes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s