This was a toy project that I worked in 2015 when I was applying for Insight Data Science Fellowship. This was my first data science project (still unfinished).
Nowadays, companies hiring data scientists may want to see whether one has some hands-on experience in “playing” with data. I have heard from many recently hired data scientists that it is great to have a small project for my resume. It does have to be neither complete nor super awesome. It just needs to show your problem-solving skills and how you approach certain types of data. So, I have decided to work on a small project that is somewhat related to my PhD work, and actually quite interesting.
During my PhD, I ran many psychophysics experiments on human participants. It’s basically showing a flashing image of shapes (stimuli) and asking the participants to answer a certain type of question. Most of the times, the stimuli looked quite dull: gray ellipses. However, some of my experiments involved multiple colored objects. The color of each object was independently drawn from a uniform distribution, and trust me, at certain times, some combination of colors actually looked pretty. It was not just my opinion. My participants also liked the color experiments better because simply they were prettier.
As a mild skeptic on contemporary visual art, I have always wondered what defines beauty. Based on my experience as an ecologist, I know that in animal kingdom, in many cases, individuals with more visual symmetry is preferred as a mating partner because oftentimes it is actually related to fitness. But what about humans?
Adobe Kuler is a website where you can make your own a 5-color combination theme either on you own or based on the website’s “color rules.” For instance, if you choose the “analogous” rule, when you pick a color for your theme, the rest four colors are automatically selected based on the rule, in which case here four similar colors.
Once you decide the colors for your theme, you can share it with other users. You can also browse other users’ color themes. You can cast a vote for your favorite schemes, and if you want to use one’s theme, you can download the RGB code of the five colors from it This 5-color-theme information is quite useful for choosing the right color for visualization or presentation. Besides, when you create a theme, you can add text tags to yours, which may describe the mood.
What is interesting here is the color rules. There are 5 colors in each theme. This means, there are 7 ways to cluster 5 colors in a theme. Some of these clustering patterns match to the specific color rules. For instance, the “triad” rule finds three color groups that are rather far away from each other. Then, in one group, the rule picks one color, and each of the rest, two colors are chosen. Here, the clustering pattern is 1 color, 2 colors, and 2 colors (or (1, 2, 2)). However, the current color rules do not cover all 7 cluster patterns, but only 3. More importantly, some of the popular themes are in fact do not follow the rules that the website suggested.
Thus, based on users’ preference, I assume I can first find which pattern of color clusters is popular. For instance, (1, 4), or one very different color and four similar colors, might be more popular than (2, 3), or two similar colors and three similar colors that are far different from the first two.
Then, based on each cluster’s structure, it’ll be interesting to understand the relationship between each cluster. For instance, in the case of (1, 4), if you set 4 greenish colors, is it more likely to be the last one color red or yellow?
Once I figure out how to make sense of the data, I can build a recommender system based on the analysis. Right now, the website seems to have information about the author of a theme only, but not which user liked which themes, which makes building a recommender engine challenging. However, even without this information, it is still possible to have a simple version of color recommender system. Thus, the ultimate goal will be that once a user picks a color, the engine suggests different patterns of color themes that feature the user’s choice.
The main goal of the project is to find relationships between colors in preferred color themes, and to eventually build a recommender system based on this information. This project consists of multiple steps, starting from collecting data to building a web application. In this post, I would like to summarize each step.
Each color theme has 5 colors. Users can view each theme and download information about each color in a theme. To gather color information from the themes, we should know how to scrape data from the website. More specifically, this involves answering the following questions:
- Which information do I need?
- Where is the data stored?
- How can I scrape the data efficiently?
- How can I convert the data in a convenient format for analysis?
Exploratory Data Analysis
Once I have the data in a manageable format, now it’s time to play with the data! In the exploratory data analysis (EDA) step, we use summary statistics to have a glimpse on the data, and visualize the data to understand it intuitively. This involves from plotting histograms or bar plots to visualizing colors in each theme in a proper color space.
Once we understand the relationship among the themes, a possible data product we can build is a recommender engine. Here’s the basic idea. A user wants to find a 5-color theme. The user picks a color, and the recommender engine finds rest of the colors that match well with the user’s color.
This is my first “data-sciency” project, and it is very likely that there will be a lot of trials and errors as we proceed. I am going to write blogs posts as soon as I make even a small progress to motivate myself and readers. :) In the next post, I will discuss pre-processing data.