For the past several posts, I have discussed a brief overview of the project, and wrote about each process from data wrangling to machine learning. There are a lot of rooms for improvement for this project starting from automating webscraping and tweaking a machine learning algorithm to including other interesting variables in consideration. However, I am very glad that I could show that we can garner somewhat decent clustering result using the color coordinates as features. In this post, I want to discuss future directions of this project and introduce some interesting ideas.
One way to build a recommender system is to have a user’s choice of color as an input. Based on the choice of a color, we can complete a color theme by suggesting the rest of the 4 colors.
Once we figure all clusters and their color clouds, it is possible to estimate the probability of a color being in a certain cloud. By comparing the conditional probability of a color given a cloud, we can estimate which color cloud is more likely to have this color.
Based on user preference, we can also estimate the probability of a cluster being preferred (= having a like from a user). This way, we can show a list of color themes, each of which generated from a cluster, let the user explore to some degree.
Something about the first color
However, we still need to figure out what the user’s color choice really represents. Does the user want this color to appear as the first color or the middle (the 3rd) color?
In fact, when I plotted the first color from the entire themes in one figure, it was clearly visible that users selection of the first color of a theme may fall into three clusters. I ran a kmeans-clustering analysis (k=3), and colored each cluster with the color of its centroid value. Now the clouds are easier to see.
But, let’s be honest. These colors are dull and ugly. So this time, I used actual color for each data point.
Now, you can see that the colors in a cloud are actually not even close to the color of its centroid (always visualize!). But still, why users show this tendency of going for certain colors than others?
What is even more interesting is that this clustering pattern is weakened in the 2nd, 3rd, and 4th colors (or “middle” colors), but stronger again in the 5th color. The slideshow below shows how these color distributions change from the first to the fifth color.
This might mean that because of the linear and ordinal characteristic of colors in a theme, the first and the last colors play a role of anchors, and the middle colors are more like bridging colors, which is SUPER INTERESTING! But of course, incorporating this idea is a different issue. We can use more weights to the first and last color for their significance, or we can find a smarter way to implement this.
Another thing to consider regarding a user’s color choice is how dominant this color is in a theme. Let’s say a user chooses red. A recommender engine can suggest find these two themes that have red. However, Seawolf (left) only has one red and Circus III has three colors with red/orange hue. So, in the latter, red/orange is more dominant than the former. And it’s difficult for us to know what a user meant by when he/she chose red to be in the theme. Again, this can be partially solved by showing multiple clusters that include clusters with different degree of color dominance, especially since we start cold.
Do absolute colors matter?
Now, this makes me rethink on my feature selection of 15D-vector from the raw Lab coordinates. Let’s say we have four colors, namely A, B, C, and D. The perceptual distance between A-B, and C-D is the same. However, A and C are very far away in the color space. If absolute color values matter, then these two pairs should be treated differently. But if not, then we can group them together. To test this idea, I can have a 15D-vector consisting of Lab coordinates, but with Euclidean translation so that the new mean of all 5 colors is the origin (to be more accurate, neutral gray [50, 0, 0], since L ranges from 0 to 100). Then we can compare the results between the two.
Better “color rules”
The color rules provided by the website does not have much versatility. It doest not consider absolute color values, and a color-distance is defined in the HSV color space. A clustering analysis result with the translated 15-D vector may be able to provide better and more complicated color rules.
The first main problem that I need to solve is automating webscraping. I collected top 5000 popular themes in the order of number of likes, and even the last data point in the dataset still had about 90 likes or so. Considering the fact that median value of number of likes is about 100, it is definitely possible to scrape more themes from the website. Automation in dire need!
User data: number of likes
Yes, I mentioned that I will use number of likes in the EDA post, but actually I didn’t use this information at all in clustering analyses. One simple way to incorporate this variable is to replicate the color themes based on the number of likes. For instance, if theme A has 50 times more likes than theme B, we can simply replicate theme A 50 times, and run a clustering analysis. To do this, we should be able to rule out some outliers such as the extremely popular ones.
Looking for a formula
Once we are settled with decent clustering results, we can characterize the relationship between colors per cluster. We already saw that once we collapse all themes from a cluster, we can have a cloud of dots. We can approximate this cloud assuming that the cloud follows a multivariate Gaussian distribution. We can parameterize each a cloud using mean and standard deviations from each dimension, and by using an optimization tool (e.g., genetic algorithm), approximate a cloud. Once we have this formula, we can create a new theme from a cluster based on probabilistic process. Based on the fitting and noisiness of the cloud, it is also possible to only use a partial central volume from the mean.
Interactive color clouds
One great thing about this project is that any data visualization can give you beautiful colorful figures! Creating interactive figures would be even more helpful to understand the data and make them more aesthetically pleasing. First, we can show a clustering analysis as interactive color clouds. This will be similar to the color clouds I already talked about (see the previous post), but when a mouse cursor hovers on top of a certain data point, namely color X, other colors that belong to the same theme where X belongs pop up, and provide more information about the theme such as number of likes, author, created date, etc.
We can also visualize some of the features I didn’t mention yet. As we discussed, Adobe Kuler provides color rules to guild users to create a theme based on the choice of the rules. If a user created a theme using this rule, this information is stored in the theme. Also, users can create their own themes by editing already-existing themes. If this is the case, this information is also stored as the id of the parent theme. I haven’t done any analyses but it’s possible that users may have been using a collection of themes extensively to create their own themes because those parent themes were exceptionally popular.
Once we find a formula and are able to create themes base on our algorithm, it is even possible to build a recommender engine. Yes it sounds a bit too ambitious at this point, but why not dream big?
For any developers who build a recommender engine, the biggest nightmare is a cold start. Let’s say you joined Netflix for the first time, and Netflix wants to recommend movies and shows for you. Since your past history does not exist (=cold start), Netflix has to make some random guesses first. Then based on your response, it will slowly learn your preference.
Unlike Facebook, in the Adobe Kuler website, we don’t know who likes what. At least in the JSON response I scraped, I couldn’t find this information. So if I am to build a recommender engine, it will have to be item-based.
Some themes have tags, which give general mood of them. For example, “Circus III” has these following tags: fire, fish, flower, marine, orange, purple, red, and yellow. Some tags designate objects or concepts, and others are simply categorical colors. However, there are themes that don’t have any tags, like “Seawolf”.
We haven’t used these tags for our analyses. To approach this issue, we will have to learn about natural language processing (NLP), and how to categorize words into groups and link them to color clusters. But if we manage to build an algorithm, we will be able to provide a better recommender system where users have a better understanding of what kind of themes they want.
Potentially, this can be combined with recommending themes based on colors. For example, a user says he/she wants to have a theme that has red in it, but wants the theme to feature summer.
As you can see, this project has a lot of potentials. The more I dig in, the more interesting things that I find, and the more I learn about data science. And above all, who doesn’t like playing with colors? :) I am very happy that I started this journey, and feeling hopeful and excited about what will come next!