Now that we know each color theme can be represented as a 5-point spatial pattern in 3D space, we can use an unsupervised learning algorithm, specifically a clustering algorithm, to cluster a certain number of themes that have similar patterns into groups.
First approach: Hierarchical Clustering
My first idea was to run a clustering analysis for each theme to label it. I came up with this idea because I wanted to simply the analysis. If we ignore the absolute location of each color and the order of the colors, there are only a finite ways to cluster a group of 5 items. For instance, each color belongs to a different cluster (1, 1, 1, 1, 1), or we can group two of them and the rest (2, 3). This way, we end up having only 7 combinations.
- 1, 1, 1, 1, 1
- 1, 1, 1, 2
- 1, 1, 3
- 1, 2, 2
- 1, 4
- 2, 3
However, a major problem of this approach is that we have an extremely small sample size, which is 5. It is possible to use an algorithm called hierarchical clustering in this case. In this method, one computes (Euclidean) distance between each pair of all individuals, and group a pair that has the minimum distance. Once you group individuals, you can draw a tree to bind them. You repeat this process until you end up having a single group, and a single complete tree. Now, you use certain criteria to cut the tree at some point, which gives you the clustering result. This method can be handy because we only have 5 colors, meaning that it will computationally not so challenging because we’ll have only 10 distance pairs. Nonetheless, still the problem is how we decide where to cut the tree.
Second Approach: K-Means Clustering with distance
If we can’t run a clustering analysis per theme, we should do it for the entire dataset. A simple clustering analysis method is k-means clustering. In this method, we first decide how many clusters (k or n_cluster) the data might have. Then we randomly assign the initial mean (centroid), and computes distance between a data point and a centroid. We do this for every data point and centroid combination. Then, we assign each data point to a cluster based on the distance between the data point and a centroid. For instance, let’s say there are 3 centroids (A, B, and C). The data point X has the minimum distance between B among the three. Then X is assigned to cluster B. Once the assignment process is over, we compute the new means (new centroids), and repeat the process until convergence. In k-means clustering is guaranteed.
However, the downside is that we cannot optimize k automatically and the location of initial centroids affect the final results. There are methods that we can use to evaluate the clustering results across k’s. We can plot variance, compute information criteria (BIC, etc.), or compare silhouette scores.
This time, I decided to use a distance between each pair of colors in a theme. I keep the order of distances for every possible pairs. This gives us 10D vector. After generating the distance vector for each theme, I run k-means clustering using scikit-learn package in Python.
from sklearn import cluster def run_Kmeans(data,n_clusters,init='k-means++', n_init=10, max_iter=300): kmeans = cluster.KMeans(n_clusters=n_clusters,init=init,n_init=n_init,max_iter=max_iter) kmeans.fit(data) labels = kmeans.labels_ centroids = kmeans.cluster_centers_ return labels, centroids, kmeans
I used various k’s and compared the results. Expectedly, the results largely depended on the k value. Some of the clusters have the themes where all colors are very similar to each other. A quite hopeful outcome! However, not every theme in this cluster showed this consistent pattern. Plus, some clusters seemed much noisier than this case, suggesting that we can still improve the clustering analysis.
I found that having the 10D vector as our feature can give us somewhat successful results. However, this feature still oversimplifies the spatial pattern of a theme. First, 3D aspect of the spatial pattern is completely ignored. Second, absolute color values are ignored as well.
Third Approach: K-Means Clustering with coordinates
To truly reflect the spatial pattern of a theme, having L, a, and b coordinates from each color as a our feature is the most straightforward way. This gives us 15D vector. I used a kmeans clustering analysis here again. The following figure is an example (k=12).
These themes all belong to a same cluster. Remarkably, they look very similar. They all start from a dark blue color and move towards brighter yellowish colors in the middle, and end their journey with orange/red colors.
When we collapse all themes that belong to the same cluster (as the above 5 themes I just mentioned), we end up having a cloud of colors for nth color in themes. For example, the themes belong to this cluster start (triangle) at somewhat dark color, have brighter colors in the middle, and have red/orange colors as their 5th color (inverse triangle).
Room for Improvement
Having the coordinates as features definitely improved the results. However, we still need to find a better way to improve the algorithm. Here, we can use evaluating tools to optimize the k. Also, we can try a different clustering analysis such as mean shift. Another approach is to first reduce dimensionality (e.g., PCA) and run a clustering analysis.