As a data scientist working in industry, I frequently witness the impact a machine learning application can make. This impact often has a cascade of downstream effects which are inconceivable to a data scientist without enough domain knowledge. Nevertheless, under the widespread motto, "move fast and break things," in tech industry, ML practitioners tend to … Continue reading Reflection on Tech Policy Workshop at the Center for Applied Data Ethics at USF
Last week, I attended NIPS (Neural Information Processing Systems) 2017 conference in Long Beach, CA. This was my first time attending NIPS. What is NIPS? NIPS is one of the largest machine learning (ML) / artificial intelligence (AI) conferences in the world. NIPS conference consists of three programs: tutorials, main event (which includes symposium) and … Continue reading NIPS 2017 symposium and workshop: interpretable and Bayesian machine learning
I've been working as a data scientist at Ayata, a small startup in North Austin. Our firm does b2b consulting and our major clients are oil and gas industry. A few weeks ago, a colleague of mine, Amrita Sen, a geoscientist at our company and I gave a talk before Dr. Mary Wheeler's group at … Continue reading Data science in oil and gas industry
Non-Internet-based, "traditional" clients have very messy data which exist in multiple places in different forms. Sometimes data is dumped as numerous files to an FTP server. Here, I discuss how to efficiently navigate the server using Python and download files of our choice.
For the past several posts, I have discussed a brief overview of the project, and wrote about each process from data wrangling to machine learning. There are a lot of rooms for improvement for this project starting from automating webscraping and tweaking a machine learning algorithm to including other interesting variables in consideration. However, I am very glad that I could show that we can garner somewhat decent clustering result using the color coordinates as features. In this post, I want to discuss future directions of this project and introduce some interesting ideas.
Now that we know each color theme can be represented as a 5-point spatial pattern in 3D space, we can use an unsupervised learning algorithm, specifically a clustering algorithm, to cluster a certain number of themes that have similar patterns into groups.
Before analyzing the color data, we should first know that the RGB color space is not a good representation of nonlinearity of color perception. A color space that is considered perceptually most uniform is CIELab (aka Lab) color space. In this space, each color is represented by three coordinates: L (brightness), a (red-greenness), and b(yellow-blueness). L ranges from 0 to 100. A higher L value means a color is brighter. The range of a and b depends on device, but normally it's [-128,128]. Positive values mean "warm" colors such as the more positive a (or b), the redder (or the more yellow) the color is.
Exploratory data analysis (EDA) is a process where you figure out general features of your data. It's a close and casual conversation between you and data, and a very fun process! You make figures of histograms, scatter plots or bar plots to see how your data looks like. This can give you a general idea about your data, help you discover interesting facts about it, and finally guide you to a right direction towards your goal. I always go back and forth between applying models/algorithms and EDA for these reasons.
Before we have a clean dataset, let's take a look at an example JSON response we scraped from the website. This is a JSON response from the first theme.
When you click a theme in the Kuler website, it shows the theme's page where you can see an enlarged image of the theme, and other information. On the right side, you can see "Action" and "Info" frames. The latter has following information:
Author of the theme ("Created By"): nominal
Date created ("Created"): ordinal
Number of views ("Viewed"): quantitative
Rating: quantitative (shown in number of stars)
Number of likes ("Appreciated By"): quantitative