Last week, I attended NIPS (Neural Information Processing Systems) 2017 conference in Long Beach, CA. This was my first time attending NIPS. What is NIPS? NIPS is one of the largest machine learning (ML) / artificial intelligence (AI) conferences in the world. NIPS conference consists of three programs: tutorials, main event (which includes symposium) and … Continue reading NIPS 2017 symposium and workshop: interpretable and Bayesian machine learning
I've been working as a data scientist at Ayata, a small startup in North Austin. Our firm does b2b consulting and our major clients are oil and gas industry. A few weeks ago, a colleague of mine, Amrita Sen, a geoscientist at our company and I gave a talk before Dr. Mary Wheeler's group at … Continue reading Data science in oil and gas industry
Non-Internet-based, "traditional" clients have very messy data which exist in multiple places in different forms. Sometimes data is dumped as numerous files to an FTP server. Here, I discuss how to efficiently navigate the server using Python and download files of our choice.
For the past several posts, I have discussed a brief overview of the project, and wrote about each process from data wrangling to machine learning. There are a lot of rooms for improvement for this project starting from automating webscraping and tweaking a machine learning algorithm to including other interesting variables in consideration. However, I am very glad that I could show that we can garner somewhat decent clustering result using the color coordinates as features. In this post, I want to discuss future directions of this project and introduce some interesting ideas.
Now that we know each color theme can be represented as a 5-point spatial pattern in 3D space, we can use an unsupervised learning algorithm, specifically a clustering algorithm, to cluster a certain number of themes that have similar patterns into groups.
Before analyzing the color data, we should first know that the RGB color space is not a good representation of nonlinearity of color perception. A color space that is considered perceptually most uniform is CIELab (aka Lab) color space. In this space, each color is represented by three coordinates: L (brightness), a (red-greenness), and b(yellow-blueness). L ranges from 0 to 100. A higher L value means a color is brighter. The range of a and b depends on device, but normally it's [-128,128]. Positive values mean "warm" colors such as the more positive a (or b), the redder (or the more yellow) the color is.
Exploratory data analysis (EDA) is a process where you figure out general features of your data. It's a close and casual conversation between you and data, and a very fun process! You make figures of histograms, scatter plots or bar plots to see how your data looks like. This can give you a general idea about your data, help you discover interesting facts about it, and finally guide you to a right direction towards your goal. I always go back and forth between applying models/algorithms and EDA for these reasons.