I've been doing data science consulting at Ayata for oil and gas industry clients. The challenges these clients face are a bit different from the ones from typical Internet-based companies. Based on the talk I gave at UT Austin recently, I will explain some of the main data science challenges that most of the oil and gas industry have.
Non-Internet-based, "traditional" clients have very messy data which exist in multiple places in different forms. Sometimes data is dumped as numerous files to an FTP server. Here, I discuss how to efficiently navigate the server using Python and download files of our choice.
For the past several posts, I have discussed a brief overview of the project, and wrote about each process from data wrangling to machine learning. There are a lot of rooms for improvement for this project starting from automating webscraping and tweaking a machine learning algorithm to including other interesting variables in consideration. However, I am… Continue reading Future directions for the Kuler project
Now that we know each color theme can be represented as a 5-point spatial pattern in 3D space, we can use an unsupervised learning algorithm, specifically a clustering algorithm, to cluster a certain number of themes that have similar patterns into groups. First approach: Hierarchical Clustering My first idea was to run a clustering analysis for… Continue reading Clustering colors in a 3D space
Before analyzing the color data, we should first know that the RGB color space is not a good representation of nonlinearity of color perception. A color space that is considered perceptually most uniform is CIELab (aka Lab) color space. In this space, each color is represented by three coordinates: L (brightness), a (red-greenness), and b(yellow-blueness). L ranges… Continue reading Color perception and color-space conversion
Exploratory data analysis (EDA) is a process where you figure out general features of your data. It's a close and casual conversation between you and data, and a very fun process! You make figures of histograms, scatter plots or bar plots to see how your data looks like. This can give you a general idea about your… Continue reading Analyzing user activity of Adobe Kuler
Filtering the data Before we have a clean dataset, let's take a look at an example JSON response we scraped from the website. This is a JSON response from the first theme. As Python's data-type dictionary, it seems to be a set of unordered key:value arrays. Here is the summary of the keys and their meaning:… Continue reading Preprocessing the scraped Kuler data
When you click a theme in the Kuler website, it shows the theme's page where you can see an enlarged image of the theme, and other information. On the right side, you can see "Action" and "Info" frames. The latter has following information: Author of the theme ("Created By"): nominal Date created ("Created"): ordinal Number of… Continue reading Webscraping: XML and JSON
This is an introduction to a toy project that I worked in 2015 when I was applying for Insight Data Science Fellowship. This was my first data science project (still unfinished) using unsupervised learning for clustering popular color themes in Adobe Kuler. I will talk about important steps in the project in the following posts.
One of the projects I have been working on is to measure one's irrelevant memory. Wait, what? Yes, irrelevant memory. Let's say there is an object with color and orientation, like an ellipse with a color. I ask you to memorize orientation. Now, orientation is relevant and color is irrelevant. If I want to test whether color information has been automatically registered (or encoded) in your memory, what shall I do? There is one way to test this. Let's say we have 100 trials of an experiment. For the first 99 trials, I ask you to only memorize orientation from a display. At the end of each trial, I test your orientation memory. But at the 100th, last trial, I ask you to recall color. And yes, you did not see this coming. That is the most important part : you should not know about the last trial!