I’ve been working as a data scientist at Ayata, a small startup in North Austin. Our firm does b2b consulting and our major clients are oil and gas industry. A few weeks ago, a colleague of mine, Amrita Sen, a geoscientist at our company and I gave a talk before Dr. Mary Wheeler’s group at the Institute for Computational Engineering and Sciences (ICES) of the University of Texas, Austin. Our talk was about real-world data science applications in oil and gas industry. It was to give the academics an idea about how we complete a project from start to finish and to introduce interesting challenges we face. I’ve always wanted to write about my experience as a data scientist working with traditional industry. In this post, I will talk about the general data science challenges in the oil industry that we covered during the talk.
Shale oil challenges
Most of our clients focus on shale oil production. Compared to offshore wells, the onshore ones are cheaper to build and thus the number of companies participate in the business is much larger. Shale oil wells typically refer to horizontal wells where you create fracture in rocks to extract oil that are trapped in the rocks. It involves complicated multiple processes that require experts with different disciplines such as engineers, geoscientists, business analysts, and so on. On top of the complexity, it faces several challenges. First, compared to traditional vertical wells, it produces less oil and the reservoir depletes more quickly. Plus, fracture process, aka “frack job” is expensive because it requires massive amount of sand, water and chemicals.
Shale oil production is comprised of three process: drilling, completion and production. Drilling involves creating a wellborn and establishing the horizontal laterals. Completion involves fracking and completing a well to get it ready for production. The location of wells determines their geological and geophysical (G&G) properties affect the production as well as how the well is completed, which is also a great interest of many shale oil companies. A horizontal lateral of a well consists of multiple stages and there are many variables that one can tweak: stage length, space between stages, fracture recipe (how much sand, water and chemicals should be added and what is the optimal proportion), the order of frac jobs, and so on.
Even though the G&G variables tend to be the major predictors of production, well locations are roughly pre-determined by the land lease that the company already made and once you drill a well, there is nothing much you can do about the G&G features. This is why shale oil companies show great interest in improving their completion recipe because these are actual actionable variables. Solving this problem is often called **completion optimization**.
Data science challenges in oil and gas
When I started working on datasets from the clients, a few major roadblocks stood out. Although my experience is limited to oil and gas, I suspect these are common data science challenges that traditional (non-Internet-based) companies face.
First, data is usually extremely messy and scattered all over the place. I suspect this is partly due because oil production requires a variety of teams involved in different processes. This means some of their data exist in relational databases and some exist in file forms; structured (MS Excel) and unstructured files (PDF, image). To make the situation worse, very few companies have centralized databased management and even if that’s the case, it seems like data quality is rarely examined. Most of the clients seem paranoiac about data security but in fact, they are poor at data consumption and lack of an efficient ETL process.
Traditional and scattered analysis
Second, their analytical approach seems rather simple and fragmented. Mostly, their analyses were based on univariate-correlation and multiple regression. Even if they use a sophisticated algorithm they tend to stick to neural network models only. I did some research to find academic papers on machine learning in this domain but studies with a data-driven approach were rare. Besides, due to the lack of centralized effort of data integration, analyses have been conducted at an individual team level, hardly at a bigger-picture level.
Physics-based vs. data-driven models
I had a long discussion with Amrita about traditional approaches in geophysics and petroleum engineering and I realized that physics-based models still dominate data-driven models. Physics-based models are based on physical assumptions on G&G features and built by domain experts (e.g., geophysicists). Since these models are analytical solutions, once they are built, they are reliable and require very little data. However, they have a typical “too-many-expert” problem; the assumptions of the model depend on the creator of the model. Besides, there is no guarantee that the model is always accurate and man-made variable are usually not included. On the other hand, data-driven models have fewer assumptions but they require a lot of data to capture complex patterns. This is a unique problem for a data scientist because it is kind of you are competing against analytically defined models built by domain experts. Personally, I think the the best strategy is to use physics-based models in feature engineering so that the domain knowledge can help building better data-driven models. Dr. Wheeler was also interested in narrowing the gap between these two and mentioned that academics are slowly opening up to data-driven models.
Interpretability and feature engineering
Speaking of models, as in many industry problems, in general clients want interpretable models, not black-box ones. As I mentioned, there are G&G variables, which we can’t really change especially once a well is drilled. However, there are still a lot of variables that are actionable during completion and production process. Clients want to know how to change these variables to optimize their production. This means the predictive model should not have tons of features and should use interpretable algorithms such as linear models or decision trees, which might have lower performance. This is usually aggravated by the fact that often it’s p >> N case, which means, brute-force type exhaustive search across different machine learning algorithms will not do much (for instance, random forest will definitely overfit in this case without proper feature engineering). Thus, it’s a data science challenge that requires extra caution.
Finally, testing our hypotheses is almost impossible, which would be universal to most of non-Internet industries. Drilling a well costs a few million dollars and there are too many variables to tweak and control.
Regardless of these technically challenges, many shale companies are getting more interested in completion optimization. Mostly it’s because oil price has been low for the past few years but also they want to reduce the cost of testing in the field. Plus, it’s relatively easy to tweak the completion recipe than other process in oil production.
In the next post, I will explain more about the actual challenges I faced during the most recent project (completion optimization) I worked on.