Data Science Header

I was originally skeptical that data science is unique and should be considered its own field/discipline.

After exploring the literature about this topic, especially some work by the National Academy and The National Science Foundation, I gained a different perspective.  My new perspective on this question came about by better understanding the data life cycle.

The data life cycle is discussed in several publications and is also mentioned in ABET accreditation materials for student outcomes for Data Science programs. A major proponent of the idea of the data life cycle is Jeannette Wing of Columbia University. The data life cycle encompasses genesis to interpretation:

  • Generation
  • Collection
  • Processing
  • Storage
  • Management
  • Analysis
  • Visualization
  • Interpretation

I believe that “Data science is the study of extracting value from data.” It sounds cliche, but it is all about the data for a data scientist.

Is data science a “made up field” consisting of a bunch of buzz-words? Let’s look at other disciplines and how they relate to data science.  Since the fields under discussion have many facets that cannot be fully covered here, naturally, I am going to simplify the discussion. So please forgive if you are actively working in these fields! Since I am an engineer, let’s start with engineers and what they do.

What does an engineer do? Fundamentally, the aspect of an engineer’s work that I believe sets engineering apart from other disciplines is that engineers design systems. Design is fundamental to engineering. Chemical engineers design chemical processes. Mechanical engineers design mechanical systems.  Electrical engineers design electrical systems. Industrial engineers design work processes, and so on. Naturally, engineered systems are not isolated and when examined from a more macro perspective, these systems often involve many different engineering disciplines as well as disciplines in science, business, etc. Simply because engineered systems are embedded in the physical world, it should be clear that these systems generate data and in some cases, vast quantities of data.

Data is often a very important component of the operation and control of an engineered system. However, I would contend that data is often not the primary focus of an engineer during the design process. For example, is the primary concern of a mechanical engineer that is designing a gear, about the data generated from the gear’s operation, or is it about ensuring that the gear functions according to it mechanical design specifications?

In today’s society, there is a growing recognition that the data aspects of engineered systems cannot be ignored at the design stages as well as during operations. Who should be the persons thinking about data?  My contention is that it is the data scientist’s role to focus on the data and its life-cycle, not necessarily the engineer within a specific discipline.

So far, I have specifically not mentioned computer science. Years ago, computer science went through the same issues concerning whether it was a separate field. I think that almost everyone now agrees that computer science is justified as being its own field/disciple. Computer scientists perform important and exciting work. However, how does computer science relate to data science?

There is no doubt that the fields of computer science and data science are intimately related. To be clear, I see computer engineering as a separate field from computer science. In my opinion, computer science is somewhat misnamed. I think that it should have been called computing science. While computer scientists study many things, at the heart of the discipline is computing. How to compute faster, more efficiently, and more reliably is the essence of computer science. In my opinion, the main focus of a computer scientist is how to compute better.

There is no argument that data is an import aspect to computation. But is a computer scientist’s main concern about the data and its life cycle or is a computer scientist’s main concern about how to compute with the data? I think that there is a useful separation of concern. It is certainly true that many of the tools and techniques that computer scientists use are also used by data scientists. And, many argue that data science is simply a sub-discipline of computer science. However, I contend that the life-cycle of data and its importance to science, engineering, and business require a concentration of focus that is not central to computer science.

What about statistics? There is no doubt that statisticians use data and that data science has deep connections to this field. Are they different? Are they the same thing under different names?

I believe that data scientists are not statisticians and vice versa. In my opinion, the focus of statisticians has been on the latter aspects of the data life cycle. From my perspective the focus of a statistician is on the most efficient and effective extraction of reliable information from data. The design and proper use of the most appropriate and effective statistical techniques to make reliable decisions is the primary focus of statisticians. Data scientists concerns are broader than this.

All the fields that I have mentioned are important (engineering, computer science, statistics). We must not forget about the other sciences (e.g. biology, chemistry, sociology, etc.) and the many business related fields (accounting, marketing, etc). What has become clear and what was prevalent within the materials that I reviewed was that all these field are becoming increasingly dependent upon the data life cycle. There is no doubt in my mind that the data life-cycle is important for all these fields to function better and meet their goals of improving lives. Thus, I believe that knowledge of data science is essential and more broad than just specific disciplines.

I have personally concluded that we need basic literacy in data science, just like we need basic literacy in reading, mathematics, computing. The other disciplines have their focus on their domains. But who will be responsible for caring about the research and application of the tools and techniques that are needed throughout the entire data life cycle? I think that data scientists should be the people whose main goal is caring about the date life cycle.  

Data scientists cannot do this job in isolation. Therefore, for data scientists to be successful, the field of data science must be interdisciplinary. It is an exciting time in this emerging field.

I hope that I have provided you with some insights into why I believe that data science is its own field/discipline. What do you think?