The term “Data Science” was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time. (At the time, there was no way of predicting the truly massive amounts of data over the next fifty years.) Data Science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights in a wide range of fields. While Data Science is used in areas such as astronomy and medicine, it is also used in business to help make smarter decisions.
Statistics, and the use of statistical models, are deeply rooted within the field of Data Science. Data Science started with statistics and has evolved to include concepts/practices such as artificial intelligence, machine learning, and the Internet of Things, to name a few. As more and more data has become available, first by way of recorded shopping behaviors and trends, businesses have been collecting and storing it in ever greater amounts. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.
A functional data scientist, as opposed to a general statistician, has a good understanding of software architecture and understands multiple programming languages. The data scientist defines the problem, identifies the key sources of information, and designs the framework for collecting and screening the needed data. Software is typically responsible for collecting, processing, and modeling the data. They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review.
There are many different dates and timelines that can be used to trace the slow growth of Data Science and its current impact on the Data Management industry, some of the more significant ones are outlined below.
From the 1960s to the Present
In 1962, John Tukey wrote a paper titled The Future of Data Analysis and described a shift in the world of statistics, saying, “… as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt…I have come to feel that my central interest is in data analysis…” Tukey is referring to the merging of statistics and computers, when computers were first being used to solve mathematical problems and work with statistics, rather than doing the work by hand.
In 1974, Peter Naur authored the Concise Survey of Computer Methods, using the term “Data Science,” repeatedly. Naur presented his own convoluted definition of the new concept:
“The usefulness of data and data processes derives from their application in building and handling models of reality.”
In 1977, The IASC, also known as the International Association for Statistical Computing was formed. The first phrase of their mission statement reads, “It is the mission of the IASC to link traditional statistical methodology, modern computer technology, and the knowledge of domain experts in order to convert data into information and knowledge.”
In 1977, Tukey wrote a second paper, titled Exploratory Data Analysis, arguing the importance of using data in selecting “which” hypotheses to test, and that confirmatory data analysis and exploratory data analysis should work hand-in-hand.
In 1989, the Knowledge Discovery in Databases, which would mature into the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, organized its first workshop.
In 1994, Business Week ran the cover story, Database Marketing, revealing the ominous news companies had started gathering large amounts of personal information, with plans to start strange new marketing campaigns. The flood of data was, at best, confusing to many company managers, who were trying to decide what to do with so much disconnected information.
In 1999, Jacob Zahavi pointed out the need for new tools to handle the massive, and continuously growing, amounts of data available to businesses, in Mining Data for Nuggets of Knowledge. He wrote:
“Scalability is a huge issue in data mining… Conventional statistical methods work well with small data sets. Today’s databases, however, can involve millions of rows and scores of columns of data… Another technical challenge is developing models that can do a better job analyzing data, detecting non-linear relationships and interaction between elements… Special data mining tools may have to be developed to address web-site decisions.”
In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using cloud-based applications.
In 2001, William S. Cleveland laid out plans for training data scientists to meet the needs of the future. He presented an action plan titled, Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics. (Look for the “read” icon at the bottom of the screen.) It described how to increase the technical experience and range of data analysts and specified six areas of study for university departments. It promoted developing specific resources for research in each of the six areas. His plan also applies to government and corporate research. In 2001, Software-as-a-Service (SaaS) was created. This was the pre-cursor to using cloud-based applications.
In 2002, the International Council for Science: Committee on Data for Science and Technology began publishing the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. Articles for the Data Science Journal are accepted by their editors and must follow specific guidelines.
In 2006, Hadoop 0.1.0, an open-source, non-relational database, was released. Hadoop was based on Nutch, another open-source database. Two problems with processing big data are the storage of huge amounts of data and then processing that stored data. (Relational data base management systems (RDBMS) cannot process non-relational data.) Hadoop solved those problems. Apache Hadoop is now an open-sourced software library that allows for the research of big data.
In 2008, the title, “data scientist” became a buzzword, and eventually a part of the language. DJ Patil and Jeff Hammerbacher, of LinkedIn and Facebook, are given credit for initiating its use as a buzzword. (In 2012, Harvard University declared the data scientists had the sexiest job of the twenty-first century.)
In 2009, the term NoSQL was reintroduced (a variation had been used since 1998) by Johan Oskarsson, when he organized a discussion on “open-source, non-relational databases”.
In 2011, job listings for data scientists increased by 15,000%. There was also an increase in seminars and conferences devoted specifically to Data Science and big data. Data Science had proven itself to be a source of profits and had become a part of corporate culture. Alsi, in 2011, James Dixon, CTO of Pentaho promoted the concept of data lakes, rather than data warehouses. Dixon stated the difference between a data warehouse and a data lake is that the data warehouse pre-categorizes the data at the point of entry, wasting time and energy, while a data lake accepts the information using a non-relational database (NoSQL) and does not categorize the data, but simply stores it.
In 2013, IBM shared statistics showing 90% of the data in the world had been created within the last two years.
In 2015, using Deep Learning techniques, Google’s speech recognition, Google Voice, experienced a dramatic performance jump of 49 percent.
In 2015, Bloomberg’s Jack Clark, wrote that it had been a landmark year for artificial intelligence (AI). Within Google, the total of software projects using AI increased from “sporadic usage” to more than 2,700 projects over the year.
Data Science Today
In the past 30 years, Data Science has quietly grown to include businesses and organizations worldwide. It is now being used by governments, geneticists, engineers, and even astronomers. During its evolution, Data Science’s use of big data was not simply a “scaling up” of the data, but included shifting to new systems for processing data and the ways data gets studied and analyzed.
Data Science has become an important part of business and academic research. Technically, this includes machine translation, robotics, speech recognition, the digital economy, and search engines. In terms of research areas, Data Science has expanded to include the biological sciences, health care, medical informatics, the humanities, and social sciences. Data Science now influences economics, governments, and business and finance.
One curious, and potentially negative, result of the Data Science revolution has been a gradual shift to writing more and more conservative programming. It has been discovered data ccientists can put too much time and energy into developing unnecessarily complex algorithms, when simpler ones work more effectively. As a consequence, dramatic “innovative” changes happen less and less often. Many data scientists now think wholesale revisions are simply too risky, and instead try to break ideas into smaller parts. Each part gets tested, and is then cautiously phased into the data flow. While more conservative programming is faster and more efficient, it also minimizes experimentation and limits new, “outside-of-the-box” thinking and discoveries.
Though this play-it-safe philosophy may save companies time and money, and avoid major gaffes, they risk focusing on very narrow constraints, and avoid pursuing true breakthroughs. Scott Huffman, of Google, said:
“One thing we spend a lot of time talking about is how we can guard against incrementalism when bigger changes are needed. It’s tough, because these testing tools can really motivate the engineering team, but they also can wind up giving them huge incentives to try only small changes. We do want those little improvements, but we also want the jumps outside the box.”
Image used under license from Shutterstock.com