A definition of Data Science
Much of my time is spent talking to organisations looking to build a data science capability, or generally looking to use analytics to drive better decision making. As part of this, I’m often asked to present on a range of topics around data science. The two topics I’m asked to present on most are: ‘What is Data science?’ and ‘What is a Data Scientist?’. I thought I’d share how we at Mango define what Data science is, along with the reasoning behind our definition.
Where did the term Data science come from?
Professor Jeff Wu —the Coca-Cola Chair in Engineering Statistics at Georgia Institute of Technology— popularised the term ‘data science’ during a talk in 1997. Before this, the term statistician was widely used instead. Professor Wu felt that the title ‘Statistician’ no longer covered the array of work being done by statisticians, and that ‘Data Scientist’ better encapsulated the multi-facetted role.
So, surely defining what a Data Scientist is and what they do should be a simple task – just bring up an image of Professor Wu and reference his 1997 lecture and ask for questions. However, the original definition has evolved since then and, in fact, most data scientists I meet are unfamiliar with Professor Wu.
What does Data Science mean today?
As mentioned, what ‘Data science’ meant originally and what it means today are two very different things. To develop what Mango’s definition of Data science would be, we looked to the wider community to see what they were saying.
Twitter has given us some great definitions, such as:
One early definition of what a data scientist means, is from Josh Wills, current Director of Data Engineering at Slack. Back in 2012, Josh described a data scientist as follows:
This speaks more directly to the data scientist being a ‘merging’ of different skillsets – a mix of a ‘statistician’ and ‘software engineer’.
Drew Conway, now CEO of Alluvium, took this concept further with a heavily used venn diagram:
Beyond these definitions, I’ve heard a range of blunt comments about what a Data Scientists is and isn’t.
For example, at a recent data science event a speaker announced that “if you haven’t got a PhD then you’re not a data scientist”, which, of course, caused a fair amount of upset across the room of non-PhD-data-scientists!
Our interest at Mango in defining and understanding what a Data Scientist is, stems from the need to hire new talent. How do we describe the job? What skills must they have? Are our expectations too high?
We’ve seen some unrealistic job descriptions that say a data scientist should be able to:
- Understand every analytic algorithm from the statistical or computer science world, including machine learning, deep learning and whatever other algorithm the hiring company has just read about in a blog post
- Be an expert in a range of technologies including R, Python, Spark, Julia and a veritable zoo-ful of Apache projects
- Be equally comfortable discussing complex error structures or speaking to the chief execs about analytic strategy
These people just don’t exist.
To me, the trouble with most definitions of a data scientist seem detached from an agreed definition of data science. If a data scientist is someone who does data science, then surely we need to agree on what that is before understanding the skills needed to perform it successfully?
As per my earlier statement, it is clear that today data science has come to represent a lot more than Professor Wu’s original definition. At Mango, after countless arguments heated discussions, we arrived at the following (very carefully worded) definition:
Data Science is…the proactive use of data and advanced analytics to drive better decision making.
The four key parts
I might be stating the obvious here, but we can’t do data science without the data. What’s interesting is that data science is often associated with the extremes of Doug Laney’s famous ‘3 V’s’:
- Volume – the size of data to be analysed, driving data science’s ongoing associated with the world of ‘big data’
- Variety – with algorithms focused on analysing a range of structured and unstructured data types (e.g. image, text, video) being developed faster perhaps than the business cases are understood
- Velocity – the speed at which new data is created and speed of decision therefore required, leading to stream analytics and increased usage of machine learning approaches
However, data science is equally applicable to small, rectangular, static datasets in my mind.
Generally, analytics can be thought of in four categories:
- Descriptive Analytics: the study of ‘what happened?’ This is largely concerned with the reporting of results and summaries via static or interactive (e.g. dashboards) and is more commonly referred to as ‘Business Intelligence’
- Diagnostic Analytics: a study of why something happened. This typically involves feature engineering, model development etc.
- Predictive Analytics: the modelling of what might happen under different circumstances. This is a mechanism for understanding possible outcomes and the certainty (or lack of) with which we can make predictions
- Prescriptive Analytics: the analysis of ‘optimum’ ways to behave in which to ‘minimise’ or ‘maximise’ a desired outcome.
As we progress through these categories, the complexity increases, and hopefully the value added to the business as well. But this isn’t a list of steps – you could jump straight to predictive or prescriptive analytics without touching on either descriptive or diagnostic.
It’s important to distinguish that data science is focused on advanced analytics and using the above definitions, this would mean dealing with everything beyond descriptive analytics.
‘Proactive’ was included to distinguish data science from the more traditional ‘statistical analysis’. In my experience, when I started my career as a statistician in industry, an organisation’s analytic function seemed a largely ‘reactive’ practice. Modern data science needs to be an active part of the business function and look for ways to improve the business.
‘To drive better decision making’
I think the last part of the definition is the most important part. If we ignore this, then there’s a danger of doing the expensive cool stuff and not actually adding any value. With organisations investing heavily in data science as an industry, we need to deliver – otherwise we may be in a situation where data science as a phrase becomes associated with high-cost initiatives that never truly add value.
We need to be very clear about something: we can use the best tech, leverage the most clever algorithms, and apply them to the cleanest data, but unless we change the way something is done then we’re not adding value. To move the needle with data science, we need to positively impact the way the business does something.
So, what is a Data Scientist?
Each part of our definition hints at a particular skill that’s needed:
- Data: ability to manipulate data across a number of dimensions (volume, variety, velocity)
- Advanced analytics: understanding of a range of analytic approaches
- Proactive: communication skills that allow us to interact with the business
- Decision making: the ability to turn analytic thinking (e.g. models) into production code so they can be embedded in systems that deliver insight or action
If data science, as a proactive pursuit, is concerned with the meeting of a range of business challenges, then a data scientist must —understand at least the possibilities— a wider range of analytic approaches.
So… we just need to hire Unicorns?
From what I’ve said earlier it sounds like you just need to hire people who understand every analytic technique, code in every language, etc.
I’ve been interviewing prospective Data Scientists for more than 15 years and I can safely say that data science ‘unicorns’ don’t exist (unless you know one, and they’re interested in a role – in which case, please contact me!).
The fact that unicorns don’t exist leads to a very important part of data science: Data Science is a Team Sport!
While we can’t hire people with all the skills required, we can hire data scientists with some of the required skills, and then create a team of complementary skillsets. This way we can create a team that, as a collective, contains all of the skills required for data science. How to successfully hire this team is whole other blog post (keep your eyes peeled)!
Do you know where you currently sit with your skills and knowledge? Take our Data Science Radar quiz to find out!