What is a Data Scientist?

Data Science is like triathlon. Data scientists are experts in 3 endurance disciplines. There are Olympic disciplines in 3 sports, cycling, running and swimming, and there always will be, but the need for super athletes who can do all 3 is growing. Data Scientists are the new triathletes, combining, programming, business and statistics to look for patterns in data that promise to create winning teams and add millions, billions, to the bottom line. But are data scientists heroes or is it all just a tad extreme? How can this new hybrid super athlete data scientist who does it all on his or her own possibly work efficiently?

Let’s suppose that Programmers are the Cyclists. Cycling is by far the most demanding discipline. It requires hardware and software and there is no good way to learn competitive cycling except to spend many many hours in an uncomfortable seat, bent over in an unnatural position, getting the details right.

  • Yes, to ride competitively, you will have to ride with cleats and change tires and yes, you will need sunscreen and a helmet and, you will probably fall off your bike a few times. Data preparation is a hands-on process and it's not glamorous. It's best to get some music going and just get on with it. 
  • Many new cyclists ask, R or Python? The answer is, pick one and start training. A good cyclist can quickly adapt their riding style to all kinds of bikes in all kinds of conditions. 
  • Think of your computer as your bicycle. Some computers are better than others but ultimately, we all want speed and efficiency. Even with the best bike, it's your cycling cadence and gear changes that make the difference. 
  • Cycling Tip: no sense in spending an extra $10 000 on a bicycle that’s 6 pounds lighter if you're carrying a 3 kg beer belly with you everywhere – instead, get in shape and train! You can spend a lot of money on the best computer and take paid programming courses or you can work with what you have and teach yourself. Access R and Python through your internet browser at http://datascientistworkbench.com, just register, there's no installation required. Take free courses online in R, Python, Scala, SQL etc.. you can find some great ones at the big data university. The vast majority of us can become good cyclists with an entry level bike; just pick one and start training.

Runners are the domain experts who are able to define a problem and the potential data science hypothesis to solve it.  Runners bring industry context and most importantly are able to communicate the whole picture, including the technical parts, to various audiences including the c-suite folk who typically sponsor projects. Sometimes we call runners 'translators' because they can speak the language of which ever department they need to engage with and they help different departments speak to each other. Think of runners as business professionals, health care professionals, HR analytics people, mining analytics people or scientists.  Runners will think about mining data to find the characteristics one disease that can be incorporated into a vaccine for another disease or how to link the retail banking database to the mortgages database and use the merged data set to predict and reduce the rate of loan defaults. They must think cross-functionally and they must often take an inter-disciplinary approach. Runners can be from any industry and if they don’t want to pick up cycling and swimming, they tend to work very closely with cyclists and swimmers in an organization.

Swimmers are number crunchers with an intimate knowledge of mathematics, statistics and modeling. They are good at stochastic thinking, linear and non-linear thinking, and they frame the world in terms of probabilities and confidence intervals. They know what the algorithm is doing and they know the rules for when different approaches are appropriate. Swimmers are statisticians or mathematicians who know when to use linear methods or clustering methods. In the old days, statistics was very theoretical., we didn't have data. Now that we have data, real data, we must adapt our thinking and apply those  statistical concepts to real life examples. Laps in a gym swimming pool can only take you so far, triathlon swimming can be grueling, you have to be prepared for the real life conditions on race day.

There will be no shortage of those testing their limits at the Cape Argus, the gorgeous Two Oceans Marathon or the Midmar Mile as separate events but it takes someone special to think about doing all 3. Are they heroes or is it all a tad extreme? After all, how can Data Scientists possibly work efficiently? Heaps of literature on the merits of 'division of labor' are being thrown out the window as demand grows for this new hybrid super athlete data scientist who does it all on his or her own.

Well, we know that athletes that are brilliant in one discipline, who work hard, who are fit and focused, can learn the other two disciplines and ultimately succeed in triathlon. No matter your core discipline, in data science, there will be running, swimming and cycling but that's just the beginning; only a handful will go on to be good enough to compete in an Iron Man. Many will work in data science, few will be data scientists.