Blogs home Featured Image

After the success of rstudio::conf 2017 this year the conference was back and bigger and better than ever with 1000+ attendees in sunny San Diego. Since the conference, my colleagues and I have been putting the techniques we learned into practice (which is totally why you’re only seeing this blog post now!).

Day 1 – shiny stole the show

The first stream was all things Shiny. With all the hype surrounding Shiny in the past few years, it didn’t disappoint. Joe Cheng spoke at the EARL London Conference in September last year about the exciting new feature allowing users to take advantage of asynchronous programming within Shiny applications through the use of the promises package. It was great to see a live demo of how this new feature can be utilised to scale Shiny apps and reduce wait time. The JavaScript inspired promises are not just Shiny specific and Joe is hoping to release the package on CRAN soon. In the meantime you can check out the package here.

At mango we’re already excited to start streamlining existing and future customer applications using promises. From a business point of view, it’s going to allow us to build more efficient and complex applications.

Straight after Joe was RStudio’s Winston Chang. Winston gave another great demo – this time showing the new features of the shinytest package. As well as improved user interaction, compared to previous shinytestversions, Winston demonstrated the latest snapshot comparison feature. This allows users to compare snapshots side by side when re-running tests and interactively dragging images to compare between them.

This is another potentially exciting breakthrough in the world of Shiny. Testing user interface components of a Shiny app has historically been a manual process, so formalising this process with shinytest will hopefully provide the framework to take proof of concept applications into a validated production ready state. You can check out the latest version here.

We were also excited to hear RStudio have built their own load testing tools which they’ll make available for us as well. Traditional tools for load testing often are incompatible with Shiny apps. RStudio’s main goals were to create something that’s easy to use, can simulate large number of users, and can work well with Shiny apps. It has multiple features in its workflow, such as recording, playback, and result analysis, and we envisage it enabling our customers to get really in-depth metrics on their Shiny apps.

Day 2 – machine learning

Aside from Shiny, a main theme of the conference was undoubtedly machine learning.

Day 2 kicked off with a key note from J.J Allaire, RStudio’s CEO. J.J’s presentation “Machine Learning with R and TensorFlow” was a fantastic insight into how RStudio have been busy in the past year making TensorFlow’s numerical computing library available to the R community. The keras package opens up the whole TensorFlow functionality for easy use in R, without the need to learn Python. It was great to hear TensorFlow explained in such a clear way and has already sparked interest and demand at Mango for our new “Deep Learning with keras in R” course (which, you can attend if you sign up for the EARL London Conference _hint hint)).

The interop stream gave us an insight into the leading technologies integrating with and exciting the world of R. With TensorFlow and Keras being machine learning buzz words at the moment, Javier Luraschi explained how to deploy TensorFlow models for fast evaluation and export using the tfdeploy package. He also highlighted integration with other technologies, such as cloudml and rsconnecthttps://github.com/riga/tfdeploy

Next year the conference has already been announced to run in Austin, Texas. Workshop materials and slides from this year’s conference can be found here.

Blogs home Featured Image

Data Science has come to represent the proactive use of data and advanced analytics to drive better decision making. While there is broad agreement around this, the skillsets of a Data Scientist are still something that generates debate (and endless venn-diagram-filled blog posts).

A common element around this debate is the frequent exclusion criteria placed on the role. Something like, “if someone has this skill/qualification then they are not a Data Scientist”, which is typically stated confidently by a self-identified Data Scientist who has — surprise, surprise — exactly the skill/qualification in question. Some recent examples of this that I’ve experienced, include:

  • If you’re not a statistician you’re not a Data Scientist
  • If you can’t build a Recommender Engine you’re not a Data Scientist
  • If you don’t have a PhD you’re not a Data Scientist

For the record, I know some fantastic Data Scientists who:

  • Wouldn’t self-identify as a Statistician (e.g. they come from machine learning background)
  • Have never needed to build a Recommender Engine (maybe because the area they work in has never required that)
  • Don’t have a PhD (or an Msc)

Now, I’m not saying “everyone is a Data Scientist” and I do think there’s an inherent danger in not defining some sort of criteria. However, with the money to be made in the world of Data Science, it’s no wonder that we’re in a situation where consultants with any sort of data skills are re-badging themselves as Data Scientists and increasing their day rates.

The concern here, of course, is that organisations will invest in non-Data-Sciencey-Data-Scientists (we’re getting pretty technical here), but not see the value they expected. This could ultimately have a negative impact on the world of Data Science in the same way that the Big Data world has been tainted by examples of over-investment in Big Data tech (people tend to be saying “hey, let’s build a data lake” a little more sheepishly than a few years ago).

So, what makes a Data Scientist a Data Scientist? Without specifying technologies you must use, algorithms you must know, or qualifications you must have, there appears to be some consensus around ‘minimum skills’ (although please let me know if you disagree):

Advanced Analytics

The word ‘Analytics’ is incredibly broad and encompasses everything from adding up some numbers to fitting advanced mathematical models. I feel a Data Scientist is someone who applies Advanced Analytic techniques (such as Predictive/Prescriptive analysis techniques based on statistical or machine learning).

While Business Intelligence is vital, I think that someone who spends their time building dashboards but not modelling would not be a Data Scientist.

Broad vs Deep Methodology

Many ‘statistical’ roles in the last few decades were largely reactive, in that their remit was narrow and long-established. This meant that the range of analytic techniques would likely also be narrow and statisticians ended up with a deep knowledge in a particular methodology rather than a broad understanding of analytic approaches.

For example, in my first role I almost exclusively used linear models, whereas in my next role it was all about survival models. As a Data Scientist is being asked to proactively solve a wider range of problems, they at least need an appreciation of the broader possibilities and the ability (to some extent) to be able to consume and apply a new methodology (once assumptions of those methods are understood).

Coding, not scripting

There’s a significant difference between someone who ‘writes scripts’ and someone who can really code. In the early 2000s, I spent a great deal of time as a statistician writing SAS code, where my primary output was ‘insight’ and the code I wrote was more of a by-product of what I did as opposed to a deliverable. For what it’s worth, I wouldn’t class that earlier version of me as a ‘Data Scientist’.

I think a Data Scientist is more of a programmer, where the code they write is part of what they deliver, and therefore needs to be scalable and written with formal development practices in mind. I’m not saying that every Data Scientist needs to be a master of DevOps (although that would be nice!), but some element of coding rigour should be essential.

Doing Science

The last thing that, for me, sets Data Scientists apart is the way they approach a challenge. When hiring, I’m looking for someone who fundamentally sees data as an opportunity and has an inherent curiosity about what insight that data will contain. Beyond that, a Data Scientist’s approach is fundamentally a ‘scientific’ one, where assumptions are created and tested using the data and available analytic methodologies.

While exclusion criteria such as the ones I’ve written above may feel unfair, but I think they are necessary to be able to delineate the career of a Data Scientist and to distinguish it from the variety of other data roles. Ultimately, the criteria will vary because each organisation will need different types of Data Scientists to achieve their goals. However, establishing a base for what would be considered a ‘Minimally Viable Data Scientist’ feels vital to the success of Data Science as a whole.

Where to from here?

If you’re looking to become a Data Scientist, then I hope this helps you to understand the skills needed.

If you’re looking to hire a Data Scientist, then make sure you know what you are trying to achieve and check potential hires have the skills required to deliver the value you’re expecting.

How Mango can help

We’ve been helping companies with their data analysis since 2002. We now also work with organisations to build their data science capability and develop their data science strategies. Talk to us about how we can help you make the most of your data to strengthen your business: info@mango-solutions.com

Blogs home Featured Image

We told you EARL 2018 was going to be awesome!

We’re excited to announce that Hadley Wickham will be the Keynote Speaker at our EARL Houston event on 9 November 2018.

Technically, we think Hadley needs no introduction, but just in case…

Hadley is Chief Scientist at RStudio, the company that created the most-used IDE for businesses and individuals using R around the world. He is interested in building computational and cognitive tools that make data ingest, manipulation, visualisation and analysis easier, particularly via the more than 30 R packages he has developed. He also leads the team that creates and maintains the widely used ‘tidyverse’, which contains some of the most popular packages in the R community.

An encouraging and supportive member of the R community, Hadley is well-known for his deep insight and willingness to answer questions and share his knowledge, authoring a number of books and online resources. While the topic of his talk will be a surprise, we know delegates will come away from his session with plenty to think about.

Take the stage with Hadley

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Hadley in Houston as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November