Blogs home

For the last week we’ve been talking on the blog and Twitter about some of the functionality in Shiny and how you can learn it. But, if you haven’t already made the leap and started using Shiny, why should you?

What is the challenge to be solved?

At Mango we define data science as the proactive use of data and advanced analytics to drive better decision making.

We all know about the power of R for solving analytic challenges. It is, without a doubt, one of the most powerful analytic tools available to us as data scientists, providing the ability to solve modelling challenges using a range of traditional and modern analytic approaches.

However, the reality is that we can fit the best models and write the best code, but unless someone in the business is able to use the insight we generate to make a better decision our teams won’t add any value.

So, how do we solve this? How can we share the insight with the decision makers? How can we actually drive decision making with the analytics we have performed? If we’re not putting the results of our analysis into the hands of the decision makers it’s completely useless.

This is where Shiny comes in!

What is Shiny?

Shiny is a web application framework for R. In a nutshell this means that anyone who knows some R can start to build applications that sit in a web browser. It could be as simple as displaying some graphics and tables, to a fully interactive dashboard. The important part is that it is all done with R; there are no requirements for web developers to get involved.

Also, Shiny allows us to create true ‘data products’ that go beyond standard Business Intelligence dashboards. We can define intuitive interfaces that allow business users to perform what-if analysis, manipulating parameters that enable them to see the impact of different approaches on business outcomes.

What can it do?

Once your Shiny app is built it’s basically an interface to R – meaning your Shiny application can do whatever R can do (if you allow it to). So you can create Shiny applications that do anything from ‘add some numbers together’ to ‘fit sophisticated models across large data sources and simulate a variety of outputs’.

There are more use cases for Shiny than we could possibly list here and I would strongly recommend checking out the Shiny user showcase for more examples.

Share Insights

When it comes to Shiny for sharing insights some of the most common uses that we see include:

  • Presenting results of analysis to end users in the form of graphics and tables, allowing limited interaction such as selecting sub-groups of the data
  • Displaying current status and presenting recommended next actions based on R models
  • Automated production of common reports, letting users upload their own data that can be viewed in a standard way

Day-to-Day Data Tasks

Sharing insights is by no means the only way in which Shiny can be used. At Mango we are regularly asked by our customers to provide applications that allow non-R users to perform standard data manipulation and visualisation tasks or run standard analysis based on supplied data or data extracted from a database. Essentially, this allows the day to day tasks to move away from the data scientists or core R users who can then focus on new business challenges.

Check out this case study for an example of how we helped Pfizer with an application to simplify their data processing.


Shiny is also a great tool for prototyping. Whilst it can be, and is, used widely in production environments, some businesses may prefer to use other tools for business critical applications.

But allowing the data scientists in the team to generate prototypes in Shiny makes it much easier to understand if investment in the full system will add value, whilst also providing an interim solution.

The possibilities really are endless – in fact a question you may need to consider is: when should we move from Shiny to a formal web development framework?

But the decision makers don’t use R

The best thing about Shiny is that it produces a web application that can be deployed centrally and shared as a URL, just like any other web page. There are a whole host of tools that allow you to do this easily.

My personal favourite is RStudio Connect, as I can deploy a new application quickly and easily without having to spend time negotiating with the IT team. But there are other options and I would recommend checking out the RStudio website for a great resource comparing some of the most popular ones.

How can we get started with shiny?

There are a number of ways that you can get started understanding whether Shiny could add value in your business: from Shiny training courses to developing a prototype.

Get in touch with the team at Mango who will be happy to talk through your current business requirements and advise on the next best steps for putting the power of Shiny into your decision making process.

Why do we love Shiny?

Shiny allows R users to put data insights into the hands of the decision makers. It’s a really simple framework that doesn’t require any additional toolsets and allows all of the advanced analytics of R to be made available to the people who will be making the decisions.

Shiny Training at Mango

This month we have launched our newly updated Shiny training programme. The three one-day courses go from getting started right through to best practices for putting Shiny into production environments.

Importantly, all of these courses are taught by data science consultants who have hands-on experience building and deploying applications for commercial use. These consultants are supported by platform experts who can advise on the best approaches for getting an application out to end users so that you can see the benefits of using Shiny as quickly as possible.

If you want to know more about the Shiny training that we offer, take a look at our training page. If you are based in the UK we will be running public Shiny courses in London (see below for the currently scheduled dates). We will also be offering a snapshot of the materials for intermediate Shiny users at London EARL in September.

Public course dates:
  • Introduction to Shiny: 17th July
  • Intermediate Shiny: 18th July, 5th September
  • Advanced Shiny: 6th September

If you would like more information or to register for our Shiny courses, please contact our Training Team.

Blogs home Featured Image

Finding the Use Cases

So you’ve gathered the data, maybe hired some data scientists, and you’re looking to make a big impact.

The next step is to look for some business problems that can be solved with analytics – after all, without solving some real business challenges you’re not going to add much value to your organisation!

As you start to look for analytics use cases to work on, you may soon find yourself inundated with a range of possible projects. But which ones should you work on? How do you prioritise?

Mango have spent a lot of time over the last few years helping organisations to identify, evaluate and prioritise Analytic Use Cases. Picking the right projects-—particularly early on in your data-driven adventure-—will have a significant impact on the success of your analytic initiative. This article is based on some of the ways in which we coach companies around the building of Analytic Portfolios and what to look for in projects.

Evaluating Analytic Use Cases

The prioritisation of analytic use cases will be largely driven by the reason your data initiative was created and what ‘success’ for your team really looks like.

However, for this post, I’m going to assume the aim of your initiative is ultimately to add value to the organisation, where success is measured in financial terms (either saving money or adding revenue).

Generally, you’ll probably want a mixture of tactical and strategic initiatives – get some quick wins under your belt while you’re working on those bigger, longer-term challenges. However, when you’re looking at projects to work on you should consider a number of aspects:

  1. The Problem is Worth Solving

This might sound obvious, but a big factor in assessing an analytic use case is the potential value it could add. Delivering a multi-million pound project that decides what colour to paint the boardroom isn’t going to win many fans.

Ensure you understand:

  • How delivering this project would add value to your organisation
  • Exactly how that value will be measured
  1. The Building Blocks are in place

Understanding the ‘readiness’ (or otherwise) of a project to be delivered is a major factor in determining whether to prioritise it. Key aspects to consider include:

  • Data – is there enough data of sufficient data to solve this challenge?
  • Platform – is the technical platform in place to enable insight to be derived?
  • Skills – do you have the skills required to implement the solution?
  • Deliver – is there a mechanism in place to deliver any insight to decision makers?
  1. The Analytic Use Case is Solvable

The world of analytics is awash with marketing right now, promising silver-bullet solutions based on Machine Learning, AI or Cognitive Computing. However, the simplicity or otherwise of a potential solution should be considered when prioritising a use case. You don’t want to end up with a portfolio of projects whose solutions are at the periphery of what’s currently possible.

  1. The Business is Ready to Change

This is–without doubt–the primary factor in the success (or otherwise) of an analytic project. You could have the best data, write the best code and implement the best algorithm – but if the business users don’t behave differently once the solution is implemented, the value you’re seeking won’t be realised.

Before you build, make sure the business is willing to change their behaviour.

Evaluating possible projects in this way can help you to build a portfolio of Analytic Use Cases that will add significant, measurably value to your organisation. Moreover, making the right decisions early can help you build momentum around data-driven change, leading to a more-engaged business community ready for change.

Mango Solutions can help you navigate this process successfully. Based on insight and experience gained over 15 years working with the world’s leading companies, we have developed 3 workshops to help overcome some of the common challenges and roadblocks at different stages of your journey.

Find out which of the three workshops would be valuable to your organisation here.

Blogs home

In this blog post I explore the purrr package (member of tidyverse collection) and its use within a data scientist’s code. I aim to present the case for using the purrr functions and through the use of examples compare them with base R functionality. To do this, we will concentrate on two typical coding scenarios in base R: 1) loops and 2) the suite of apply functions and then compare them with their relevant counterpart map functions in the purrr package.

However, before I start, I wanted to make it clear that I do sympathise with those of you whose first reaction to purrr is “but I can do all this stuff in base R”. Putting that aside, the obvious first obstacle for us to overcome is to lose the notion of “if it’s not broken why change it” and open our ‘coding’ minds to change. At least, I hope you agree with me that the silver lining of this kind of exercise is to satisfy ones curiosity about the purrrpackage and maybe learn something new!

Let us first briefly describe the concept of functional programming (FP) in case you are not familiar with it.

Functional programming (FP)

R is a functional programming language which means that a user of R has the necessary tools to create and manipulate functions. There is no need to go into too much depth here but it suffices to know that FP is the process of writing code in a structured way and through functions remove code duplications and redundancies. In effect, computations or evaluations are treated as mathematical functions and the output of a function only depends on the values of its inputs – known as arguments. FP ensures that any side-effects such as changes in state do not affect the expected output such that if you call the same function twice with the same arguments the function returns the same output.

For those that are interested to find out more, I suggest reading Hadley Wickham’s Functional Programmingchapter in the “Advanced R” book. The companion website for this can be found at:

The purrr package, which forms part of the tidyverse ecosystem of packages, further enhances the functional programming aspect of R. It allows the user to write functional code with less friction in a complete and consistent manner. The purrr functions can be used, among other things, to replace loops and the suite of apply functions.

Let’s talk about loops

The motivation behind the examples we are going to look at involve iterating in R for various scenarios. For example, iterate over elements of a vector or list, iterate over rows or columns of a matrix … the list (pun intended) can go on and on!

One of the first things that one gets very excited to ‘play’ when learning to use R – at least that was the case for me – is loops! Lot’s of loops, elaborate, complex… dare I say never ending infinite loops (queue hysteric laughter emoji). Joking aside, it is usually the default answer to a problem that involves iteration of some sort as I demonstrate below.

# Create a vector of the mean values of all the columns of the mtcars dataset
# The long repetitive way
mean_vec <- c(mean(mtcars$mpg),mean(mtcars$cyl),mean(mtcars$disp),mean(mtcars$hp),
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

# The loop way
mean_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

The resulting vectors are the same and the difference in speed (milliseconds) is negligible. I hope that we can all agree that the long way is definitely not advised and actually is bad coding practice, let alone the frustration (and error-prone task) of copy/pasting. Having said that, I am sure there are other ways to do this – I demonstrate this later using lapply – but my aim was to show the benefit of using a for loop in base R for an iteration problem.

Now imagine if in the above example I wanted to calculate the variance of each column as well…

# Create two vectors of the mean and variance of all the columns of the mtcars dataset

# For mean
mean_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

#For variance
var_vec_loop <- vector("double", ncol(mtcars))
for (i in seq_along(mtcars)) {
  var_vec_loop[[i]] <- var(mtcars[[i]])
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

# Or combine both calculations in one loop
for (i in seq_along(mtcars)) {
  mean_vec_loop[[i]] <- mean(mtcars[[i]])
  var_vec_loop[[i]] <- var(mtcars[[i]])
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Now let us assume that we know that we want to create these vectors not just for the mtcars dataset but for other datasets as well. We could in theory copy/paste the for loops and just change the dataset we supply in the loop but one should agree that this action is repetitive and could result to mistakes. Instead we can generalise this into functions. This is where FP comes into play.

# Create two functions that returns the mean and variance of the columns of a dataset

# For mean
col_mean <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- mean(df[[i]])
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500

#For variance
col_variance <- function(df) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- var(df[[i]])
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Why not take this one step further and take full advantage of R’s functional programming tools by creating a function that takes as an argument a function! Yes, you read it correctly… a function within a function!

Why do we want to do that? Well, the code for the two functions above, as clean as it might look, is still repetitive and the only real difference between col_mean and col_var is the mathematical function that we are calling. So why not generalise this further?

# Create a function that returns a computational value (such as mean or variance)
# for a given dataset

col_calculation <- function(df,fun) {
  output <- vector("double", length(df))
  for (i in seq_along(df)) {
    output[[i]] <- fun(df[[i]])
 [1]  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250
 [7]  17.848750   0.437500   0.406250   3.687500   2.812500
 [1] 3.632410e+01 3.189516e+00 1.536080e+04 4.700867e+03 2.858814e-01
 [6] 9.573790e-01 3.193166e+00 2.540323e-01 2.489919e-01 5.443548e-01
[11] 2.608871e+00

Did someone say apply?

I mentioned earlier that an alternative way to solve the problem is to use the apply function (or suite of applyfunctions such as lapply, sapply, vapply, etc). In fact, these functions are what we call Higher Order Functions. Similar to what we did earlier, these are functions that can take other functions as an argument.

The benefit of using higher order functions instead of a for loop is that they allow us to think about what code we are executing at a higher level. Think of it as: “apply this to that” rather than “take the first item, do this, take the next item, do this…”

I must admit that at first it might take a little while to get used to but there is definitely a sense of pride when you can improve your code by eliminating for loops and replace them with apply-type functions.

# Create a list/vector of the mean values of all the columns of the mtcars dataset
lapply(mtcars,mean) %>% head # Returns a list
[1] 20.09062

[1] 6.1875

[1] 230.7219

[1] 146.6875

[1] 3.596563

[1] 3.21725
sapply(mtcars,mean) %>% head # Returns a vector
       mpg        cyl       disp         hp       drat         wt 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250

Once again, speed of execution is not the issue and neither is the common misconception about loops being slow compared to apply functions. As a matter of fact the main argument in favour of using lapply or any of the purrr functions as we will see later is the pure simplicity and readability of the code. Full stop.

Enter the purrr

The best place to start when exploring the purrr package is the map function. The reader will notice that these functions are utilised in a very similar way to the apply family of functions. The subtle difference is that the purrr functions are consistent and the user can be assured of the output – as opposed to some cases when using for example sapply as I demonstrate later on.

# Create a list/vector of the mean values of all the columns of the mtcars dataset
map(mtcars,mean) %>% head # Returns a list
[1] 20.09062

[1] 6.1875

[1] 230.7219

[1] 146.6875

[1] 3.596563

[1] 3.21725
map_dbl(mtcars,mean) %>% head # Returns a vector - of class double
       mpg        cyl       disp         hp       drat         wt 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250

Let us introduce the iris dataset with a slight modification in order to demonstrate the inconsistency that sometimes can occur when using the sapply function. This can often cause issues with the code and introduce mystery bugs that are hard to spot.

# Modify iris dataset
iris_mod <- iris
iris_mod$Species <- ordered(iris_mod$Species) # Ordered factor levels class(iris_mod$Species) # Note: The ordered function changes the class [1] "ordered" "factor" # Extract class of every column in iris_mod sapply(iris_mod, class) %>% str # Returns a list of the results
List of 5
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"
 $ Petal.Width : chr "numeric"
 $ Species     : chr [1:2] "ordered" "factor"
sapply(iris_mod[1:3], class) %>% str # Returns a character vector!?!? - Note: inconsistent object type
 Named chr [1:3] "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:3] "Sepal.Length" "Sepal.Width" "Petal.Length"

Since by default map returns a list one can ensure that an object of the same class is returned without any unexpected (and unwanted) surprises. This is inline with FP consistency.

# Extract class of every column in iris_mod
map(iris_mod, class) %>% str # Returns a list of the results
List of 5
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"
 $ Petal.Width : chr "numeric"
 $ Species     : chr [1:2] "ordered" "factor"
map(iris_mod[1:3], class) %>% str # Returns a list of the results
List of 3
 $ Sepal.Length: chr "numeric"
 $ Sepal.Width : chr "numeric"
 $ Petal.Length: chr "numeric"

To further demonstrate the consistency of the purrr package in this type of setting, the map_*() functions (see below) can be used to return a vector of the expected type, otherwise you get an informative error.

  • map_lgl() makes a logical vector.
  • map_int() makes an integer vector.
  • map_dbl() makes a double vector.
  • map_chr() makes a character vector.
# Extract class of every column in iris_mod
map_chr(iris_mod[1:4], class) %>% str # Returns a character vector
 Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
map_chr(iris_mod, class) %>% str # Returns a meaningful error
Error: Result 5 is not a length 1 atomic vector

# As opposed to the equivalent base R function vapply
vapply(iris_mod[1:4], class, character(1)) %>% str  # Returns a character vector
 Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
 - attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
vapply(iris_mod, class, character(1)) %>% str  # Returns a possibly harder to understand error
Error in vapply(iris_mod, class, character(1)): values must be length 1,
 but FUN(X[[5]]) result is length 2

It is worth noting that if the user does not wish to rely on tidyverse dependencies they can always use base R functions but need to be extra careful of the potential inconsistencies that might arise.

Multiple arguments and neat tricks

In case we wanted to apply a function to multiple vector arguments we have the option of mapply from base R or the map2 from purrr.

# Create random normal values from a list of means and a list of standard deviations
mu <- list(10, 100, -100)
sigma <- list(0.01, 1, 10)

mapply(rnorm, n=5, mu, sigma, SIMPLIFY = FALSE) # I need SIMPLIFY = FALSE because otherwise I get a matrix
[1] 10.002750 10.001843  9.998684 10.008720  9.994432

[1] 100.54979  99.64918 100.00214 102.98765  98.49432

[1]  -82.98467  -99.05069  -95.48636  -97.43427 -110.02194

map2(mu, sigma, rnorm, n = 5)
[1] 10.00658 10.00005 10.00921 10.02296 10.00840

[1]  98.92438 100.86043 100.20079  97.02832  99.88593

[1] -113.32003  -94.37817  -86.16424  -97.80301 -105.86208

The map2 function can easily extend to further arguments – not just two as in the example above – and that is where the pmap function comes in.

I also thought of sharing a couple of neat tricks that one can use with the map function.

  1. Say you want to fit a linear model for every cylinder type in the mtcars dataset. You can avoid code duplication and do it as follows:
# Split mtcars dataset by cylinder values and then fit a simple lm
models <- mtcars %>% 
  split(.$cyl) %>% # Split by cylinder into 3 lists
  map(function(df) lm(mpg ~ wt, data = df)) # Fit linear model for each list
  1. Say we are using a function, such as sqrt (calculate square root), on a list that contains a non-numeric element. The base R function lapply throws an error and execution stops without knowing what caused the error. The safely function of purrr completes execution and the user can identify what caused the error.
x <- list(1, 2, 3, "e", 5)

# Base R
lapply(x, sqrt)
Error in FUN(X[[i]], ...): non-numeric argument to mathematical function

# purrr package
safe_sqrt <- safely(sqrt)
safe_result_list <- map(x, safe_sqrt) %>% transpose
[1] 1

[1] 1.414214

[1] 1.732051


[1] 2.236068


Overall, I think it is fair to say that using higher order functions in R is a great way to improve ones code. With that in mind, my closing remark for this blog post is to simply re-iterate the benefits of using the purrrpackage. That is:

  • The output is consistent.
  • The code is easier to read and write.

If you enjoyed learning about purrr, then you can join us at our purrr workshop at this years EARL London – early bird tickets are available now!

Blogs home

We are excited to annouce that the goodpractice package is now available on CRAN. The package gives advice about good practices when building R packages. Advice includes functions and syntax to avoid, package structure, code complexity, code formatting, etc.

You can install the CRAN version via


Building R packages

Building an R package is a great way of encapsulating code, documentation and data in a single testable and easily distributable unit.

For a package to be distributed via CRAN, it needs to pass a set of checks implemented in R CMD check, such as: Is there minimal documentation, e.g., are all arguments of exported functions documented? Are all dependencies declared?

These checks are helpful in developing a solid R package but they don’t check for several other good practices. For example, a package does not need to contain any tests but is it good practice to include some. Following a coding standard helps readability. Avoiding overly complex functions reduces the risk of bugs. Including an URL for bug reports lets people more easily report bugs if they find any.

What the goodpractice package does

Tools for automatically checking several of the above mentioned aspects already exist and the goodpracticepackage bundles the checks from rcmdcheck with code coverage through the covr package, source code linting via the lintr package and cyclompatic complexity via the cyclocomp package and augments it with some further checks on good practice for R package development such as avoiding T and F in favour of TRUEand FALSE. It provides advice on which practices to follow and which to avoid.

You can use goodpractice checks as a reminder for you and your collegues – and if you have custom checks to run, you can make goodpractice run those as well!

How to use goodpractice

The main fuction goodpractice() (and its alias gp()) takes the path to the source code of a package as its first argument. The goodpractice package contains the source for a simple package which violates some good practices. We’ll use this for the examples.


# get path to example package
pkg_path <- system.file("bad1", package = "goodpractice")

# run gp() on it
g <- gp(pkg_path) #> Preparing: covr
#> Warning in MYPREPS[[prep]](state, quiet = quiet): Prep step for test
#> coverage failed.
#> Preparing: cyclocomp
#> Skipping 2 packages ahead of CRAN: callr, remotes
#> Installing 1 packages: stringr
#>   There is a binary version available but the source version is
#>   later:
#>         binary source needs_compilation
#> stringr  1.3.0  1.3.1             FALSE
#> installing the source package 'stringr'
#> Preparing: description
#> Preparing: lintr
#> Preparing: namespace
#> Preparing: rcmdcheck

# show the result
#> ── GP badpackage ──────────────────────────────────────────────────────────
#> It is good practice to
#>   ✖ not use "Depends" in DESCRIPTION, as it can cause name
#>     clashes, and poor interaction with other packages. Use
#>     "Imports" instead.
#>   ✖ omit "Date" in DESCRIPTION. It is not required and it gets
#>     invalid quite often. A build date will be added to the package
#>     when you perform `R CMD build` on it.
#>   ✖ add a "URL" field to DESCRIPTION. It helps users find
#>     information about your package online. If your package does
#>     not have a homepage, add an URL to GitHub, or the CRAN package
#>     package page.
#>   ✖ add a "BugReports" field to DESCRIPTION, and point it to a bug
#>     tracker. Many online code hosting services provide bug
#>     trackers for free,,,
#>     etc.
#>   ✖ omit trailing semicolons from code lines. They are not needed
#>     and most R coding standards forbid them
#>     R/semicolons.R:4:30
#>     R/semicolons.R:5:29
#>     R/semicolons.R:9:38
#>   ✖ not import packages as a whole, as this can cause name clashes
#>     between the imported packages. Instead, import only the
#>     specific functions you need.
#>   ✖ fix this R CMD check ERROR: VignetteBuilder package not
#>     declared: ‘knitr’ See section ‘The DESCRIPTION file’ in the
#>     ‘Writing R Extensions’ manual.
#>   ✖ avoid 'T' and 'F', as they are just variables which are set to
#>     the logicals 'TRUE' and 'FALSE' by default, but are not
#>     reserved words and hence can be overwritten by the user.
#>     Hence, one should always use 'TRUE' and 'FALSE' for the
#>     logicals.
#>     R/tf.R:NA:NA
#>     R/tf.R:NA:NA
#>     R/tf.R:NA:NA
#>     R/tf.R:NA:NA
#>     R/tf.R:NA:NA
#>     ... and 4 more lines
#> ───────────────────────────────────────────────────────────────────────────

So with this package, we’ve done a few things in the DESCRIPTION file for which there are reasons not to do them, have unnecessary trailing semicolons in the code and used T and F for TRUE and FALSE. The output of gp() will tell you what isn’t considered good practice out of what you have already written. If that is in the R code itself, it will also point you to the location of your faux-pas. In general, the messages are supposed to not only point out to you what you might want to avoid but also why.

Custom checks

The above example tries to run all 230 checks available, to see the full list use all_checks(). You can customise the set of checks run by selecting only those default checks you are intersted in and by adding your own checks.

If you only want to run a subset of the checks, e.g., just the check on the URL field in the DESCRIPTION, you can specify the checks by name:

# what is the name of the check?
grep("url", all_checks(), value = TRUE)
#> [1] "description_url"

# run only this check
gp(pkg_path, checks = "description_url")
#> Preparing: description
#> ── GP badpackage ──────────────────────────────────────────────────────────
#> It is good practice to
#>   ✖ add a "URL" field to DESCRIPTION. It helps users find
#>     information about your package online. If your package does
#>     not have a homepage, add an URL to GitHub, or the CRAN package
#>     package page.
#> ───────────────────────────────────────────────────────────────────────────

Additional checks can be used in gp() via the extra_checks argument. This should be a named list of checkobjects as returned by the make_check() function.

# make a simple version of the T/F check
check_simple_tf <- make_check( description = "TRUE and FALSE is used, not T and F", gp = "avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead.", check = function(state) { length(tools::checkTnF(dir = state$path)) == 0 } ) gp(pkg_path, checks = c("description_url", "simple_tf"), extra_checks = list(simple_tf = check_simple_tf)) #> Preparing: description
#> ── GP badpackage ──────────────────────────────────────────────────────────
#> It is good practice to
#>   ✖ add a "URL" field to DESCRIPTION. It helps users find
#>     information about your package online. If your package does
#>     not have a homepage, add an URL to GitHub, or the CRAN package
#>     package page.
#>   ✖ avoid 'T' and 'F', use 'TRUE' and 'FALSE' instead.
#> ───────────────────────────────────────────────────────────────────────────

For more details on creating custom checks, please see the vignette Custom Checks.


This package was written by Gábor Csárdi with contributions by Noam Ross, Neal Fultz, Douglas Ashton, Marcel Ramos, Joseph Stachelek, and myself. Special thanks for the input and feedback to the rOpenScileadership team and community as well as everybody who opened issues!


If you have any feedback, please consider opening an issue on GitHub.

Blogs home

(Or, how to write a Shiny app.R file that only contains a single line of code)

This post is long overdue. The information contained herein has been built up over years of deploying and hosting Shiny apps, particularly in production environments, and mainly where those Shiny apps are very large and contain a lot of code.

Last year, during some of my conference talks, I told the story of Mango’s early adoption of Shiny and how it wasn’t always an easy path to production for us. In this post I’d like to fill in some of the technical background and provide some information about Shiny app publishing and packaging that is hopefully useful to a wider audience.

I’ve figured out some of this for myself, but the most pivotal piece of information came from Shiny creator, Joe Cheng. Joe told me some time ago, that all you really need in an app.R file is a function that returns a Shiny application object. When he told me this, I was heavily embedded in the publication side and I didn’t immediately understand the implications.

Over time though I came to understand the power and flexibility that this model provides and, to a large extent, that’s what this post is about.

What is Shiny?

Hopefully if you’re reading this you already know, but Shiny is a web application framework for R. It allows R users to develop powerful web applications entirely in R without having to understand HTML, CSS and JavaScript. It also allows us to embed the statistical power of R directly into those web applications.

Shiny apps generally consist of either a ui.R and a server.R (containing user interface and server-side logic respectively) or a single app.R which contains both. Why package a Shiny app anyway?

If your app is small enough to fit comfortably in a single file, then packaging your application is unlikely to be worth it. As with any R script though, when it gets too large to be comfortably worked with as a single file, it can be useful to break it up into discrete components.

Publishing a packaged app will be more difficult, but to some extent that will depend on the infrastructure you have available to you.

Pros of packaging

Packaging is one of the many great features of the R language. Packages are fairly straightforward, quick to create and you can build them with a host of useful features like built-in documentation and unit tests.

They also integrate really nicely into Continuous Integration (CI) pipelines and are supported by tools like Travis. You can also get test coverage reports using things like

They’re also really easy to share. Even if you don’t publish your package to CRAN, you can still share it on GitHub and have people install it with devtools, or build the package and share that around, or publish the package on a CRAN-like system within your organisation’s firewall.

Cons of packaging

Before you get all excited and start to package your Shiny applications, you should be aware that — depending on your publishing environment — packaging a Shiny application may make it difficult or even impossible to publish to a system like Shiny Server or RStudio Connect, without first unpacking it again.

* Since time of writing this information is now incorrect. Check out for more information on deploying packaged shinyapps to shiny server, and rsconnect.

A little bit of Mango history

This is where Mango were in the early days of our Shiny use. We had a significant disconnect between our data scientists writing the Shiny apps and the IT team tasked with supporting the infrastructure they used. This was before we’d committed to having an engineering team that could sit in the middle and provide a bridge between the two.

When our data scientists would write apps that got a little large or that they wanted robust tests and documentation for, they would stick them in packages and send them over to me to publish to our original Shiny Server. The problem was: R packages didn’t really mean anything to me at the time. I knew how to install them, but that was about as far as it went. I knew from the Shiny docs that a Shiny app needs certain files (server.R and ui.R or an app.R) file, but that wasn’t what I got, so I’d send it back to the data science team and tell them that I needed those files or I wouldn’t be able to publish it.

More than once I got back a response along the lines of, “but you just need to load it up and then do runApp()”. But, that’s just not how Shiny Server works. Over time, we’ve evolved a set of best practices around when and how to package a Shiny application.

The first step was taking the leap into understanding Shiny and R packages better. It was here that I started to work in the space between data science and IT.

How to package a Shiny application

If you’ve seen the simple app you get when you choose to create a new Shiny application in RStudio, you’ll be familiar with the basic structure of a Shiny application. You need to have a UI object and a server function.

If you have a look inside the UI object you’ll see that it contains the html that will be used for building your user interface. It’s not everything that will get served to the user when they access the web application — some of that is added by the Shiny framework when it runs the application — but it covers off the elements you’ve defined yourself.

The server function defines the server-side logic that will be executed for your application. This includes code to handle your inputs and produce outputs in response.

The great thing about Shiny is that you can create something awesome quite quickly, but once you’ve mastered the basics, the only limit is your imagination.

For our purposes here, we’re going to stick with the ‘geyser’ application that RStudio gives you when you click to create a new Shiny Web Application. If you open up RStudio, and create a new Shiny app — choosing the single file app.R version — you’ll be able to see what we’re talking about. The small size of the geyser app makes it ideal for further study.

If you look through the code you’ll see that there are essentially three components: the UI object, the server function, and the shinyApp() function that actually runs the app.

Building an R package of just those three components is a case of breaking them out into the constituent parts and inserting them into a blank package structure. We have a version of this up on GitHub that you can check out if you want.

The directory layout of the demo project looks like this:

|-- R
|   |-- launchApp.R
|   |-- shinyAppServer.R
|   `-- shinyAppUI.R
|-- inst
|   `-- shinyApp
|       `-- app.R
|-- man
|   |-- launchApp.Rd
|   |-- shinyAppServer.Rd
|   `-- shinyAppUI.Rd
`-- shinyAppDemo.Rproj

Once the app has been adapted to sit within the standard R package structure we’re almost done. The UI object and server function don’t really need to be exported, and we’ve just put a really thin wrapper function around shinyApp() — I’ve called it launchApp() — which we’ll actually use to launch the app. If you install the package from GitHub with devtools, you can see it in action.


This will start the Shiny application running locally.

The approach outlined here also works fine with Shiny Modules, either in the same package, or called from a separate package.

And that’s almost it! The only thing remaining is how we might deploy this app to Shiny server (including Shiny Server Pro) or RStudio Connect.

Publishing your packaged Shiny app

We already know that Shiny Server and RStudio Connect expect either a ui.R and a server.R or an app.R file. We’re running our application out of a package with none of this, so we won’t be able to publish it until we fix this problem.

The solution we’ve arrived at is to create a directory called ‘shinyApp’ inside the inst directory of the package. For those of you who are new to R packaging, the contents of the ‘inst’ directory are essentially ignored during the package build process, so it’s an ideal place to put little extras like this.

The name ‘shinyApp’ was chosen for consistency with Shiny Server which uses a ‘shinyApps’ directory if a user is allowed to serve applications from their home directory.

Inside this directory we create a single ‘app.R’ file with the following line in it:


And that really is it. This one file will allow us to publish our packaged application under some circumstances, which we’ll discuss shortly.

Here’s where having a packaged Shiny app can get tricky, so we’re going to talk you through the options and do what we can to point out the pitfalls.

Shiny Server and Shiny Server Pro

Perhaps surprisingly — given that Shiny Server is the oldest method of Shiny app publication — it’s also the easiest one to use with these sorts of packaged Shiny apps. There are basically two ways to publish on Shiny Server. From your home directory on the server — also known as self-publishing — or publishing from a central location, usually the directory ‘/srv/shiny-server’.

The central benefit of this approach is the ability to update the application just by installing a newer version of the package. Sadly though, it’s not always an easy approach to take.

Apps served from home directory (AKA self-publishing)

The first publication method is from a users’ home directory. This is generally used in conjunction with RStudio Server. In the self-publishing model, Shiny Server (and Pro) expect apps to be found in a directory called ‘ShinyApps’, within the users home directory. This means that if we install a Shiny app in a package the final location of the app directory will be inside the installed package, not in the ShinyApps directory. In order to work around this, we create a link from where the app is expected to be, to where it actually is within the installed package structure.

So in the example of our package, we’d do something like this in a terminal session:

# make sure we’re in our home directory
# change into the shinyApps directory
cd shinyApps
# create a link from our app directory inside the package
ln -s /home/sellorm/R/x86_64-pc-linux-gnu-library/3.4/shinyAppDemo/shinyApp ./testApp

Note: The path you will find your libraries in will differ from the above. Check by running .libPaths()[1] and then dir(.libPaths()[1]) to see if that’s where your packages are installed.

Once this is done, the app should be available at ‘http://<server-address>:3838//’ and can be updated by updating the installed version of the package. Update the package and the updates will be published via Shiny Server straight away.

Apps Server from a central location (usually /srv/shiny-server)

This is essentially the same as above, but the task of publishing the application generally falls to an administrator of some sort.

Since they would have to transfer files to the server and log in anyway, it shouldn’t be too much of an additional burden to install a package while they’re there. Especially if that makes life easier from then on.

The admin would need to transfer the package to the server, install it and then create a link — just like in the example above — from the expected location, to the installed location.

The great thing with this approach is that when updates are due to be installed the admin only has to update the installed package and not any other files.

RStudio Connect

Connect is the next generation Shiny Server. In terms of features and performance, it’s far superior to its predecessor. One of the best features is the ability to push Shiny app code directly from the RStudio IDE. For the vast majority of users, this is a huge productivity boost, since you no longer have to wait for an administrator to publish your app for you.

Since publishing doesn’t require anyone to directly log into the server as part of the publishing process, there aren’t really any straightforward opportunities to install a custom package. This means that, in general, publishing a packaged shiny application isn’t really possible.

There’s only one real workaround for this situation that I’m aware of. If you have an internal CRAN-like repository for your custom packages, you should be able to use that to update Connect, with a little work.

You’d need to have your dev environment and Connect hooked up to the same repo. The updated app package needs to be available in that repo and installed in your dev environment. Then, you could publish and then update the single line app.R for each successive package version you publish.

Connect uses packrat under the hood, so when you publish the app.R the packrat manifest will also be sent to the server. Connect will use the manifest to decide which packages are required to run your app. If you’re using a custom package this would get picked up and installed or updated during deployment.

It’s not currently possible to publish a packaged application to You’d need to make sure your app followed the accepted conventions for creating Shiny apps and only uses files, rather than any custom packages.


Packaging Shiny apps can be a real productivity boon for you and your team. In situations where you can integrate that process into other processes, such as automatically running your unit tests or automated publishing it can also help you adopt devops-style workflows.

However, in some instances, the practice can actually make things worse and really slow you down. It’s essential to understand what the publishing workflow is in your organisation before embarking on any significant Shiny packaging project as this will help steer you towards the best course of action.

If you would like to find out how we can help you with Shiny, get in touch with us:

Blogs home Featured Image

A definition of Data Science

Much of my time is spent talking to organisations looking to build a data science capability, or generally looking to use analytics to drive better decision making. As part of this, I’m often asked to present on a range of topics around data science. The two topics I’m asked to present on most are: ‘What is Data science?’ and ‘What is a Data Scientist?’. I thought I’d share how we at Mango define what Data science is, along with the reasoning behind our definition.

Where did the term Data science come from?

Professor Jeff Wu —the Coca-Cola Chair in Engineering Statistics at Georgia Institute of Technology— popularised the term ‘data science’ during a talk in 1997. Before this, the term statistician was widely used instead. Professor Wu felt that the title ‘Statistician’ no longer covered the array of work being done by statisticians, and that ‘Data Scientist’ better encapsulated the multi-facetted role.

So, surely defining what a Data Scientist is and what they do should be a simple task – just bring up an image of Professor Wu and reference his 1997 lecture and ask for questions. However, the original definition has evolved since then and, in fact, most data scientists I meet are unfamiliar with Professor Wu.

What does Data Science mean today?

As mentioned, what ‘Data science’ meant originally and what it means today are two very different things. To develop what Mango’s definition of Data science would be, we looked to the wider community to see what they were saying.

Twitter has given us some great definitions, such as:

One early definition of what a data scientist means, is from Josh Wills, current Director of Data Engineering at Slack. Back in 2012, Josh described a data scientist as follows:

This speaks more directly to the data scientist being a ‘merging’ of different skillsets – a mix of a ‘statistician’ and ‘software engineer’.

Drew Conway, now CEO of Alluvium, took this concept further with a heavily used venn diagram:

Beyond these definitions, I’ve heard a range of blunt comments about what a Data Scientists is and isn’t.

For example, at a recent data science event a speaker announced that “if you haven’t got a PhD then you’re not a data scientist”, which, of course, caused a fair amount of upset across the room of non-PhD-data-scientists!

Our interest at Mango in defining and understanding what a Data Scientist is, stems from the need to hire new talent. How do we describe the job? What skills must they have? Are our expectations too high?

We’ve seen some unrealistic job descriptions that say a data scientist should be able to:

  • Understand every analytic algorithm from the statistical or computer science world, including machine learning, deep learning and whatever other algorithm the hiring company has just read about in a blog post
  • Be an expert in a range of technologies including R, Python, Spark, Julia and a veritable zoo-ful of Apache projects
  • Be equally comfortable discussing complex error structures or speaking to the chief execs about analytic strategy

These people just don’t exist.

To me, the trouble with most definitions of a data scientist seem detached from an agreed definition of data science. If a data scientist is someone who does data science, then surely we need to agree on what that is before understanding the skills needed to perform it successfully?

Drumroll please…

As per my earlier statement, it is clear that today data science has come to represent a lot more than Professor Wu’s original definition. At Mango, after countless arguments heated discussions, we arrived at the following (very carefully worded) definition:

Data Science is…the proactive use of data and advanced analytics to drive better decision making.

The four key parts


I might be stating the obvious here, but we can’t do data science without the data. What’s interesting is that data science is often associated with the extremes of Doug Laney’s famous ‘3 V’s’:

  • Volume – the size of data to be analysed, driving data science’s ongoing associated with the world of ‘big data’
  • Variety – with algorithms focused on analysing a range of structured and unstructured data types (e.g. image, text, video) being developed faster perhaps than the business cases are understood
  • Velocity – the speed at which new data is created and speed of decision therefore required, leading to stream analytics and increased usage of machine learning approaches

However, data science is equally applicable to small, rectangular, static datasets in my mind.

‘Advanced analytics’

Generally, analytics can be thought of in four categories:

  • Descriptive Analytics: the study of ‘what happened?’ This is largely concerned with the reporting of results and summaries via static or interactive (e.g. dashboards) and is more commonly referred to as ‘Business Intelligence’
  • Diagnostic Analytics: a study of why something happened. This typically involves feature engineering, model development etc.
  • Predictive Analytics: the modelling of what might happen under different circumstances. This is a mechanism for understanding possible outcomes and the certainty (or lack of) with which we can make predictions
  • Prescriptive Analytics: the analysis of ‘optimum’ ways to behave in which to ‘minimise’ or ‘maximise’ a desired outcome.

As we progress through these categories, the complexity increases, and hopefully the value added to the business as well. But this isn’t a list of steps – you could jump straight to predictive or prescriptive analytics without touching on either descriptive or diagnostic.

It’s important to distinguish that data science is focused on advanced analytics and using the above definitions, this would mean dealing with everything beyond descriptive analytics.


‘Proactive’ was included to distinguish data science from the more traditional ‘statistical analysis’. In my experience, when I started my career as a statistician in industry, an organisation’s analytic function seemed a largely ‘reactive’ practice. Modern data science needs to be an active part of the business function and look for ways to improve the business.

‘To drive better decision making’

I think the last part of the definition is the most important part. If we ignore this, then there’s a danger of doing the expensive cool stuff and not actually adding any value. With organisations investing heavily in data science as an industry, we need to deliver – otherwise we may be in a situation where data science as a phrase becomes associated with high-cost initiatives that never truly add value.

We need to be very clear about something: we can use the best tech, leverage the most clever algorithms, and apply them to the cleanest data, but unless we change the way something is done then we’re not adding value. To move the needle with data science, we need to positively impact the way the business does something.

So, what is a Data Scientist?

Each part of our definition hints at a particular skill that’s needed:

  • Data: ability to manipulate data across a number of dimensions (volume, variety, velocity)
  • Advanced analytics: understanding of a range of analytic approaches
  • Proactive: communication skills that allow us to interact with the business
  • Decision making: the ability to turn analytic thinking (e.g. models) into production code so they can be embedded in systems that deliver insight or action

If data science, as a proactive pursuit, is concerned with the meeting of a range of business challenges, then a data scientist must —understand at least the possibilities— a wider range of analytic approaches.

So… we just need to hire Unicorns?

From what I’ve said earlier it sounds like you just need to hire people who understand every analytic technique, code in every language, etc.

I’ve been interviewing prospective Data Scientists for more than 15 years and I can safely say that data science ‘unicorns’ don’t exist (unless you know one, and they’re interested in a role – in which case, please contact me!).

The fact that unicorns don’t exist leads to a very important part of data science: Data Science is a Team Sport!

While we can’t hire people with all the skills required, we can hire data scientists with some of the required skills, and then create a team of complementary skillsets. This way we can create a team that, as a collective, contains all of the skills required for data science. How to successfully hire this team is whole other blog post (keep your eyes peeled)!

Do you know where you currently sit with your skills and knowledge? Take our Data Science Radar quiz to find out!

If you’re looking at building your company’s data science capabilities, the Mango team have helped organisations across a range of industries around the world build theirs. The key is having the right team and the right guidance to ensure your analytics are in line with your objectives. That’s where we come in, so contact us today:
Blogs home

We are excited to announce the speakers for this year’s EARL London Conference!

Every year, we receive an immense number of excellent abstracts and this year was no different – in fact, it’s getting harder to decide. We spent a lot of time deliberating and had to make some tough choices. We would like to thank everyone who submitted a talk – we appreciate the time taken to write and submit; if we could accept every talk, we would.

This year, we have a brilliant lineup, including speakers from Auto Trader, Marks and Spencer, Aviva,, Google, Ministry of Defence and KPMG. Take a look below at our illustrious list of speakers:

Full length talks
Abigail Lebrecht, Abigail Lebrecht Consulting
Alex Lewis, Africa’s Voices Foundation
Alexis Iglauer, PartnerRe
Amanda Lee, Merkle Aquila
Andrie de Vries, RStudio
Catherine Leigh, Auto Trader
Catherine Gamble, Marks and Spencer
Chris Chapman, Google
Chris Billingham, N Brown PLC
Christian Moroy, Edge Health
Christoph Bodner, Austrian Post
Dan Erben, Dyson
David Smith, Microsoft
Douglas Ashton, Mango Solutions
Dzidas Martinaitis, Amazon Web Services
Emil Lykke Jensen, MediaLytic
Gavin Jackson, Screwfix
Ian Jacob, HCD Economics
James Lawrence, The Behavioural Insights Team
Jeremy Horne, MC&C Media
Jobst Löffler, Bayer Business Services GmbH
Jo-fai Chow,
Jonathan Ng, HSBC
Kasia Kulma, Aviva
Leanne Fitzpatrick, Hello Soda
Lydon Palmer, Investec
Matt Dray, Department for Education
Michael Maguire, Tusk Therapeutics
Omayma Said, WUZZUF
Paul Swiontkowski, Microsoft
Sam Tazzyman, Ministry of Justice
Scott Finnie, Hymans Robertson
Sean Lopp, RStudio
Sima Reichenbach, KPMG
Steffen Bank, Ekstra Bladet
Taisiya Merkulova, Photobox
Tim Paulden, ATASS Sports
Tomas Westlake, Ministry Of Defence
Victory Idowu, Aviva
Willem Ligtenberg, CZ

Lightning Talks
Agnes Salanki,
Andreas Wittmann, MAN Truck & Bus AG
Ansgar Wenzel, Qbiz UK
George Cushen, Shop Direct
Jasmine Pengelly, DAZN
Matthias Trampisch, Boehringer Ingelheim
Mike K Smith, Pfizer
Patrik Punco, NOZ Medien
Robin Penfold, Willis Towers Watson

Some numbers

We thought we would share some stats from this year’s submission process:

This is based on a combination of titles, photos and pronouns.


We’re still putting the agenda together, so keep an eye out for that announcement!


Early bird tickets are available until 31 July 2018, get yours now.

Blogs home Featured Image

Data Science has come to represent the proactive use of data and advanced analytics to drive better decision making. While there is broad agreement around this, the skillsets of a Data Scientist are still something that generates debate (and endless venn-diagram-filled blog posts).

A common element around this debate is the frequent exclusion criteria placed on the role. Something like, “if someone has this skill/qualification then they are not a Data Scientist”, which is typically stated confidently by a self-identified Data Scientist who has — surprise, surprise — exactly the skill/qualification in question. Some recent examples of this that I’ve experienced, include:

  • If you’re not a statistician you’re not a Data Scientist
  • If you can’t build a Recommender Engine you’re not a Data Scientist
  • If you don’t have a PhD you’re not a Data Scientist

For the record, I know some fantastic Data Scientists who:

  • Wouldn’t self-identify as a Statistician (e.g. they come from machine learning background)
  • Have never needed to build a Recommender Engine (maybe because the area they work in has never required that)
  • Don’t have a PhD (or an Msc)

Now, I’m not saying “everyone is a Data Scientist” and I do think there’s an inherent danger in not defining some sort of criteria. However, with the money to be made in the world of Data Science, it’s no wonder that we’re in a situation where consultants with any sort of data skills are re-badging themselves as Data Scientists and increasing their day rates.

The concern here, of course, is that organisations will invest in non-Data-Sciencey-Data-Scientists (we’re getting pretty technical here), but not see the value they expected. This could ultimately have a negative impact on the world of Data Science in the same way that the Big Data world has been tainted by examples of over-investment in Big Data tech (people tend to be saying “hey, let’s build a data lake” a little more sheepishly than a few years ago).

So, what makes a Data Scientist a Data Scientist? Without specifying technologies you must use, algorithms you must know, or qualifications you must have, there appears to be some consensus around ‘minimum skills’ (although please let me know if you disagree):

Advanced Analytics

The word ‘Analytics’ is incredibly broad and encompasses everything from adding up some numbers to fitting advanced mathematical models. I feel a Data Scientist is someone who applies Advanced Analytic techniques (such as Predictive/Prescriptive analysis techniques based on statistical or machine learning).

While Business Intelligence is vital, I think that someone who spends their time building dashboards but not modelling would not be a Data Scientist.

Broad vs Deep Methodology

Many ‘statistical’ roles in the last few decades were largely reactive, in that their remit was narrow and long-established. This meant that the range of analytic techniques would likely also be narrow and statisticians ended up with a deep knowledge in a particular methodology rather than a broad understanding of analytic approaches.

For example, in my first role I almost exclusively used linear models, whereas in my next role it was all about survival models. As a Data Scientist is being asked to proactively solve a wider range of problems, they at least need an appreciation of the broader possibilities and the ability (to some extent) to be able to consume and apply a new methodology (once assumptions of those methods are understood).

Coding, not scripting

There’s a significant difference between someone who ‘writes scripts’ and someone who can really code. In the early 2000s, I spent a great deal of time as a statistician writing SAS code, where my primary output was ‘insight’ and the code I wrote was more of a by-product of what I did as opposed to a deliverable. For what it’s worth, I wouldn’t class that earlier version of me as a ‘Data Scientist’.

I think a Data Scientist is more of a programmer, where the code they write is part of what they deliver, and therefore needs to be scalable and written with formal development practices in mind. I’m not saying that every Data Scientist needs to be a master of DevOps (although that would be nice!), but some element of coding rigour should be essential.

Doing Science

The last thing that, for me, sets Data Scientists apart is the way they approach a challenge. When hiring, I’m looking for someone who fundamentally sees data as an opportunity and has an inherent curiosity about what insight that data will contain. Beyond that, a Data Scientist’s approach is fundamentally a ‘scientific’ one, where assumptions are created and tested using the data and available analytic methodologies.

While exclusion criteria such as the ones I’ve written above may feel unfair, but I think they are necessary to be able to delineate the career of a Data Scientist and to distinguish it from the variety of other data roles. Ultimately, the criteria will vary because each organisation will need different types of Data Scientists to achieve their goals. However, establishing a base for what would be considered a ‘Minimally Viable Data Scientist’ feels vital to the success of Data Science as a whole.

Where to from here?

If you’re looking to become a Data Scientist, then I hope this helps you to understand the skills needed.

If you’re looking to hire a Data Scientist, then make sure you know what you are trying to achieve and check potential hires have the skills required to deliver the value you’re expecting.

How Mango can help

We’ve been helping companies with their data analysis since 2002. We now also work with organisations to build their data science capability and develop their data science strategies. Talk to us about how we can help you make the most of your data to strengthen your business:

Effective Data Analytics In Manufacturing
Blogs home Featured Image

Data analytics is rapidly changing the face of manufacturing as we know it. At Mango, we’re seeing companies using their data effectively to gain an advantage over competitors.

These companies are using data science to properly set up and control manufacturing. For example, automatically adjusting parameters for specific parts/production lines to decrease wastage and meet demand. Research has shown that 68% of manufacturers were already investing in data science to achieve a range of improvements. This means that more than 30% of manufacturers still haven’t adopted a data-driven approach and are therefore not yet working leaner, smarter, improving yields and reducing costs for an increased bottom line.

We know that manufacturing is an asset-intensive industry and companies need to move fast, be more innovative and work smart in order to be competitive. To remain ahead of the game, manufacturers need to adopt a different way of thinking when it comes to data. However, any transition from the industrial to the digital age can be both daunting and a minefield.

Too much data

One of the main problems for many companies – especially within the manufacturing sector – is the speed at which they are collecting massive amounts of real-time data, making it hard to work out what data is actually important. This is even harder without the right tools.

A solution – building a data science capability

To understand their data better, many organisations have started to build teams of Data Scientists. The Data Scientist is indeed becoming an increasingly valuable asset within any organization looking to make the most of their data.

The aim of building a data science capability is to harvest and analyse the data being collected to drive business change. However, many companies struggle to get the right skillsets in their team. In response to this need, we developed the Data Science Radar. The Radar is a conceptual framework that explores character traits and it is a visual aid to support our customers to build and shape an existing data science team, identifying gaps in skillsets and monitoring learning needs. The application has been such a success we provide it free to help companies start their data-driven journey. Take a look at the Data Science Radar here:

Choosing the right tools for the job

Data science requires tools that go beyond the capabilities of spreadsheet programs like Excel —which is still often the tool used for data analysis in manufacturing. It is a common, but false, belief that the only alternatives to this are expensive off-the-shelf software packages, which can differ greatly in terms of cost, usability data capacity and visualisation capabilities.

While we use a range of cutting edge tools for our projects, we often recommend one being used around the world by thousands of analysts – the open source R language. From computational science to extensive marketing techniques, R is the most popular analytic language in the world today and a fundamental analytic tool within a range of industries. The growth and popularity of R programming has been helping data-driven organisations succeed for years.

Our knowledge, experience and passion for Data Science means we have engaged in some truly amazing analytic projects. We understand the challenges faced by the manufacturing industry and have worked with companies all over the world to lower product development and operating costs, increase production quality, improve customer experience, and improve manufacturing yields – all using the power of R!

Analytics for non-technical stakeholders

Visualization tools communicate the results of analytics in a clear and precise manner. It’s possible you may have overheard Shiny in discussions between your data analysts and noted it in some of the below case studies. But what is Shiny?

Shiny combines the computational power of R with the interactivity of the modern web. It is a powerful and popular web framework for R programmers to elevate the way people —both technical and non-technical decision makers— consume analytics.

R allows data scientists to effectively analyse large amounts of real-time data but Shiny visualises that data effectively and easily to present outputs for non-analysts, allowing non-technical stakeholders to easily review and filter the data. Outputs can then be hosted on a client’s own servers or via RStudio’s hosting service.

Here are just a few examples of our successful projects:

Mango delivered a large SAS to R migration project with a global semiconductor manufacturer. A complex Shiny application was created to replace the expensive SAS application software already in use. This made it possible to exit an expensive SAS license and adopt modern analytic techniques. This has resulted in improved production yields, reduced costs and enthused production teams with a modern production infrastructure.

Mondelez were using a SAS Roast Coffee Blend Generator. Mango used advanced prescriptive analytics to migrate the client to R, resulting in optimization of their coffee recipe, improved yield qualities and reduced production costs.

Mango helped a global agrochemical company by providing an in-depth code review of their Shiny application, including modification of code to improve performance. A pack of Shiny coding best practices was also developed by Mango for the client to reference in their future developments, thus helping them improve performance and yields.

Campden BRI have a large Consumer and Sensory Science department who perform comprehensive analysis on sensory and consumer data. Due to years of adding additional features to their exisiting database, the internal systems had come to rely on a restrictive ‘jigsaw of legacy code’. Using R, Mango helped rationalise the work flows and processes to provide a more robust solution, which resulted in a neat application that users could use intuitively. The team have streamlined their work and their use of software packages, saving time, money and effort.

Names have been removed where required, more case examples can be found on our website.

Why Mango?

Mango Solutions have been long-term trusted partners with companies in a wide range of industries, including Manufacturing, Pharmaceutical, Retail, Travel, Automotive, Finance, Energy and Government since 2002. Our team of Data Scientists, Data Engineers, Technical Architects and Software Developers deliver independent, forward thinking, critical, predictive and prescriptive analytical solutions.

Mango have assisted hundreds of companies reap the business gains that come from effective data science because our unique mix of both technical and commercial real-world experience ensures best practice approaches.

Are you ready to become data-driven? Please contact us for an obligation-free conversation today with Christina Halliday: 

*RStudio is a partner of Mango Solutions and the creators of Shiny and Shiny commercial products.

ANNOUNCEMENT: EARL London 2018 + abstract submissions open!
Blogs home Featured Image

14 February 2018

Mango Solutions are delighted to announce that loyalty programme pioneer and data science innovator, Edwina Dunn, will keynote at the 2018 Enterprise Applications of the R Language (EARL) Conference in London on 11-13 September.

Mango Solutions’ Chief Data Scientist, Richard Pugh, has said that it is a privilege to have Ms Dunn address Conference delegates.

“Edwina helped to change the data landscape on a global scale while at dunnhumby; Tesco’s Clubcard, My Kroger Plus and other loyalty programmes have paved the way for data-driven decision making in retail,” Mr Pugh said.

“Having Edwina at EARL this year is a win for delegates, who attend the Conference to find inspiration in their use of analytics and data science using the R Language.

“In this centenary year of the 1918 Suffrage act, Edwina’s participation is especially appropriate, as she is the founder of The Female Lead, a non-profit organization dedicated to giving women a platform to share their inspirational stories,” he said.

Ms Dunn is currently CEO at Starcount, a consumer insights company that combines the science of purchase and intent and brings the voice of the customer into the boardroom.

The EARL Conference is a cross-sector conference focusing on the commercial use of the R programming language with presentations from some of the world’s leading practitioners.

More information and tickets are available on the EARL Conference website:


For more information, please contact:
Karis Bouher, Marketing Manager: or +44 (0)1249 705 450