Blogs home Featured Image

Our next interviewee is Patrik Punco, Marketing Analyst at German media company, NOZ Medien. Patrik is presenting a lighting talk ‘Subscription Analytics with focus on Churn Pattern Recognition in a German News Company’ at EARL London.

Ruth Thomson, Mango’s Practice Lead for Strategic Advice chatted to Patrik about the business need for his project, what value it created for the business and any learnings and recommendations he had for other businesses interested in using predictive analytics and machine learning to reduce churn.

Firstly and most importantly, thanks Patrik for this interview. We’d love to know what was the business need or opportunity that prompted your project?

There is a structural change happening in media companies in Germany and also globally.

There is a decline in sales of print products with an increase in digital products. As with many media companies, our print products still provide significant income and we want to reduce the decline of print customer numbers with churn prevention strategies.

If we can identify which customers are likely to churn and why, we can put in place targeted customer loyalty activities to both improve customer experience and reduce churn. We saw an opportunity to use predictive modelling and machine learning to achieve these goals.

What were the key elements of your project?

Most importantly, we were mindful of our customers privacy and made sure customers were fully informed about how their data would be used and had the appropriate permissions in place.

We started with a strong understanding of our business and our current churn reduction strategies. This understanding informed the 115 variables we identified as important in the customer lifecycle. We combined data from our SAP system with data from external sources such as delivery data. Next we tested different models to find the one which delivered the best results.

As a result, we were able to choose the 1% of customers most likely to churn to include in our customer loyalty activities.

We ran A/B tests to measure the impact of our work.

What was the impact? How did you measure value?

Using our A/B tests, we were able to quantify the reduction in churn and the reduction has been significant and financially valuable to the company.

Overall the project has been a success so much so that we are extending and building on the work.

What would you say were the critical elements that made the project a success?

R ecosystem helped us not only to implement predictive modelling but also to work much faster and efficiently. For example using data.table and Rcpp package reduced the aggregation of customer tenures runtime from over 30 minutes to less than 1 second.

Our Data Mining Methodology: We had a complex set of data to prepare for this project, it wasn’t simple or easy. I think the methodology we used was critical to the success of this project. We focused on maximisation of churn lift values that we obtain from the model compared to the overall performance of the print segment. On this basis, we were able to build the A/B testing strategies.

Understanding the Business: A key element of success is the understanding of our business. What type of business are we? Who are our customers? What are the policies that need to be factored in? Without that there is the risk that the project would not have been structured appropriately and not deliver the expected value. An example of what I mean is the exact understanding of the churn policy followed by the company which had a direct impact on definition and encoding of the response variable.

What other businesses could benefit from this type of use case?

Subscription businesses in all sectors would benefit from using advanced analytics to reduce churn. But the applicability of this use case is even wider than that, any business that has a churn reduction strategy could make it more effective with advanced analytics.

Thank you Patrik!

To hear Patrik’s talk in September and others like his, get your EARL London tickets now – early bird ticket sales end 31 July.


Blogs home Featured Image

The year 2017 has completely turned the film industry upside down. The allegations of harassment and sexual assault against Harvey Weinstein have raised the issue of sexism and misogyny in this industry to the eyes of the general public. In addition, it has helped raise awareness of the poor gender diversity and under-representation of women in Hollywood. One of the main problems posed by the low presence of women behind the camera is that this is then reflected in the fictional characters on screen: lots of movies portray women in an incomplete, stereotyped and biased way.

This post focuses on some key behind-the-camera roles to measure the evolution of gender diversity in the last decade – from 2007 until 2017. The roles I studied were: directorswritersproducerssound teamsmusic teamsart teamsmakeup teams and costume teams.

The whole code to reproduce the following results is available on GitHub.

Data frame creation – Web scraping

What I needed first was a list which gathered the names connected to film job roles for 50 movies. For each year between 2007 and 2017, I gathered the information about the 50 most profitable movies of the year from the IMDb website.

As a first step, I built data frames which contained the titles of these movies, their gross profit and their IMDb crew links – which shows the names and roles of the whole movie crew. The following code is aimed at building the corresponding data frame for the 50 most profitable movies of 2017.


url <- ",2017-12-31&sort=boxoffice_gross_us,desc"
page <- read_html(url)

# Movies details
movie_nodes <- html_nodes(page, '.lister-item-header a') 
movie_link <- sapply(html_attrs(movie_nodes),`[[`,'href')
movie_link <- paste0("", movie_link)
movie_crewlink <- gsub("[?]", "fullcredits?", movie_link) #Full crew links
movie_name <- html_text(movie_nodes)
movie_year <- rep(2017, 50)
movie_gross <- html_nodes(page, '.sort-num_votes-visible span:nth-child(5)') %>%

# CREATE DATAFRAME: TOP 2017 ----------------------------------------------

top_2017 <- data.frame(movie_name, movie_year, movie_gross, movie_crewlink, stringsAsFactors = FALSE)

Let’s have a look at the top_2017 data frame:

##                                movie_name movie_year movie_gross
## 1 Star Wars: Episode VIII - The Last Jedi       2017    $620.18M
## 2                    Beauty and the Beast       2017    $504.01M
## 3                            Wonder Woman       2017    $412.56M
## 4          Jumanji: Welcome to the Jungle       2017    $404.26M
## 5         Guardians of the Galaxy: Vol. 2       2017    $389.81M
## 6                   Spider-Man Homecoming       2017    $334.20M
##                                                   movie_crewlink
## 1
## 2
## 3
## 4
## 5
## 6

I adapted the previous code in order to build equivalent data frames for the past 10 years. I then had 11 data frames: top2017top2016, …, top2007, which gathered the names, years, gross profit and crew links of the 50 most profitable movies of each year.

I combined these 11 data frames into one data frame called top_movies.

List creation – Web scraping

After that, I had a data frame with 550 rows, and I next needed to build a list which gathered:

  • the years from 2007 to 2017
  • for each year, the names of the top 50 grossing movies corresponding
  • for each movie, the names of the people whose job was included in one of the categories I listed above (director, writer, costume teams)

In order to build this list, I navigated through all the IMDb full crew web pages stored in our top_movies data frame, and did some web scraping again to gather the information listed above.

movies_list <- list()

for (r in seq_len(nrow(top_movies))) {
  # FOCUS ON EACH MOVIE -----------------------------------------------------------------
  movie_name <- top_movies[r, "movie_name"]
  movie_year <- as.character(top_movies[r, "movie_year"])
  page <- read_html(as.character(top_movies[r, "movie_crewlink"]))
  # GATHER THE CREW NAMES FOR THIS MOVIE ------------------------------------------------
  movie_allcrew <- html_nodes(page, '.name , .dataHeaderWithBorder') %>%
  movie_allcrew <- gsub("[\n]", "", movie_allcrew) %>%
    trimws() #Remove white spaces 
  # SPLIT THE CREW NAMES BY CATEGORY ----------------------------------------------------
  movie_categories <- html_nodes(page, '.dataHeaderWithBorder') %>%
  movie_categories <- gsub("[\n]", "", movie_categories) %>%
    trimws() #Remove white spaces
  ## MUSIC DEPARTMENT -------------------------------------------------------------------
  movie_music <- c()
  for (i in 1:(length(movie_allcrew)-1)){
    if (grepl("Music by", movie_allcrew[i])){
      j <- 1
      while (! grepl(movie_allcrew[i], movie_categories[j])){
        j <- j+1
      k <- i+1
      while (! grepl(movie_categories[j+1], movie_allcrew[k])){
        movie_music <- c(movie_music, movie_allcrew[k])
        k <- k+1
  for (i in 1:(length(movie_allcrew)-1)){
    if (grepl("Music Department", movie_allcrew[i])){
      j <- 1
      while (! grepl(movie_allcrew[i], movie_categories[j])){
        j <- j+1
      k <- i+1
      while (! grepl(movie_categories[j+1], movie_allcrew[k])){
        movie_music <- c(movie_music, movie_allcrew[k])
        k <- k+1
  if (length(movie_music) == 0){
    movie_music <- c("")
  ## IDEM FOR OTHER CATEGORIES ---------------------------------------------------------
  movie_info <- list()
  movie_info$directors <- movie_directors
  movie_info$writers <- movie_writers
  movie_info$producers <- movie_producers
  movie_info$sound <- movie_sound
  movie_info$music <- movie_music
  movie_info$art <- movie_art
  movie_info$makeup <- movie_makeup
  movie_info$costume <- movie_costume
  movies_list[[movie_year]][[movie_name]] <- movie_info


Here are some of the names I collected:

## - Star Wars VIII 2017, Director:
## Rian Johnson
## - Sweeney Todd 2007, Costume team:
## Colleen Atwood, Natasha Bailey, Sean Barrett, Emma Brown, Charlotte Child, Charlie Copson, Steve Gell, Liberty Kelly, Colleen Kelsall, Linda Lashley, Rachel Lilley, Cavita Luchmun, Ann Maskrey, Ciara McArdle, Sarah Moore, Jacqueline Mulligan, Adam Roach, Sunny Rowley, Jessica Scott-Reed, Marcia Smith, Sophia Spink, Nancy Thompson, Suzi Turnbull, Dominic Young, Deborah Ambrosino, David Bethell, Mariana Bujoi, Mauricio Carneiro, Sacha Chandisingh, Lisa Robinson

Gender determination

All of the names I needed to measure the gender diversity of were now gathered in the list movies_list. Then, I had to determine the gender of almost 275,000 names. This is what the R package GenderizeR does: “The genderizeR package uses API to predict gender from first names”. At the moment, the database contains 216286 distinct names across 79 countries and 89 languages. The data is collected from social networks from all over the world, which ensure the diversity of origins.

However, I am aware that determining genders based on names is not an ideal solution: some names are unisex, some people do not recognise themselves as male or female, and some transitioning transgender people still have their former name. But this solution was the only option I had, and as I worked on about 275,000 names, I assumed that the error induced by the cases listed above was not going to have a big impact on my results.

With this in mind, I used the GenderizeR package and applied its main function on the lists of names I gathered earlier in movies_list. The function genderizeAPI checks if the names tested are included in the database and returns:

  • the gender associated with the first name tested
  • the counts of this first name in database
  • the probability of gender given the first name tested.

The attribute I was interested in was obviously the first one, the gender associated with the first name tested.

The aim was to focus on every category of jobs, and to count the number of males and females by category, film and year. With the script below, here is the information I added to each object movies_list$year$film:

  • the number of male directors
  • the number of female directors
  • the number of male producers
  • the number of female producers
  • the number of males in costume team
  • the number of females in costume team

The following code shows how I determined the gender of the directors’ names for every film in the movie_list. The code is similar for all the other categories.

# for each year
for (y in seq_along(movies_list)){ 
  # for each movie
  for (i in seq_along(movies_list[[y]])){
# Genderize directors -----------------------------------------------------
    directors <- movies_list[[y]][[i]]$directors
    if (directors == ""){
      directors_gender <- list()
      directors_gender$male <- 0
      directors_gender$female <- 0
      movies_list[[y]][[i]]$directors_gender <- directors_gender
      # Split the firstnames and the lastnames
      # Keep the firstnames
      directors <- strsplit(directors, " ")
      l <- c()
      for (j in seq_along(directors)){
      l <- c(l, directors[[j]][1])
      directors <- l
      movie_directors_male <- 0
      movie_directors_female <- 0
      # Genderize every firstname and count the number of males and females 
      for (p in seq_along(directors)){
        directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e")
        gender <- directors_gender$response$gender
        if (length(gender)>0 && gender == "male"){
          movie_directors_male <- movie_directors_male + 1
        if (length(gender)>0 && gender == "female"){
          movie_directors_female <- movie_directors_female + 1
      # Put the number of males and females in movies_list
      directors_gender <- list()
      directors_gender$male <- movie_directors_male
      directors_gender$female <- movie_directors_female
      movies_list[[y]][[i]]$directors_gender <- directors_gender
# Idem for the 7 other categories -----------------------------------------------------    


Here are some examples of the number of male and female names I collected:

## - Star Wars VIII 2017 
##  Number of male directors: 1 
##  Number of female directors: 0
## - Sweeney Todd 2007 
##  Number of male in costume team: 9 
##  Number of female in costume team: 20

Percentages calculation

Once I had all the gender information listed above, the next step was to calculate percentages by year. I then went through the whole list movies_list and created a data frame called percentages which gathered the percentages of women in each job category for each year.

Let’s have a look at the percentages data frame:

##    year women_directors women_writers women_producers women_sound
## 1  2017        3.571429      9.386282        23.03030    14.17497
## 2  2016        3.174603      9.174312        19.04762    14.02918
## 3  2015        6.000000     12.432432        21.19914    15.69061
## 4  2014        1.785714      8.041958        23.12634    14.89028
## 5  2013        1.886792     10.769231        22.86282    13.54005
## 6  2012        5.357143     10.227273        24.06542    12.33696
## 7  2011        3.846154      9.523810        19.73392    15.08410
## 8  2010        0.000000     10.526316        17.40088    16.06700
## 9  2009        7.407407     13.157895        21.24711    15.30185
## 10 2008        7.547170      9.756098        18.67612    14.70588
## 11 2007        3.333333      9.047619        17.42243    16.13904
##    year women_music women_art women_makeup women_costume
## 1  2017    22.46998  26.87484     68.22204      69.89796
## 2  2016    25.84896  25.04481     67.54386      69.44655
## 3  2015    20.46163  24.90697     68.83117      70.83333
## 4  2014    22.86967  22.31998     67.29508      67.47430
## 5  2013    20.46482  22.45546     63.88697      69.79495
## 6  2012    21.62819  20.90395     66.95402      68.83539
## 7  2011    18.09816  20.22792     70.09482      67.44548
## 8  2010    20.90137  22.38199     65.81118      68.72082
## 9  2009    19.15734  22.14386     61.15619      70.25948
## 10 2008    19.82984  21.80974     60.87768      71.20253
## 11 2007    19.64385  20.21891     59.23310      67.36035

Visualisation – gender diversity in 2017

I was then able to visualise these percentages. For example, here is the code I used to visualise the gender diversity in 2017.

# Formating our dataframe
percentages_t <- data.frame(t(percentages), stringsAsFactors = FALSE)
colnames(percentages_t) <- percentages_t[1, ]
percentages_t <- percentages_t[-1, ]
rownames(percentages_t) <- c("directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")

# Ploting our barplot
percentages_2017 <- percentages_t$`2017`
y <- as.matrix(percentages_2017)

p <- ggplot(percentages_t, aes(x = rownames(percentages_t),
                               y = percentages_2017, 
                               fill = rownames(percentages_t))) + 
  geom_bar(stat = "identity") +
  coord_flip() + # Horizontal bar plot
  geom_text(aes(label=format(y, digits = 2)), hjust=-0.1, size=3.5) + # pecentages next to bars
        plot.title = element_text(hjust = 0.5)) + # center the title
  labs(title = "Percentages of women in the film industry in 2017") +
  guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend
  scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

As we can see, in 2017, the behind-the-camera roles of both directors and writers show the most limited women occupation: less than 10% for writers and less than 4% for directors. This is really worrying considering that these are key roles which determine the way women are portrayed in front of the camera. Some studies have already shown that the more these roles are diversified in terms of gender, the more gender diversity is shown on screen.

Let’s go back to our barplot. Women are also under-represented in sound teams (14%), music teams (22.5%), producer roles (23%) and art teams (27%). The only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers, among which almost 70% of the roles are taken by women.

Visualisation – gender diversity evolution through the last decade

Even if the 2017 results are not exciting, I wanted to know whether there had been an improvement through the last decade. The evolution I managed to visualise is as follows.

# From wide to long dataframe
colnames(percentages) <- c("year", "directors", "writers","producers", "sound",    
                           "music", "art", "makeup", "costume")
percentages_long <- percentages %>%
  gather(key = category, value = percentage, -year)
percentages_long$year <- ymd(percentages_long$year, truncated = 2L) # year as date 

# line plot
evolution_10 <- ggplot(percentages_long, aes(x = year,
                                             y = percentage,
                                             group = category,
                                             colour = category)) +
  geom_line(size = 2) +
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women in the film industry from 2007 to 2017",
       x = "",
       y = "Percentages")

The first thing I noticed is that the representativeness gap between the roles of make-up artists and costume designers and the other ones has not decreased in a flagrant way since 2007.

In addition, the roles that women are really under-represented – directors, writers and jobs related to sound, no improvement has been achieved.

If we focus on directors, we do not see any trend. Figures vary depending on the year we consider. For example in 2010, we notice that there are not any female directors among the 50 most profitable movies, and for other years it never goes beyond 7.5%. What is interesting for the role of director, the best levels of female representation were reached in 2008 and 2009. After these years the number of female directors has declined and never reached more than 6%. The percentage of women directors reached in 2017 is practically the same as the percentage reached in 2007.

We then notice an evenness in the number of female sound teams and writers: women consistently represent around 10% of writers and 15% of sound teams in the last decade. But there is no sign of improvement.

Only a slight improvement of 3-5% is notable among producers, music and art teams. But nothing astonishing.

Visualisation – gender diversity forecasting in 2018

The last step of our study was to forecast, at a basic level, these percentages for 2018. I used the forecast package and its function forecast, and then applied it to the data I collected between 2007 and 2017, in order to get this prediction:

# Time series
ts <- ts(percentages, start = 2007, end = 2017, frequency = 1)

# Auto forecast directors 2018
arma_fit_director <- auto.arima(ts[ ,2])
arma_forecast_director <- forecast(arma_fit_director, h = 1)
dir_2018 <- arma_forecast_director$fitted[1] # value predicted

# Idem for writers, producers, sound, music, art, makeup and costume

# Create a data frame for 2018 fitted values
percentages_2018 <- data.frame(year = ymd(2018, truncated = 2L), 
                               women_directors = dir_2018, 
                               women_writers = writ_2018, 
                               women_producers = prod_2018, 
                               women_sound = sound_2018,
                               women_music = music_2018,
                               women_art = art_2018,
                               women_makeup = makeup_2018,
                               women_costume = costu_2018, 
                               stringsAsFactors = FALSE)

# Values from 2007 to 2017 + 2018 fitted values
percentages_fitted_2018 <- bind_rows(percentages, percentages_2018)
# From wide to long dataframe
colnames(percentages_fitted_2018) <- c("year", "directors", "writers","producers", "sound",    
                                      "music", "art", "makeup", "costume")
percentages_long_f2018 <- percentages_fitted_2018 %>%
  gather(key = category, value = percentage, -year)
percentages_long_f2018$year <- ymd(percentages_long_f2018$year, truncated = 2L) # year as date

# Forecast plot for 2018 
forecast_2018 <- ggplot(percentages_long_f2018, aes(x = year,
                                                    y = percentage,
                                                    group = category,
                                                    colour = category)) +
  geom_line(size = 2)+
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women in the film industry from 2007 to 2017\n Fitted values for 2018",
       x = "",
       y = "Percentages")

The predicted values I got for 2018 are approximately the same as the ones I calculated for 2017. However, it is a basic forecast, and it does not take into consideration the upheaval which happened in the film industry in 2017. This will surely have an impact on the gender diversity in the film industry. But to what extent? Has general awareness been sufficiently increased to truly achieve change?

In any case, I sincerely hope that our forecasting is wrong and that a constant improvement will be seen in the next couple of years, so that female characters on cinema screens will become more interesting and complex.

Blogs home Featured Image

For today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Willem Ligtenberg, Data Scientist at CZ, a health insurance company in the Netherlands.

Willem is presenting “Developing a Shiny application to predict work load” at EARL London and we got the chance to get a preview of some of the areas he will be covering.

Thanks Willem for taking the time for this interview. What was the business need or opportunity that led to this project?

We are a healthcare insurance company and in one specific department, it often took longer than 3 working days to process a claim.

We knew that if we could process a claim within 3 working days, customer satisfaction would increase, regardless of the outcome.

So, we wanted to improve customer satisfaction by processing claims quickly. However, this department often had a backlog of claims as it was hard to predict how many claims would be received and therefore the number of staff members needed to process those claims.

The processing of these specific claims is difficult to automate as the documents that are submitted are not standardised, an individual is required to review it.

How did you go about solving this problem?

The most important first step was understanding what we wanted to predict. It sounds simple but this detail was important. We realised we didn’t want to predict when a claim would arrive from the post but instead when the claim was ready for the department to process. This difference is very important.

Once we had clarified this important point, we had to prepare the data to ensure it was in the right format. We then tested different predictive models and finally chose Prophet, a forecasting model which is developed by Facebook. Next, we evaluated different models until we were happy with the result. To allow the business to generate their own forecasts we created a Shiny app.

The result of our work has been that claims are now processed by the department within 1 working day and the department is able to maintain optimum staffing levels to process those claims.

What value did your project deliver?

The most important value has been in customer satisfaction. Customer satisfaction increases from a 7.5 to an 8 when claims are processed within 3 days. This may not sound like much but it is a significant increase in this context. As a business, we highly value our customer satisfaction.

There has also been a reduction in the need for employing short-term temporary staff which has reduced costs.

Interestingly, we have also found that, by processing claims within 1 day, productivity has increased. We think that there is something interesting in the psychology behind being able to complete all your work in one day which might lead to people going the extra mile. For me, increased productivity was an unexpected benefit.

What would you say were the elements that made your project a success?

Getting the right data – like many insurance companies, we have a lot of data. The critical thing is choosing the right data to use for this specific use case.

The right model – we spent time finding the right model for the project to get the best result.

The user interface – the Shiny App was the ideal user interface because it allows the user to interact with the results by, for example, changing the date range. We also made sure that the users could export the results for use in the existing planning tools to maximize the value from the results.

What other businesses do you think would benefit from this use case?

Any business which has a need to predict or make forecasts! It will be most of value to help make decisions on staffing levels in bigger teams, say 30+ people and where there is a variability or seasonality, the Prophet model is really good for that.

More EARL interviews

We will be sharing some more EARL interviews with our speakers leading up to 11 September. We look forward to sharing speaker insights and more of what we can all expect September!

Early bird ticket sales will end 31st July – get yours now.

Blogs home Featured Image

As an R user who is building models and analysing data, one of the key challenges is how to make those results available to those who need it. After all, data science is about making better decisions, and your results need to get into the hands of the people who make those decisions.

For reporting, there are many options from writing Excel files to rmarkdown documents and shiny apps. Many businesses already have great reporting with a business intelligence (BI) tool. For them, it is preferable that you present your results alongside a number of other critical business metrics. Moreover, your results need to be refreshed daily. In this situation, you might be working with SQL developers to integrate your work. The question is, what is the best way to deliver R code to the BI team?

In this blog post, we will be looking at the specific case of deploying a predictive model, written in R, to a Microsoft SQL Server database for consumption by a BI tool. We’ll look at some of the different options to integrate R, from in-database R services, to pushing with ODBC or picking up flat files with SSIS.

The Problem

Flight delay planning

To demonstrate we’ll use the familiar flights dataset from the nycflights13 package to imagine that we are airport planners and we want to test various scenarios related to flight delays. Our data contains the departure delay of all flights leaving the New York airports: JFK, LGA, and EWR in 2013. I’m running this code on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called ml. If you want to reproduce the code you’ll need to have your own SQL Server setup (you can install it locally) and push the flights table there. Here’s a selection of columns:

SELECT TOP(5) flight, origin, dest, sched_dep_time, carrier, time_hour, dep_delay
FROM flights
flight origin dest sched_dep_time carrier time_hour dep_delay
1545 EWR IAH 515 UA 2013-01-01 05:00:00 2
1714 LGA IAH 529 UA 2013-01-01 05:00:00 4
1141 JFK MIA 540 AA 2013-01-01 05:00:00 2
725 JFK BQN 545 B6 2013-01-01 05:00:00 -1
461 LGA ATL 600 DL 2013-01-01 06:00:00 -6

We’ll fit a statistical model for the departure delay, and run simulations for the delay of future flights. We want to capture the natural variation from day to day so a useful approach here is a mixed-effects model where each day is a random effect.

model <- lme4::lmer(
    dep_delay ~ 1 +
      (1 | date:origin) +
      carrier +
      origin +
      sched_dep_time +
      distance +
    data = data_train

This is a simple model for demonstration purposes. For example, it doesn’t capture big delays (extreme values) well, but it will serve our purpose. The full model code and data prep is available at mangothecat/dbloadss so we won’t go through every line here.

To simulate delays on future flights we can call the simulate function. Here we’ll run 10 different scenarios:

sim_delays <- simulate(model, nsim = 10, newdata = data_test)

The reason we’re using simulate rather than predict is that we don’t just want the most likely value for each delay, we want to sample from likely scenarios. That way we can report any aspect of the result that seems relevant. A data scientist will ask: “how can I predict dep_delay as accurately as possible?”. An airport manager will want to know “how often will the last flight of the day leave after midnight?”, or another question that you haven’t thought of. This type of output lets the BI team address these sorts of question.

Package it up

At Mango, we believe that the basic unit of work is a package. A well-written package will be self-documenting, have a familiar structure, and unit tests. All behind-the-scenes code can be written into unexported functions, and user facing code lives in a small number (often one) of exported functions. This single entry point should be designed for someone who is not an experienced R user to run the code, and if anything goes wrong, be as informative as possible. R is particularly friendly for building packages, with the excellent devtools automating most of it, and the wonderfully short R packages book by Hadley guiding you through it all.

The code for this blog post lives in the dbloadss package available on GitHub. For the flights model, a single function is exported simulate_departure_delays, which is documented to explain exactly what it expects as input, and what it will output. The entire model runs with the single line:

output_data <- simulate_departure_delays(input_data, nsim = 20)

where the input_data is prepared from the database and output_data will be pushed/pulled back to the database.

Connecting to the Database

Push, Pull, or Pickup?

Once the model has been packaged and the interface decided, it remains to decide how to actually run the code. With SQL Server there are three options:

  1. Run the model from R and push the results to SQL Server using an ODBC connection.
  2. Call the model from SQL Server using a stored procedure to run an R script using R Services and pull the results back.
  3. Invoke an Rscript from SSIS and pickup flat files (csv).

Which you choose will depend on a number of factors. We’ll take some time to look at each one.

The Push (SQL from R)

The best way to talk to a database from R is to use the DBI database interface package. The DBI project has been around for a while but received a boost with R Consortium funding. It provides a common interface to many databases integrating specific backend packages to each separate database type. For SQL Server we’re going to use the odbc backend. It has great documentation and since Microsoft released ODBC drivers for Linux it’s a cinch to setup from most operating systems.

Let’s get the flights data from SQL Server:

con <- dbConnect(odbc::odbc(),
                 driver = "SQL Server",
                 server = "localhost\\SQL17ML",
                 database = "ml")

flights <- dbReadTable(con, Id(schema="dbo", name="flights"))

I’ve included the the explicit schema argument because it’s a recent addition to DBI and it can be a sticking point for complicated database structures.

Now we run the model as above

output_data <- simulate_departure_delays(flights, nsim = 20, split_date = "2013-07-01")
## [1] 3412360       3

So for 20 simulations, we have about 3.5 million rows of output! It’s just a flight ID (for joining back to the source), a simulation ID, and a delay.

##      id sim_id dep_delay
## 1 27005      1 -49.25918
## 2 27006      1  23.09865
## 3 27007      1  28.84683
## 4 27008      1 -43.68340
## 5 27009      1 -44.98220
## 6 27010      1  37.62463

We’ll do all further processing in the database so let’s push it back.

# Workaround for known issue
dbRemoveTable(con, name = Id(schema = "dbo", name = "flightdelays"))

odbctime <- system.time({
               name = Id(schema = "dbo", name = "flightdelays"),
               value = output_data,
               overwrite = TRUE)
##    user  system elapsed 
##   10.08    0.47  114.60

That took under 2 minutes. This post started life as a benchmark of write times from odbc vs RODBC, an alternative way to talk to SQL Server. The results are on the dbloadss README and suggest this would take several hours! RODBC is usually fine for reads but we recommend switching to odbc where possible.

It is relatively straightforward to push from R and this could run as a scheduled job from a server running R.

The Pull (R from SQL)

An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the sp_execute_external_script command, which we will wrap in a stored procedure. This method is great for SQL developers because they don’t need to go outside their normal tool and they can have greater control about exactly what is returned and where it goes.

A word of warning before we continue. There’s a line in the Microsoft docs:

Do not install R Services on a failover cluster. The security mechanism used for isolating R processes is not compatible with a Windows Server failover cluster environment.

that somewhat limits the ultimate use of this technique in production settings as a failover cluster is a very common configuration. Assuming this might get fixed, or perhaps it’s not an issue for you, let’s see how it works.

I’m running SQL Server 2017 with Machine Learning Services. This installs its own version of R that you can access. To do so you have to enable “the execution of scripts with certain remote language extensions”. Then you need to install the dbloadss package somewhere that this version of R can see. This can require admin privileges, or alternatively, you can set the .libPaths() somewhere in the stored proc.

The following stored procedure is what we’ll add for our flight delays model:

use [ml];

DROP PROC IF EXISTS r_simulate_departure_delays;
CREATE PROC r_simulate_departure_delays(
    @nsim int = 20,
    @split_date date = "2013-07-01")
 EXEC sp_execute_external_script
     @language = N'R'  
   , @script = N'
    output_data <- simulate_departure_delays(input_data, nsim = nsim_r,
                                             split_date = split_date_r)
   , @input_data_1 = N' SELECT * FROM [dbo].[flights];'
   , @input_data_1_name = N'input_data'
   , @output_data_1_name = N'output_data'
   , @params = N'@nsim_r int, @split_date_r date'
   , @nsim_r = @nsim
   , @split_date_r = @split_date
        "id" int not null,   
        "sim_id" int not null,  
        "dep_delay" float not null)); 

The query that goes into @input_data_1 becomes a data frame in your R session. The main things to note are that you can pass in as many parameters as you like, but only one data frame. Your R script assigns the results to a nominated output data frame and this is picked up and returned to SQL server.

I believe it’s very important that the R script that is inside the stored procedure does not get too complicated. Much better to use your single entry point and put the complex code in a package where it can be unit tested and documented.

We then call the stored procedure with another query:

INSERT INTO [dbo].[flightdelays]
EXEC [dbo].[r_simulate_departure_delays] @nsim = 20

The performance of this method seems to be good. For write speeds in our tests it was faster even than pushing with odbc, although it’s harder to benchmark in the flights example because it includes running the simulation.

Overall, were it not for the issue with failover clusters I would be recommending this as the best way to integrate R with SQL Server. As it stands you’ll have to evaluate on your setup.

The Pickup (R from SSIS)

The final method is to use SSIS to treat running the R model as an ETL process. To keep things simple we use SSIS to output the input data as a flat file (csv), kick-off an R process to run the job, and pickup the results from another csv. This means that we’ll be making our R code run as a command line tool and using a csv “air gap”.

Running R from the command line is relatively straightforward. To handle parameters we’ve found the best way is to use argparser, also honourable mention to optparse. Checkout Mark’s blog post series on building R command line applications. After you’ve parsed the arguments everything is essentially the same as pushing straight to the database, except that you write to csv at the end. SSIS then picks up the csv file and loads it into the database. Performance is generally not as good as the other methods but in our experience it was close enough — especially for a batch job.

An example of what this script might look like is on GitHub. We can run this by doing (from the commandline):

> Rscript blog/flight_delay.R -n 10 -d '2017-07-01' -i flights.csv
Loading required package: methods
Running delay simulations:
   = FALSE
  help = FALSE
  verbose = FALSE
  opts = NA
  nsim = 10
  split_date = 2013-07-01
  input_path = flights.csv
  output_path = simulated_delays.csv
Reading... Parsed with column specification:
  .default = col_integer(),
  carrier = col_character(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  time_hour = col_datetime(format = "")
See spec(...) for full column specifications.
Read  336776  rows
Running simulations...
Writing  1706180  rows

When doing this from SSIS it can directly call Rscript and the arguments can be variables.

The SSIS solution has some great advantages in that it is controlled by the SQL developers, it has the greatest separation of technologies, and it’s easy to test the R process in isolation. Downsides are it’s unlikely that going via csv will be the fastest, and you need to be a little more careful about data types when reading the csv into R.

The Results

If everything goes well the results of your delays simulation will land in the database every night, and every morning reports can be built and dashboards updated. The results look something like this:

SELECT TOP(5) * FROM flightdelays
id sim_id dep_delay
27005 1 -27.48316
27006 1 46.50636
27007 1 65.94304
27008 1 -78.93091
27009 1 -17.86126

Your work is not totally done. There is one cost of getting the dashboards setup. The simulation results are not always the easiest to get your head around so it helps if you can setup the BI team with a few queries just to get started. For example: To generate a daily average delay for the airline UA, they would need something like the following:

WITH gp AS (
SELECT sim_id
      ,avg(fd.dep_delay) as mean_delay
  FROM dbo.flightdelays fd
  LEFT JOIN (SELECT *, convert(date, time_hour) as day_date FROM fs on =
  WHERE fs.carrier='UA'
  GROUP BY sim_id, origin, day_date

SELECT day_date, origin
     , avg(mean_delay) as mean_delay
     , min(mean_delay) as min_delay
     , max(mean_delay) as max_delay
GROUP BY origin, day_date
ORDER BY day_date, origin

So first you aggregate over each simulation, then you aggregate across simulations. Your BI team’s SQL is better than yours (and definitely mine) so this can be a useful way to get feedback on your code and it’s also a good way to explain what your model does to them (code speaks louder than words). Loading up a table like this into a BI tool (for example Power BI) you can get all the plots you’re used to.

Power BI Screenshot of Results

It turns out the daily variation means you don’t learn much from this plot. Maybe it’s time to go and look at that manager’s question about late flights after all.


After spending a bit of time delivering R models for business reporting here are my main takeaways:

  1. Data Science does not exist in a vacuum. Think about how to get your results into the hands of decision-makers from the beginning. Think about their tools and workflow and work back to how you can deliver.
  2. Always serve your code from a package with as few exported functions as possible. Make it as simple as possible for a non-R (or Python) user to run the code. Document everything that goes in and out and defend your code like crazy from bad input. The rest is your domain.
  3. Push as much processing as possible back to the database. It gives maximum flexibility for the BI team to produce all the reports that they might be asked for.
  4. When it works I really like the in-database R method. It gives the power to the database team and performs really well. Unfortunately right now the issue with failover clusters has been a deal-breaker for me and so falling back to SSIS has worked just fine.
Blogs home Featured Image

Ruth Thomson, Strategic Innovation & Strategic Advice

On Monday, I am very much looking forward to joining the panel discussion at the Insurance Data Science Conference at the Cass Business School alongside Kristen Dardia from Verisk Analytics and Hadrien Dykiel from R Studio.

If you’re reading this on the 16th July, and currently at the conference after seeing me make some very intelligent comments on the panel, do come and say hello at our stand or feel free to tweet us @mangothecat using #mangosolutions.

Returning to the current day, in preparation for the conference I have been enjoying reminding myself of the breadth of interesting work Mango Solutions has been doing to help the leading insurance companies harness the power of advanced analytics whilst also reviewing the current opportunities for advanced analytics and some of the challenges in the use of data science in the insurance industry.

With only 30 minutes available for the panel discussion, there is no possibility that I will ever be able to share at least a fraction of the insight I have been gathering and I hate a wasted opportunity so here is a short blog covering a couple of emerging themes, with input from the other companies represented on the panel, Verisk Analytics and R Studio.

Breaking Down The Silos

I can imagine that some actuaries might look at the hype around advanced analytics and data science and be quite amused that the rest of the world has finally recognised what actuarial science has known since the 17th century. The conference programme is a testament to the breadth and depth of advanced analytics being used across the industry.

There are of course, still exciting developments and opportunities to be made in actuarial science and now R has been added to the 2019 actuarial exam this will ensure that the new generation will formalise the skills they need to continue to evolve within the profession.

However, it fascinates me that, within a business, one area can be so advanced in their use of analytics whilst other areas of the business could still be making subjective decisions and working from excel spreadsheets.

Hadrien Dykiel from RStudio, summarised an opportunity that we see for our customers at Mango Solutions to use data science and advanced analytics for not just insurance but the business of insurance.

“Adopting new and emerging technologies can help organizations to not only take advantage of new opportunities, but also increase the efficiency of day-to-day operations”

We are finding there are significant gains to be made in both increasing revenue and reducing costs across our customer’s businesses from customer service to HR.

People and Culture

If you have ever worked with me, you’ll know I am endlessly fascinated and will often spend too much time exploring applications of data science, projects such as one using machine learning to improve flood risk assessment.

However, I am also frustrated to see how, in some cases the potential of these applications and use cases is not being fully realised and this is often due to the culture and readiness of the people within the business to embrace and finally adopt them.

Michael Regier, Lead Data Scientist at Verisk Analytics touches on this

“It comes down to building trusting relationships across business units so that there is more ease in accessing both the technology and data required.”

It is becoming clear to me, and many of our customers, how important is it for data scientists and analysts to have strong communication, influencing and collaboration skills to prevent their projects getting stuck or at worst ignored by the business users.

We have an Education Practice Lead at Mango Solutions who focuses not on just the core analytic skills but also the communication skills required to be successful.

At an enterprise level, we are working with financial service companies to help build career frameworks, capability assessments and upskilling programmes for data scientists.

It’s not just individual data scientists and analysts having the skills they need. In conversation with our insurance and financial services customers and other sectors, we are seeing the growing awareness of the need to build data-driven cultures from the very top of the business. We have been delivering a number of projects working at board and executive levels, to help build the cultures where advanced analytics can thrive.

I am very much looking forward to discussing some of these and more on the panel on Monday. If you are at the conference, come and say hello. If you can’t be there, we’d still love to hear about the opportunities and challenges you have in your business and explore how our consultants could help.

If you’d like to find out more, please get in touch.

Blogs home Featured Image

In case you hadn’t spotted it already, a major new release of ggplot2 hit CRAN last week. Among the many new features, the one that caught my attention was support for simple features. Or in other words, it’s now really easy to plot spatial data in ggplot2.

On Friday afternoon I spent a couple of hours playing with this new functionality as well as mapview, a package I hadn’t come across before but discovered on Friday. This blog is a quick walkthrough to show you how simple spatial visualisation in R really is.

In case you want to follow along, the full list of packages used were:


The Data

Whenever I start to play with new functions I always try to use my own data. In this case, thanks to a recent project, I happened to have a shape file for Italy to hand (You can download it for yourself from Istat).

Initially, I just plotted the region level data with random colours (as you may have already seen on Twitter) but to reflect my usual workflow I found some data on the amount of wine produced by region – it was Friday afternoon after all!

I won’t go into all the detail of the web scraping and data manipulation, but the code is included throughout so that you can replicate this example.

wine <- read_html("")
wine %<>% 
  html_table() %>% 
  extract2(1) %>%
wine[,-1] %<>% map_df(parse_number)

This gave me an interesting dataset to play with that wasn’t actually too dissimilar to what I often end up trying to plot i.e. change over time.

1 Abruzzo 2443 2728 2273 2936 2937
2 Basilicata 189 178 102 87 93
3 Calabria 400 370 314 404 391
4 Campania 1542 1644 1183 1614 1286
5 Emilia Romagna 6273 7396 6958 6752 7039
6 Friuli-Venezia Giulia 1281 1073 1367 1872 1856

Importing Shape Files

Now for the interesting part.

To import the spatial data I used the sf package. This package makes reading the data very simple and stores it in a way that is much easier to reason about, especially for data that is intended to provide regions.

To import the shape file we can use the st_read function, in this case pointing to the file location, although you can also point to a database. Once the data is imported we have a single row for each region. The data includes a column geometry. This is where the information related to the grouping is defined. In this case the geometry stores (as a list-column) all of the information relating to polygon of the region. But, this could also be a single point or a whole host of other information. Check out the vignettes for more details.

italy <- st_read("./Limiti_2016_ED50_g/Reg2016_ED50_g/Reg2016_ED50_g.shp")
1 1 Piemonte 1235685.5 25394259406 <S3: sfc_MULTIPOLYGON>
2 2 Valle D’Aosta 311141.5 3258954558 <S3: sfc_MULTIPOLYGON>
3 3 Lombardia 1410619.8 23862504702 <S3: sfc_MULTIPOLYGON>
4 5 Veneto 1058657.7 18406022144 <S3: sfc_MULTIPOLYGON>
5 4 Trentino-Alto Adige 800899.4 13607742436 <S3: sfc_MULTIPOLYGON>
6 7 Liguria 825474.4 5415107538 <S3: sfc_MULTIPOLYGON>

As this is just a simple data frame, it can be manipulated in the same way as any other dataframe. In this case that includes a little change to the region names to make them match exactly and joining the data so I have everything I want in one data set, ready to start plotting.

levels(italy$REGIONE) <- wine$Region
WineLocation <- full_join(italy, wine, by = c("REGIONE" = "Region"))

New Features of ggplot2

Now for the new features of ggplot2. There are a lot of them this time around. Many related to the specification of variables for programming purposes but I was most interested in how easy it was to plot my Italy data. And it turns out that the answer was extremely.

Point to the data, tell it which column to use as the fill and add the new geom_sf layer. Just to try it out, I have added the new viridis scale layer. But that is all I needed to do. All of the projections have been managed for me, the colours look much more impressive than the defaults and all of the axis labels and titles have been handled the way I would want them, everything has just worked without me having to do any thinking.

ggplot(data = WineLocation, aes(fill = `2016`)) + 
  geom_sf() + 

Taking this even further, I happened to pick a data set with multiple years in it. A little manipulation to get the data into a better format to plot and an additional line in my plotting code and I have facetted maps. Again, something I didn’t need to do but I have tried out the new vars specification for facetting. Don’t worry, the formula interface is still there, and this is one of the features that will make programming with ggplot2 much easier. But this alternative is certainly very simple and may even be easier for some ggplot2 novices.

wineLong <- gather(wine, Year, Quantity, -Region)
fullWineLong <- full_join(italy, wineLong, by = c("REGIONE" = "Region"))

ggplot(data = fullWineLong, aes(fill = Quantity)) + 
  geom_sf() + 
  facet_wrap(vars(Year)) +

I tried this with all the shape files I could get my hands on and I am pleased to report it was just as easy. As it’s ggplot2 I was able to add layers as I wanted them, including additional shape files and all the projections were handled for me – although plotting the US and Italy on one map did look a little odd by default!

A quick look at mapview

Whilst I was reviewing the vignettes for sf I came across an example of using mapview. Despite this package having been on CRAN for a few years I had never come across it before, but I am very glad I have now.

In a nutshell, it is a wrapper to leaflet. The good thing is that it works very easily with simple features data. There are a huge number of options but to recreate an interactive version of my first map I simply needed to define the data, the column that defines the colouring (zcol) and as an extra the data that gives labels when I hover on the map, in this case the region names.

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE)

You can take a look at the map created on RPubs.

There are a huge number of options to mapview and I certainly haven’t tried them all out yet, but some of the notable features include:

  • Adding additional shape files by simply adding two mapview objects together
  • Adding leaflet layers, using the usual leaflet functions
  • Including multiple layers of data
  • Defining the pop-up functionality

This last one is particularly interesting as you can not only pop-up tables of data but also graphics. As an example, suppose that we want to be able to click on the region of interest and see the change in wine production over time. All we have to do is create a list object that includes all of the graphics that we need (in the same order as the data) and pass this to the popup option using the popupGraph function.

plotByRegion <- function(region){
  fullWineLong %>%
    filter(REGIONE == region) %>%
  ggplot() +
    geom_line(aes(Year, Quantity, group = REGIONE)) +
    labs(title = region)

p <- lapply(WineLocation$REGIONE, plotByRegion)

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE, 
          popup = popupGraph(p))

The documentation for mapview does state that it is aimed at quick visualisation for exploratory purposes rather than for production-ready graphics and it really suited that purpose for me. Personally, I can see my usage becoming a combination of both mapview and leaflet for interactive graphics and ggplot2 for static.

Right now I am very happy that I have an incredibly quick and easy workflow for getting my spatial data into R and visualising it without having to worry about map projections and little details that distract me from analysis.

Blogs home Featured Image

Linux containers, of which Docker is the most well known, can be a really great way to improve reproducibility on your data projects (for more info see here), and create portable, reusable applications. But how would we manage the deployment of multiple containerised applications?

Kubernetes is an open source container management platform that automates the core operations for you. It allows you to automatically deploy and scale containerised applications and removes the manual steps that would otherwise be involved. Essentially, you cluster together groups of hosts running Linux containers, and Kubernetes helps you easily and efficiently manage those clusters. This is especially effective in cloud based environments.

Why use kubernetes in your data stack?

Since Kubernetes orchestrates containers and since containers are a great way to bundle up your applications with their dependencies — thus improving reproducibility — Kubernetes is a natural fit if you’re aiming for high levels of automation in your stack.

Kubernetes allows you to manage containerised apps that span multiple containers as well as scale and schedule the containers as necessary across the cluster.

For instance, if you’re building stateless microservices in flask (Python) and plumber (R) it’s easy to initially treat running them in containers as though they were running in a simple virtual machine. However, once these containers are in a production environment and scale becomes much more important, you’ll likely need to run multiple instances of the containers and Kubernetes can take care of that for you.

Automation is a key driver

When container deployments are small it can be tempting to try to manage them by hand. Starting and stopping the containers that are required to service your application. But this approach is very inflexible and beyond the smallest of deployments such an approach is not really practical. Kubernetes is designed to manage the complexity of looking after production scale container deployments. This takes away the complexity of trying to manage such systems by hand as they can quickly reach a size and level of complexity that does not lend itself to error-prone manual management.

Scheduling is another often overlooked feature of Kubernetes in data processing pipelines, as you could, for example, schedule refreshes of models in order to keep them fresh. Such processes could be scheduled for times when you know the cluster will be otherwise quiet (such as overnight, or on weekends), with the refreshed model being published automatically.

The Case for Kubernetes in your data stack

More broadly, it helps you fully implement and rely on a container-based infrastructure in production environments. This is especially beneficial when you’re trying to reduce infrastructure costs as it allows you to keep your cluster size at the bare minimum required to run your applications, which in turn saves you money on wasted compute resource.

The features of Kubernetes are too long to list here, but the key things to take away is that it can be used to run containerised apps across multiple hosts, can scale applications on the fly, can auto-restart applications that have fallen over and help automate deployments.

The wider Kubernetes ecosystem relies on many other projects to deliver these fully orchestrated services. These additional projects provide such additional features as registry services for your containers, networking, security and so on.

Kubernetes offers a rich toolset to manage complex application stacks and with data science, engineering and operations becoming increasingly large scale, automation is a key driver for many new projects. If you’re not containerising your apps yet, jumping into Kubernetes can seem daunting, but if you start small by building out some simple containerised applications to start with, the benefits of this approach should become clear pretty quickly.

For an in-depth technical look at running Kubernetes, this post by Mark Edmondson offers an excellent primer.