rstudioconf
Blogs home Featured Image

We’re delighted to be sponsoring the rstudio::global conference this year. It’s a credit to the community that such events (including our own flagship EARL conference) have been so readily able to respond to the pandemic lockdown and transform to a fully virtual presence, providing inspirational talks on all things R. 

We are excited to see John Burn-Murdoch, senior data visualisation journalist at the Financial Times amongst the keynotes.  John led the FT’s data-driven coverage of the pandemic, amassing an enormous following on social media for his incisive reporting.  It will be fascinating to hear the lessons he has learned from his reporting and visualisations and how he addressed the challenges of communicating often complex findings to the population at large. It was our pleasure to have had John speak at the EARL conference in both 2014 and 2016 – so we know that the rstudio::global audience can expect a riveting presentation. 

With a packed itinerary and 24-hour streaming for accessibility all over the world, there will be some extremely useful presentations and stimulating conversations to be had for the 10,000 expected data professionals.  

As a sponsor of this event and as RStudio’s longest serving Full Service Certified Partner, we would like this opportunity to invite attendees to meet us in our virtual booth. Whether you are scaling the use of R in your organisation and require technical advice on setup or configuration, lacking internal IT to support the required maintenance of RStudio products or have reservations around the validation of open-source packages from a security or malware perspective, we can help.   

Some of the services that we offer include: 

  • A Managed Service providing a scalable RStudio environment which can be effectively built up, run in the cloud and fully maintained by Mango, to minimise the responsibility and burden on your inhouse IT teams. 
  • An On-Premise solution designed to address current in-house service gaps; following an Installation, Accelerate and Healthcheck review, this service offers the full installation, configuration and maintenance of RStudio products. 
  • A new validation service through ValidR® presents a validated collection of the 150 most popular industry leading R packages, such as those within the tidyverse and can be deployed with RStudio Package Manager (RSPM) to mitigate any uncertainty of using opensource software, with guaranteed reproducibility for any data science team. 

We’re very much looking forward to seeing you at the event on 21st January – don’t forget to sign up for this event now if you haven’t already. 

#Rstudioconf2021 #rstats #RStudiofullservicepartner 

 

      

centralized collaboration to futureproof your data scinece team
Blogs home Featured Image

The data science industry like any IT or tech company have the largest share of remote employees with teams based all over the world. Yet, like other teams projects are implemented effectively, collaboratively and on time.

With the rapidly evolving COVID-19 crisis, companies have been forced to adopt working from home policies. Our technology and digital infrastructure has never been more important. Newly formed ‘remote data science teams’ through this, need to maintain productivity and continue to drive effective stakeholder communication and business value and the only way to achieve this is through appropriate infrastructure and well-defined ways of working.

Whether your workforce works remotely or otherwise, centralising platforms and enabling a cloud based infrastructure for data science will lead to more opportunities for collaboration. It may even reduce IT spend in terms of equipment and maintenance overhead, thus future-proofing your data science infrastructure for the long run.

So when it comes to implementing long-lived platform, here are some things to keep in mind:

Collaboration through a centralised data & analytics platform

A centralized platform, such as RStudio Server Pro, means all your data scientists will have access to an appropriate platform and be working within the same environment. Working in this way means that a package written by one developer can work with a minimum of effort in all your developers’ environments allowing simpler collaboration. There are other ways of achieving this with technologies such as virtualenv for Python, but this requires that each project set up its own environment, thereby increasing overhead. Centralizing this effort ensures that  there is a well-understood way of creating projects, and each developer is working in the same way.

Best practices when using a centralized platform

  1. Version control. If you are writing code of any kind, even just scripts, it should be versioned religiously and have clear commit messages. This ensures that users can see each change made in scripts if anything breaks and can reproduce your results on their own.
  2. Packages. Whether you are working in Python or R, code should be packaged and treated like the valuable commodity it is. At Mango Solutions, a frequent challenge we address with our clients is to debug legacy code where a single ‘expert’ in a particular technology has written some piece of process which has become mission critical and then left the business. There is then no way to support, develop, or otherwise change this process without the whole business grinding to a halt. Packaging code and workflows helps to document and enforce dependencies, which can make legacy code easier to manage. These packages can then be maintained by RStudio Package Manager or Artifactory.
  3. Reusability. By putting your code in packages and managing your environments with renv, you’re able to make your data science reusable. Creating this institutional knowledge means that you can avoid a Data Scientist becoming a single point of failure, and with the job market still incredibly buoyant in the data sector, when a data scientist does leave, you won’t be left with a model that nobody understands or can’t run. As Lou Bajuk explained in his blog post, Driving Real, Lasting Value with Serious Data Science, durable code is a significant criteria for future-proofing your data science organization.

Enabling a Cloud-based environment

In addition to this institutional knowledge benefit, running this data science  platform on a cloud instance allows us to scale up the platform easily. With the ability to deploy to Kubernetes, scaling your deployment as your data science team grows is a huge benefit while only requiring you to pay for what you need to, when you need it.

This move to cloud comes with some tangential benefits which are often overlooked. Providing your data science team with a cloud-based environment has a number of benefits:

  1. The cost of hardware for your data science staff can be reduced to low cost laptops rather than costly high end on-premise hardware.
  2. By providing a centralized development platform, you allow remote and mobile work which is a key discriminator for hiring the best talent.
  3. By enhancing flexibility, you are better positioned to remain productive in unforeseen circumstances.

This last point cannot be overstated. At the beginning of the Covid-19 lockdown, a nationwide company whose data team was tied to desktops found themselves struggling to provide enough equipment to continue working through the lockdown. As a result, their data science team could not function and were unable to provide insights that would have been  invaluable through these changing times. By contrast, here at Mango, our data science platform strategy allowed us to switch seamlessly to remote working, add value to our partners and deliver insights when they were needed most.

Building agility into your basic ways of working means that you are well placed to adapt to unexpected events and adopt new platforms which are easier to update as technology moves on.

Once you have a centralized analytics platform and cloud-based infrastructure in place, how are you going to convince the business to use it? This is where the worlds of Business Intelligence and software dev-ops come to the rescue.

Analytics-backed dashboards using technologies like Shiny and RStudio Connect or Plotly and Dash for Python means you can quickly and easily create front ends for business users to access results from your models. You can also easily expose APIs that allow your websites to be backed by scalable models, potentially creating new ways for customers to engage with your business.

A word of caution here: Doing this without considering how you are going to maintain and update what have now become software products can be dangerous. Models may go out of date, functionality can become  irrelevant,  and the business can become disillusioned. Fortunately, these are solved problems in the web world, and solutions such as containers and Kubernetes alongside CI/CD tools make this a simpler challenge. As a consultancy, we have a tried and tested solutions that expose APIs from R or Python that back high-throughput websites from across a number of sectors for our customers.

Collaborative forms of communications

The last piece of the puzzle for your data science team to be productive has nothing to do with data science but is instead about empathy and communication. Your data science team may create insights from your data, but they are like a rudderless ship without input from the business. Understanding business problems and what has value to the wider enterprise requires good communication. This means that your data scientists have to partner with people who understand the sales and marketing strategy. And if you are to embrace the ethos of flexibility as protection against the future, then good video-conferencing and other technological communications are essential.

Written by Dean Wood, Principal Data Science Consultant at Mango Solutions, also published as guest blog on our partner RStudio’s website.

 

 

RStudio::Conf 2020
Blogs home Featured Image

Dude: Where are my Cats?  RStudio::Conf 2020

It may not have been the start to the conference that we planned as RStudio Full Service Certified partners. – did you see the lonely guy on social media? Yes, that was me, and I’m here to tell the tale…

Eventful as it was at the time, I have to say this was the first RStudio Conference I have had the pleasure to attend since joining Mango Solutions. The things that really stood out for me were the event’s ubiquitous and thought-through inclusivity and the fantastically run and well organised event for nearly 2400 R users worldwide. Here’s a summary of our time in San Francisco, what it had to offer and why we are immensely proud partners of RStudio.

Cat rehoming 10:41am San Francisco time

Held up in customs, the conference started without our exhibition stand, materials and conference goodies, the famous Mango cats. I remain ever thankful to the whole #rstats community, who despite this little hiccup, took pity on us and came to visit us anyway. What I was able to quickly grasp, was that this is a community that is so quickly available to support others, present a forum to share ideas and learn how to solve problems, in particular learn how others are benefiting from using R.

Public Benefit Corporation

A vital and impressive moment of the conference was the standing ovation for J.J Allaire after his announcement that RStudio had become a Public Benefit Corporation. You could feel the appreciation in the room for RStudio’s innovation and how it had pushed the R Community forward.  He discussed their future plans which provides growth opportunities for the community.

From a content perspective, the RStudio::conf was a great event, filled with informative and well organised workshops and talks. As hard as it is to pick out one particular talk, it was probably Jenny Bryan’s talk: “Object of type ‘closure’ is not subsettable”; this was all about debugging in R – best approaches, available tools and hints on how to write more informative error messages in your own functions. It was engaging, informative, witty and it was relevant to pretty much every single R developer on this planet, let alone present in the room.

Amongst other things, the Mango team of Data Scientists really appreciated these packages which the RStudio team featured as part of their workshops:

  • The best ways to scale up you API using plumber package
  • Custom styling of Shiny apps using bootstraplib package
  • Effective R code parallelization using future and furrr packages
  • Load testing using loadtest package

 

Inclusivity all round

Inclusivity was felt not only with the RLadies breakfast, but also in having prayer rooms, quiet rooms for neurodiverse attendees, the gender-neutral bathrooms, diversity scholarships and very frequent reminders of the event’s code of conduct that revolves heavily around inclusivity and tolerance. Great organisation was shown not only in a suitable venue, but also in every effort that went into ensuring that queues for food/buses didn’t stay long, that there was enough time to change rooms between the talks and via the great entertainment/perks throughout the event.

Endless networking opportunities

RStudio::conf 2020 was a fantastic place to meet and connect with other people in the industry and gain insight into how other companies, data science teams and individuals are using R and the underlying infrastructure that supports it. For Ben our Data Platform Consultant, it was interesting and exciting to hear from a platform perspective about the needs of data science teams, and how we could potentially solve the challenges they are facing. A recurring issue seemed to be in scaling R in a production environment and the best way to do this. Ben found the Renv talk interesting and hopes to be using it more this year in place of Packrat.

For Mango it was a real pleasure to discuss at large the wealth of opportunity presented by ValidR in our validated production-ready version of R.

A huge thank you to everyone at RStudio for supporting my first conference with RStudio.  It was truly a pleasure to meet the team in person and has really given Mango and RStudio the opportunity to consolidate our partnership to the next level.

 

Author: Rich Adams, Solutions Specialist

Blogs home Featured Image

This summer Mango took on three summer interns, Chris, Ellena and Lizzi, all maths students at different stages of their university careers. To provide insight into what it’s like to work on a data science project, Mango set up a three day mini-project. The brief was to analyse data from the 2018 Cost of Edinburgh survey.

The Cost of Edinburgh project was founded in 2017 by director and producer Claire Stone. The survey was designed in collaboration with, and with support from, the Fringe Society. It ran on SurveyMonkey in 2018 with the goal of collecting 100-150 responses from the wide range of people involved with performing at the Edinburgh Fringe Festival that year. There were three elements to the scope of the survey:

  • Collect demographic data in order to explore the impact of the costs of attending the Fringe on diversity;
  • Collect data on production income versus costs over multiple years, in terms of venue costs, accommodation and travel;
  • Obtain qualitative responses on the financial and wellbeing costs of attending the Fringe.

The survey aimed to determine which performers attend the Fringe, ascertain what stops people from attending and whether it’s becoming more expensive to perform at the festival. 368 people responded to the survey which had 22 questions with three main sections: demographics, quantitative questions on costs and income and qualitative questions on costs and wellbeing. In this post, Chris and Lizzi share their experiences of their first data science project.

——————————————————————————————————————————————————————————

As is usually the case when real-world data are involved, they weren’t ready to be analysed out of the box. On first look we saw that we didn’t have tidy data in which each column contained the answer to a different question. There were lots of gaps in the spreadsheet, but not all due to missing data.

  • Questions that required the respondent to choose one answer but included an ‘Other’ category corresponded to two columns in the spreadsheet; one containing the chosen answer, and one containing free text if the chosen answer was ‘Other’ and empty otherwise.
  • Questions for which respondents could choose multiple answers had one column per answer. Each column contained the answer if chosen and was empty if not.
  • For quantitative cost and income questions respondents could fill in details for up to ten productions, or years of attending the Fringe. If a question asked for five pieces of information it corresponded to 50 columns per subject, many of which were empty.

 

Chris’s thoughts

After being told that Nick had had a “quick, preliminary look at the data” and discussing his findings, we decided to split the data into two sections; Demographics and Financial, with the idea that any spare time at the end would be spent on the more qualitative feedback questions. Considering that there was a lot more complicated data in the financial questions, it was decided that both Ellena and I would tackle them. Furthermore, Ellena would take the “costs” and I would take the “income” questions.

Now the jobs were split into manageable chunks we could start appraising what questions we wanted to answer with the data. Looking more carefully at the data given, it was clear that a lot of the answers were categorical and hence bar-graphs seemed like an obvious option. It would have been really nice to have continuous data but I can understand why people would be uncomfortable answering a survey with that level of personal detail. Having the opportunity to see the project evolve from the beginning to this point where we had specific questions to answer was a really positive experience. By this point, I felt as if I’d learnt so much already. Here is a histogram of the annual income of Edinburgh Fringe performers from their work in the arts.

 

 

We were to use an AGILE development methodology, using a creative, sped-up version of the scrum method. The scrum method is a series of short bursts of work with defined targets, called sprints, interspersed with short meetings called stand-ups (named because you’re not allowed to sit down). This was my first introduction to a professional workflow, and it’s given me an insight into how companies might manage work. These sprints are meant to be days or weeks in length but because of our 3 day deadline we had to augment this strategy, splitting the day into 2 parts, and having 2 stand-ups per day.

We spent the next 2 days transforming the data into a usable form, and creating some graphs that certainly showed some things clearly. However, for a lot of the financial data we didn’t have a large enough sample size to perform statistical tests on it with a high level of certainty, which left the output feeling very categorical. This didn’t stop me from learning a huge amount though. I was introduced to tidyverse, a collection of packages that have been integrated into each other. And then it was just a matter of coaxing me out of my `for` loop ways and using group_by instead. There was a lot of coding to do and I feel that this project has really developed my R skills. I mainly code in python and before this, my only history with R was one years worth of academic use at University. This was a whole new experience, both in level of exposure, and impromptu lessons every few hours. My favourite being enthusiastically introduced to regex by Nick, who taught me that any problem can be solved by regex.

This project made me appreciate the need for good quality data in data-science, and how much of a project is spent cleaning and pre-processing compared to actually performing statistical tests and generating suave ggplots. There were a lot of firsts for me; like using ggplot2, dplyr and git in a professional environment.

 

Lizzi’s thoughts

Three days isn’t very long to take a piece of work from start to finish and in particular, it doesn’t allow for much thinking time. We had to decide which questions it was achievable to address in the time frame and divide the tasks so that we weren’t all working on the same thing. My job was to look at the demographic data. I didn’t produce any ground-breaking research, but I was able to produce a bunch of pretty pictures, discovering some new plotting packages and practising some useful skills along the way.

Firstly, I learnt what a waffle plot is. A way of displaying categorical data to show parts-to-whole contributions. Plots like the example below, which represents the self-reported annual income of Edinburgh Fringe performers from their work in the arts, can be easily created in R using the waffle package. The most time consuming task required to create such a plot is ensuring the factor levels are in the desired order. The forcats package came in useful for this.

 

 

The leaflet package enables you to easily create zoomable maps, using images from OpenStreetMap by default. Once again, the most time consuming part was getting the data in the right format. Location data was a mix of UK post codes, US zip codes, towns, countries and sets of four numbers that seemed like they might be Australian post codes. Using a Google API through the mapsapi package, and a bit of a helping hand so that Bristol didn’t turn into a NASCAR track in Tennessee, I could convert these data into longitude and latitude coordinates for plotting. This package can also calculate distances between places, but it only worked for locations in Great Britain as it uses Google maps driving directions. Instead, to determine how far performers travel to get to the Fringe I resorted to calculating the geodesic distance between pairs of longitudes and latitudes using the gdist function from Imap.

 

This project was also a good opportunity to practise using the tidyverse to manipulate data. I learnt base R at university and only came across the %>% pipe on a placement last year when working with somebody else’s code. Currently data manipulation in base R is second nature to me and doing something in the tidyverse way requires much more thought, so this project was a step towards becoming less old fashioned and shifting towards the tidyverse.

This was my first experience of using version control properly for a collaborative project. I use version control for my PhD work but purely in a way that means I don’t lose everything if my laptop gets lost, stolen or broken. I’m sure this will send shivers down some spines but my commits are along the lines of “all changes on laptop” or “all changes from desktop” as I update between working at home or at uni, often after a few days. I’ve learnt that proper use of Git means making a branch on which to carry out a particular piece of the work without interfering with the master branch. It also means committing regularly, with informative messages. Once work on a branch is finished, you submit a merge request (aka pull request). This prompts somebody else to review the code and if they’re satisfied with it, press the big green button to merge the branch into the master branch. It was also important to make the repository tidy and well-structured so that it made sense to others and not just to me.

The output of the project was an R Markdown document rendered to html. We brought our work together by making a skeleton Markdown document and importing each person’s work as a child document. Once we worked out which packages each of us had installed and made the sourcing of files and reading in of data consistent, the document knitted together smoothly.

As well as the coding side of things, I also learnt a bit about project management methodology. In the initial messages about the project we were told it would be carried out using scrum methodology. A quick google reassured me that no rugby was going to be involved. As Chris mentioned, we had 15 minute stand-ups (nothing to do with comedians) each morning and afternoon. The purpose of these meetings was to quickly catch each other up on what we had been working on, what we were going to do next and whether there were any blockers to getting the work done. The latter being particularly important given the small time frame.

 

In summary

In summary, this project resembled an exploratory phase of a full project and was perhaps a bit too short to produce a completely polished deliverable. However, we all learnt something, improved our R skills and had an enjoyable and interesting experience of working on a data science project.

Cost of Edinburgh Survey Analysis

Blogs home Featured Image

What’s in a Data Community? One of the UK’s top retailers, Sainsbury’s, knows just how much value can be derived from a community, which, when coming together in a data science context, can add immeasurable business benefit throughout the organisation.

In fact, Sainsbury’s firm belief in the power of data science and community has led to the establishment its own Data Community, which collaborates with teams across the business to find new ways of working with data and learn what’s possible. Part of the activity includes biannual Group Data Conferences, which provide an opportunity for Sainsbury’s 800-strong community to come together to listen to inspiring ideas, connect as a community and get involved with all that the industry has to offer.

As big believers in data analysis that delivers, the Mango team was thrilled to be involved with the company’s most recent Group Data Conference, delivering two R-focused workshops; a high level Introduction to R for analysts not familiar with this popular statistical programming language, showcasing why R is one of the leading data science tools, and a Package Building in R workshop for more advanced users that focused on getting started with building packages, understanding the benefits of package building best practices and being able to implement them.

“We were delighted to have Data Science experts Mango Solutions participating at our internal LOVE Data conference – an event for the 800 data professionals across our Group,” said Naomi Sayers, Sainsbury’s Group Senior Manager of ADS Community & Capability Group Data. “Mango provided training workshops, supporting our aim of inspiring and connecting our community.”

Mango has and continues to support various large companies and organisations keen to set up or who have already set up their own internal data science communities. It’s an excellent way for companies to promote data science culture and methodologies, upskill employees and encourage collaboration across different teams, sites or departments.

If you or your company is keen to find out more about how Mango can help your data science journey, get in touch with us or take a look through more information here.

Blogs home Featured Image

It’s easy to get stuck in the day-to-day at the office and there’s never time to upskill or even think about career development. However, to really grow and develop your organisation, it’s important to grow and develop your team.

While there are many ways to develop teams, including training and providing time to complete personal (and relevant) projects, conferences provide a range of benefits.

Spark innovation
Some of the best in the business present their projects, ideas and solutions at EARL each year. It’s the perfect opportunity to see what’s trending and what’s really working. Topics at EARL Conferences have included, best practice SAS to R; Shiny applications; using social media data; web scraping, plus presentations on R in marketing, healthcare, finance, insurance and transport.

A cross-sector conference like EARL can help your organisation think outside the box because learnings are transferable, regardless of industry.

Imbue knowledge
This brings us to knowledge. Learning from the best in the business will help employees expand their knowledge base. This can keep them motivated and engaged in what they’re doing, and a wider knowledge base can also inform their everyday tasks enabling them to advance the way they do their job.

When employees feel like you want to invest in them, they stay engaged and are more likely to remain in the same organisation for longer.

Encourage networking
EARL attracts R users from all levels and industries and not just to speak. The agenda offers plenty of opportunities to network with some of the industry’s most engaged R users. This is beneficial for a number of reasons, including knowledge exchange and sharing your organisation’s values.

Boost inspiration
We often see delegates who have come to an EARL Conference with a specific business challenge in mind. By attending, they get access to the current innovations, knowledge and networking mentioned above, and can return to their team —post-conference— with a renewed vigour to solve those problems using their new-found knowledge.

Making the most out of attending EARL

After all of that, the next step is making sure your organisation makes the most out of attending EARL. We recommend:

Setting goals
Do you have a specific challenge you’re trying to solve in your organisation? Going with a set challenge in mind means your team can plan which sessions to sit in and who they should talk to during the networking sessions.

De-briefing
This is two-fold:
1) Writing a post-conference report will help your team put what they have learnt at EARL into action.
2) Not everyone can attend, so those who do can share their new-found knowledge with their peers who can learn second-hand from their colleague’s experience.

Following up
We’re all guilty of going to a conference, coming back inspired and then getting lost in the day-to-day. Assuming you’ve set goals and de-briefed, it should be easy to develop a follow-up plan.

You can make the most of inspired team members to put in place new strategies, technologies and innovations through further training, contact follow-ups and new procedure development.

EARL Conference can offer a deal for organisations looking to send more than 5 delegates.

Buy tickets now

Blogs home Featured Image

The year 2017 has completely turned the film industry upside down. The allegations of harassment and sexual assault against Harvey Weinstein have raised the issue of sexism and misogyny in this industry to the eyes of the general public. In addition, it has helped raise awareness of the poor gender diversity and under-representation of women in Hollywood. One of the main problems posed by the low presence of women behind the camera is that this is then reflected in the fictional characters on screen: lots of movies portray women in an incomplete, stereotyped and biased way.

This post focuses on some key behind-the-camera roles to measure the evolution of gender diversity in the last decade – from 2007 until 2017. The roles I studied were: directorswritersproducerssound teamsmusic teamsart teamsmakeup teams and costume teams.

The whole code to reproduce the following results is available on GitHub.

Data frame creation – Web scraping

What I needed first was a list which gathered the names connected to film job roles for 50 movies. For each year between 2007 and 2017, I gathered the information about the 50 most profitable movies of the year from the IMDb website.

As a first step, I built data frames which contained the titles of these movies, their gross profit and their IMDb crew links – which shows the names and roles of the whole movie crew. The following code is aimed at building the corresponding data frame for the 50 most profitable movies of 2017.

# IMDB TOP US GROSSING 2017: 50 MORE PROFITABLE MOVIES OF 2017 -------------

url <- "https://www.imdb.com/search/title?release_date=2017-01-01,2017-12-31&sort=boxoffice_gross_us,desc"
page <- read_html(url)

# Movies details
movie_nodes <- html_nodes(page, '.lister-item-header a') 
movie_link <- sapply(html_attrs(movie_nodes),`[[`,'href')
movie_link <- paste0("http://www.imdb.com", movie_link)
movie_crewlink <- gsub("[?]", "fullcredits?", movie_link) #Full crew links
movie_name <- html_text(movie_nodes)
movie_year <- rep(2017, 50)
movie_gross <- html_nodes(page, '.sort-num_votes-visible span:nth-child(5)') %>%
  html_text()

# CREATE DATAFRAME: TOP 2017 ----------------------------------------------

top_2017 <- data.frame(movie_name, movie_year, movie_gross, movie_crewlink, stringsAsFactors = FALSE)

Let’s have a look at the top_2017 data frame:

##                                movie_name movie_year movie_gross
## 1 Star Wars: Episode VIII - The Last Jedi       2017    $620.18M
## 2                    Beauty and the Beast       2017    $504.01M
## 3                            Wonder Woman       2017    $412.56M
## 4          Jumanji: Welcome to the Jungle       2017    $404.26M
## 5         Guardians of the Galaxy: Vol. 2       2017    $389.81M
## 6                   Spider-Man Homecoming       2017    $334.20M
##                                                   movie_crewlink
## 1 http://www.imdb.com/title/tt2527336/fullcredits?ref_=adv_li_tt
## 2 http://www.imdb.com/title/tt2771200/fullcredits?ref_=adv_li_tt
## 3 http://www.imdb.com/title/tt0451279/fullcredits?ref_=adv_li_tt
## 4 http://www.imdb.com/title/tt2283362/fullcredits?ref_=adv_li_tt
## 5 http://www.imdb.com/title/tt3896198/fullcredits?ref_=adv_li_tt
## 6 http://www.imdb.com/title/tt2250912/fullcredits?ref_=adv_li_tt

I adapted the previous code in order to build equivalent data frames for the past 10 years. I then had 11 data frames: top2017top2016, …, top2007, which gathered the names, years, gross profit and crew links of the 50 most profitable movies of each year.

I combined these 11 data frames into one data frame called top_movies.

List creation – Web scraping

After that, I had a data frame with 550 rows, and I next needed to build a list which gathered:

  • the years from 2007 to 2017
  • for each year, the names of the top 50 grossing movies corresponding
  • for each movie, the names of the people whose job was included in one of the categories I listed above (director, writer, costume teams)

In order to build this list, I navigated through all the IMDb full crew web pages stored in our top_movies data frame, and did some web scraping again to gather the information listed above.

movies_list <- list()

for (r in seq_len(nrow(top_movies))) {
  
  # FOCUS ON EACH MOVIE -----------------------------------------------------------------
  movie_name <- top_movies[r, "movie_name"]
  movie_year <- as.character(top_movies[r, "movie_year"])
  page <- read_html(as.character(top_movies[r, "movie_crewlink"]))
  
  # GATHER THE CREW NAMES FOR THIS MOVIE ------------------------------------------------
  movie_allcrew <- html_nodes(page, '.name , .dataHeaderWithBorder') %>%
    html_text()
  movie_allcrew <- gsub("[\n]", "", movie_allcrew) %>%
    trimws() #Remove white spaces 
  
  # SPLIT THE CREW NAMES BY CATEGORY ----------------------------------------------------
  movie_categories <- html_nodes(page, '.dataHeaderWithBorder') %>%
    html_text()
  movie_categories <- gsub("[\n]", "", movie_categories) %>%
    trimws() #Remove white spaces
    
  ## MUSIC DEPARTMENT -------------------------------------------------------------------
  movie_music <- c()
  for (i in 1:(length(movie_allcrew)-1)){
    if (grepl("Music by", movie_allcrew[i])){
      j <- 1
      while (! grepl(movie_allcrew[i], movie_categories[j])){
        j <- j+1
      }
      k <- i+1
      while (! grepl(movie_categories[j+1], movie_allcrew[k])){
        movie_music <- c(movie_music, movie_allcrew[k])
        k <- k+1
      }
    }
  }
  for (i in 1:(length(movie_allcrew)-1)){
    if (grepl("Music Department", movie_allcrew[i])){
      j <- 1
      while (! grepl(movie_allcrew[i], movie_categories[j])){
        j <- j+1
      }
      k <- i+1
      while (! grepl(movie_categories[j+1], movie_allcrew[k])){
        movie_music <- c(movie_music, movie_allcrew[k])
        k <- k+1
      }
    }
  }
  if (length(movie_music) == 0){
    movie_music <- c("")
  }
    
  ## IDEM FOR OTHER CATEGORIES ---------------------------------------------------------
    
  ## MOVIE_INFO CONTAINS THE MOVIE CREW NAMES ORDERED BY CATEGORY ----------------------
  movie_info <- list()
  movie_info$directors <- movie_directors
  movie_info$writers <- movie_writers
  movie_info$producers <- movie_producers
  movie_info$sound <- movie_sound
  movie_info$music <- movie_music
  movie_info$art <- movie_art
  movie_info$makeup <- movie_makeup
  movie_info$costume <- movie_costume
    
  ## MOVIES_LIST GATHERS THE INFORMATION FOR EVERY YEAR AND EVERY MOVIE ----------------
  movies_list[[movie_year]][[movie_name]] <- movie_info

}

Here are some of the names I collected:

## - Star Wars VIII 2017, Director:
## Rian Johnson
## - Sweeney Todd 2007, Costume team:
## Colleen Atwood, Natasha Bailey, Sean Barrett, Emma Brown, Charlotte Child, Charlie Copson, Steve Gell, Liberty Kelly, Colleen Kelsall, Linda Lashley, Rachel Lilley, Cavita Luchmun, Ann Maskrey, Ciara McArdle, Sarah Moore, Jacqueline Mulligan, Adam Roach, Sunny Rowley, Jessica Scott-Reed, Marcia Smith, Sophia Spink, Nancy Thompson, Suzi Turnbull, Dominic Young, Deborah Ambrosino, David Bethell, Mariana Bujoi, Mauricio Carneiro, Sacha Chandisingh, Lisa Robinson

Gender determination

All of the names I needed to measure the gender diversity of were now gathered in the list movies_list. Then, I had to determine the gender of almost 275,000 names. This is what the R package GenderizeR does: “The genderizeR package uses genderize.io API to predict gender from first names”. At the moment, the genderize.io database contains 216286 distinct names across 79 countries and 89 languages. The data is collected from social networks from all over the world, which ensure the diversity of origins.

However, I am aware that determining genders based on names is not an ideal solution: some names are unisex, some people do not recognise themselves as male or female, and some transitioning transgender people still have their former name. But this solution was the only option I had, and as I worked on about 275,000 names, I assumed that the error induced by the cases listed above was not going to have a big impact on my results.

With this in mind, I used the GenderizeR package and applied its main function on the lists of names I gathered earlier in movies_list. The function genderizeAPI checks if the names tested are included in the genderize.io database and returns:

  • the gender associated with the first name tested
  • the counts of this first name in database
  • the probability of gender given the first name tested.

The attribute I was interested in was obviously the first one, the gender associated with the first name tested.

The aim was to focus on every category of jobs, and to count the number of males and females by category, film and year. With the script below, here is the information I added to each object movies_list$year$film:

  • the number of male directors
  • the number of female directors
  • the number of male producers
  • the number of female producers
  • the number of males in costume team
  • the number of females in costume team

The following code shows how I determined the gender of the directors’ names for every film in the movie_list. The code is similar for all the other categories.

# for each year
for (y in seq_along(movies_list)){ 
  
  # for each movie
  for (i in seq_along(movies_list[[y]])){
    
# Genderize directors -----------------------------------------------------
    directors <- movies_list[[y]][[i]]$directors
    
    if (directors == ""){
      directors_gender <- list()
      directors_gender$male <- 0
      directors_gender$female <- 0
      movies_list[[y]][[i]]$directors_gender <- directors_gender
    }
    
    else{
      # Split the firstnames and the lastnames
      # Keep the firstnames
      directors <- strsplit(directors, " ")
      l <- c()
      for (j in seq_along(directors)){
      l <- c(l, directors[[j]][1])
      }
  
      directors <- l
      movie_directors_male <- 0
      movie_directors_female <- 0
  
      # Genderize every firstname and count the number of males and females 
      for (p in seq_along(directors)){
        directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e")
        gender <- directors_gender$response$gender
        if (length(gender)>0 && gender == "male"){
          movie_directors_male <- movie_directors_male + 1
        }
        if (length(gender)>0 && gender == "female"){
          movie_directors_female <- movie_directors_female + 1
        }
      }
  
      # Put the number of males and females in movies_list
      directors_gender <- list()
      directors_gender$male <- movie_directors_male
      directors_gender$female <- movie_directors_female
      movies_list[[y]][[i]]$directors_gender <- directors_gender
    }  
    
# Idem for the 7 other categories -----------------------------------------------------    

  }
}

Here are some examples of the number of male and female names I collected:

## - Star Wars VIII 2017 
##  Number of male directors: 1 
##  Number of female directors: 0
## - Sweeney Todd 2007 
##  Number of male in costume team: 9 
##  Number of female in costume team: 20

Percentages calculation

Once I had all the gender information listed above, the next step was to calculate percentages by year. I then went through the whole list movies_list and created a data frame called percentages which gathered the percentages of women in each job category for each year.

Let’s have a look at the percentages data frame:

##    year women_directors women_writers women_producers women_sound
## 1  2017        3.571429      9.386282        23.03030    14.17497
## 2  2016        3.174603      9.174312        19.04762    14.02918
## 3  2015        6.000000     12.432432        21.19914    15.69061
## 4  2014        1.785714      8.041958        23.12634    14.89028
## 5  2013        1.886792     10.769231        22.86282    13.54005
## 6  2012        5.357143     10.227273        24.06542    12.33696
## 7  2011        3.846154      9.523810        19.73392    15.08410
## 8  2010        0.000000     10.526316        17.40088    16.06700
## 9  2009        7.407407     13.157895        21.24711    15.30185
## 10 2008        7.547170      9.756098        18.67612    14.70588
## 11 2007        3.333333      9.047619        17.42243    16.13904
##    year women_music women_art women_makeup women_costume
## 1  2017    22.46998  26.87484     68.22204      69.89796
## 2  2016    25.84896  25.04481     67.54386      69.44655
## 3  2015    20.46163  24.90697     68.83117      70.83333
## 4  2014    22.86967  22.31998     67.29508      67.47430
## 5  2013    20.46482  22.45546     63.88697      69.79495
## 6  2012    21.62819  20.90395     66.95402      68.83539
## 7  2011    18.09816  20.22792     70.09482      67.44548
## 8  2010    20.90137  22.38199     65.81118      68.72082
## 9  2009    19.15734  22.14386     61.15619      70.25948
## 10 2008    19.82984  21.80974     60.87768      71.20253
## 11 2007    19.64385  20.21891     59.23310      67.36035

Visualisation – gender diversity in 2017

I was then able to visualise these percentages. For example, here is the code I used to visualise the gender diversity in 2017.

# Formating our dataframe
percentages_t <- data.frame(t(percentages), stringsAsFactors = FALSE)
colnames(percentages_t) <- percentages_t[1, ]
percentages_t <- percentages_t[-1, ]
rownames(percentages_t) <- c("directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")

# Ploting our barplot
percentages_2017 <- percentages_t$`2017`
y <- as.matrix(percentages_2017)

p <- ggplot(percentages_t, aes(x = rownames(percentages_t),
                               y = percentages_2017, 
                               fill = rownames(percentages_t))) + 
  geom_bar(stat = "identity") +
  coord_flip() + # Horizontal bar plot
  geom_text(aes(label=format(y, digits = 2)), hjust=-0.1, size=3.5) + # pecentages next to bars
  theme(axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank(),
        legend.title=element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  labs(title = "Percentages of women in the film industry in 2017") +
  guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend
  scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

As we can see, in 2017, the behind-the-camera roles of both directors and writers show the most limited women occupation: less than 10% for writers and less than 4% for directors. This is really worrying considering that these are key roles which determine the way women are portrayed in front of the camera. Some studies have already shown that the more these roles are diversified in terms of gender, the more gender diversity is shown on screen.

Let’s go back to our barplot. Women are also under-represented in sound teams (14%), music teams (22.5%), producer roles (23%) and art teams (27%). The only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers, among which almost 70% of the roles are taken by women.

Visualisation – gender diversity evolution through the last decade

Even if the 2017 results are not exciting, I wanted to know whether there had been an improvement through the last decade. The evolution I managed to visualise is as follows.

# From wide to long dataframe
colnames(percentages) <- c("year", "directors", "writers","producers", "sound",    
                           "music", "art", "makeup", "costume")
percentages_long <- percentages %>%
  gather(key = category, value = percentage, -year)
percentages_long$year <- ymd(percentages_long$year, truncated = 2L) # year as date 

# line plot
evolution_10 <- ggplot(percentages_long, aes(x = year,
                                             y = percentage,
                                             group = category,
                                             colour = category)) +
  geom_line(size = 2) +
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women in the film industry from 2007 to 2017",
       x = "",
       y = "Percentages")

The first thing I noticed is that the representativeness gap between the roles of make-up artists and costume designers and the other ones has not decreased in a flagrant way since 2007.

In addition, the roles that women are really under-represented – directors, writers and jobs related to sound, no improvement has been achieved.

If we focus on directors, we do not see any trend. Figures vary depending on the year we consider. For example in 2010, we notice that there are not any female directors among the 50 most profitable movies, and for other years it never goes beyond 7.5%. What is interesting for the role of director, the best levels of female representation were reached in 2008 and 2009. After these years the number of female directors has declined and never reached more than 6%. The percentage of women directors reached in 2017 is practically the same as the percentage reached in 2007.

We then notice an evenness in the number of female sound teams and writers: women consistently represent around 10% of writers and 15% of sound teams in the last decade. But there is no sign of improvement.

Only a slight improvement of 3-5% is notable among producers, music and art teams. But nothing astonishing.

Visualisation – gender diversity forecasting in 2018

The last step of our study was to forecast, at a basic level, these percentages for 2018. I used the forecast package and its function forecast, and then applied it to the data I collected between 2007 and 2017, in order to get this prediction:

# Time series
ts <- ts(percentages, start = 2007, end = 2017, frequency = 1)

# Auto forecast directors 2018
arma_fit_director <- auto.arima(ts[ ,2])
arma_forecast_director <- forecast(arma_fit_director, h = 1)
dir_2018 <- arma_forecast_director$fitted[1] # value predicted

# Idem for writers, producers, sound, music, art, makeup and costume

# Create a data frame for 2018 fitted values
percentages_2018 <- data.frame(year = ymd(2018, truncated = 2L), 
                               women_directors = dir_2018, 
                               women_writers = writ_2018, 
                               women_producers = prod_2018, 
                               women_sound = sound_2018,
                               women_music = music_2018,
                               women_art = art_2018,
                               women_makeup = makeup_2018,
                               women_costume = costu_2018, 
                               stringsAsFactors = FALSE)

# Values from 2007 to 2017 + 2018 fitted values
percentages_fitted_2018 <- bind_rows(percentages, percentages_2018)
# From wide to long dataframe
colnames(percentages_fitted_2018) <- c("year", "directors", "writers","producers", "sound",    
                                      "music", "art", "makeup", "costume")
percentages_long_f2018 <- percentages_fitted_2018 %>%
  gather(key = category, value = percentage, -year)
percentages_long_f2018$year <- ymd(percentages_long_f2018$year, truncated = 2L) # year as date

# Forecast plot for 2018 
forecast_2018 <- ggplot(percentages_long_f2018, aes(x = year,
                                                    y = percentage,
                                                    group = category,
                                                    colour = category)) +
  geom_line(size = 2)+
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women in the film industry from 2007 to 2017\n Fitted values for 2018",
       x = "",
       y = "Percentages")

The predicted values I got for 2018 are approximately the same as the ones I calculated for 2017. However, it is a basic forecast, and it does not take into consideration the upheaval which happened in the film industry in 2017. This will surely have an impact on the gender diversity in the film industry. But to what extent? Has general awareness been sufficiently increased to truly achieve change?

In any case, I sincerely hope that our forecasting is wrong and that a constant improvement will be seen in the next couple of years, so that female characters on cinema screens will become more interesting and complex.

Blogs home Featured Image

As an R user who is building models and analysing data, one of the key challenges is how to make those results available to those who need it. After all, data science is about making better decisions, and your results need to get into the hands of the people who make those decisions.

For reporting, there are many options from writing Excel files to rmarkdown documents and shiny apps. Many businesses already have great reporting with a business intelligence (BI) tool. For them, it is preferable that you present your results alongside a number of other critical business metrics. Moreover, your results need to be refreshed daily. In this situation, you might be working with SQL developers to integrate your work. The question is, what is the best way to deliver R code to the BI team?

In this blog post, we will be looking at the specific case of deploying a predictive model, written in R, to a Microsoft SQL Server database for consumption by a BI tool. We’ll look at some of the different options to integrate R, from in-database R services, to pushing with ODBC or picking up flat files with SSIS.

The Problem

Flight delay planning

To demonstrate we’ll use the familiar flights dataset from the nycflights13 package to imagine that we are airport planners and we want to test various scenarios related to flight delays. Our data contains the departure delay of all flights leaving the New York airports: JFK, LGA, and EWR in 2013. I’m running this code on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called ml. If you want to reproduce the code you’ll need to have your own SQL Server setup (you can install it locally) and push the flights table there. Here’s a selection of columns:

SELECT TOP(5) flight, origin, dest, sched_dep_time, carrier, time_hour, dep_delay
FROM flights
flight origin dest sched_dep_time carrier time_hour dep_delay
1545 EWR IAH 515 UA 2013-01-01 05:00:00 2
1714 LGA IAH 529 UA 2013-01-01 05:00:00 4
1141 JFK MIA 540 AA 2013-01-01 05:00:00 2
725 JFK BQN 545 B6 2013-01-01 05:00:00 -1
461 LGA ATL 600 DL 2013-01-01 06:00:00 -6

We’ll fit a statistical model for the departure delay, and run simulations for the delay of future flights. We want to capture the natural variation from day to day so a useful approach here is a mixed-effects model where each day is a random effect.

model <- lme4::lmer(
    dep_delay ~ 1 +
      (1 | date:origin) +
      carrier +
      origin +
      sched_dep_time +
      distance +
      week_day,
    data = data_train
  )

This is a simple model for demonstration purposes. For example, it doesn’t capture big delays (extreme values) well, but it will serve our purpose. The full model code and data prep is available at mangothecat/dbloadss so we won’t go through every line here.

To simulate delays on future flights we can call the simulate function. Here we’ll run 10 different scenarios:

sim_delays <- simulate(model, nsim = 10, newdata = data_test)

The reason we’re using simulate rather than predict is that we don’t just want the most likely value for each delay, we want to sample from likely scenarios. That way we can report any aspect of the result that seems relevant. A data scientist will ask: “how can I predict dep_delay as accurately as possible?”. An airport manager will want to know “how often will the last flight of the day leave after midnight?”, or another question that you haven’t thought of. This type of output lets the BI team address these sorts of question.

Package it up

At Mango, we believe that the basic unit of work is a package. A well-written package will be self-documenting, have a familiar structure, and unit tests. All behind-the-scenes code can be written into unexported functions, and user facing code lives in a small number (often one) of exported functions. This single entry point should be designed for someone who is not an experienced R user to run the code, and if anything goes wrong, be as informative as possible. R is particularly friendly for building packages, with the excellent devtools automating most of it, and the wonderfully short R packages book by Hadley guiding you through it all.

The code for this blog post lives in the dbloadss package available on GitHub. For the flights model, a single function is exported simulate_departure_delays, which is documented to explain exactly what it expects as input, and what it will output. The entire model runs with the single line:

output_data <- simulate_departure_delays(input_data, nsim = 20)

where the input_data is prepared from the database and output_data will be pushed/pulled back to the database.

Connecting to the Database

Push, Pull, or Pickup?

Once the model has been packaged and the interface decided, it remains to decide how to actually run the code. With SQL Server there are three options:

  1. Run the model from R and push the results to SQL Server using an ODBC connection.
  2. Call the model from SQL Server using a stored procedure to run an R script using R Services and pull the results back.
  3. Invoke an Rscript from SSIS and pickup flat files (csv).

Which you choose will depend on a number of factors. We’ll take some time to look at each one.

The Push (SQL from R)

The best way to talk to a database from R is to use the DBI database interface package. The DBI project has been around for a while but received a boost with R Consortium funding. It provides a common interface to many databases integrating specific backend packages to each separate database type. For SQL Server we’re going to use the odbc backend. It has great documentation and since Microsoft released ODBC drivers for Linux it’s a cinch to setup from most operating systems.

Let’s get the flights data from SQL Server:

library(DBI)
con <- dbConnect(odbc::odbc(),
                 driver = "SQL Server",
                 server = "localhost\\SQL17ML",
                 database = "ml")

flights <- dbReadTable(con, Id(schema="dbo", name="flights"))

I’ve included the the explicit schema argument because it’s a recent addition to DBI and it can be a sticking point for complicated database structures.

Now we run the model as above

library(dbloadss)
output_data <- simulate_departure_delays(flights, nsim = 20, split_date = "2013-07-01")
dim(output_data)
## [1] 3412360       3

So for 20 simulations, we have about 3.5 million rows of output! It’s just a flight ID (for joining back to the source), a simulation ID, and a delay.

head(output_data)
##      id sim_id dep_delay
## 1 27005      1 -49.25918
## 2 27006      1  23.09865
## 3 27007      1  28.84683
## 4 27008      1 -43.68340
## 5 27009      1 -44.98220
## 6 27010      1  37.62463

We’ll do all further processing in the database so let’s push it back.

# Workaround for known issue https://github.com/r-dbi/odbc/issues/175
dbRemoveTable(con, name = Id(schema = "dbo", name = "flightdelays"))

odbctime <- system.time({
  dbWriteTable(con,
               name = Id(schema = "dbo", name = "flightdelays"),
               value = output_data,
               overwrite = TRUE)
})
odbctime
##    user  system elapsed 
##   10.08    0.47  114.60

That took under 2 minutes. This post started life as a benchmark of write times from odbc vs RODBC, an alternative way to talk to SQL Server. The results are on the dbloadss README and suggest this would take several hours! RODBC is usually fine for reads but we recommend switching to odbc where possible.

It is relatively straightforward to push from R and this could run as a scheduled job from a server running R.

The Pull (R from SQL)

An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the sp_execute_external_script command, which we will wrap in a stored procedure. This method is great for SQL developers because they don’t need to go outside their normal tool and they can have greater control about exactly what is returned and where it goes.

A word of warning before we continue. There’s a line in the Microsoft docs:

Do not install R Services on a failover cluster. The security mechanism used for isolating R processes is not compatible with a Windows Server failover cluster environment.

that somewhat limits the ultimate use of this technique in production settings as a failover cluster is a very common configuration. Assuming this might get fixed, or perhaps it’s not an issue for you, let’s see how it works.

I’m running SQL Server 2017 with Machine Learning Services. This installs its own version of R that you can access. To do so you have to enable “the execution of scripts with certain remote language extensions”. Then you need to install the dbloadss package somewhere that this version of R can see. This can require admin privileges, or alternatively, you can set the .libPaths() somewhere in the stored proc.

The following stored procedure is what we’ll add for our flight delays model:

use [ml];

DROP PROC IF EXISTS r_simulate_departure_delays;
GO
CREATE PROC r_simulate_departure_delays(
    @nsim int = 20,
    @split_date date = "2013-07-01")
AS
BEGIN
 EXEC sp_execute_external_script
     @language = N'R'  
   , @script = N'
    library(dbloadss)
    output_data <- simulate_departure_delays(input_data, nsim = nsim_r,
                                             split_date = split_date_r)
' 
   , @input_data_1 = N' SELECT * FROM [dbo].[flights];'
   , @input_data_1_name = N'input_data'
   , @output_data_1_name = N'output_data'
   , @params = N'@nsim_r int, @split_date_r date'
   , @nsim_r = @nsim
   , @split_date_r = @split_date
    WITH RESULT SETS ((
        "id" int not null,   
        "sim_id" int not null,  
        "dep_delay" float not null)); 
END;
GO

The query that goes into @input_data_1 becomes a data frame in your R session. The main things to note are that you can pass in as many parameters as you like, but only one data frame. Your R script assigns the results to a nominated output data frame and this is picked up and returned to SQL server.

I believe it’s very important that the R script that is inside the stored procedure does not get too complicated. Much better to use your single entry point and put the complex code in a package where it can be unit tested and documented.

We then call the stored procedure with another query:

INSERT INTO [dbo].[flightdelays]
EXEC [dbo].[r_simulate_departure_delays] @nsim = 20

The performance of this method seems to be good. For write speeds in our tests it was faster even than pushing with odbc, although it’s harder to benchmark in the flights example because it includes running the simulation.

Overall, were it not for the issue with failover clusters I would be recommending this as the best way to integrate R with SQL Server. As it stands you’ll have to evaluate on your setup.

The Pickup (R from SSIS)

The final method is to use SSIS to treat running the R model as an ETL process. To keep things simple we use SSIS to output the input data as a flat file (csv), kick-off an R process to run the job, and pickup the results from another csv. This means that we’ll be making our R code run as a command line tool and using a csv “air gap”.

Running R from the command line is relatively straightforward. To handle parameters we’ve found the best way is to use argparser, also honourable mention to optparse. Checkout Mark’s blog post series on building R command line applications. After you’ve parsed the arguments everything is essentially the same as pushing straight to the database, except that you write to csv at the end. SSIS then picks up the csv file and loads it into the database. Performance is generally not as good as the other methods but in our experience it was close enough — especially for a batch job.

An example of what this script might look like is on GitHub. We can run this by doing (from the commandline):

> Rscript blog/flight_delay.R -n 10 -d '2017-07-01' -i flights.csv
Loading required package: methods
Running delay simulations:
   = FALSE
  help = FALSE
  verbose = FALSE
  opts = NA
  nsim = 10
  split_date = 2013-07-01
  input_path = flights.csv
  output_path = simulated_delays.csv
Reading... Parsed with column specification:
cols(
  .default = col_integer(),
  carrier = col_character(),
  tailnum = col_character(),
  origin = col_character(),
  dest = col_character(),
  time_hour = col_datetime(format = "")
)
See spec(...) for full column specifications.
Read  336776  rows
Running simulations...
Writing  1706180  rows
Done.

When doing this from SSIS it can directly call Rscript and the arguments can be variables.

The SSIS solution has some great advantages in that it is controlled by the SQL developers, it has the greatest separation of technologies, and it’s easy to test the R process in isolation. Downsides are it’s unlikely that going via csv will be the fastest, and you need to be a little more careful about data types when reading the csv into R.

The Results

If everything goes well the results of your delays simulation will land in the database every night, and every morning reports can be built and dashboards updated. The results look something like this:

SELECT TOP(5) * FROM flightdelays
id sim_id dep_delay
27005 1 -27.48316
27006 1 46.50636
27007 1 65.94304
27008 1 -78.93091
27009 1 -17.86126

Your work is not totally done. There is one cost of getting the dashboards setup. The simulation results are not always the easiest to get your head around so it helps if you can setup the BI team with a few queries just to get started. For example: To generate a daily average delay for the airline UA, they would need something like the following:

WITH gp AS (
SELECT sim_id
      ,origin
      ,day_date
      ,avg(fd.dep_delay) as mean_delay
  FROM dbo.flightdelays fd
  LEFT JOIN (SELECT *, convert(date, time_hour) as day_date FROM dbo.flights) fs on fd.id = fs.id
  WHERE fs.carrier='UA'
  GROUP BY sim_id, origin, day_date
)

SELECT day_date, origin
     , avg(mean_delay) as mean_delay
     , min(mean_delay) as min_delay
     , max(mean_delay) as max_delay
FROM gp
GROUP BY origin, day_date
ORDER BY day_date, origin

So first you aggregate over each simulation, then you aggregate across simulations. Your BI team’s SQL is better than yours (and definitely mine) so this can be a useful way to get feedback on your code and it’s also a good way to explain what your model does to them (code speaks louder than words). Loading up a table like this into a BI tool (for example Power BI) you can get all the plots you’re used to.

Power BI Screenshot of Results

It turns out the daily variation means you don’t learn much from this plot. Maybe it’s time to go and look at that manager’s question about late flights after all.

Conclusion

After spending a bit of time delivering R models for business reporting here are my main takeaways:

  1. Data Science does not exist in a vacuum. Think about how to get your results into the hands of decision-makers from the beginning. Think about their tools and workflow and work back to how you can deliver.
  2. Always serve your code from a package with as few exported functions as possible. Make it as simple as possible for a non-R (or Python) user to run the code. Document everything that goes in and out and defend your code like crazy from bad input. The rest is your domain.
  3. Push as much processing as possible back to the database. It gives maximum flexibility for the BI team to produce all the reports that they might be asked for.
  4. When it works I really like the in-database R method. It gives the power to the database team and performs really well. Unfortunately right now the issue with failover clusters has been a deal-breaker for me and so falling back to SSIS has worked just fine.
Blogs home Featured Image

In case you hadn’t spotted it already, a major new release of ggplot2 hit CRAN last week. Among the many new features, the one that caught my attention was support for simple features. Or in other words, it’s now really easy to plot spatial data in ggplot2.

On Friday afternoon I spent a couple of hours playing with this new functionality as well as mapview, a package I hadn’t come across before but discovered on Friday. This blog is a quick walkthrough to show you how simple spatial visualisation in R really is.

In case you want to follow along, the full list of packages used were:

library(tidyverse)
library(ggplot2)
library(sf)
library(mapview)
library(rvest)
library(magrittr)

The Data

Whenever I start to play with new functions I always try to use my own data. In this case, thanks to a recent project, I happened to have a shape file for Italy to hand (You can download it for yourself from Istat).

Initially, I just plotted the region level data with random colours (as you may have already seen on Twitter) but to reflect my usual workflow I found some data on the amount of wine produced by region – it was Friday afternoon after all!

I won’t go into all the detail of the web scraping and data manipulation, but the code is included throughout so that you can replicate this example.

wine <- read_html("https://italianwinecentral.com/wine-production-in-italy-by-region/")
wine %<>% 
  html_table() %>% 
  extract2(1) %>%
  slice(-21)
wine[,-1] %<>% map_df(parse_number)

This gave me an interesting dataset to play with that wasn’t actually too dissimilar to what I often end up trying to plot i.e. change over time.

head(wine)
Region
<chr>
2012
<dbl>
2013
<dbl>
2014
<dbl>
2015
<dbl>
2016
<dbl>
1 Abruzzo 2443 2728 2273 2936 2937
2 Basilicata 189 178 102 87 93
3 Calabria 400 370 314 404 391
4 Campania 1542 1644 1183 1614 1286
5 Emilia Romagna 6273 7396 6958 6752 7039
6 Friuli-Venezia Giulia 1281 1073 1367 1872 1856

Importing Shape Files

Now for the interesting part.

To import the spatial data I used the sf package. This package makes reading the data very simple and stores it in a way that is much easier to reason about, especially for data that is intended to provide regions.

To import the shape file we can use the st_read function, in this case pointing to the file location, although you can also point to a database. Once the data is imported we have a single row for each region. The data includes a column geometry. This is where the information related to the grouping is defined. In this case the geometry stores (as a list-column) all of the information relating to polygon of the region. But, this could also be a single point or a whole host of other information. Check out the vignettes for more details.

italy <- st_read("./Limiti_2016_ED50_g/Reg2016_ED50_g/Reg2016_ED50_g.shp")
head(italy)
COD_REG
<int>
REGIONE
<fctr>
SHAPE_Leng
<dbl>
SHAPE_Area
<dbl>
geometry
<S3: sfc_MULTIPOLYGON>
1 1 Piemonte 1235685.5 25394259406 <S3: sfc_MULTIPOLYGON>
2 2 Valle D’Aosta 311141.5 3258954558 <S3: sfc_MULTIPOLYGON>
3 3 Lombardia 1410619.8 23862504702 <S3: sfc_MULTIPOLYGON>
4 5 Veneto 1058657.7 18406022144 <S3: sfc_MULTIPOLYGON>
5 4 Trentino-Alto Adige 800899.4 13607742436 <S3: sfc_MULTIPOLYGON>
6 7 Liguria 825474.4 5415107538 <S3: sfc_MULTIPOLYGON>

As this is just a simple data frame, it can be manipulated in the same way as any other dataframe. In this case that includes a little change to the region names to make them match exactly and joining the data so I have everything I want in one data set, ready to start plotting.

levels(italy$REGIONE) <- wine$Region
WineLocation <- full_join(italy, wine, by = c("REGIONE" = "Region"))

New Features of ggplot2

Now for the new features of ggplot2. There are a lot of them this time around. Many related to the specification of variables for programming purposes but I was most interested in how easy it was to plot my Italy data. And it turns out that the answer was extremely.

Point to the data, tell it which column to use as the fill and add the new geom_sf layer. Just to try it out, I have added the new viridis scale layer. But that is all I needed to do. All of the projections have been managed for me, the colours look much more impressive than the defaults and all of the axis labels and titles have been handled the way I would want them, everything has just worked without me having to do any thinking.

ggplot(data = WineLocation, aes(fill = `2016`)) + 
  geom_sf() + 
  scale_fill_viridis_c()

Taking this even further, I happened to pick a data set with multiple years in it. A little manipulation to get the data into a better format to plot and an additional line in my plotting code and I have facetted maps. Again, something I didn’t need to do but I have tried out the new vars specification for facetting. Don’t worry, the formula interface is still there, and this is one of the features that will make programming with ggplot2 much easier. But this alternative is certainly very simple and may even be easier for some ggplot2 novices.

wineLong <- gather(wine, Year, Quantity, -Region)
fullWineLong <- full_join(italy, wineLong, by = c("REGIONE" = "Region"))

ggplot(data = fullWineLong, aes(fill = Quantity)) + 
  geom_sf() + 
  facet_wrap(vars(Year)) +
  scale_fill_viridis_c() 

I tried this with all the shape files I could get my hands on and I am pleased to report it was just as easy. As it’s ggplot2 I was able to add layers as I wanted them, including additional shape files and all the projections were handled for me – although plotting the US and Italy on one map did look a little odd by default!

A quick look at mapview

Whilst I was reviewing the vignettes for sf I came across an example of using mapview. Despite this package having been on CRAN for a few years I had never come across it before, but I am very glad I have now.

In a nutshell, it is a wrapper to leaflet. The good thing is that it works very easily with simple features data. There are a huge number of options but to recreate an interactive version of my first map I simply needed to define the data, the column that defines the colouring (zcol) and as an extra the data that gives labels when I hover on the map, in this case the region names.

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE)

You can take a look at the map created on RPubs.

There are a huge number of options to mapview and I certainly haven’t tried them all out yet, but some of the notable features include:

  • Adding additional shape files by simply adding two mapview objects together
  • Adding leaflet layers, using the usual leaflet functions
  • Including multiple layers of data
  • Defining the pop-up functionality

This last one is particularly interesting as you can not only pop-up tables of data but also graphics. As an example, suppose that we want to be able to click on the region of interest and see the change in wine production over time. All we have to do is create a list object that includes all of the graphics that we need (in the same order as the data) and pass this to the popup option using the popupGraph function.

plotByRegion <- function(region){
  fullWineLong %>%
    filter(REGIONE == region) %>%
  ggplot() +
    geom_line(aes(Year, Quantity, group = REGIONE)) +
    labs(title = region)
}

p <- lapply(WineLocation$REGIONE, plotByRegion)

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE, 
          popup = popupGraph(p))

The documentation for mapview does state that it is aimed at quick visualisation for exploratory purposes rather than for production-ready graphics and it really suited that purpose for me. Personally, I can see my usage becoming a combination of both mapview and leaflet for interactive graphics and ggplot2 for static.

Right now I am very happy that I have an incredibly quick and easy workflow for getting my spatial data into R and visualising it without having to worry about map projections and little details that distract me from analysis.

Blogs home Featured Image

Linux containers, of which Docker is the most well known, can be a really great way to improve reproducibility on your data projects (for more info see here), and create portable, reusable applications. But how would we manage the deployment of multiple containerised applications?

Kubernetes is an open source container management platform that automates the core operations for you. It allows you to automatically deploy and scale containerised applications and removes the manual steps that would otherwise be involved. Essentially, you cluster together groups of hosts running Linux containers, and Kubernetes helps you easily and efficiently manage those clusters. This is especially effective in cloud based environments.

Why use kubernetes in your data stack?

Since Kubernetes orchestrates containers and since containers are a great way to bundle up your applications with their dependencies — thus improving reproducibility — Kubernetes is a natural fit if you’re aiming for high levels of automation in your stack.

Kubernetes allows you to manage containerised apps that span multiple containers as well as scale and schedule the containers as necessary across the cluster.

For instance, if you’re building stateless microservices in flask (Python) and plumber (R) it’s easy to initially treat running them in containers as though they were running in a simple virtual machine. However, once these containers are in a production environment and scale becomes much more important, you’ll likely need to run multiple instances of the containers and Kubernetes can take care of that for you.

Automation is a key driver

When container deployments are small it can be tempting to try to manage them by hand. Starting and stopping the containers that are required to service your application. But this approach is very inflexible and beyond the smallest of deployments such an approach is not really practical. Kubernetes is designed to manage the complexity of looking after production scale container deployments. This takes away the complexity of trying to manage such systems by hand as they can quickly reach a size and level of complexity that does not lend itself to error-prone manual management.

Scheduling is another often overlooked feature of Kubernetes in data processing pipelines, as you could, for example, schedule refreshes of models in order to keep them fresh. Such processes could be scheduled for times when you know the cluster will be otherwise quiet (such as overnight, or on weekends), with the refreshed model being published automatically.

The Case for Kubernetes in your data stack

More broadly, it helps you fully implement and rely on a container-based infrastructure in production environments. This is especially beneficial when you’re trying to reduce infrastructure costs as it allows you to keep your cluster size at the bare minimum required to run your applications, which in turn saves you money on wasted compute resource.

The features of Kubernetes are too long to list here, but the key things to take away is that it can be used to run containerised apps across multiple hosts, can scale applications on the fly, can auto-restart applications that have fallen over and help automate deployments.

The wider Kubernetes ecosystem relies on many other projects to deliver these fully orchestrated services. These additional projects provide such additional features as registry services for your containers, networking, security and so on.

Kubernetes offers a rich toolset to manage complex application stacks and with data science, engineering and operations becoming increasingly large scale, automation is a key driver for many new projects. If you’re not containerising your apps yet, jumping into Kubernetes can seem daunting, but if you start small by building out some simple containerised applications to start with, the benefits of this approach should become clear pretty quickly.

For an in-depth technical look at running Kubernetes, this post by Mark Edmondson offers an excellent primer.