Blogs home Featured Image

Just after starting at Mango I made the decision to start learning Italian. I have always been interested in learning languages and I was really keen to go back to Italy, so I thought it would be something fun to do out of work. It turned out to have a much greater impact on my work than I expected – and not just because of project work based in Italy.

In the early stages of my learning, I read “Fluent in 3 Months” by Benny Lewis. If you have never heard the name before, Benny Lewis is a polyglot (someone who speaks multiple languages) who left school unable to speak anything other than English. After six months living in Spain and failing to learn the language, he switched his approach. In a matter of weeks, he was speaking Spanish with natives. Now he can speak a dozen languages to varying degrees. And this really got me thinking. Could his approach be applied to R and Python? Could we get more people engaged in the languages more quickly?

The “Language Hacking” Approach

To start let’s consider the approach that Benny Lewis advocates for learning languages.

Part of the approach is to speak the language from the start, not a few days or weeks in, but on the first day. And to continue to speak the language every day. It’s a simple but powerful idea. You arrange to have a conversation with a native speaker of the language and prepare a few sentences to have a basic conversation. It doesn’t have to be long, it can literally be three or four sentences, but you are using the language from the start.

Combined with speaking from the start is the idea of “language hacking”. This is really what makes the technique powerful. The idea that you don’t need to know everything to be able to use a language. Think about that conversation you are having on day one. You won’t know about how to conjugate verbs or all the rules of sentence structure or all the vocabulary, but you can certainly use a phrase book to find out how to ask “how are you?”, “what is your name?” and respond “I am well”, “my name is Aimee”.

The fundamental concept of the approach that Benny Lewis proposes is to learn what you need to communicate right away. Don’t spend months learning grammar and rules and hope that this will be enough to get by, just start to speak.

There are of course challenges to this approach, the biggest is the knowledge that you will make mistakes. This is typically the main blocker to language learning. Fear that you will make a mistake, but the reality is that nothing bad will happen if you try and get it wrong, generally the people you are talking to will helpfully point you in the right direction and you will learn from it. Once you get over this fear you can very quickly learn a language.

Language Hacking for R and Python

At the time that I read Benny Lewis’ book, I was just starting to teach more and I was interested in whether it would be possible to teach R (and Python) this way. But what does language hacking mean for programming for data science?

The answer is simple, it means the same thing. If you want to “hack” learning R and Python for data science you focus on learning the code that you need to do what you need to do. Don’t worry about the details of programming, put aside the ins and outs of functional or object oriented programming, forget the technical language. Just focus on getting things done.

For data scientists that typically means starting with a basic workflow. Your first “conversation” will typically be more along the lines of loading some data and generating some summaries. Let’s think about that example for a moment.

Suppose that I am going to read the iris data from a csv file and find the mean sepal length. How much code does that take? Three or four lines. Do we need to spend hours or possibly days studying the rules of the language first or can we simply jump in with those lines? We can, and should, jump straight in and teach those three or four lines right at the start. Put yourself in the shoes of the learner. If after just minutes of learning you can see a result that is meaningful and useful – the chances are you will keep going and you will want to learn more. What’s more is that you will start to experiment with that code. You will see how you can make changes to the code to find the mean of a different column, or maybe you will think about finding other summaries. You are no longer just learning rules of a language to be implemented, you are actively living the language.

You can very quickly build these “conversations” up to include grouping, performing common manipulation tasks and creating visualisations. In no time at all, you will be doing analytics, modelling and machine learning.

At Mango, we switched our R training to this approach around 3 years ago and we haven’t looked back. Our trainers no longer teach programming the traditional way. They are all taught to teach the hacking approach, and they all come back from teaching with the same success stories. It took a little longer to convince our Python team that they should make the same change, but it is now the approach we take with all of our training. From a personal perspective, after years of avoiding Python because I didn’t want to spend weeks learning to program, I was the first tester for the Python version of our training. Now I am comfortable with running some of my common analysis in Python, and whilst I still make mistakes and it takes me a bit longer than when I write R, I have finally got the confidence to consider Python as a solution as well as R and I can talk more confidently to my Python using colleagues.

In practical terms, I would strongly recommend focusing on the tidyverse for R and pandas for Python, with seaborn for graphics. These packages have been designed to make the tasks that we perform regularly with data easy and accessible, so if we are trying to hack our approach to learning and be able to use the languages quickly, why would we use anything else?

But What About Grammar?

You can get a long way in a language without the need to learn lots of grammar. Think about how you learned your native language, I don’t remember being taught grammar when I started to speak but I could still communicate effectively. My friends are not actively teaching their pre-school aged children grammar. But they can communicate, and whilst it is not always the best way, they can get their message across. But eventually, to really master a language, you do need to get to grips with the grammar.

So those of you who are passionate about the detail of R or Python, who like the “best” way to do things, who want to promote programming paradigms and philosophies. Don’t worry. There is still a place for this. It just doesn’t come first, and it isn’t necessary for everyone.

If all I want to do with Python is import data, and run some analytics, then I really don’t need to worry about more than what I have achieved through language hacking. If I want to master it and be able to produce tools that are used by a wider community, then I do need to know more. The good news is that this is much easier for a programming language than a spoken one. There are rarely exceptions to rules for a start, and you don’t have to learn an endless stream of tenses!

But we can do more to make even this accessible to learners. We can help them to understand the practical applications. We can focus on immediate needs rather than eventualities. We can provide constructive feedback that helps learners to develop their skills.

By making even the detail of a language interesting and accessible we ultimately end up with greater numbers of people who can speak the language and contribute to its success. But we must start with practical code that achieves a specific goal and leave the grammar for later.

 

 

 

Blogs home Featured Image

Why build a recommender system?

The most wonderful and most frustrating characteristic of the Internet is its excessive supply of content. As a result, many of today’s commercial giants are not content providers, but content distributors. The success of companies such as Amazon, Netflix, YouTube and Spotify relies on their ability to effectively deliver relevant and novel content to users. However, with such a vast array of content at their fingertips, the search space becomes near impossible to navigate with traditional search methods. It is therefore essential for businesses to exploit the data at their disposal to find similarities between products and user behaviours, in order to make relevant recommendations to users.

The importance of this is further emphasised by phenomena such as the Long Tail, a term popularised by Chris Anderson’s iconic 2004 blog post. This refers to the fact that a large percentage of online distributors’ revenue comes from the sale of less popular items, for which they are able to find a market thanks to their recommendation engines. “If the Amazon statistics are any guide, the market for books that are not even sold in the average bookstore is larger than the market for those that are”.

Another interesting example is Spotify, a company which invests heavily in recommendation, since one of their selling points is their ability to build perfectly curated playlists for individual users. A lesser-known ulterior motive of Spotify’s recommendations is their need to reduce their licensing costs, which are currently growing at a faster rate than their revenue. By recommending relevant songs by emerging artists, towards which Spotify can pay lower licensing fees, the company can reduce their average cost per listen. Similarly, any business with a large product range might find a recommendation engine useful for identifying which products to push to certain customers.

Figure 1 – Anatomy of the Long Tail (https://wired.com/2004/10/tail/)

Now, I hear you cry, “what considerations does one have to make when building a recommender system?” Well, I’m glad you asked!

 

Business Challenges:

What metric are we optimising for?

First and foremost, we are trying to solve a business need. Just because an algorithm perfectly predicts a user’s movie rating, this might not necessarily translate to a higher level business metric. What are we optimising for? User retention? An increase in sales? How many items have the average user bookmarked or purchased? How many recommended items users have clicked on? These goals will vary for different business contexts. Even with a well-defined business objective, these are only observations we can make after a model has been trained and deployed, and many successive iterations of A/B testing will be required to establish the usefulness of a model.

Different user profiles

To add further complexity, it is possible that some users respond better to one type of model over another. Then the question arises: do we use varying algorithms for different user profiles, and how do we identify those profiles? This is where a weighted hybrid recommender system might come in. A more adventurous user might prefer more exploratory recommendations, whereas a conservative user may only respond to recommendations which closely relate to their browsing history. How do we balance customer satisfaction with the need to push new content on them? A content distributor may be satisfied with suiting the needs of their customer, whereas a content provider may also want to increase the sale of their less popular items, alongside increasing their customer retention.

Operational costs & algorithm selection

Furthermore, we need to determine whether the operational costs of developing and maintaining an advanced recommender system are worth the potentially marginal improvements in content suggestions. Aside from the cost of hiring researchers and engineers, there can also be large costs associated with training an advanced recommendation engine on the cloud, such as that of Amazon or Spotify. As the size of the user base and item database increase, so will the operational costs. An algorithm which has to compare an item to the whole user database to perform a recommendation (such as memory-based collaborative filtering) is not as scalable as one which uses item properties and metadata to identify similar items (such as content-based recommendations). However, it might also be that a more complex algorithm (such as matrix factorization) – popularised by the Netflix Prize would be able to extract better features from the data, with the caveat of requiring much more time and compute power to train.

This highlights the importance of clearly defining business goals and evaluation metrics before launching into a venture such as this, since A/B testing might reveal that an expensive recommender system which offers marginally better recommendations might not have any better impact on the bottom line than a simple one.

 

Technical Challenges

Data Availability / Sparsity

Figure 2 – Data Sparsity (https://ebaytech.berlin/deep-learning-for-recommender-systems-48c786a20e1a)

The second major challenge we face is the sparsity of our dataset. The average user’s activity only provides a limited amount of data relating to their likes and dislikes. The biggest mistake we can make is to assume that a user who has not clicked or rated an item necessarily dislikes that item. The more likely explanation is that the user has not yet discovered it. As a result, missing values need to be ignored, rather than included as dislikes or 0 ratings. However, this results in a very sparse dataset in which users have only interacted with a fraction of the available items. This leads to a few issues – can we guarantee that we have a full picture of this user? How do we make predictions for a new user, for whom we have no data available? This is also known as the “cold start” problem. Potential solutions include recommending the most popular items (YouTube and Amazon home page), user-inductions which request information from a new user (such as Reddit or Quora) or extracting metadata from items to compare them, such as Spotify.

Implicit Feedback

For this reason, implicit feedback is oftentimes preferred. This refers to the use of data such as number of clicks, shares and streaming time. The advantage of this over explicit feedback is that it allows businesses to collect more data on their users, who may otherwise be unwilling to give explicit ratings. It also removes any potential bias towards users who may be particularly expressive of their opinions but do not represent the majority.

However, implicit feedback brings its own set of problems. Whereas a 5-star rating has a predetermined scale, which allows us to adjust any bias towards users who are more critical or complementary than average, implicit feedback is more difficult to deal with. How do we determine the relative value between a click, a like or a sale? In addition, how do we deal with data from a user who may have listened to their favourite song 99 times but also has a special place in their heart for that song they only listen to once a month?

Some algorithms simply ignore values such as play count and transform them into binary 1s and 0s, whereas others use them as a confidence metric for how much a user likes an item. This may be part of the reason why YouTube and Netflix have switched to a like/dislike based system rather than 5-star ratings. Likes might have a better usage than 5-star ratings, and oftentimes confer the same amount of information to a recommender system as a 5-star rating.

In follow up posts, I will explore the different types of recommender systems, followed by an implementation of these using recent technologies such as PyTorch.

Blogs home Featured Image

The language used by data scientists can be confusing to anyone encountering it for the first time. Ever changing best practices and constantly evolving technologies and methodologies have given rise to a range of nuanced terms used throughout casual data conversation. Unfamiliarity with these terms often leads to disconnected expectations across different parts of a business when undertaking projects involving data and analytics. To make the most out of any data science project, it is important that participants have a shared vocabulary and an understanding of key terms at a level that is required of their role.

Mango Solutions is regularly involved in data science projects spanning different levels of a business. Below, we’ve outlined the most common data science terms that act as communication barriers in such projects:

Click here to view the table as a PDF

Mango Solutions can help you build a shared language around data science in your organisation. Based on our experience working with the world’s leading companies, we have developed 3 workshops to build a common language.

Find out which of the three workshops would be valuable to your organisation:

 

 

Blogs home Featured Image

In a previous post, I studied gender diversity in the film industry, I did this by focusing on some key behind-the-camera roles and measuring the evolution of the gender diversity in the last decade. The conclusion was not great: women are under-represented, especially in the most important roles of directors and writers, as these key roles determine the way women are portrayed in front of the camera.

I was curious about the TV series industry too: as the TV series industry is faster paced than the movie industry, might they might be more open to women? I decided to have a look.

In this post, as in the film industry post, the behind-the-camera roles I studied were: directorswritersproducerssound teamsmusic teamsart teamsmakeup teams and costume teams.

The whole code to reproduce the following results is available on GitHub.

Data Frame Creation – Web Scraping

All the data I used was gathered from the IMDb website: I went through the 100 Most Popular TV Shows (according to the IMDb ratings), and gathered some useful information about these 100 series: I built a data frame which contains the titles of these series, their years of release and their IMDb episode links – the link where we can find all the episodes of a series.

# IMDb 100 most popular TV shows ------------------------------

url <- "https://www.imdb.com/chart/tvmeter?sort=us,desc&mode=simple&page=1"
page <- read_html(url)

serie_nodes <- html_nodes(page, '.titleColumn') %>%
  as_list()

# Series details
serie_name <- c()
serie_link <- c()
serie_year <- c()
for (i in seq_along(serie_nodes)){
  serie_name <- c(serie_name, serie_nodes[[i]]$a[[1]])
  serie_link <- c(serie_link, attr(serie_nodes[[i]]$a, "href"))
  serie_year <- c(serie_year, serie_nodes[[i]]$span[[1]])
}
serie_link <- paste0("http://www.imdb.com",serie_link)
serie_year <- gsub("[()]", "", serie_year)
serie_episodelist <- sapply(strsplit(serie_link, split='?', fixed=TRUE),
                            function(x) (x[1])) %>%
  paste0("episodes?ref_=tt_eps_yr_mr")


# Create dataframe ----------------------------------------------
top_series <- data.frame(serie_name, serie_year, serie_episodelist, stringsAsFactors = FALSE)


# series_year was the date of 1st release but we needed the years of release for all the episodes
# I did not manage to gather this information by doing some web scraping.
# I added it manually as it is available on the IMDb episodes links (column serie_episodelist)
top_series[20:30, ]
##                        serie_name       serie_year
## 20                         Legion             2017
## 21 A Series of Unfortunate Events       2017, 2018
## 22                       Timeless 2016, 2017, 2018
## 23                      Westworld       2016, 2018
## 24                      Luke Cage             2016
## 25                       MacGyver 2016, 2017, 2018
## 26                  Lethal Weapon 2016, 2017, 2018
## 27            Designated Survivor 2016, 2017, 2018
## 28                           Bull 2016, 2017, 2018
## 29                     This Is Us 2016, 2017, 2018
## 30                        Atlanta       2016, 2018
##                                                 serie_episodelist
## 20 http://www.imdb.com/title/tt5114356/episodes?ref_=tt_eps_yr_mr
## 21 http://www.imdb.com/title/tt4834206/episodes?ref_=tt_eps_yr_mr
## 22 http://www.imdb.com/title/tt5511582/episodes?ref_=tt_eps_yr_mr
## 23 http://www.imdb.com/title/tt0475784/episodes?ref_=tt_eps_yr_mr
## 24 http://www.imdb.com/title/tt3322314/episodes?ref_=tt_eps_yr_mr
## 25 http://www.imdb.com/title/tt1399045/episodes?ref_=tt_eps_yr_mr
## 26 http://www.imdb.com/title/tt5164196/episodes?ref_=tt_eps_yr_mr
## 27 http://www.imdb.com/title/tt5296406/episodes?ref_=tt_eps_yr_mr
## 28 http://www.imdb.com/title/tt5827228/episodes?ref_=tt_eps_yr_mr
## 29 http://www.imdb.com/title/tt5555260/episodes?ref_=tt_eps_yr_mr
## 30 http://www.imdb.com/title/tt4288182/episodes?ref_=tt_eps_yr_mr

The series_year column often contains several years. For example, for the series called “This is us”, it means that episodes have been released in 2016, 2017 and 2018. This column will allow me to split the episodes by year of release, and then visualise the gender diversity of the crew for each year.

List Creation – Web Scraping

At this stage, I just had some global information on the 100 series. The next step was to go through the IMDb links gathered in the column series_episodelist of my top_series data frame, which gives me access to all the series episodes split by year of release. I did some web scraping on these links and built a list which gathered:

  • the names of the 100 most popular TV shows
  • for each series, the different years of release
  • for each year, the names of the episodes which have been released
  • for each episode, the names of the people whose job was included in one of the categories I listed above (directors, writers, …, costume teams)
### Create series list

series_list <- list()

# FOCUS ON EACH SERIES -----------------------------------------------------------------
for (r in seq_len(nrow(top_series))) { 
  
  serie_name <- top_series[r, "serie_name"]
  print(serie_name)
  
  # Years of release for each serie
  list_serieyear <- as.list(strsplit(top_series[r, "serie_year"], split = ", ")[[1]]) 
  # List of IMDb links where we find all the episodes per year of release
  link_episodelist_peryear <- list() 
  
  episodes_list_peryear <- list()
  
  # FOCUS ON EACH YEAR OF REALEASE FOR THIS SERIE -------------------------------------
  for (u in seq_along(list_serieyear)){ 

    year <- list_serieyear[[u]]
    print(year)
    
    link_episodelist_yeari <- strsplit(top_series[r, "serie_episodelist"], split='?', fixed=TRUE)[[1]][1] %>%
      paste0("?year=", year, collapse = "")
    link_episodelist_peryear[[u]] <- link_episodelist_yeari
    
    # FOCUS ON EACH EPISODE FOR THIS YEAR OF RELEASE ----------------------------------
    for (l in seq_along(link_episodelist_peryear)){ 
      
      page <- read_html(link_episodelist_peryear[[l]]) 
      episodes_nodes <- html_nodes(page, '.info') %>%
        as_list()
      
      episode_name <- c()
      episode_link <- c()
      
      for (t in seq_along(episodes_nodes)){
        episode_name <- c(episode_name, episodes_nodes[[t]]$strong$a[[1]])
        episode_link <- c(episode_link, attr(episodes_nodes[[t]]$strong$a, "href"))
      }
      
      episode_link <- paste0("http://www.imdb.com",episode_link)
      episode_link <- sapply(strsplit(episode_link, split='?', fixed=TRUE), 
                             function(x) (x[1])) %>%
        paste0("fullcredits?ref_=tt_ql_1")
      
      episode_name <- sapply(episode_name, 
                             function(x) (gsub(pattern = "\\#", replacement = "", x)))  %>% # some names = "Episode #1.1"
        as.character()
      
      # GATHER THE NAME OF THE EPISODE, ITS YEAR OF RELEASE AND ITS FULL CREW LINK ----
      episodes_details_peryear <- data.frame(year = year,
                                             episode_name = episode_name,
                                             episode_link = episode_link,
                                             stringsAsFactors = FALSE)
    }
    
    # FOCUS ON EACH FULL CREW LINK ----------------------------------------------------
    for (e in seq_len(nrow(episodes_details_peryear))){
      
      print(episodes_details_peryear[e, "episode_link"])
      
      episode_page <- read_html(episodes_details_peryear[e, "episode_link"])
      episode_name <- episodes_details_peryear[e, "episode_name"]
      
      # GATHER ALL THE CREW NAMES FOR THIS EPISODE -------------------------------------
      episode_allcrew <- html_nodes(episode_page, '.name , .dataHeaderWithBorder') %>%
        html_text()
      episode_allcrew <- gsub("[\n]", "", episode_allcrew) %>%
        trimws() #Remove white spaces 
      
      # SPLIT ALL THE CREW NAMES BY CATEGORY -------------------------------------------
      episode_categories <- html_nodes(episode_page, '.dataHeaderWithBorder') %>%
        html_text()
      episode_categories <- gsub("[\n]", "", episode_categories) %>%
        trimws() #Remove white spaces
      
      ## MUSIC DEPT -----------------------------------------------------------------------
      episode_music <- c()
      for (i in 1:(length(episode_allcrew)-1)){
        if (grepl("Music by", episode_allcrew[i])){
          j <- 1
          while (! grepl(episode_allcrew[i], episode_categories[j])){
            j <- j+1
          }
          k <- i+1
          while (! grepl(episode_categories[j+1], episode_allcrew[k])){
            episode_music <- c(episode_music, episode_allcrew[k])
            k <- k+1
          }
        }
      }
      for (i in 1:(length(episode_allcrew)-1)){
        if (grepl("Music Department", episode_allcrew[i])){
          # Sometimes music dept is last category
          if (grepl ("Music Department", episode_categories[length(episode_categories)])){ 
            first <- i+1
            for (p in first:length(episode_allcrew)) {
              episode_music <- c(episode_music, episode_allcrew[p])
            }
          } else {
            j <- 1
            while (! grepl(episode_allcrew[i], episode_categories[j])){
              j <- j+1
            }
            k <- i+1
            while (! grepl(episode_categories[j+1], episode_allcrew[k])){
              episode_music <- c(episode_music, episode_allcrew[k])
              k <- k+1
            }
          }
        }
      }
      if (length(episode_music) == 0){
        episode_music <- c("")
      }
      
      ## IDEM FOR OTHER CATEGORIES ----------------------------------------------------------
      
      ## EPISODE_INFO CONTAINS THE EPISODE CREW NAMES ORDERED BY CATEGORY -------------------
      episode_info <- list()
      episode_info$directors <- episode_directors
      episode_info$writers <- episode_writers
      episode_info$producers <- episode_producers
      episode_info$sound <- episode_sound
      episode_info$music <- episode_music
      episode_info$art <- episode_art
      episode_info$makeup <- episode_makeup
      episode_info$costume <- episode_costume
      
      ## EPISODES_LIST_PER_YEAR GATHERS THE INFORMATION FOR EVERY EPISODE OF THE SERIE-------
      ## SPLIT BY YEAR OF RELEASE --------------------------------------------------------
      episodes_list_peryear[[year]][[episode_name]] <- episode_info
    }
    
    ## SERIES_LIST GATHERS THE INFORMATION FOR EVERY YEAR AND EVERY SERIE -------------------
    series_list[[serie_name]] <- episodes_list_peryear
  } 
}

Let’s have a look at the information gathered in series_list. Here are some of the names I collected:

## - Black Mirror, 2011
##  Episode: The National Anthem 
##  Director: Otto Bathurst
## - Black Mirror, 2017
##  Episode: Black Museum 
##  Director: Colm McCarthy
## - Game of Thrones, 2011
##  Episode: Winter Is Coming 
##  Music team: Ramin Djawadi, Evyen Klean, David Klotz, Robin Whittaker, Michael K. Bauer, Brandon Campbell, Stephen Coleman, Janet Lopez, Julie Pearce, Joe Rubel, Bobby Tahouri
## - Game of Thrones, 2017
##  Episode: Dragonstone 
##  Music team: Ramin Djawadi, Omer Benyamin, Evyen Klean, David Klotz, William Marriott, Douglas Parker, Stephen Coleman

What we can see is that for the same series the crew changes depending on the episode we consider.

Gender Determination

Now that I had all the names gathered in the series_list, I needed to determine the genders. I used the same package as in my previous post on the film industry: GenderizeR, which “uses genderize.io API to predict gender from first names”. More details on this package and the reasons why I decided to use it are available in my previous post.

With this R package, I was able to determine for each episode the number of males and females in each category of jobs:

  • the number of male directors,
  • the number of female directors,
  • the number of male producers,
  • the number of female producers,
  • the number of males in costume team,
  • the number of females in costume team.

Here is the code I wrote:

### Genderize our lists of names

# for each serie
for (s in seq_along(series_list) ){  
  print(names(series_list[s])) # print serie name
  
  # for each year
  for (y in seq_along(series_list[[s]])){ 
    print(names(series_list[[s]][y])) # print serie year
    
    # for each episode
    for (i in seq_along(series_list[[s]][[y]])){ 
      print(names(series_list[[s]][[y]][i])) # print serie episode
      
      # Genderize directors -----------------------------------------------------
      directors <- series_list[[s]][[y]][[i]]$directors
      
      if (directors == ""){
        directors_gender <- list()
        directors_gender$male <- 0
        directors_gender$female <- 0
        series_list[[s]][[y]][[i]]$directors_gender <- directors_gender
      }
      
      else{
        # Split the firstnames and the lastnames
        # Keep the firstnames
        directors <- strsplit(directors, " ")
        l <- c()
        for (j in seq_along(directors)){
          l <- c(l, directors[[j]][1])
        }
        
        directors <- l
        serie_directors_male <- 0
        serie_directors_female <- 0
        
        # Genderize every firstname and count the number of males and females 
        for (p in seq_along(directors)){
          directors_gender <- genderizeAPI(x = directors[p], apikey = "233b284134ae754d9fc56717fec4164e")
          gender <- directors_gender$response$gender
          if (length(gender)>0 && gender == "male"){
            serie_directors_male <- serie_directors_male + 1
          }
          if (length(gender)>0 && gender == "female"){
            serie_directors_female <- serie_directors_female + 1
          }
        }
        
        # Put the number of males and females in series_list
        directors_gender <- list()
        directors_gender$male <- serie_directors_male
        directors_gender$female <- serie_directors_female
        series_list[[s]][[y]][[i]]$directors_gender <- directors_gender
      }  
      
      # Same code for the 7 other categories -----------------------------------
      
      } 
    }
  }
}

Here are some examples of numbers of male and female I collected:

## Black Mirror, 2011
##  Episode: The National Anthem 
##  Number of male directors: 1 
##  Number of female directors: 0 
## 
## Black Mirror, 2017
##  Episode: Black Museum 
##  Number of male directors: 1 
##  Number of female directors: 0 
## 
## Game of Thrones, 2011
##  Episode: Winter Is Coming 
##  Number of male in music team: 8 
##  Number of female in music team: 3 
## 
## Game of Thrones, 2017
##  Episode: Dragonstone 
##  Number of male in music team: 7 
##  Number of female in music team: 0 
## 

Percentages Calculation

With these numbers gathered in my list, I then calculated the percentages of women in each job category, for each year between 2007 and 2018. I gathered these figures in a data frame called percentages:

##    year directors  writers producers     sound    music      art   makeup
## 1  2018  22.69693 25.06514  27.87217 12.247212 23.25581 36.93275 73.10795
## 2  2017  20.51948 28.20016  27.28932 10.864631 25.46912 29.90641 71.41831
## 3  2016  17.13456 24.51189  27.93240 11.553444 25.03117 30.98003 71.74965
## 4  2015  16.14764 19.42845  26.43828 11.214310 22.16505 29.83354 69.50787
## 5  2014  18.38624 20.88644  27.59163 10.406150 22.21016 30.11341 69.97544
## 6  2013  14.94413 19.60432  28.15726 10.504896 23.29693 29.01968 69.01683
## 7  2012  15.60694 19.82235  29.66566 10.685681 21.45378 26.74160 67.47677
## 8  2011  13.95349 17.60722  26.73747 11.296882 17.11185 25.61805 64.81795
## 9  2010  15.95745 17.05882  27.38841 11.264644 16.51376 24.14815 65.33004
## 10 2009  16.49123 18.90496  28.79557  8.498350 21.72285 26.11128 68.15961
## 11 2008  17.87440 16.62088  29.05844  7.594264 18.74405 23.46251 68.39827
## 12 2007  21.15385 21.78771  30.12798  9.090909 19.23077 21.66124 63.03502
##     costume
## 1  77.24853
## 2  81.34648
## 3  79.35358
## 4  76.48649
## 5  76.62972
## 6  74.74791
## 7  77.35247
## 8  77.46315
## 9  77.67380
## 10 79.56332
## 11 80.53191
## 12 79.24720

Gender Diversity in 2017: TV Series Industry VS Film Industry

Based on this data frame, I created some bar plots to visualise the gender diversity of each job category for each year. Here is the code I wrote to create the bar plot for 2017, which compares the TV series industry to the film industry.

### Barplot 2017

# Data manipulation -------------------------------------------------------------

# Import our movies dataset
percentages_movies <- read.csv("percentages_movies.csv") 
percentages_movies <- percentages_movies[ , -1]

# Change column names for movie and serie dataframes
colnames(percentages_movies) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")
colnames(percentages) <- c("year", "directors", "writers", "producers", "sound", "music", "art", "makeup", "costume")

# From wide to long dataframes
percentages_movies_long <- percentages_movies %>%
  gather(key = category, value = percentage, -year)
percentages_long <- percentages %>%
  gather(key = category, value = percentage, -year)

# Add a column to these dataframes: movie or film ?
percentages_movies_long$industry <- rep("Film industry", 88)
percentages_long$industry <- rep("Series industry", 96)

# Combine these 2 long dataframes
percentages_movies_series <- bind_rows(percentages_long, percentages_movies_long)

# Filter with year=2017
percentages_movies_series_2017 <- percentages_movies_series %>%
  filter(year == 2017)


# Data visualisation -------------------------------------------------------------

percentages_movies_series_2017$percentage <- as.numeric(format(percentages_movies_series_2017$percentage, 
                                                        digits = 2))

bar_2017 <- ggplot(percentages_movies_series_2017, aes(x = category,
                                                       y = percentage,
                                                       group = category,
                                                       fill = category)) +
  geom_bar(stat = "identity") +
  facet_wrap(~industry) +
  coord_flip() + # Horizontal bar plot
  geom_text(aes(label = percentage), hjust=-0.1, size=3) +
  theme(axis.text.x=element_blank(),
        axis.ticks.x=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(hjust = 0.5), # center the title
        legend.title=element_blank()) +  
  labs(title = paste("Percentages of women in 2017"),
       x = "",
       y = "Percentages") +
  guides(fill = guide_legend(reverse=TRUE)) + # reverse the order of the legend
  scale_fill_manual(values = brewer.pal(8, "Spectral")) # palette used to fill the bars and legend boxs

I have built a simple shiny app which gives access to the bar plots for each year between 2007 and 2017.

Let’s analyse the graph of the year 2017. If we only focus on the TV series figures, we see that sound teams show the lowest female occupation, with less than 11%. It is followed by the role of director with 20.5%. Then, we can see that between 25% and 30% of the roles of writers, producers, music teams and art teams are taken by women. Thus, women are still under-represented in the TV series industry. However, even if series figures show little gender diversity in the above job categories, they are better than the film industry ones, especially for the key roles of directors, writors and producers, which are respectively 5.7, 3 and 1.2 times higher for the series industry than for the film industry. The last thing to notice is that as in the film industry, the series industry graph shows a representativeness gap between the above roles and the jobs of make-up artists and costume designers, among which more than 70% of the roles are taken by women.

Evolution of the Gender Diversity: TV Series Industry VS Film Industry

Let’s have a look at the evolution of the gender diversity in these two industries in the last decade.

### Evolution plot

# year as date
percentages_movies_series_ymd <- percentages_movies_series %>%
  subset(year != 2018)
percentages_movies_series_ymd$year <- ymd(percentages_movies_series_ymd$year, truncated = 2L) 

# Data visualisation
evolution <- ggplot(percentages_movies_series_ymd, aes(x = year,
                                                       y = percentage,
                                                       group = category,
                                                       colour = category)) +
  geom_line(size = 2) +
  facet_wrap(~industry) +
  theme(panel.grid.minor.x = element_blank(),
        plot.title = element_text(hjust = 0.5)) + # center the title
  scale_x_date(date_breaks = "2 year", date_labels = "%Y") +
  scale_color_manual(values = brewer.pal(8, "Set1")) +
  labs(title = "Percentages of women from 2007 to 2017\n Film industry VS serie industry",
       x = "",
       y = "Percentages")

The first thing I noticed is that for both the film and series industries, the representation gap between the roles of make-up artists and costume designers and the other ones had not decreased since 2007.

The fact that the roles of directors, writers and producers are more open to women in the TV series industry than in the film one is easy to visualise with this graph, and we can see that it has been the case at least since 2007 (and probably before). Besides, since 2007 the series industry has been more diversified in terms of gender for all the categories I studied, except for the sound roles.

I also noticed that since 2010/2011, in the TV series industry, almost all the categories tend to be more diversified in terms of gender. The only exceptions are the roles of producers (percentages are generally decreasing slightly since 2007), sound teams (no improvement has been achieved since 2010) and costume teams (the trend has been positive only since 2013). Apart from that, there is a positive trend for the TV series industry, which is not the case for the film industry.

This trend is significant for some roles: writers, music teams, art teams and make-up teams percentages in the series industry have increased by 5 to 10% in the last decade. But if we look at the role of directors, the percentage of women has also increased by 5% since 2011, but the percentage reached in 2017 is essentially the same as the one reached in 2007, just as for the film industry. Let’s hope that the trend seen since 2011 for directors will continue.

Conclusion

This study has definitely shown that the TV series industry is more diversified in terms of gender than the film industry, especially for the key roles of directors and writers.

However even if the series percentages are better than the film ones, women are still under-represented in the TV series industry as the same regrettable analysis has been echoed: the only jobs which seem open to women are the stereotyped female jobs of make-up artists and costume designers. In all the other categories, the percentages of women in the series industry never reach more than 30%.

But contrary to the film industry, the TV series one is actually evolving in the right direction: since 2011, a positive trend has been happening for directors and writers. This evolution is encouraging for the future and suggests that powerful female characters, such as Daenerys Targaryen from Game of Thrones, are coming on TV screens.

Blogs home Featured Image

For today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Catherine Gamble, Data Scientist at Marks and Spencer.

Catherine is presenting “Using R to Drive Revenue for your Online Business” at EARL London and we got the chance to get a preview of the use case she’ll be presenting.

Thanks Catherine for this interview. What was the business need or opportunity that led to this project?

As an online retailer, we know that the actions we take, for example, any changes we make to our website, have an impact on our financial results. However, when multiple changes are being made or campaigns are being run at the same time, it can be hard to separate which action led to the desired result.

From a strategy and planning perspective, we knew it would be valuable to be able to predict the direct impact of any actions we took, before we made them.

How did you go about solving this problem?

I developed a predictive model to explore the relationships between action and result. The result was I was able to identify which actions would have an impact on our KPIs.

What value did your project deliver?

We now have clear insight which is fed into our strategic decision making. As a result, we have had a positive impact on our KPIs and there has been a positive financial impact.

What would you say were the elements that made your project a success?

Support from the Team – one of the key drivers of success in this project was the time I was given to explore different techniques and models and to learn.

Curiosity – this project came about because I was curious about the patterns in the data and wanted to explore some questions around things we were seeing.

What other businesses do you think would benefit from this use case?

Any online retailer who have multiple sales, marketing and development events and campaigns at the same time.

It would also be useful for businesses who have a sales funnel and want to explore how the actions businesses take impact on the results.

To hear Catherine’s full talk and others like it – join us at EARL London this September!

Blogs home Featured Image

One of the few remaining hurdles when working with R in the enterprise is consistent access to CRAN. Often desktop class systems will have unrestricted access while server systems might not have any access at all.

This inconsistency often stems from security concerns about allowing servers access to the internet. There have been many different approaches to solving this problem, with some organisations reluctantly allowing outbound access to CRAN, some rolling their own internal CRAN-like repositories, and others installing a fixed set of packages and leaving it at that.

Fig 1. Access to public CRAN from multiple sources can be a security and compliance headache

Fortunately, this problem may now be a thing of the past. Yesterday RStudio announced a new software tool called “Package Manager” that provides a single, on-premise, CRAN-like interface that can provide access to CRAN, your organisations own internal packages, or a combination of the two all in a unified system.

RStudio Package Manager (RSPM) removes the need for IT teams to whitelist external access to CRAN from all of their R servers. Now, just a single system requires external access to a carefully managed CRAN mirror built specifically for this purpose. Internal systems can now connect to this single, internal package repository instead of to ad-hoc mirrors. Desktop and laptop users can connect to it too, providing a unified package management experience.

Fig 2. RStudio Package Manager simplifies CRAN access and reduces risk

Further, RSPM can be used to publish internal packages as well, and even supports hosting multiple repositories, which can be useful for different groups within the business.

Mango have been using RSPM and providing feedback on it since the earliest private beta stage and have already provided support around it to a small number of other beta customers. That, combined with our long R heritage and deep roots in the R and enterprise ecosystems means we’re well placed to help others on their enterprise R journey.

To schedule a call to discuss the options, contact sales@mango-solutions.com.