data science & star wars
Blogs home Featured Image

 

Not many people know this, but the First Order (the bad guys from the latest Star Wars films) once created a Data Science team.  It all ended very badly, but based on some intel smuggled by a chippy R2 unit, we were able to piece together the story …

 

 

Analytics: Expectation vs Reality

Now, of course this is just (data) science fiction, but the basic plot will be familiar to many of you.

The marketing hype around AI and Data Science over the last few years has really raised the stakes in the analytics world.  It’s easy to see why – if you’re a salesperson selling AI software for £1m, then you’re going to need to be bullish about how many millions it is going to make/save the customer.

The reality though is that Data Science can add enormous value to an organisation, but:

  • It isn’t magic
  • It won’t happen overnight
  • It’s very difficult if the building blocks aren’t in place
  • It’s more about culture and change than algorithms and tech

So, how do we deal with a situation where leaders (whether they be evil Sith overlords or just impatient executives) have inflated expectations about what is possible (and have possibly over-invested on that basis)?

 

Education is key

With so much buzz and hype around analytics, it’s unsurprising that leadership are bombarded with an array of confusing terminology and unrealistic promises.  To counter that, it is important that Data Science teams look to educate the business and leadership on what these terms really mean.  In particular, we need to educate the business on the “practical” application of data science, what the possibilities are, and the potential barriers to success that exist.

 

Create a repeatable process

Once we’ve educated the business about the possibilities of analytics, we need to create a repeatable delivery process that is understood from both analytic AND business perspectives.  This moves the practice of analytics away from “moments of magic” producing anecdotal success to a process that is understandable, repeatable and produces consistent success.  Within this, we can establish shared understanding about how we will prioritise effort, measure success, and overcome the barriers to delivering initiatives (e.g. data, people, change).

 

Be consistent

Having established the above, we must engage with the business and leadership using our new consistent language and approach.  This will ensure the business understands the steps that are being carried out and the risk of success and failure.  After all, if there’s no signal in your data you can’t conjure accuracy from nowhere – ensuring that your stakeholders understand this (without getting into the detail of accuracy measures) is an important enabler to engaging effectively with them.

 

Summary

Being in a situation where the value and possibilities of data science have been significantly over-estimated can be very challenging.  The important thing is to educate the business, create a repeatable process for successful delivery and be consistent and clear about the realities and practicalities of applying data science.

Then again, if your executive sponsor starts wielding a Lightsaber – I’d get out quickly.

 

Blogs home Featured Image

 

Following on from the success of our recent graduate intake, we are already looking to find three more graduates and one yearlong placement to join us in September 2020.  Our placements and interns have been an integral part at Mango for several years now, and we’re proud to say that every single intern has come back once they’ve finished university and joined us as a permanent employee.

Mango hosted our very first graduate assessment day recently.  We thought that an assessment day would give us a better chance to really get to know the applicants, and to really show them what life at Mango is like – and it certainly did just that!

As wonderful as our current graduate intake is, I have to admit that all four of them are male.  As signatories of the Tech Talent Charter, and supporters of Women in Data, we were determined to change that statistic this year.  I’m pleased to say that of the eight candidates at the assessment day, there were four males and four females.  Also, Mango is also justifiably proud of the diversity of the background of our data science – and this cohort was similarly diverse – we had representatives from five different subjects, and four different universities.

Following the recent Data Science Skills Survey – created in partnership with Women In Data UK and Datatech – that highlighted a national data science skills shortage, we were delighted that we had over 60 applicants for the three graduate roles and we have already whittled these down to the top six candidates who will move forward to the next stage of the application process to become a Mango graduate.

The next part of the process is about assessing skills and we do this by defining what we call a Minimally Viable Data Scientist – this is what we expect our graduates to achieve by the end of the graduate program.  We put exercises in place throughout the day to assess current skills as well as potential skills.

The more ‘technical’ skills were assessed at interview, whilst the softer skills, which are essential for our consultancy projects, were tested in individual and group exercises. We tasked the candidates with imagining a new project with Bath Cats and Dogs home, thinking about how that might play out.

We’re proud of some of the feedback that we received at the end of the day.  We consciously set out for this day to be two way – we wanted the candidates to want to work for Mango, just as much as we wanted to employ them. Some candidate’s feedback revealed that the day was “refreshingly open”, “actually enjoyable” and “not as daunting as I’d thought an assessment day would be”.

We’ve now got the incredibly difficult decision of which of the brilliant candidates to make offers to!

are you on a data-driven coddiwomple?
Blogs home Featured Image

If like me, you attend data conferences, then there’s one word you will hear time and time again: “journey.  It’s an incredibly popular word in the data-driven transformation world, and it’s common to hear a speaker talking about the “journey” their business is on.  But, for me, I often struggle with that word.

 

Journey

The Oxford Dictionary defines a “journey” as follows:

Journey:
the act of travelling from one place to another

So to be on a journey, I feel we need to have a very clear understanding of (1) where we are travelling from and (2) what our destination is.

For example, as I’m writing this, I’m on a “journey” – I am travelling by train from my home (my starting point) to our London office (my destination).  Knowing my starting point and destination allows me to select the most appropriate route, estimate the time I need to invest to reach my destination, and allows me to understand my current progress along that route.  And, if I encounter any difficulties or delays on my journey (surely not on British trains!) then I know how to adjust and reset my expectations to ensure my route is appropriate and understood.

If we compare this to the use of the word “journey” in the context of data-driven transformation, I’m not entirely sure it fits.  When I speak with data and analytic leaders who are on a data-driven journey, it is surprising how often there is a lack of clarity over the destination, or their current position, which makes it very difficult to plan and measure progress.

But I see how the word journey has become so common – it conjures a sense of momentum and change which really fits the world of data-driven transformation.

 

Coddiwomple

However, I recently came across this incredible word, which I think may be more fitting. The origins of the word are unknown, but it is defined as follows:

Coddiwomple: 
to travel in a purposeful manner towards a vague destination

Despite being a lovely word to use, I think it is a far more appropriate description of many data-driven “journeys” I have encountered.

Know your destination

So if you’re currently on a “data-driven coddiwomple” and want to be on a “data-driven journey”, then you need only decide on a destination – in other words, what does a “data-driven” version of your current business look like?  In my experience, this can vary significantly – I’ve worked with organisations who see the destination as everything from a fully autonomous company to a place with highly disruptive business models.

Once this is decided, then you can build data-driven maturity models to measure your value and inform downstream investments – in the meantime, “Happy Coddiwompling!!”

 

Python
Blogs home Featured Image

I have been asked this tricky question many times in my career – “Python or R?”. Based on my experience, if anything, the answer to this is totally dependent on purpose for purpose and is still a question that many aspiring data scientists, business leaders and organisations are still pondering over.

It is important to have the right and best tools when providing the desired answers to the many business questions within the data science space – which isn’t as simple as it sounds. When you consider Data Analytics, Data Science, Data Strategic Planning or developing a Data Science team, where to start from in terms of languages could be a major blocker.

Python has become the de facto language of choice for organisations seeking seamless creation or upscaling skills; and its influence is evident in the cloud computing environment. The fact of the matter is, according to the 20th annual KDnuggets Software Poll, Python is still the leader – top tech companies like Alphabet’s Google and Facebook continue to use Python at the core of their frameworks.

Also, some of the essential benefits of Python are its fluency and clarity in natural readability. It is easy to learn, and it provides much flexibility in terms of scalability and productionalization. There are many libraries or packages that have been created for purpose.

Data is everywhere

Data is everywhere, big or small. And loads of companies have it but are not harnessing the capabilities of these great assets. Of course, the availability of data without the “algorithms” will not add any business values. That is why it is important for companies and business leaders to get on fast and get the tool that helps to transform their data fundamentally into the viable economic positives they desire. By choosing Python, companies will be able to utilize the potential of their data.

Deployment and Cloud Capability

The Python capability is big and its impact is felt in the areas of Machine Learning, Computer Vision, Natural Language Processing and many others. Its robustness and growing ecosystem has made it easy for many deployment and integration tools. If you use Google Cloud Platform (GCP), Amazon Web Service (AWS) or Microsoft’s Azure, you will find the convenience of use and integration with Python. As a matter of fact, cloud technologies are growing at the fastest pace with ease as Python drives most applications on cloud.

Concluding Remarks

Considering a broad perspective, you might doubt if there is any question of supremacy between Python and R (or even SQL). But there is apparently a high variation in terms of needs and versatility. Python has been become a kingpin in terms of its user-friendliness, scalability and the extensive ecosystem of libraries and interoperability. Some popular libraries within Python supports the development and evolution of Artificial Intelligence (AI). Many organisations are beginning to see the reality of upskilling and taking advantage of Python in their AI driven decisions.

Mango Solutions

There is a big drive within the layers of Mango that supports the use of Python as an essential tool benefiting our consultants and clients in many ways. Many projects have had Python at their core when it comes to project execution. Also, our consultants have delivered several training courses to different organisations within both the public and private sectors across the globe, to help them harness the potential Python in their data-driven-decisions, asserting business values and helping to shape their data journey

Author: Dayo Oguntoyinbo, Data Scientist

50 shades of R
Blogs home Featured Image

 

I’ve been joking about R’s “200 shades of grey” on training courses for a long time. The popularity of the book “50 Shades of Grey” has changed the meaning of this statement somewhat. As the film is due to be released on Valentine’s Day I thought this might be worth a quick blog post.

Firstly, where did I get “200 shades of grey” from? This statement was originally derived from the 200 available named colours that contain either “grey” or “gray” in the vector generated by the colours function. As you will see there are in fact 224 shades of grey in R.

greys <- grep("gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 224

 

This is because there are also colours such as slategrey, darkgrey and even dimgrey! So lets now remove anything that is more than just “grey” or “gray”.

 

greys <- grep("^gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 204

 

So in fact there are 204 that are classified as “grey” or “gray”. If we take a closer look though its clear that there are not 204 unique shades of grey in R as we are doubling up so that we can use both the British, “grey”, and US, “gray”. This is really useful for R users not having to remember to change the way they usually spell grey/gray (you might also notice that I have used the function colours rather than colors) but when it comes to unique greys it means we have to be a little more specific in our search pattern. So stripping back to just shades of “grey”:

 

greys <- grep("^grey", colours(), value = TRUE)

length(greys)

[1] 102

 

we find we are actually down to just 102. Interestingly we don’t double up on all grey/gray colours, slategrey4 doesn’t exist but slategray4 does!

So really we have 102 shades of grey in R. Of course this is only using the named colours, if we were to define the colour using rgb we can make use of all 256 colour values!

 

 

So how can we get 50 shades of grey? Well the colorRampPalette function can help us out by allowing us to generate new colour palettes based on colours we give it. So a palette that goes from grey0 (black) to grey100 (white) can easily be generated.

 

shadesOfGrey <- colorRampPalette(c("grey0", "grey100"))

shadesOfGrey(2)

[1] "#000000" "#FFFFFF"

 

And 50 shades of grey?

 

R 50 shades of grey

 

fiftyGreys <- shadesOfGrey(50)

mat <- matrix(rep(1:50, each = 50))

image(mat, axes = FALSE, col = fiftyGreys)

box()

 

I hear the film is not as “graphic” as the book – but hope this fits bill!

 

Author: Andy Nicholls, Data Scientist

 

FIFA World Cup 2018 predictions
Blogs home Featured Image

Given that the UEFA Champion League final a few weeks ago between Real Madrid and Liverpool is the only match I’ve watched properly in over ten years, how dare I presume I can guess that Brazil is going to lift the trophy in the 2018 FIFA World Cup? Well, here goes…

By the way, if you find the below dry to read, it is because of my limited natural language on the subject matter…data science tricks to the rescue!

The idea is that in each simulation run of a tournament, we find team winner, runners-up, third and fourth etc. N times of simulation runs e.g. 10k returns a list of winners with highest probability to be ranked top.

library(tidyverse)
library(magrittr)
devtools::load_all("worldcup")

normalgoals <- params$normalgoals 
nsim <- params$nsim

data(team_data) 
data(group_match_data) 
data(wcmatches_train)

Apart from the winner question, this post seeks to answer which team will be top scorer and how many goals will they score. After following Claus’s analysis rmarkdown file, I collected new data, put functions in a package and tried another modelling approach. Whilst the model is too simplistic to be correct, it captures the trend and is a fair starting point to add complex layers on top.

Initialization

To begin with, we load packages including accompanying R package worldcup where my utility functions reside. Package is a convenient way to share code, seal utility functions and speed up iteration. Global parameters normalgoals (the average number of goals scored in a world cup match) and nsim (number of simulations) are declared in the YAML section at the top of the RMarkdown document.

Next we load three datasets that have been tidied up from open source resource or updated from original version. Plenty of time was spent on gathering data, aligning team names and cleaning up features.

  • team_data contains features associated with team
  • group_match_data is match schedule, public
  • wcmatches_train is a match dataset available on this Kaggle competition and can be used as training set to estimate parameter lamda i.e. the average goals scored in a match for a single team. Records from 1994 up to 2014 are kept in the training set.

Play game

Claus proposed three working models to calculate single match outcome. The first is based on two independent poisson distributions, where two teams are treated equal and so the result is random regardless of their actual skills and talent. The second assumes the scoring event in a match are two possion events, the difference of two poisson events believed to have skellam distribution. The result turns out to be much more reliable as the parameters are estimated from actual bettings. The third one is based on World Football ELO Ratings rules. From current ELO ratings, we calculate expected result of one side in a match. It can be seen as the probability of success in a binomial distribution. It seems that this approach overlooked draw due to the nature of binomial distribution i.e. binary.

The fourth model presented here is my first attempt. To spell out: we assumed two independent poisson events, with lambdas predicted from a trained poisson model. Then predicted goal is simulated by rpois.

Model candidate each has its own function, and it is specified by the play_fun parameter and provided to higher level wrapper function play_game.

# Specify team Spain and Portugal
play_game(play_fun = "play_fun_simplest", 
          team1 = 7, team2 = 8, 
          musthavewinner=FALSE, normalgoals = normalgoals)
##      Agoals Bgoals
## [1,]      0      1
play_game(team_data = team_data, play_fun = "play_fun_skellam", 
          team1 = 7, team2 = 8, 
          musthavewinner=FALSE, normalgoals = normalgoals)
##      Agoals Bgoals
## [1,]      1      4
play_game(team_data = team_data, play_fun = "play_fun_elo", 
          team1 = 7, team2 = 8)
##      Agoals Bgoals
## [1,]      0      1
play_game(team_data = team_data, train_data = wcmatches_train, 
          play_fun = "play_fun_double_poisson", 
          team1 = 7, team2 = 8)
##      Agoals Bgoals
## [1,]      2      2

Estimate poisson mean from training

Let’s have a quick look at the core of my training function. Target variable in the glm function is the number of goals a team obtained in a match. Predictors are FIFA and ELO ratings at a point before the 2014 tournament started. Both are popular ranking systems – the difference being that the FIFA rating is official and the latter is in the wild, adapted from chess ranking methodology.

mod <- glm(goals ~ elo + fifa_start, family = poisson(link = log), data = wcmatches_train)
broom::tidy(mod)
##          term      estimate    std.error  statistic      p.value
## 1 (Intercept) -3.5673415298 0.7934373236 -4.4960596 6.922433e-06
## 2         elo  0.0021479463 0.0005609247  3.8292949 1.285109e-04
## 3  fifa_start -0.0002296051 0.0003288228 -0.6982638 4.850123e-01

From the model summary, the ELO rating is statistically significant whereas the FIFA rating is not. More interesting is that the estimate for the FIFA ratings variable is negative, inferring the effect is 0.9997704 relative to average. Overall, FIFA rating appears to be less predictive to the goals one may score than ELO rating. One possible reason is that ratings in 2014 alone are collected, and it may be worth future effort to go into history. Challenge to FIFA ratings’ predictive power is not new after all.

Training set wcmatches_train has a home column, representing whether team X in match Y is the home team. However, it’s hard to say in a third country whether a team/away position makes much difference comparing to league competetions. Also, I didn’t find an explicit home/away split for the Russian World Cup. We could derive a similar feature – host advantage, indicating host nation or continent in future model interation. Home advantage stands no better chance for the time being.

Group and kickout stages

Presented below are examples showing how to find winners at various stages – from group to round 16, quarter-finals, semi-finals and final.

find_group_winners(team_data = team_data, 
                   group_match_data = group_match_data, 
                   play_fun = "play_fun_double_poisson",
                   train_data = wcmatches_train)$goals %>% 
  filter(groupRank %in% c(1,2)) %>% collect()
## Warning: package 'bindrcpp' was built under R version 3.4.4

## # A tibble: 16 x 11
##    number name         group  rating   elo fifa_start points goalsFore
##                               
##  1      2 Russia       A       41.0   1685        493   7.00         5
##  2      3 Saudi Arabia A     1001     1582        462   5.00         4
##  3      7 Portugal     B       26.0   1975       1306   7.00         6
##  4      6 Morocco      B      501     1711        681   4.00         2
##  5     12 Peru         C      201     1906       1106   5.00         3
##  6     11 France       C        7.50  1984       1166   5.00         6
##  7     13 Argentina    D       10.0   1985       1254   9.00         8
##  8     15 Iceland      D      201     1787        930   6.00         4
##  9     17 Brazil       E        5.00  2131       1384   7.00         8
## 10     20 Serbia       E      201     1770        732   6.00         4
## 11     21 Germany      F        5.50  2092       1544   6.00         8
## 12     24 Sweden       F      151     1796        889   6.00         5
## 13     27 Panama       G     1001     1669        574   5.00         3
## 14     25 Belgium      G       12.0   1931       1346   5.00         4
## 15     31 Poland       H       51.0   1831       1128   4.00         2
## 16     29 Colombia     H       41.0   1935        989   4.00         1
## # ... with 3 more variables: goalsAgainst , goalsDifference ,
## #   groupRank 
find_knockout_winners(team_data = team_data, 
                      match_data = structure(c(3L, 8L, 10L, 13L), .Dim = c(2L, 2L)), 
                      play_fun = "play_fun_double_poisson",
                      train_data = wcmatches_train)$goals
##   team1 team2 goals1 goals2
## 1     3    10      2      2
## 2     8    13      1      2

Run the tournament

Here comes to the most exciting part. We made a function–simulate_one()–to play the tournament one time and then replicate() (literally) it many many times. To run an ideal number of simulations, for example 10k, you might want to turn on parallel. I am staying at 1000 for simplicity.

Finally, simulate_tournament() is an ultimate wrapper for all the above bullet points. The returned resultX object is a 32 by R params$nsim matrix, each row representing predicted rankings per simulation. set.seed() is here to ensure the result of this blogpost is reproducible.

# Run nsim number of times world cup tournament
set.seed(000)
result <- simulate_tournament(nsim = nsim, play_fun = "play_fun_simplest") 
result2 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_skellam")
result3 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_elo")
result4 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_double_poisson", train_data = wcmatches_train)

Get winner list

get_winner() reports a winner list showing who has highest probability. Apart from the random poisson model, Brazil is clearly the winner in three other models. The top two teams are between Brazil and Germany. With different seeds, the third and fourth places (in darker blue) in my model are more likely to change. Variance might be an interesting point to look at.

get_winner(result) %>% plot_winner()

get_winner(result2) %>% plot_winner()

get_winner(result3) %>% plot_winner()

get_winner(result4) %>% plot_winner()

Who will be top scoring team?

The skellum model seems more reliable, my double poisson model gives systematically lower scoring frequency than probable actuals. They both favour Brazil though.

get_top_scorer(nsim = nsim, result_data = result2) %>% plot_top_scorer()

get_top_scorer(nsim = nsim, result_data = result4) %>% plot_top_scorer()

Conclusion

The framework is pretty clear, all you need is to customise the play_game function, such as game_fun_simplestgame_fun_skellam and game_fun_elo.

Tick-tock… Don’t hesitate to send a pull request to ekstroem/socceR2018 on GitHub. Who is winning the guess-who-wins-worldcup2018 game?

If you like this post, please leave your star, fork, issue or banana on the GitHub repository of the post, including all code (https://github.com/MangoTheCat/blog_worldcup2018). The analysis couldn’t have been done without help from Rich, Doug, Adnan and all others who have kindly shared ideas. I have passed on your knowledge to the algorithm.

Notes

  1. Data collection. I didn’t get to feed the models with the most updated betting odds and ELO ratings in the team_data dataset. If you would like to, they are available via the below three sources. FIFA rating is the easiest and can be scraped by rvest in the usual way. The ELO ratings and betting odds tables seem to have been rendered by javascript and I haven’t found a working solution. For betting information, Betfair, an online betting exchange has an API and R package abettor which helps to pull those odds which are definetly interesting for anyone who are after strategy beyond prediction:
      1. Model enhancement. This is probably where it matters most. For example, previous research has suggested various bivariate poissons for football predictions.
      2. Feature engineering. Economic factors such as national GDP, market information like total player value or insurance value, and player injure data may be useful to improve accuracy.
      3. Model evaluation. One way to understand if our model has good prediction capibility or not is to evaluate the predictions against actual outcomes after 15 July 2018. Current odds from bookies can also be referred to. It is not imporssible to run the whole thing on historical data e.g. Year 2014. and perform model selection and tuning.
      4. Functions and package could be better parameterized; code to be tidied up.

Author: Ava Yang, Data Scientist at Mango Solutions