Modspace collaboration software for data science teams
Blogs home Featured Image

The ability to collaborate and share resources and information across distributed, and often siloed, data science teams is a common goal for today’s managers and team leads.  To maximise efficiencies, one centralised location that facilitates collaboration across projects is the ultimate aim for team success and best practice. Gone are multiple workflows, incomplete activity or locally held models and datasets meaning hours of wasted time and unreproducible efforts.

Collaboration is the key across Data Science teams

Mango’s team of consultants share these frustrations of in-house data science teams and in answer to this common business challenge Mango developed Modspace: a single repository that allows data science teams – modellers and analysts – to work with non-technical staff to streamline the collaboration process by creating a virtual workspace where teams can meet, work and prioritise deadlines.

A single computational environment accessible by all

Providing a single meeting point that integrates analytic development environments with traditional desktop applications, Modspace has the additional functionality of a powerful search engine to make this valuable information quickly accessible to all.

The result is the ability to manage complex tasks across a team ensuring projects with multiple needs and requirements are taken care of.

Mango Solutions’ Product Manager, Richard Kaye, believes ModSpace is the future of project collaboration: “It has often been said that data is the new oil.  A company’s analytic IP – code and models that unlock the insight from their data – is a highly valuable asset. The management of this asset is of increasing importance as organisations embark on a journey  to leverage their data to make more informed, data-driven decisions.  ModSpace was specifically built to support this journey – offering controlled storage of information and powerful search facilities in a single computational environment accessible by all.”

A proven platform

ModSpace offers teams a centralised environment for your data science projects:

  • A safe centralised repository for all file types, from scripts and datasets to MS Office documents
  • An industry-leading text search engine powered by strong text parsing capabilities which helps team members find valuable resources efficiently
  • The ability to categorise and describe projects and content using configurable metadata, thus enabling easy location of related and useful work
  • A newsfeed to enhance visibility of project activity, along with public and private feedback channels
  • Control over who has access to what, to keep sensitive information private
  • Access to industry-standard transparent version control, which assists with compliance and integrity
  • Compatibility with data analysis and modelling tools, such as R, Python, MATLAB, SAS and NONMEM, allowing users to interact easily with all files held in the repository

Richard continues, “Some of the world’s biggest pharmaceutical companies currently use ModSpace due to its flexibility to allow them to store everything in one place. Mango’s own Consultants use it as an internal repository – storing and collaborating on projects. It streamlines our processes and allows our teams to work on projects, no matter where they are, and without concern for having work overwritten or lost.”

For a product demonstration or a discussion about how ModSpace can help improve the way your data science team works powering innovation, collaboration and efficiency, call

data scientist or data engineer - what's the difference?
Blogs home Featured Image

Author: Principal Consultant, Dean Wood

When it was floated that I should write this article, I approached it with trepidation. There is no better way to start an argument in the world of data than by trying to define what a Data Scientist is or isn’t – by adding in the complication of the relatively newly appearing role of Data engineer, there is no way this is not going to end in supposition and a lot of sentences starting “I reckon”.

Nevertheless, understanding the value of these two vital roles is important to any organisation that seeks to unlock the value found in its data – without being able to describe these roles, it’s next to impossible to recruit appropriately. With that in mind, here is what I reckon.

‘By thinking in terms of rigidly defined boxes we are missing the point. A Data team should be thought of as covering a spectrum of the range of skills you need for effective data management and analytics. Simple boxes like Data Scientist and Data Engineer are useful, but should not be too rigidly defined.’

Reams have been written attempting to define what a Data Scientist is. The data science community has careered from expecting a Data Scientist to know everything from Dev Ops to statistics, to a Data Scientist needing to have a PhD which leads to large institutions giving up and just rebranding their BI professionals as Data Scientists. All of this misses the point.

Then arise the Data Engineer. No longer is your IT department the custodians of the data. The role has become too specialist and critical to the business for those who have worked really hard to understand traditional IT systems and think things like 3rd Normal Form is something in gymnastics and Hadoop is a noise you make after eating a kebab. Completely understandably, data has outgrown your average IT professional, but what do you need to make sure your data is corralled properly? Can’t we just throw a Data Scientist at it and get them to look after the Data? Again, I think this misses the point.

Human beings are good at putting things into boxes and categories. It is how we manage the world and it is largely how we are trained to manage our businesses. Our management accountants take care of the finances and our HR department takes care of our employees. However, by putting people in these boxes with fairly rigid boundaries, there is a risk that necessary skills are missed and you end up with a team across your organisation that cannot provide what the business needs.

This is particularly true when we come to think of Data Scientists and Data Engineers. Rather than thinking of people in terms of the box to put them in, when looking at building your data team it is preferable to think of a spectrum of skills that you need to cover. These can be broadly put into the boxes of Data Scientist and Data Engineer, however the crossover can be high as can be seen in the diagram above.

In your Data Engineering team you will need individuals with a leaning towards the world of Dev Ops and you will need team members who are close to Machine Learning engineers. Likewise, in your Data Science team you will need members who are virtually statisticians, and team members who know something about deploying a model in a production environment – making sure your team as a whole and your individual project teams cover this skill mix, can be a real challenge.

So in summary, I reckon that we need to stop thinking about the boxes we put people in quite so much, and start looking at the skills we actually need in our teams to make our projects a success. Understanding the Data Scientist/Data Engineer job roles as a spectrum of skills where you may need Data Engineer-like Data Scientists, and Data Scientist-like Data Engineers, will give you more success when it comes to building your data team and delivering value from your data.

Data for Good
Blogs home Featured Image

Good triumphing over evil in the end is the stuff of every good fairy tale or Hollywood storyline, but in real life, as we all know, it’s usually the tales of political doom and gloom across the world that dominate our screens with stories of good remaining well away from the spotlight.  Good, it seems, does not make for high viewing figures.

And stories about data are no exception to this rule.

Barely a day goes by without a story alarming the general public about their privacy and how their information is being used.  Think about the investigative documentary about the Cambridge Analytica Scandal, The Great Hack.  Think about banking information leaks or how Facebook is using your personal details and preferences.

It’s easy to forget that data science is also shaping the way we live, improving lives for the better and providing services we could only have dreamed of decades ago.

And it’s for this reason, that we at Mango decided to celebrate #Data4Good week, showcasing all of the different ways data science and analytics can be used for good in the world.

When The Economist declared in 2017 that data was more valuable than oil, few people truly understood its power and how this was possible.  Fast-forward to the current day, and the picture of data usage is becoming clearer.

In September, we were fortunate enough to secure some incredible speakers at our annual EARL Conference in London who shared stories of how data science has benefited services from local communities, to healthcare and even helping progress peace talks in war stricken areas.  We have shared some of these stories via Twitter and I would urge you to take a look.

R in the NHS

 When it comes to healthcare, analysis and prediction can be used to better inform decisions about healthcare provision, streamline and automate tasks, dig into complex problems, and predict changes in the healthcare the NHS provides to its patients. Because of this, many non-profit organisations want to harness the power of data science.

During EARL, Edward Watkinson, an analyst, economist and data scientist currently working for the Royal Free London Hospital Group, took to the stage to explain how adopting R as the core tool on their analytical workbench for helping to run their hospitals, and show how useful it has been in the cash-strapped NHS.

You can watch Edward’s ten minute lighting talk here.

Helping local communities

 Another great use-case for data science being used for good, is how it can help local communities. David Baker, research and evaluation residential volunteer worker at Toynbee Hall, took to the stage at EARL to explain how Toynbee Hall has adopted R as a tool within the charity sector.

By way of background, since its inception in 1884, evidence-based research has been central to Toynbee Hall’s mission as a charity throughout East London communities. It has had a hand in creating first data visualisations for public good, publishing a series of poverty maps, and regularly engages with the local community to solve problems.

R has allowed for rapid analyses of data on a diversity of projects, at Toynbee Hall. Additionally, Baker explained how embracing an open source software allowed for the team to host a series of data hackathons, that allowed them to recruit freelance data scientists to help analyse publicly available datasets, contributing to building materials that they use in their policy advocacy campaigns.

You can catch-up on David’s talk here.

Some amazing work has been done through the power of data science and analytics, and it’s continually changing the world around us. These stories won’t ever make the news, but it’s reassuring to remind ourselves that sometimes, good really can triumph over evil in the real world as well as in the movies.  We hope that people are beginning to see that data science is more than just a buzzword – it’s a new hope for good.

integrating Python and R
Blogs home Featured Image

For a conference in the R language, the EARL Conference sees a surprising number of discussions about Python. I like to think that at least some of these are to do with the fact that we have run 3-hour workshops outlining various strategies for integrating Python and R – here’s how:

  • outline the basic strategy for integrating Python and R;
  • run through the different steps involved in this process; and
  • give a real example of how and why you would want to do this.

This post kicks everything off by:

  • covering the reasons why you may want to include both languages in a pipeline;
  • introducing ways of running R and Python from the command line; and
  • showing how you can accept inputs as arguments and write outputs to various file formats.

Why “And” not “Or”?

From a quick internet search for articles about “R Python”, of the top 10 results, only 2 discuss the merits of using both R and Python rather than pitting them against each other. This is understandable; from their inception, both have had very distinctive strengths and weaknesses. Historically, though, the split has been one of educational background: statisticians have preferred the approach that R takes, whereas programmers have made Python their language of choice. However, with the growing breed of data scientists, this distinction blurs:

Data Scientist (n.): Person who is better at statistics than any software engineer and better at software engineering than any statistician. — twitter @josh_wills

With the wealth of distinct library resources provided by each language, there is a growing need for data scientists to be able to leverage their relative strengths. For example: Python tends to outperform R in such areas as:

  • Web scraping and crawling: though rvest has simplified web scraping and crawling within R, Python’s beautifulsoup and Scrapy are more mature and deliver more functionality.
  • Database connections: though R has a large number of options for connecting to databases, Python’s sqlachemy offers this in a single package and is widely used in production environments.

Whereas R outperforms Python in such areas as:

  • Statistical analysis options: though Python’s combination of ScipyPandas and statsmodels offer a great set of statistical analysis tools, R is built specifically around statistical analysis applications and so provides a much larger collection of such tools.
  • Interactive graphics/dashboardsbokehplotly and intuitics have all recently extended the use of Python graphics onto web browsers, but getting an example up and running using shiny and shiny dashboard in R is faster, and often requires less code.

Further, as data science teams now have a relatively wide range of skills, the language of choice for any application may come down to prior knowledge and experience. For some applications – especially in prototyping and development – it is faster for people to use the tool that they already know.

Flat File “Air Gap” Strategy

In this series of posts we are going to consider the simplest strategy for integrating the two languages, and step though it with some examples. Using a flat file as an air gap between the two languages requires you to do the following steps.

  1. Refactor your R and Python scripts to be executable from the command line and accept command line arguments.
  2. Output the shared data to a common file format.
  3. Execute one language from the other, passing in arguments as required.

Pros

  • Simplest method, so commonly the quickest
  • Can view the intermediate outputs easily
  • Parsers already exist for many common file formats: CSV, JSON, YAML

Cons

  • Need to agree upfront on a common schema or file format
  • Can become cumbersome to manage intermediate outputs and paths if the pipeline grows.
  • Reading and writing to disk can become a bottleneck if data becomes large.

Command Line Scripting

Running scripts from the command line via a Windows/Linux-like terminal environment is similar in both R and Python. The command to be run is broken down into the following parts,

<command_to_run> <path_to_script> <any_additional_arguments>

where:

  • <command> is the executable to run (Rscript for R code and Python for Python code),
  • <path_to_script> is the full or relative file path to the script being executed. Note that if there are any spaces in the path name, the whole file path must me enclosed in double quotes.
  • <any_additional_arguments> This is a list of space delimited arguments parsed to the script itself. Note that these will be passed in as strings.

So for example, an R script is executed by opening up a terminal environment and running the following:

Rscript path/to/myscript.R arg1 arg2 arg3

A Few Gotchas

  • For the commands Rscript and Python to be found, these executables must already be on your path. Otherwise the full path to their location on your file system must be supplied.
  • Path names with spaces create problems, especially on Windows, and so must be enclosed in double quotes so they are recognised as a single file path.

Accessing Command Line Arguments in R

In the above example where arg1arg2 and arg3 are the arguments parsed to the R script being executed, these are accessible using the commandArgsfunction.

## myscript.R

# Fetch command line arguments
myArgs <- commandArgs(trailingOnly = TRUE)

# myArgs is a character vector of all arguments
print(myArgs)
print(class(myArgs))

By setting trailingOnly = TRUE, the vector myArgs only contains arguments that you added on the command line. If left as FALSE (by default), there will be other arguments included in the vector, such as the path to the script that was just executed.

Accessing Command Line Arguments in Python

For a Python script executed by running the following on the command line

python path/to/myscript.py arg1 arg2 arg3

the arguments arg1arg2 and arg3 can be accessed from within the Python script by first importing the sys module. This module holds parameters and functions that are system specific, however we are only interested here in the argv attribute. This argv attribute is a list of all the arguments passed to the script currently being executed. The first element in this list is always the full file path to the script being executed.

# myscript.py
import sys

# Fetch command line arguments
my_args = sys.argv

# my_args is a list where the first element is the file executed.
print(type(my_args))
print(my_args)

If you only wished to keep the arguments parsed into the script, you can use list slicing to select all but the first element.

# Using a slice, selects all but the first element
my_args = sys.argv[1:]

As with the above example for R, recall that all arguments are parsed in as strings, and so will need converting to the expected types as necessary.

Writing Outputs to a File

You have a few options when sharing data between R and Python via an intermediate file. In general for flat files, CSVs are a good format for tabular data, while JSON or YAML are best if you are dealing with more unstructured data (or metadata), which could contain a variable number of fields or more nested data structures. All these are very common data serialisation formats, and parsers already exist in both languages. In R the following packages are recommended for each format:

And in Python:

The csv and json modules are part of the Python standard library, distributed with Python itself, whereas PyYAML will need installing separately. All R packages will also need installing in the usual way.

Summary

So passing data between R and Python (and vice-versa) can be done in a single pipeline by:

  • using the command line to transfer arguments, and
  • transferring data through a commonly-structured flat file.

However, in some instances, having to use a flat file as an intermediate data store can be both cumbersome and detrimental to performance.

Authors: Chris Musselle and Kate Ross-Smith

Blogs home Featured Image

At Mango, we talk a lot about going on a ‘data-driven journey’ with your business. We’re passionate about data and getting the best use out of it. But for now, instead of looking at business journeys, I wanted to talk to the Mango team and find out how they started on their own ‘data journey’ – what attracted them to a career in data science and what they enjoy about their day-to-day work. (It’s not just typing in random numbers?! What?!)

We are hugely fortunate to have a wonderful team of data scientists who are always generous in sharing their skills or don’t mind teaching non data scientists R for uber beginners. So let’s see what they have to say on becoming a Mango…

Jack Talboys

Jack joined us last year as a year-long placement student 

“I actually had no idea what Data Science was until I discovered Mango about a year and a half ago. I was at the university career fair – not really impressed by the prospect of working in finance or as a statistician for a large company. I stumbled across Liz Matthews and Owen Jones who were there representing Mango, drawn in by the title “Data Science” we started talking. Data Science seemed to tick all of my boxes, being able to use my knowledge of statistics and probability while doing lots of coding in R.

I’m now 6 months in at Mango and it couldn’t be going better. I’ve greatly improved my proficiency in R, alongside learning new skills like Git, SQL and Python. I’ve been given a great deal of responsibility, assisting in delivering training to a client and attending the EARL 2018 conference making up some of my highlights. There have also been opportunities for me to be client-facing, giving me a deeper understanding of what it takes to be a Data Science Consultant.

Working at Mango hasn’t just developed my technical skills however, without really noticing I’ve found that I have become a better communicator. Whether organising tasks with the other members of the ValidR team or talking to clients I have discovered a new sense of confidence and trust in myself. Even as a relative newbie I can see that Data Science as an industry is growing massively – and I’m excited to be part of this growth and make the most of the exciting opportunities it presents with Mango.”

Beth Ashlee, Data Scientist

“I got into data science after applying for a summer internship at Mango. I didn’t really know much about the data science community previously, but spent the next few weeks learning more technical and practical skills than I had in 3 years at university.

I’ve been working as a Data Science Consultant for nearly 3 years and due to the wide variety of projects I’ve never had a dull moment. I have had amazing opportunities to travel worldwide teaching training courses and interacting with customers from all industries. The variety is my favourite part of the job, you could be building a Shiny application to help a pharmaceutical company visualise their assay data one week and the next teaching a training course at the head offices of large companies such as Screwfix.”

Owen Jones, Data Scientist

“To be honest, it rarely feels like work… since we’re a consultancy, there’s always a wide variety of projects on the go, and you can get yourself involved in the areas you find most interesting! Plus, you have the opportunity to visit new places, and you’re always meeting and working with new people – which means new conversations, new ideas and new understanding. I love it.”

Nick Howlett, Data Scientist

Nick is currently working on a client project in Italy.

“During my time creating simulations in academic contexts I found myself more motivated to meet my supervisor’s requirements than pursuing niche research topics. Towards the end of my studies, I discovered data science and realised that the client-consultant relationship was a situation very similar to this.

Working at Mango has allowed me to develop personal relationships with clients across many sectors – and get to know their motivations and individual data requirements. Mango has also given me the opportunity to travel on both short term training projects and more long term projects abroad.”

Jack Talboys

Python
Blogs home Featured Image

I have been asked this tricky question many times in my career – “Python or R?”. Based on my experience, if anything, the answer to this is totally dependent on purpose for purpose and is still a question that many aspiring data scientists, business leaders and organisations are still pondering over.

It is important to have the right and best tools when providing the desired answers to the many business questions within the data science space – which isn’t as simple as it sounds. When you consider Data Analytics, Data Science, Data Strategic Planning or developing a Data Science team, where to start from in terms of languages could be a major blocker.

Python has become the de facto language of choice for organisations seeking seamless creation or upscaling skills; and its influence is evident in the cloud computing environment. The fact of the matter is, according to the 20th annual KDnuggets Software Poll, Python is still the leader – top tech companies like Alphabet’s Google and Facebook continue to use Python at the core of their frameworks.

Also, some of the essential benefits of Python are its fluency and clarity in natural readability. It is easy to learn, and it provides much flexibility in terms of scalability and productionalization. There are many libraries or packages that have been created for purpose.

Data is everywhere

Data is everywhere, big or small. And loads of companies have it but are not harnessing the capabilities of these great assets. Of course, the availability of data without the “algorithms” will not add any business values. That is why it is important for companies and business leaders to get on fast and get the tool that helps to transform their data fundamentally into the viable economic positives they desire. By choosing Python, companies will be able to utilize the potential of their data.

Deployment and Cloud Capability

The Python capability is big and its impact is felt in the areas of Machine Learning, Computer Vision, Natural Language Processing and many others. Its robustness and growing ecosystem has made it easy for many deployment and integration tools. If you use Google Cloud Platform (GCP), Amazon Web Service (AWS) or Microsoft’s Azure, you will find the convenience of use and integration with Python. As a matter of fact, cloud technologies are growing at the fastest pace with ease as Python drives most applications on cloud.

Concluding Remarks

Considering a broad perspective, you might doubt if there is any question of supremacy between Python and R (or even SQL). But there is apparently a high variation in terms of needs and versatility. Python has been become a kingpin in terms of its user-friendliness, scalability and the extensive ecosystem of libraries and interoperability. Some popular libraries within Python supports the development and evolution of Artificial Intelligence (AI). Many organisations are beginning to see the reality of upskilling and taking advantage of Python in their AI driven decisions.

Mango Solutions

There is a big drive within the layers of Mango that supports the use of Python as an essential tool benefiting our consultants and clients in many ways. Many projects have had Python at their core when it comes to project execution. Also, our consultants have delivered several training courses to different organisations within both the public and private sectors across the globe, to help them harness the potential Python in their data-driven-decisions, asserting business values and helping to shape their data journey

Author: Dayo Oguntoyinbo, Data Scientist

Blogs home Featured Image

This summer Mango took on three summer interns, Chris, Ellena and Lizzi, all maths students at different stages of their university careers. To provide insight into what it’s like to work on a data science project, Mango set up a three day mini-project. The brief was to analyse data from the 2018 Cost of Edinburgh survey.

The Cost of Edinburgh project was founded in 2017 by director and producer Claire Stone. The survey was designed in collaboration with, and with support from, the Fringe Society. It ran on SurveyMonkey in 2018 with the goal of collecting 100-150 responses from the wide range of people involved with performing at the Edinburgh Fringe Festival that year. There were three elements to the scope of the survey:

  • Collect demographic data in order to explore the impact of the costs of attending the Fringe on diversity;
  • Collect data on production income versus costs over multiple years, in terms of venue costs, accommodation and travel;
  • Obtain qualitative responses on the financial and wellbeing costs of attending the Fringe.

The survey aimed to determine which performers attend the Fringe, ascertain what stops people from attending and whether it’s becoming more expensive to perform at the festival. 368 people responded to the survey which had 22 questions with three main sections: demographics, quantitative questions on costs and income and qualitative questions on costs and wellbeing. In this post, Chris and Lizzi share their experiences of their first data science project.

——————————————————————————————————————————————————————————

As is usually the case when real-world data are involved, they weren’t ready to be analysed out of the box. On first look we saw that we didn’t have tidy data in which each column contained the answer to a different question. There were lots of gaps in the spreadsheet, but not all due to missing data.

  • Questions that required the respondent to choose one answer but included an ‘Other’ category corresponded to two columns in the spreadsheet; one containing the chosen answer, and one containing free text if the chosen answer was ‘Other’ and empty otherwise.
  • Questions for which respondents could choose multiple answers had one column per answer. Each column contained the answer if chosen and was empty if not.
  • For quantitative cost and income questions respondents could fill in details for up to ten productions, or years of attending the Fringe. If a question asked for five pieces of information it corresponded to 50 columns per subject, many of which were empty.

 

Chris’s thoughts

After being told that Nick had had a “quick, preliminary look at the data” and discussing his findings, we decided to split the data into two sections; Demographics and Financial, with the idea that any spare time at the end would be spent on the more qualitative feedback questions. Considering that there was a lot more complicated data in the financial questions, it was decided that both Ellena and I would tackle them. Furthermore, Ellena would take the “costs” and I would take the “income” questions.

Now the jobs were split into manageable chunks we could start appraising what questions we wanted to answer with the data. Looking more carefully at the data given, it was clear that a lot of the answers were categorical and hence bar-graphs seemed like an obvious option. It would have been really nice to have continuous data but I can understand why people would be uncomfortable answering a survey with that level of personal detail. Having the opportunity to see the project evolve from the beginning to this point where we had specific questions to answer was a really positive experience. By this point, I felt as if I’d learnt so much already. Here is a histogram of the annual income of Edinburgh Fringe performers from their work in the arts.

 

 

We were to use an AGILE development methodology, using a creative, sped-up version of the scrum method. The scrum method is a series of short bursts of work with defined targets, called sprints, interspersed with short meetings called stand-ups (named because you’re not allowed to sit down). This was my first introduction to a professional workflow, and it’s given me an insight into how companies might manage work. These sprints are meant to be days or weeks in length but because of our 3 day deadline we had to augment this strategy, splitting the day into 2 parts, and having 2 stand-ups per day.

We spent the next 2 days transforming the data into a usable form, and creating some graphs that certainly showed some things clearly. However, for a lot of the financial data we didn’t have a large enough sample size to perform statistical tests on it with a high level of certainty, which left the output feeling very categorical. This didn’t stop me from learning a huge amount though. I was introduced to tidyverse, a collection of packages that have been integrated into each other. And then it was just a matter of coaxing me out of my `for` loop ways and using group_by instead. There was a lot of coding to do and I feel that this project has really developed my R skills. I mainly code in python and before this, my only history with R was one years worth of academic use at University. This was a whole new experience, both in level of exposure, and impromptu lessons every few hours. My favourite being enthusiastically introduced to regex by Nick, who taught me that any problem can be solved by regex.

This project made me appreciate the need for good quality data in data-science, and how much of a project is spent cleaning and pre-processing compared to actually performing statistical tests and generating suave ggplots. There were a lot of firsts for me; like using ggplot2, dplyr and git in a professional environment.

 

Lizzi’s thoughts

Three days isn’t very long to take a piece of work from start to finish and in particular, it doesn’t allow for much thinking time. We had to decide which questions it was achievable to address in the time frame and divide the tasks so that we weren’t all working on the same thing. My job was to look at the demographic data. I didn’t produce any ground-breaking research, but I was able to produce a bunch of pretty pictures, discovering some new plotting packages and practising some useful skills along the way.

Firstly, I learnt what a waffle plot is. A way of displaying categorical data to show parts-to-whole contributions. Plots like the example below, which represents the self-reported annual income of Edinburgh Fringe performers from their work in the arts, can be easily created in R using the waffle package. The most time consuming task required to create such a plot is ensuring the factor levels are in the desired order. The forcats package came in useful for this.

 

 

The leaflet package enables you to easily create zoomable maps, using images from OpenStreetMap by default. Once again, the most time consuming part was getting the data in the right format. Location data was a mix of UK post codes, US zip codes, towns, countries and sets of four numbers that seemed like they might be Australian post codes. Using a Google API through the mapsapi package, and a bit of a helping hand so that Bristol didn’t turn into a NASCAR track in Tennessee, I could convert these data into longitude and latitude coordinates for plotting. This package can also calculate distances between places, but it only worked for locations in Great Britain as it uses Google maps driving directions. Instead, to determine how far performers travel to get to the Fringe I resorted to calculating the geodesic distance between pairs of longitudes and latitudes using the gdist function from Imap.

 

This project was also a good opportunity to practise using the tidyverse to manipulate data. I learnt base R at university and only came across the %>% pipe on a placement last year when working with somebody else’s code. Currently data manipulation in base R is second nature to me and doing something in the tidyverse way requires much more thought, so this project was a step towards becoming less old fashioned and shifting towards the tidyverse.

This was my first experience of using version control properly for a collaborative project. I use version control for my PhD work but purely in a way that means I don’t lose everything if my laptop gets lost, stolen or broken. I’m sure this will send shivers down some spines but my commits are along the lines of “all changes on laptop” or “all changes from desktop” as I update between working at home or at uni, often after a few days. I’ve learnt that proper use of Git means making a branch on which to carry out a particular piece of the work without interfering with the master branch. It also means committing regularly, with informative messages. Once work on a branch is finished, you submit a merge request (aka pull request). This prompts somebody else to review the code and if they’re satisfied with it, press the big green button to merge the branch into the master branch. It was also important to make the repository tidy and well-structured so that it made sense to others and not just to me.

The output of the project was an R Markdown document rendered to html. We brought our work together by making a skeleton Markdown document and importing each person’s work as a child document. Once we worked out which packages each of us had installed and made the sourcing of files and reading in of data consistent, the document knitted together smoothly.

As well as the coding side of things, I also learnt a bit about project management methodology. In the initial messages about the project we were told it would be carried out using scrum methodology. A quick google reassured me that no rugby was going to be involved. As Chris mentioned, we had 15 minute stand-ups (nothing to do with comedians) each morning and afternoon. The purpose of these meetings was to quickly catch each other up on what we had been working on, what we were going to do next and whether there were any blockers to getting the work done. The latter being particularly important given the small time frame.

 

In summary

In summary, this project resembled an exploratory phase of a full project and was perhaps a bit too short to produce a completely polished deliverable. However, we all learnt something, improved our R skills and had an enjoyable and interesting experience of working on a data science project.

Cost of Edinburgh Survey Analysis

gambling and gaming industry
Blogs home Featured Image

One of my favourite movies of all time is Rain Man, in which Dustin Hoffman plays Raymond Babbit, an autistic savant whose ability to count hundreds of cards at once leads to significant wins at the Las Vegas casino tables. Fortunately for gambling and gaming companies, Raymond’s counting abilities extend far beyond the normal range of human subitising and, aside from the occasional winning streak, the vast majority of us will be net losers. But in the gambling and gaming industry, it’s not just the customer who needs the statistical insight at their fingertips in order to succeed, it’s the vendors. And this insight comes in the form of data, AI/ML and the correct cloud infrastructure in order to help vendors win as big as their customers.

On the surface of it, the odds are stacked in favour of the vendors. With 2 billion gamers across the world, and the current size of the global gambling market – almost $46 billion – forecast to double in the upcoming years, it’s clear that the ‘have to be in it to win it’ player mentality is proving lucrative for vendors. Each of the billion global players will participate in multiple actions and interactions, leaving a trail of data as they go, allowing vendors to act on this information to really understand their customers. But scratch beneath the surface and a mine of potential complexities unfold. How do they prevent the loss of high value customers ? How do they optimise marketing spend? Or predict & prevent problem playing/gambling?  How do they ensure they are identifying and developing the right products? Forecasting sales accurately? Predicting churn? Or preventing & detecting fraud?

When Mango begins a new engagement with companies operating in this sector, before any data science can take place, it’s crucial to ensure that the right questions are being asked and challenges being addressed.

Let’s take a look at some of those common challenges facing the gaming & gambling industry:

Preventing the loss of high value customers

Solving the issue of customer churn is one of the biggest challenges for any online business, but the effective use of data science can help by better forecasting when customers – and particularly high value customers – look likely to leave. The more accurately you can forecast churn, the more effective you can be at preventing customer loss. Using data science to segment the customer base by any attribute, such as age, location or date they joined, means approaches can be developed that are personalized – and therefore relevant – to that customer base. For example, analytics could show a list of customers who are approaching the end of their contract or detect less activity on an account than is ‘normal’ according to historical patterns, or perhaps that new, heavily featured games are not being played. In all of these instances, data driven decisions can be made on the most effective intervention tactics and appropriate incentives to retain these customers, such as loyalty points, or reduced price play.

Optimising marketing spend

Contrary to popular opinion, marketing funds are not bottomless pits, and adopting a ‘spray and pray’ approach will likely result in little, if any, return on investment. With customer data being captured with every online transaction, however, vendors can gather huge volumes of structured and unstructured data about each individual in order to offer targeted, personalised marketing. The key word here is ‘personalised.’ Just because 100 customers might fall into the same broad segment, it doesn’t mean they should be targeted in the same way. Individuals within these groups have individual preferences, and algorithms can help determine which communication or marketing channel would have the most impact with a particular customer and deliver the highest response rate, thus optimising marketing spend.

Maximising cross & upsell through great customer experience

Ensuring that customers are happy enables vendors to cross and upsell and data science can unlock insights which help win and retain customers. These insights can help online businesses ‘know their customer’ and therefore make tailored improvements to the customer experience and measure immediate impact. Did increased spend on customer retention contribute to increased revenue? If so, by how much? What is the impact of multiple offers and communications to customers on sales, unsubscribes and retention? Excellence in customer service can be achieved by adopting a data driven, 360 degree approach which offers a thorough understanding of the audience and a means to deliver the service desired at the right time and via the most appropriate channels.

Predicting & preventing problem gambling

As the gambling and gaming industry grows, so does the problem of gambling addiction. According to a recent BBC article, there are about 430,000 people experiencing problems with gambling and, as we all know, this can impact anyone. There is no typical ‘problem gambler’ – it’s an issue that transcends all social and demographic groups. The Gambling Commission recently launched its new three-year National Strategy which focuses on prevention, education, treatment and support for problem gamblers, which is a significant step in the right direction. And data science can help this process by identifying potential candidates for such support. Using data collected on the betting patterns of every customer, including the time of day, frequency and size of bets placed, a picture can be built up of an individual’s typical behavioural patterns, so that any gradual change or deviation from this pattern can signal the onset of a potential problem. At this point, the company can decide to apply intervention strategies, such as the temporary stop of an account, or refer the player for online help.

Predicting & preventing fraud

In this sector, the list of potential pitfalls is sadly, long and sobering for customers and vendors alike. Frequent, often large volumes of credit card payments being made, free credit offered by companies as incentives to play encouraging fake accounts being created, stolen credit card details, accounts being hacked….I could go on. Fortunately, advanced analytics can be used to help create a picture of ‘normal’ account activity for individual players, and so flag any abnormalities at the earliest opportunity. With this early detection system in place, an effective monitoring programme can help protect the organisation and the individual.

The possibilities for data science to help your business win are far reaching and, if you’re wondering how you can find out more, I’m delighted to say that next Wednesday evening, I’ll be presenting alongside Rackspace and Google to share with attendees some of the possibilities. We’ll be showing how to build a successful data science capability on Google Cloud, aligning key challenges with the ‘Art of the Possible.’ By answering the right questions through advanced analytics, we can help you create predictive models, including churn, assessing demand and customer life time value to enable effective decision making.

Don’t leave your fortune to Lady Luck. Organisations that win with data science will do so by answering the best business questions, not creating the best answers to data questions.

50 shades of R
Blogs home Featured Image

 

I’ve been joking about R’s “200 shades of grey” on training courses for a long time. The popularity of the book “50 Shades of Grey” has changed the meaning of this statement somewhat. As the film is due to be released on Valentine’s Day I thought this might be worth a quick blog post.

Firstly, where did I get “200 shades of grey” from? This statement was originally derived from the 200 available named colours that contain either “grey” or “gray” in the vector generated by the colours function. As you will see there are in fact 224 shades of grey in R.

greys <- grep("gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 224

 

This is because there are also colours such as slategrey, darkgrey and even dimgrey! So lets now remove anything that is more than just “grey” or “gray”.

 

greys <- grep("^gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 204

 

So in fact there are 204 that are classified as “grey” or “gray”. If we take a closer look though its clear that there are not 204 unique shades of grey in R as we are doubling up so that we can use both the British, “grey”, and US, “gray”. This is really useful for R users not having to remember to change the way they usually spell grey/gray (you might also notice that I have used the function colours rather than colors) but when it comes to unique greys it means we have to be a little more specific in our search pattern. So stripping back to just shades of “grey”:

 

greys <- grep("^grey", colours(), value = TRUE)

length(greys)

[1] 102

 

we find we are actually down to just 102. Interestingly we don’t double up on all grey/gray colours, slategrey4 doesn’t exist but slategray4 does!

So really we have 102 shades of grey in R. Of course this is only using the named colours, if we were to define the colour using rgb we can make use of all 256 colour values!

 

 

So how can we get 50 shades of grey? Well the colorRampPalette function can help us out by allowing us to generate new colour palettes based on colours we give it. So a palette that goes from grey0 (black) to grey100 (white) can easily be generated.

 

shadesOfGrey <- colorRampPalette(c("grey0", "grey100"))

shadesOfGrey(2)

[1] "#000000" "#FFFFFF"

 

And 50 shades of grey?

 

R 50 shades of grey

 

fiftyGreys <- shadesOfGrey(50)

mat <- matrix(rep(1:50, each = 50))

image(mat, axes = FALSE, col = fiftyGreys)

box()

 

I hear the film is not as “graphic” as the book – but hope this fits bill!

 

Author: Andy Nicholls, Data Scientist

 

FIFA World Cup 2018 predictions
Blogs home Featured Image

Given that the UEFA Champion League final a few weeks ago between Real Madrid and Liverpool is the only match I’ve watched properly in over ten years, how dare I presume I can guess that Brazil is going to lift the trophy in the 2018 FIFA World Cup? Well, here goes…

By the way, if you find the below dry to read, it is because of my limited natural language on the subject matter…data science tricks to the rescue!

The idea is that in each simulation run of a tournament, we find team winner, runners-up, third and fourth etc. N times of simulation runs e.g. 10k returns a list of winners with highest probability to be ranked top.

library(tidyverse)
library(magrittr)
devtools::load_all("worldcup")

normalgoals <- params$normalgoals 
nsim <- params$nsim

data(team_data) 
data(group_match_data) 
data(wcmatches_train)

Apart from the winner question, this post seeks to answer which team will be top scorer and how many goals will they score. After following Claus’s analysis rmarkdown file, I collected new data, put functions in a package and tried another modelling approach. Whilst the model is too simplistic to be correct, it captures the trend and is a fair starting point to add complex layers on top.

Initialization

To begin with, we load packages including accompanying R package worldcup where my utility functions reside. Package is a convenient way to share code, seal utility functions and speed up iteration. Global parameters normalgoals (the average number of goals scored in a world cup match) and nsim (number of simulations) are declared in the YAML section at the top of the RMarkdown document.

Next we load three datasets that have been tidied up from open source resource or updated from original version. Plenty of time was spent on gathering data, aligning team names and cleaning up features.

  • team_data contains features associated with team
  • group_match_data is match schedule, public
  • wcmatches_train is a match dataset available on this Kaggle competition and can be used as training set to estimate parameter lamda i.e. the average goals scored in a match for a single team. Records from 1994 up to 2014 are kept in the training set.

Play game

Claus proposed three working models to calculate single match outcome. The first is based on two independent poisson distributions, where two teams are treated equal and so the result is random regardless of their actual skills and talent. The second assumes the scoring event in a match are two possion events, the difference of two poisson events believed to have skellam distribution. The result turns out to be much more reliable as the parameters are estimated from actual bettings. The third one is based on World Football ELO Ratings rules. From current ELO ratings, we calculate expected result of one side in a match. It can be seen as the probability of success in a binomial distribution. It seems that this approach overlooked draw due to the nature of binomial distribution i.e. binary.

The fourth model presented here is my first attempt. To spell out: we assumed two independent poisson events, with lambdas predicted from a trained poisson model. Then predicted goal is simulated by rpois.

Model candidate each has its own function, and it is specified by the play_fun parameter and provided to higher level wrapper function play_game.

# Specify team Spain and Portugal
play_game(play_fun = "play_fun_simplest", 
          team1 = 7, team2 = 8, 
          musthavewinner=FALSE, normalgoals = normalgoals)
##      Agoals Bgoals
## [1,]      0      1
play_game(team_data = team_data, play_fun = "play_fun_skellam", 
          team1 = 7, team2 = 8, 
          musthavewinner=FALSE, normalgoals = normalgoals)
##      Agoals Bgoals
## [1,]      1      4
play_game(team_data = team_data, play_fun = "play_fun_elo", 
          team1 = 7, team2 = 8)
##      Agoals Bgoals
## [1,]      0      1
play_game(team_data = team_data, train_data = wcmatches_train, 
          play_fun = "play_fun_double_poisson", 
          team1 = 7, team2 = 8)
##      Agoals Bgoals
## [1,]      2      2

Estimate poisson mean from training

Let’s have a quick look at the core of my training function. Target variable in the glm function is the number of goals a team obtained in a match. Predictors are FIFA and ELO ratings at a point before the 2014 tournament started. Both are popular ranking systems – the difference being that the FIFA rating is official and the latter is in the wild, adapted from chess ranking methodology.

mod <- glm(goals ~ elo + fifa_start, family = poisson(link = log), data = wcmatches_train)
broom::tidy(mod)
##          term      estimate    std.error  statistic      p.value
## 1 (Intercept) -3.5673415298 0.7934373236 -4.4960596 6.922433e-06
## 2         elo  0.0021479463 0.0005609247  3.8292949 1.285109e-04
## 3  fifa_start -0.0002296051 0.0003288228 -0.6982638 4.850123e-01

From the model summary, the ELO rating is statistically significant whereas the FIFA rating is not. More interesting is that the estimate for the FIFA ratings variable is negative, inferring the effect is 0.9997704 relative to average. Overall, FIFA rating appears to be less predictive to the goals one may score than ELO rating. One possible reason is that ratings in 2014 alone are collected, and it may be worth future effort to go into history. Challenge to FIFA ratings’ predictive power is not new after all.

Training set wcmatches_train has a home column, representing whether team X in match Y is the home team. However, it’s hard to say in a third country whether a team/away position makes much difference comparing to league competetions. Also, I didn’t find an explicit home/away split for the Russian World Cup. We could derive a similar feature – host advantage, indicating host nation or continent in future model interation. Home advantage stands no better chance for the time being.

Group and kickout stages

Presented below are examples showing how to find winners at various stages – from group to round 16, quarter-finals, semi-finals and final.

find_group_winners(team_data = team_data, 
                   group_match_data = group_match_data, 
                   play_fun = "play_fun_double_poisson",
                   train_data = wcmatches_train)$goals %>% 
  filter(groupRank %in% c(1,2)) %>% collect()
## Warning: package 'bindrcpp' was built under R version 3.4.4

## # A tibble: 16 x 11
##    number name         group  rating   elo fifa_start points goalsFore
##                               
##  1      2 Russia       A       41.0   1685        493   7.00         5
##  2      3 Saudi Arabia A     1001     1582        462   5.00         4
##  3      7 Portugal     B       26.0   1975       1306   7.00         6
##  4      6 Morocco      B      501     1711        681   4.00         2
##  5     12 Peru         C      201     1906       1106   5.00         3
##  6     11 France       C        7.50  1984       1166   5.00         6
##  7     13 Argentina    D       10.0   1985       1254   9.00         8
##  8     15 Iceland      D      201     1787        930   6.00         4
##  9     17 Brazil       E        5.00  2131       1384   7.00         8
## 10     20 Serbia       E      201     1770        732   6.00         4
## 11     21 Germany      F        5.50  2092       1544   6.00         8
## 12     24 Sweden       F      151     1796        889   6.00         5
## 13     27 Panama       G     1001     1669        574   5.00         3
## 14     25 Belgium      G       12.0   1931       1346   5.00         4
## 15     31 Poland       H       51.0   1831       1128   4.00         2
## 16     29 Colombia     H       41.0   1935        989   4.00         1
## # ... with 3 more variables: goalsAgainst , goalsDifference ,
## #   groupRank 
find_knockout_winners(team_data = team_data, 
                      match_data = structure(c(3L, 8L, 10L, 13L), .Dim = c(2L, 2L)), 
                      play_fun = "play_fun_double_poisson",
                      train_data = wcmatches_train)$goals
##   team1 team2 goals1 goals2
## 1     3    10      2      2
## 2     8    13      1      2

Run the tournament

Here comes to the most exciting part. We made a function–simulate_one()–to play the tournament one time and then replicate() (literally) it many many times. To run an ideal number of simulations, for example 10k, you might want to turn on parallel. I am staying at 1000 for simplicity.

Finally, simulate_tournament() is an ultimate wrapper for all the above bullet points. The returned resultX object is a 32 by R params$nsim matrix, each row representing predicted rankings per simulation. set.seed() is here to ensure the result of this blogpost is reproducible.

# Run nsim number of times world cup tournament
set.seed(000)
result <- simulate_tournament(nsim = nsim, play_fun = "play_fun_simplest") 
result2 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_skellam")
result3 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_elo")
result4 <- simulate_tournament(nsim = nsim, play_fun = "play_fun_double_poisson", train_data = wcmatches_train)

Get winner list

get_winner() reports a winner list showing who has highest probability. Apart from the random poisson model, Brazil is clearly the winner in three other models. The top two teams are between Brazil and Germany. With different seeds, the third and fourth places (in darker blue) in my model are more likely to change. Variance might be an interesting point to look at.

get_winner(result) %>% plot_winner()

get_winner(result2) %>% plot_winner()

get_winner(result3) %>% plot_winner()

get_winner(result4) %>% plot_winner()

Who will be top scoring team?

The skellum model seems more reliable, my double poisson model gives systematically lower scoring frequency than probable actuals. They both favour Brazil though.

get_top_scorer(nsim = nsim, result_data = result2) %>% plot_top_scorer()

get_top_scorer(nsim = nsim, result_data = result4) %>% plot_top_scorer()

Conclusion

The framework is pretty clear, all you need is to customise the play_game function, such as game_fun_simplestgame_fun_skellam and game_fun_elo.

Tick-tock… Don’t hesitate to send a pull request to ekstroem/socceR2018 on GitHub. Who is winning the guess-who-wins-worldcup2018 game?

If you like this post, please leave your star, fork, issue or banana on the GitHub repository of the post, including all code (https://github.com/MangoTheCat/blog_worldcup2018). The analysis couldn’t have been done without help from Rich, Doug, Adnan and all others who have kindly shared ideas. I have passed on your knowledge to the algorithm.

Notes

  1. Data collection. I didn’t get to feed the models with the most updated betting odds and ELO ratings in the team_data dataset. If you would like to, they are available via the below three sources. FIFA rating is the easiest and can be scraped by rvest in the usual way. The ELO ratings and betting odds tables seem to have been rendered by javascript and I haven’t found a working solution. For betting information, Betfair, an online betting exchange has an API and R package abettor which helps to pull those odds which are definetly interesting for anyone who are after strategy beyond prediction:
      1. Model enhancement. This is probably where it matters most. For example, previous research has suggested various bivariate poissons for football predictions.
      2. Feature engineering. Economic factors such as national GDP, market information like total player value or insurance value, and player injure data may be useful to improve accuracy.
      3. Model evaluation. One way to understand if our model has good prediction capibility or not is to evaluate the predictions against actual outcomes after 15 July 2018. Current odds from bookies can also be referred to. It is not imporssible to run the whole thing on historical data e.g. Year 2014. and perform model selection and tuning.
      4. Functions and package could be better parameterized; code to be tidied up.

Author: Ava Yang, Data Scientist at Mango Solutions