Blogs home Featured Image

Can you tell us about your upcoming keynote at EARL and what the key take-home messages will be for delegates?

I’m going to talk about functional programming which I think is one of the most important programming techniques used with R. It’s not something you need on day 1 as a data scientist but it gives you some really powerful tools for repeating the same action again and again with code. It takes a little while to get your head around it but recently, because I’ve been working writing the second edition of Advanced R, I’ve prepared some diagrams that make it easier to understand. So the take-home message will be to use more functional programming because it will make your life easier!

Writer, developer, analyst, educator or speaker – what do you enjoy most? Why?

The two things that motivate me most in the work that I do is the intellectual joy of understanding how you can take a big problem and break it up into small pieces that can be combined together in different ways – for example, the algebras and grammars of tools like dplyr and ggplot2. I’ve been working a lot in Advanced R to understand how the bits and pieces of R fit together which I find really enjoyable. The other thing that I really enjoy is hearing from people who have done cool stuff with the things that I’ve worked on which has made their life easier – whether that’s on Twitter or in person. Those are the two things that I really enjoy and pretty much everything else comes out of that. Educating, for example, is just helping other people understand how I’ve broken down a problem and sharing in ways that they can understand too.

What is your preferred industry or sector for data, analysis and applying the tools that you have developed?

I don’t do much data analysis myself anymore so when I do it, it’s normally data related to me in some way or, for example, data on RStudio packages. I do enjoy challenges like figuring out how to get data from a web API and turning it into something useful but the domain for my analysis is very broadly on data science topics.

When developing what are your go-to resources?

I still use StackOverflow quite a bit and Google in general. I also do quite a bit of comparative reading to understand different programming languages, seeing what’s going on across different languages, the techniques being used and learning about the evolving practices in other languages which is very helpful.

Is there anything that has surprised you about how any of the tools you’ve created has been used by others?

I used to but I’ve lost my capability to be surprised now just because the diversity of uses is crazy. I guess now, I think it’s most notable is when someone uses any of my tools to commit academic fraud (well-publicised examples sometimes). Otherwise, people are using R and data to understand pretty much every aspect of the world which is really neat.

What are the biggest changes that you see between data science now and when you started?

I think the biggest difference is that there’s a term for it – data science. I think it’s been useful to have that term rather than just Applied Statistician or Data Analyst because I think Data Science is becoming different to what these roles have been traditionally. It’s different from data analysis because data science uses programming heavily and it’s different from statistics since there’s a much greater emphasis on correct data import and data engineering with the goal may be to eventually turn the data analysis into a product, web app or something other than a standard report.

Where do you currently perceive the biggest bottlenecks in data science to be?

I think there are still a lot of bottlenecks in getting high-quality data and that’s what most people currently struggle with. I think another bottleneck is how to help people learn about all the great tools that are available, understand what their options are, where all the tools are and what they should be learning. I think there are still plenty of smaller things to improve with data manipulation, data visualization and tidying but by and large it feels to me like all the big pieces are there. Now it’s more about getting everything polished and working together really well. But still, getting data to a place to even start an analysis can be really frustrating so a major bottleneck is the whole pipeline that occurs before arriving in R.

What topic would you like to be presenting on in a data science conference a year from now?

I think one thing I’m going to be talking more about next year is this vctrs package that I’ve been working on. The package provides tools for handling object types in R and managing the types of inputs and outputs that a function expects and produces. My motivation for this is partly because there are a lot of inconsistencies in the tidyverse and base R that vctrs aims to fix. I think of this as part of my mental model because when I read R code, there’s a simplified R interpreter in my head which mainly focuses on the types of objects and predicts whether some code is going to work at all or if it’s going to fail. So part of the motivation behind this package is me thinking about how to get stuff out of my head and into the heads of other people so they can write well-functioning and predictable R code.

What do you hope to see out of RStudio in the next 5 years?

Generally, I want RStudio to be continually strengthening the connections between R and other programming languages. There’s an RStudio version 1.2 coming out which has a bunch of features to make it easy to use SQL, Stan and Python from RStudio. Also, the collaborative work we do with Wes McKinney and Ursa Labs – I think we’re just going to see more and more of that because data scientists are working in bigger and bigger teams on a fundamentally collaborative activity so making it as easy as possible to get data in and out of R is a big win for everyone.

I’m also excited to see the work that Max Kuhn has been doing on tidy modelling. I think the idea is really appealing because it gives modelling an API that is very similar to the tidyverse. But I think the thing that’s really neat about this work is that it takes inspiration from dplyr to separate the expression of a model from its computation so you can express the model once and fit it in R, Spark, tensorflow, Stan or whatever. The R language is really well suited to exploit this type of tool where the computation is easily described in R and executed somewhere that is more suitable for high performance.


Blogs home Featured Image

Dr. Gentleman’s work at 23andMe focuses on the exploration of how human genetic and trait data in the 23andMe database can be used to identify new therapies for disease. Dr. Gentleman is also recognised as one of the originators of the R programming language and has been awarded the Benjamin Franklin Award, a recognition for Open Access in the Life Sciences presented by the Bioinformatics Organisation. His keynote will focus on the History of R and some thoughts on data science.

Dr Robert Gentleman it is an honour to have you as a keynote speaker at our EARL conference in Houston, we are intrigued to hear more about your career to date and how your work around open access to scientific data has helped shape valuable research worldwide…

Amongst your significant achievements to date has been the development of the R programming language alongside fellow statistician Ross Ihaka at the University of Auckland in the mid 1990’s. What prompted you to develop a new statistical programing language?

Ross and I had a lot of familiarity with S (from Bell Labs) and at that time (my recollection is) there were a lot more languages for Statistics around.  We were interested in how languages were used for data analysis. Both Ross and I had some experience with Lisp and Scheme and at that time some of the work in computer science was showing how one could easily write interpreters for different types of languages, largely based on simple Scheme prototypes. We liked a lot about S, but there were a few places where we thought that different design decisions might provide improved functionality. So we wrote a simple Scheme interpreter and then gradually modified it into the core of the R language. As we went forward we added all sorts of different capabilities and found a large number of great collaborators.

As we made some progress we found that were others around the world who were also interested in developing a system like R. And luckily at just about that time, the internet became reliable and tools evolved that really helped support a distributed software development process. That group of collaborators became R Core and then later formed the nucleus for the R Foundation.

Probably the most important development was CRAN and some of the important tools that were developed to support the widespread creation of packages. Really a very large part of the success of R is due to the ability of any scientist to write a package containing code to carry out an analysis and to share that.

In 2008 you were awarded the Benjamin Franklin Award for your contribution to open access research. What areas of your research contributed towards being awarded this prestigious accolade?

I believe that my work on R was important, but perhaps more important for that award was the creation of the Bioconductor Project, together with a number of really great colleagues. Our paper in Genome Biology describes both those involved and what we did.

In your opinion how has the application of open source R for big data analysis, predictive modelling, data science and visualisation evolved since its inception?

In too many ways for me to do a good job of really describing.  As I said above, the existence of CRAN (and the Bioconductor package repository) where there are a vast number of packages. UseRs can easily get a package to try out just about any idea. Those packages are not always well written, or well supported, but they provide a simple, fast way to try ideas out. And mostly the packages are of high quality and the developers are often very happy to discuss ideas with users. The community aspect is important. And R has become increasingly performant.

Your work today involves the fascinating work of combining bioinformatics and computational drug discovery at 23andMe. What led to your transition to drug discovery or was it a natural progression?

I had worked in two cancer centers including The Dana Farber in Boston and The Fred Hutchinson in Seattle. When I was on the faculty at Harvard, and then the Fred Hutchinson in Seattle, I was developing a computational biology department. The science is fantastic at both institutions and I learned a lot about cancer and how we could begin to use computational methods to begin exploring and understanding some of the computational molecular biology that is important.

But I also became convinced that making new drugs was something that would happen in a drug company. I wanted to see how computational methods could help lead to better and faster target discovery. When I was approached by Genentech it seemed like a great opportunity – and it was. Genentech is a fantastic company, I spend almost six years there and learned a huge amount.

As things progressed, I became convinced that using human genetics to do drug discovery was likely to be more effective than any other strategy that I was aware of. And when the opportunity came to join 23andMe I took it. And 23andMe is also a great company. We are in the early stages of drug discovery still, but I am very excited about the progress we are making and the team I am working with.

How is data science being used to improve health and accelerate the discovery of therapies?

If we are using a very broad definition (by training I am a statistician, and still think that careful hypothesis-driven research is essential to much discovery) of data science – it is essential. Better models, more data and careful analysis has yielded many breakthroughs.

Perhaps a different question is ‘where are the problems’?  And for me, the biggest problem is that I am not sure there is much appreciation of the impact of bias as there should be. Big data is great – but it really only addresses the variance problem. Bias is different, it is harder to discover and its effects can be substantial. Perhaps, put another way, is just how generalizable are the results?

 In addition to the development of the R programming language? What have been your proudest career achievements that you’d like to share?

The Bioconductor Project, working with my graduate students and post-docs and pretty much anytime I gave someone good advice.

Can you tell us about what to expect from your keynote talk and what might be the key take-home messages for our EARL delegates?

I hope an appreciation of why it is important to get involved in developing software systems and tools. And I hope some things to think about when approaching large-scale data analysis projects.

Inspired by the work of Dr Robert Gentleman, what questions would you like to ask? Tickets to EARL Houston are still available. Find out more and get tickets here.




Blogs home Featured Image

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

Lovely to speak to you Julia, could you give us a bit of a background around the work that you do? 

The open source work I do focuses on building a bridge between the tidyverse ecosystem of tools and the real world text data that so many of us need to use in our organizations, so we can use powerful, well-designed tidy tools with text data. In my day job, I work at Stack Overflow, using statistics and machine learning to make our site the best place for people who code to learn and share knowledge online, and to help our clients who want to engage with developers be successful.

What led to your career path?

My academic background is in physics and astronomy, where I was an observational astronomer who spent my time “in the trenches” with real-life data. Also, I’ve been heavily involved in education in various forms for a long time, whether speaking, teaching, writing, or otherwise. All of this together informs how I do data science, because a huge part of what I do is communicate with people about what a complex data analysis means. The fact that I analyze some dataset or train some machine learning model is great, but if I can’t explain it to my business partners, then we can’t make decisions.

Could you tell us what to expect from the content of your talk? And are there any key takeaway advice or tips that delegates will come away with?

Many R users working in fields from healthcare to finance to tech deal with messy text data (this includes me at Stack Overflow!); my talk focuses on a practical, flexible approach to use this text data to gain insight and make better decisions.

Can you give an example?

Folks at EARL can expect my talk to start with the fundamentals of exploratory data analysis for text. EDA is a fruitful and important part of the data science process, and in my own work, I know how much bang for the buck I get when I am deliberate about EDA strategies. We won’t stop there, though! We will also cover how to use tidy data principles for supervised and unsupervised machine learning for text.

What inspired you to write your book Text Mining with R – A Tidy Approach?

The book that my collaborator Dave and I wrote together grew organically out of the work we were doing in this space. We started by developing long-form documentation for our R package, invested more time in laying out best practices in workflows through blog posts, and eventually brought a book’s worth of content together in one cohesive, organized place.

Tell us about the type of work you get involved with on a day to day basis.

In my day job at Stack Overflow, I work on two main categories of questions. The first is centered on the ways that we directly generate revenue, through partnering with clients who want to hire, engage with, and enable the world’s developers. The second (which is of course connected to the first) is centered on the public Q&A community of Stack Overflow and the other Stack Exchange sites; I work on questions around how technologies are related to each other and changing, how to scaffold question askers to success, and how to make Stack Overflow more welcoming and inclusive.

What work do you do with the wider data science community and how do you see it evolving?

In my open source work, I maintain my own R packages, blog and speak about data analysis practices and share resources about data science and tech via social media. I have some ideas for new work I am excited about pursuing soon! I would love to evolve my data science work to more fully support best practices in machine learning for text. Another area that I want to continue to invest energy in, both in my day job and community work, is moving data science and tech toward more just and inclusive practices.

Come and see Julia and be inspired about her love for text mining and tidyverse applications at EARL Seattle on 7th November, we are really looking forward to the conference programme in Seattle, Houston and Boston.

Tickets can still be purchased here.

Blogs home Featured Image

Two weeks ago was our most successful EARL London conference in its 5-year history, which I had the pleasure of attending for both days of talks. Now I must admit, as a Python user, I did feel a little bit like I was being dragged along to an event where everyone would be talking about the latest R packages for customising RMarkdown and Shiny applications (… and there was a little bit of that – I’m pretty sure I heard someone joke that it should be called the Shiny conference).

However, I was pleasantly surprised to find a diverse forum of passionate and inspiring data scientists from a wide range of specialisations (and countries!), each with unique personal insights to share. Although the conference was R focused, the concepts that were discussed are universally applicable across the Data Science profession, and I learned a great deal from attending these talks. If you weren’t so fortunate to attend or would like a refresher, here are my top 5 takeaways from the conference (you can find the slides for all the talks here, click on the speaker image to find the slides):

1. Business decisions should lead Data Science

Steven Wilkins, Edwina Dunn, Rich Pugh

For data to have a positive impact within an organisation, data science projects need to be defined according to the challenges impacting the business and those important decisions that the business needs to make. There’s no use building a model to describe past behaviour or predict future sales if this can’t be translated into action. I’ve heard this from Rich a thousand times since I’ve been at Mango Solutions, but hearing Steven Wilkins describe how this allowed Hiscox to successfully deliver business value from analytics really drove the point home for me. Similarly, Edwina Dunn demonstrated that those organisations which take the world by storm (e.g. Netflix, Amazon, Uber and AirBnB) are those which first and foremost are able to identify customer needs and then use data to meet those needs.

2. Communication drives change within organisations

Rich Pugh, Edwina Dunn, Leanne Fitzpatrick, Steven Wilkins

However, even the best run analytics projects won’t have any impact if the organisation does not value the insights they deliver. People are at the heart of the business, and organisations need to undergo a cultural shift if they want data to drive their decision making. An organisation can only become truly data-driven if all of its members can see the value of making decisions based on data and not intuition. Obviously, an important part of data science is the ability to communicate insights to external stakeholders, by means of storytelling and visualisations. However, even within an organisation, communication is just as important to instil this much needed cultural change.

3. Setting up frameworks streamlines productivity

Leanne Fitzpatrick, Steven Wilkins, Garrett Grolemund, Scott Finnie & Nick Forrester, George Cushen

Taking the time to set up frameworks ensures that company vision can be translated into day to day productivity. In reference to point 1, setting up a framework for prototyping of data science projects allows rapid evaluation of their potential impact to the business. Similarly, a consistent framework should be applied to communication within organisations, such as establishing how to educate the business to promote cultural change, or in the form of documentation and code reviews for developers.

On the technical side, pre-defined frameworks should also be used to bridge the gap between modelling and deployment. Leanne Fitzpatrick’s presentation demonstrated how the use of Docker images, YAML, project templates and engineer-defined test frameworks minimises unnecessary back and forth between data scientists and data engineers and therefore can streamline productivity. To enable this, however, it is important to teach modellers the importance of keeping production in mind during development, and to teach model requirements to data engineers, which hugely improved collaboration at Hymans according to Scott Finnie & Nick Forrester.

In the same vein, I was really intrigued by the flexibility of RMarkdown for creating re-usable templates. Garrett Grolemund from RStudio mentioned that we are currently experiencing a reproducibility crisis, in which the validity of scientific studies is put to question by the fact that most of their results are not reproducible. Using a tool such as RMarkdown to publish code used in statistical studies makes sharing and reviewing code much simpler, and minimises the risk of oversight. Similarly, RMarkdown seems to be a valuable tool for documentation and can even become a simple way of creating project websites, when combined with R packages such as George Cushen’s Kickstart-R.

4. Interpretability beats complexity (sometimes)

Kasia Kulma, Wojtek Kostelecki, Jeremy Horne, Jo-fai Chow

Stakeholders might not always be willing to trust models, and might prefer to fall back on their own experience. Therefore, being able to clearly interpret modelling results is essential to engage people and drive decision-making. One way of addressing this concern is to use simple models such as linear regression or logistic regression for time-series econometrics and market attribution, as demonstrated by Wojtek Kostelecki. The advantage of these is that we can assess the individual contribution of variables to the model, and therefore clearly quantify their impact on the business.

However, there are some cases where a more sophisticated model should be favoured over a simple one. Jeremy Horne’s example of customer segmentation proved that we aren’t always able to implement geo-demographic rules to help identify which customers are likely to engage with the business. “This is the reason why we use sophisticated machine learning models”, since they are better able to distinguish between different people from the same socio-demographic group, for example. This links back to Edwina Dunn’s mention of how customers should no longer be categorised by their profession or geo-demographics, but by their passions and interests.

Nevertheless, ‘trusting the model’ is a double-edged sword, and there are some serious ethical issues to consider, especially when dealing with sensitive personal information. I’m also pretty sure I heard the word ‘GDPR’ mentioned at every talk I attended. But fear not, here comes LIME to the rescue! Kasia Kulna explained how Local Interpretable Model-Agnostic Explanations (say that 5 times fast) allow modellers to sanity check their models by giving interpretable explanations as to why a model predicted a certain result. By extension, this can help prevent bias, discrimination and help avoid exploitative marketing.

5. R and Python can learn from each other

David Smith (during the panellist debate)

Now comes the fiery debate. Python or R? Call me controversial but, how about both? This was one of the more intriguing concepts that I heard, which came as the result of a question during the engaging panellist debate about the R and data science community. What this conference has demonstrated to me is that R is undergoing a massive transformation from being the simple statistical tool it once was, to a fully-fledged programming language which even has tools for production! Not only this, but it has the advantage of being a domain-specific language, which results in a very tight-knit community – which seemed to be the general consensus amongst the panel.

However, there are still a few things R can learn from Python, namely its vast array of tools for transitioning from modelling to deployment. It does seem like R is making steady progress in this regard, with tools such as Plumber to create REST APIs, Shiny Server for serving Shiny web apps online and RStudio Connect to tie these all together with RMarkdown and dashboards. Similarly, machine learning frameworks and cloud services which were more Python focused are now available in R. Keras, for example, provides a nice way to use TensorFlow from R, and there are many R packages available for deploying those models to production servers, as mentioned by Andrie de Vries.

Conversely, Python could learn from R in its approach to data analysis. David Smith remarked that there is a tendency within the Python world to have a model-centric approach to data science. This is also something that I have personally noticed. Whereas R is historically embedded in statistics, and therefore brings many tools for exploratory data analysis, this seems to take a backstage in the Python world. This tendency is exacerbated by popular Python machine learning frameworks such as scikit-learn and TensorFlow, which seem to recommend throwing whole datasets into the model and expecting the algorithm to select significant features for us. Python needs to learn from R tools such as ggplot2, Shiny and the tidyverse, which make it easier to interactively explore datasets.

Another part of the conference I really enjoyed were the lightning talks, which proved how challenging it can be to effectively pitch an idea within a single 10 minute presentation! As a result here are my…

Lightning takeaways!

  • “Companies should focus on what data they need, not the data they have.” (Edwina Dunn – Starcount)
  • “Don’t give in to the hype” (Andrie de Vries – RStudio)
  • “Trust the model” (Jeremy Horne – MC&C Media)
  • h2o + Spark = hot” (Paul Swiontkowski – Microsoft)
  • “Shiny dashboards are cool” (Literally everyone at EARL)

I’m sorry to all the speakers who I haven’t mentioned. I heard great things about all the talks, but this is all I could attend!

Finally, my personal highlight of the conference was the unlimited free drinks – er I mean, getting the opportunity to talk to so many knowledgeable and approachable people from such a wide range of fields! It really was a pleasure meeting and learning from all of you.

If you enjoyed this post, be sure to join us at LondonR at Ball’s Brothers on Tuesday 25th September, where other Mangoes will share their experience of the conference, in addition to the usual workshops, talks and networking drinks.

If you live in the US, or happen to be visiting this November, then come join us in at one of our EARL 2018 US Roadshow events: EARL Seattle (WA) on 7th November, EARL Houston (TX) on 9th November, and EARL Boston (MA) on 13th November. Our highlights to the EARL Conference London will be online soon.


Blogs home Featured Image

With EARL just next week, we have just one more speaker interview to share!

In today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Jasmine Pengelly, whose career includes teaching Data Analysis and Data Science at General Assembly and permanent positions as a Data Analyst at Stack Overflow and DAZN.

Jasmine will be presenting a lightning talk “Putting the R in Bar” where she will show how businesses can make data-driven decisions using the example of a Cocktail Bar.

Thanks Jasmine for taking the time for this interview. Where did the idea for this project come from?

The idea came to me organically. My fiance owns a cocktail bar and it was clear to me how they could improve their business using advanced analytics even with limited technical expertise.

I started asking, what insight would be valuable to the decision makers in that business?

So where did you start?

I identified that I had two datasets to work with. Customer reviews which were spread over four separate websites and cocktail sales information.

The cocktail sales information led me to consider the choices on the menu. The decision of which cocktails to put on the menu had previously successfully been made using intuition, but there had been no data driven-decisions up until that point.

My approach was to use exploratory data analysis to build the best menu. I also started experimenting with regression models and I’ll be touching on my findings in this area in my talk.

For the other areas, I used text mining and natural language processing and I’m looking forward to sharing more detail about these two use cases at EARL soon.

What other businesses do you think would benefit from these examples?

The beauty of predictive analytics is that any business that provides a service to customers would benefit from using insight to make better decisions. It’s even more important for service-based businesses who also benefit from word of mouth marketing and referrals.

For many small and medium-sized business, analytics could be seen as difficult to use and complex. However, it doesn’t need to be.

We hope you’ve enjoyed our series of speaker interviews leading up to EARL London, we can’t wait to hear the talks in full.

There’s still time to get tickets.


Blogs home Featured Image

For today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Catherine Gamble, Data Scientist at Marks and Spencer.

Catherine is presenting “Using R to Drive Revenue for your Online Business” at EARL London and we got the chance to get a preview of the use case she’ll be presenting.

Thanks Catherine for this interview. What was the business need or opportunity that led to this project?

As an online retailer, we know that the actions we take, for example, any changes we make to our website, have an impact on our financial results. However, when multiple changes are being made or campaigns are being run at the same time, it can be hard to separate which action led to the desired result.

From a strategy and planning perspective, we knew it would be valuable to be able to predict the direct impact of any actions we took, before we made them.

How did you go about solving this problem?

I developed a predictive model to explore the relationships between action and result. The result was I was able to identify which actions would have an impact on our KPIs.

What value did your project deliver?

We now have clear insight which is fed into our strategic decision making. As a result, we have had a positive impact on our KPIs and there has been a positive financial impact.

What would you say were the elements that made your project a success?

Support from the Team – one of the key drivers of success in this project was the time I was given to explore different techniques and models and to learn.

Curiosity – this project came about because I was curious about the patterns in the data and wanted to explore some questions around things we were seeing.

What other businesses do you think would benefit from this use case?

Any online retailer who have multiple sales, marketing and development events and campaigns at the same time.

It would also be useful for businesses who have a sales funnel and want to explore how the actions businesses take impact on the results.

To hear Catherine’s full talk and others like it – join us at EARL London this September!

Blogs home Featured Image

Our next interviewee is Patrik Punco, Marketing Analyst at German media company, NOZ Medien. Patrik is presenting a lighting talk ‘Subscription Analytics with focus on Churn Pattern Recognition in a German News Company’ at EARL London.

Ruth Thomson, Mango’s Practice Lead for Strategic Advice chatted to Patrik about the business need for his project, what value it created for the business and any learnings and recommendations he had for other businesses interested in using predictive analytics and machine learning to reduce churn.

Firstly and most importantly, thanks Patrik for this interview. We’d love to know what was the business need or opportunity that prompted your project?

There is a structural change happening in media companies in Germany and also globally.

There is a decline in sales of print products with an increase in digital products. As with many media companies, our print products still provide significant income and we want to reduce the decline of print customer numbers with churn prevention strategies.

If we can identify which customers are likely to churn and why, we can put in place targeted customer loyalty activities to both improve customer experience and reduce churn. We saw an opportunity to use predictive modelling and machine learning to achieve these goals.

What were the key elements of your project?

Most importantly, we were mindful of our customers privacy and made sure customers were fully informed about how their data would be used and had the appropriate permissions in place.

We started with a strong understanding of our business and our current churn reduction strategies. This understanding informed the 115 variables we identified as important in the customer lifecycle. We combined data from our SAP system with data from external sources such as delivery data. Next we tested different models to find the one which delivered the best results.

As a result, we were able to choose the 1% of customers most likely to churn to include in our customer loyalty activities.

We ran A/B tests to measure the impact of our work.

What was the impact? How did you measure value?

Using our A/B tests, we were able to quantify the reduction in churn and the reduction has been significant and financially valuable to the company.

Overall the project has been a success so much so that we are extending and building on the work.

What would you say were the critical elements that made the project a success?

R ecosystem helped us not only to implement predictive modelling but also to work much faster and efficiently. For example using data.table and Rcpp package reduced the aggregation of customer tenures runtime from over 30 minutes to less than 1 second.

Our Data Mining Methodology: We had a complex set of data to prepare for this project, it wasn’t simple or easy. I think the methodology we used was critical to the success of this project. We focused on maximisation of churn lift values that we obtain from the model compared to the overall performance of the print segment. On this basis, we were able to build the A/B testing strategies.

Understanding the Business: A key element of success is the understanding of our business. What type of business are we? Who are our customers? What are the policies that need to be factored in? Without that there is the risk that the project would not have been structured appropriately and not deliver the expected value. An example of what I mean is the exact understanding of the churn policy followed by the company which had a direct impact on definition and encoding of the response variable.

What other businesses could benefit from this type of use case?

Subscription businesses in all sectors would benefit from using advanced analytics to reduce churn. But the applicability of this use case is even wider than that, any business that has a churn reduction strategy could make it more effective with advanced analytics.

Thank you Patrik!

To hear Patrik’s talk in September and others like his, get your EARL London tickets now – early bird ticket sales end 31 July.


Blogs home Featured Image

For today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Willem Ligtenberg, Data Scientist at CZ, a health insurance company in the Netherlands.

Willem is presenting “Developing a Shiny application to predict work load” at EARL London and we got the chance to get a preview of some of the areas he will be covering.

Thanks Willem for taking the time for this interview. What was the business need or opportunity that led to this project?

We are a healthcare insurance company and in one specific department, it often took longer than 3 working days to process a claim.

We knew that if we could process a claim within 3 working days, customer satisfaction would increase, regardless of the outcome.

So, we wanted to improve customer satisfaction by processing claims quickly. However, this department often had a backlog of claims as it was hard to predict how many claims would be received and therefore the number of staff members needed to process those claims.

The processing of these specific claims is difficult to automate as the documents that are submitted are not standardised, an individual is required to review it.

How did you go about solving this problem?

The most important first step was understanding what we wanted to predict. It sounds simple but this detail was important. We realised we didn’t want to predict when a claim would arrive from the post but instead when the claim was ready for the department to process. This difference is very important.

Once we had clarified this important point, we had to prepare the data to ensure it was in the right format. We then tested different predictive models and finally chose Prophet, a forecasting model which is developed by Facebook. Next, we evaluated different models until we were happy with the result. To allow the business to generate their own forecasts we created a Shiny app.

The result of our work has been that claims are now processed by the department within 1 working day and the department is able to maintain optimum staffing levels to process those claims.

What value did your project deliver?

The most important value has been in customer satisfaction. Customer satisfaction increases from a 7.5 to an 8 when claims are processed within 3 days. This may not sound like much but it is a significant increase in this context. As a business, we highly value our customer satisfaction.

There has also been a reduction in the need for employing short-term temporary staff which has reduced costs.

Interestingly, we have also found that, by processing claims within 1 day, productivity has increased. We think that there is something interesting in the psychology behind being able to complete all your work in one day which might lead to people going the extra mile. For me, increased productivity was an unexpected benefit.

What would you say were the elements that made your project a success?

Getting the right data – like many insurance companies, we have a lot of data. The critical thing is choosing the right data to use for this specific use case.

The right model – we spent time finding the right model for the project to get the best result.

The user interface – the Shiny App was the ideal user interface because it allows the user to interact with the results by, for example, changing the date range. We also made sure that the users could export the results for use in the existing planning tools to maximize the value from the results.

What other businesses do you think would benefit from this use case?

Any business which has a need to predict or make forecasts! It will be most of value to help make decisions on staffing levels in bigger teams, say 30+ people and where there is a variability or seasonality, the Prophet model is really good for that.

More EARL interviews

We will be sharing some more EARL interviews with our speakers leading up to 11 September. We look forward to sharing speaker insights and more of what we can all expect September!

Early bird ticket sales will end 31st July – get yours now.

Blogs home

We’re delighted to announce RStudio’s Garrett Grolemund as one of our Keynote Speakers at this year’s EARL London.

He will join Starcount’s Edwina Dunn and a whole host of brilliant speakers for the 5th EARL London Conference on 11-13 September at The Tower Hotel.

Garrett specialises in teaching people how to use R, which is why you’ll see his name on some brilliant resources, including video courses on and O’Reilly media, his series of popular R cheat sheets distributed by RStudio, and as co-author of R for Data Science and Hands-On Programming with R. He also wrote the lubridate R package and works for RStudio as an advocate who trains engineers to do data science with R and the Tidyverse.

He earned his Phd in Statistics from Rice University in 2012 under the guidance of Hadley Wickham. Before that, he earned a Bachelor’s degree in Psychology from Harvard University and briefly attended law school before wising up.

Garrett is one of the foremost promoters of Shiny, R Markdown, and the Tidyverse, so we’re really looking forward to his keynote.

Don’t miss out on early bird tickets

Early bird tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

Blogs home

We are excited to announce the speakers for this year’s EARL London Conference!

Every year, we receive an immense number of excellent abstracts and this year was no different – in fact, it’s getting harder to decide. We spent a lot of time deliberating and had to make some tough choices. We would like to thank everyone who submitted a talk – we appreciate the time taken to write and submit; if we could accept every talk, we would.

This year, we have a brilliant lineup, including speakers from Auto Trader, Marks and Spencer, Aviva,, Google, Ministry of Defence and KPMG. Take a look below at our illustrious list of speakers:

Full length talks
Abigail Lebrecht, Abigail Lebrecht Consulting
Alex Lewis, Africa’s Voices Foundation
Alexis Iglauer, PartnerRe
Amanda Lee, Merkle Aquila
Andrie de Vries, RStudio
Catherine Leigh, Auto Trader
Catherine Gamble, Marks and Spencer
Chris Chapman, Google
Chris Billingham, N Brown PLC
Christian Moroy, Edge Health
Christoph Bodner, Austrian Post
Dan Erben, Dyson
David Smith, Microsoft
Douglas Ashton, Mango Solutions
Dzidas Martinaitis, Amazon Web Services
Emil Lykke Jensen, MediaLytic
Gavin Jackson, Screwfix
Ian Jacob, HCD Economics
James Lawrence, The Behavioural Insights Team
Jeremy Horne, MC&C Media
Jobst Löffler, Bayer Business Services GmbH
Jo-fai Chow,
Jonathan Ng, HSBC
Kasia Kulma, Aviva
Leanne Fitzpatrick, Hello Soda
Lydon Palmer, Investec
Matt Dray, Department for Education
Michael Maguire, Tusk Therapeutics
Omayma Said, WUZZUF
Paul Swiontkowski, Microsoft
Sam Tazzyman, Ministry of Justice
Scott Finnie, Hymans Robertson
Sean Lopp, RStudio
Sima Reichenbach, KPMG
Steffen Bank, Ekstra Bladet
Taisiya Merkulova, Photobox
Tim Paulden, ATASS Sports
Tomas Westlake, Ministry Of Defence
Victory Idowu, Aviva
Willem Ligtenberg, CZ

Lightning Talks
Agnes Salanki,
Andreas Wittmann, MAN Truck & Bus AG
Ansgar Wenzel, Qbiz UK
George Cushen, Shop Direct
Jasmine Pengelly, DAZN
Matthias Trampisch, Boehringer Ingelheim
Mike K Smith, Pfizer
Patrik Punco, NOZ Medien
Robin Penfold, Willis Towers Watson

Some numbers

We thought we would share some stats from this year’s submission process:

This is based on a combination of titles, photos and pronouns.


We’re still putting the agenda together, so keep an eye out for that announcement!


Early bird tickets are available until 31 July 2018, get yours now.