Blogs home Featured Image

Two weeks ago was our most successful EARL London conference in its 5-year history, which I had the pleasure of attending for both days of talks. Now I must admit, as a Python user, I did feel a little bit like I was being dragged along to an event where everyone would be talking about the latest R packages for customising RMarkdown and Shiny applications (… and there was a little bit of that – I’m pretty sure I heard someone joke that it should be called the Shiny conference).

However, I was pleasantly surprised to find a diverse forum of passionate and inspiring data scientists from a wide range of specialisations (and countries!), each with unique personal insights to share. Although the conference was R focused, the concepts that were discussed are universally applicable across the Data Science profession, and I learned a great deal from attending these talks. If you weren’t so fortunate to attend or would like a refresher, here are my top 5 takeaways from the conference (you can find the slides for all the talks here, click on the speaker image to find the slides):

1. Business decisions should lead Data Science

Steven Wilkins, Edwina Dunn, Rich Pugh

For data to have a positive impact within an organisation, data science projects need to be defined according to the challenges impacting the business and those important decisions that the business needs to make. There’s no use building a model to describe past behaviour or predict future sales if this can’t be translated into action. I’ve heard this from Rich a thousand times since I’ve been at Mango Solutions, but hearing Steven Wilkins describe how this allowed Hiscox to successfully deliver business value from analytics really drove the point home for me. Similarly, Edwina Dunn demonstrated that those organisations which take the world by storm (e.g. Netflix, Amazon, Uber and AirBnB) are those which first and foremost are able to identify customer needs and then use data to meet those needs.

2. Communication drives change within organisations

Rich Pugh, Edwina Dunn, Leanne Fitzpatrick, Steven Wilkins

However, even the best run analytics projects won’t have any impact if the organisation does not value the insights they deliver. People are at the heart of the business, and organisations need to undergo a cultural shift if they want data to drive their decision making. An organisation can only become truly data-driven if all of its members can see the value of making decisions based on data and not intuition. Obviously, an important part of data science is the ability to communicate insights to external stakeholders, by means of storytelling and visualisations. However, even within an organisation, communication is just as important to instil this much needed cultural change.

3. Setting up frameworks streamlines productivity

Leanne Fitzpatrick, Steven Wilkins, Garrett Grolemund, Scott Finnie & Nick Forrester, George Cushen

Taking the time to set up frameworks ensures that company vision can be translated into day to day productivity. In reference to point 1, setting up a framework for prototyping of data science projects allows rapid evaluation of their potential impact to the business. Similarly, a consistent framework should be applied to communication within organisations, such as establishing how to educate the business to promote cultural change, or in the form of documentation and code reviews for developers.

On the technical side, pre-defined frameworks should also be used to bridge the gap between modelling and deployment. Leanne Fitzpatrick’s presentation demonstrated how the use of Docker images, YAML, project templates and engineer-defined test frameworks minimises unnecessary back and forth between data scientists and data engineers and therefore can streamline productivity. To enable this, however, it is important to teach modellers the importance of keeping production in mind during development, and to teach model requirements to data engineers, which hugely improved collaboration at Hymans according to Scott Finnie & Nick Forrester.

In the same vein, I was really intrigued by the flexibility of RMarkdown for creating re-usable templates. Garrett Grolemund from RStudio mentioned that we are currently experiencing a reproducibility crisis, in which the validity of scientific studies is put to question by the fact that most of their results are not reproducible. Using a tool such as RMarkdown to publish code used in statistical studies makes sharing and reviewing code much simpler, and minimises the risk of oversight. Similarly, RMarkdown seems to be a valuable tool for documentation and can even become a simple way of creating project websites, when combined with R packages such as George Cushen’s Kickstart-R.

4. Interpretability beats complexity (sometimes)

Kasia Kulma, Wojtek Kostelecki, Jeremy Horne, Jo-fai Chow

Stakeholders might not always be willing to trust models, and might prefer to fall back on their own experience. Therefore, being able to clearly interpret modelling results is essential to engage people and drive decision-making. One way of addressing this concern is to use simple models such as linear regression or logistic regression for time-series econometrics and market attribution, as demonstrated by Wojtek Kostelecki. The advantage of these is that we can assess the individual contribution of variables to the model, and therefore clearly quantify their impact on the business.

However, there are some cases where a more sophisticated model should be favoured over a simple one. Jeremy Horne’s example of customer segmentation proved that we aren’t always able to implement geo-demographic rules to help identify which customers are likely to engage with the business. “This is the reason why we use sophisticated machine learning models”, since they are better able to distinguish between different people from the same socio-demographic group, for example. This links back to Edwina Dunn’s mention of how customers should no longer be categorised by their profession or geo-demographics, but by their passions and interests.

Nevertheless, ‘trusting the model’ is a double-edged sword, and there are some serious ethical issues to consider, especially when dealing with sensitive personal information. I’m also pretty sure I heard the word ‘GDPR’ mentioned at every talk I attended. But fear not, here comes LIME to the rescue! Kasia Kulna explained how Local Interpretable Model-Agnostic Explanations (say that 5 times fast) allow modellers to sanity check their models by giving interpretable explanations as to why a model predicted a certain result. By extension, this can help prevent bias, discrimination and help avoid exploitative marketing.

5. R and Python can learn from each other

David Smith (during the panellist debate)

Now comes the fiery debate. Python or R? Call me controversial but, how about both? This was one of the more intriguing concepts that I heard, which came as the result of a question during the engaging panellist debate about the R and data science community. What this conference has demonstrated to me is that R is undergoing a massive transformation from being the simple statistical tool it once was, to a fully-fledged programming language which even has tools for production! Not only this, but it has the advantage of being a domain-specific language, which results in a very tight-knit community – which seemed to be the general consensus amongst the panel.

However, there are still a few things R can learn from Python, namely its vast array of tools for transitioning from modelling to deployment. It does seem like R is making steady progress in this regard, with tools such as Plumber to create REST APIs, Shiny Server for serving Shiny web apps online and RStudio Connect to tie these all together with RMarkdown and dashboards. Similarly, machine learning frameworks and cloud services which were more Python focused are now available in R. Keras, for example, provides a nice way to use TensorFlow from R, and there are many R packages available for deploying those models to production servers, as mentioned by Andrie de Vries.

Conversely, Python could learn from R in its approach to data analysis. David Smith remarked that there is a tendency within the Python world to have a model-centric approach to data science. This is also something that I have personally noticed. Whereas R is historically embedded in statistics, and therefore brings many tools for exploratory data analysis, this seems to take a backstage in the Python world. This tendency is exacerbated by popular Python machine learning frameworks such as scikit-learn and TensorFlow, which seem to recommend throwing whole datasets into the model and expecting the algorithm to select significant features for us. Python needs to learn from R tools such as ggplot2, Shiny and the tidyverse, which make it easier to interactively explore datasets.

Another part of the conference I really enjoyed were the lightning talks, which proved how challenging it can be to effectively pitch an idea within a single 10 minute presentation! As a result here are my…

Lightning takeaways!

  • “Companies should focus on what data they need, not the data they have.” (Edwina Dunn – Starcount)
  • “Don’t give in to the hype” (Andrie de Vries – RStudio)
  • “Trust the model” (Jeremy Horne – MC&C Media)
  • h2o + Spark = hot” (Paul Swiontkowski – Microsoft)
  • “Shiny dashboards are cool” (Literally everyone at EARL)

I’m sorry to all the speakers who I haven’t mentioned. I heard great things about all the talks, but this is all I could attend!

Finally, my personal highlight of the conference was the unlimited free drinks – er I mean, getting the opportunity to talk to so many knowledgeable and approachable people from such a wide range of fields! It really was a pleasure meeting and learning from all of you.

If you enjoyed this post, be sure to join us at LondonR at Ball’s Brothers on Tuesday 25th September, where other Mangoes will share their experience of the conference, in addition to the usual workshops, talks and networking drinks.

If you live in the US, or happen to be visiting this November, then come join us in at one of our EARL 2018 US Roadshow events: EARL Seattle (WA) on 7th November, EARL Houston (TX) on 9th November, and EARL Boston (MA) on 13th November. Our highlights to the EARL Conference London will be online soon.

 

Blogs home Featured Image

Nowadays whenever I do my work in R there is a constant nagging voice in the back of my head telling me “you should do this in Python”. And when I do my work in Python it’s telling me “you can do this faster in R”. So when the reticulate package came out I was overjoyed and in this blogpost I will explain to you why.

re-tic-u-late (rĭ-tĭkˈyə-lĭt, -lātˌ)

So what exactly does reticulate do? It’s goal is to facilitate interoperability between Python and R. It does this by embedding a Python session within the R session which enables you to call Python functionality from within R. I’m not going to go into the nitty gritty of how the package works here; RStudio have done a great job in providing some excellent documentation and a webinar. Instead I’ll show a few examples of the main functionality.

Just like R, the House of Python was built upon packages. Except in Python you don’t load functionality from a package through a call to librarybut instead you import a module. reticulate mimics this behaviour and opens up all the goodness from the module that is imported.

library(reticulate)
np <- import("numpy")
# the Kronecker product is my favourite matrix operation
np$kron(c(1,2,3), c(4,5,6))
## [1]  4  5  6  8 10 12 12 15 18

In the above code I import the numpy module which is a powerful package for all sorts of numerical computations. reticulate then gives us an interface to all the functions (and objects) from the numpy module. I can call these functions just like any other R function and pass in R objects, reticulate will make sure the R objects are converted to the appropriate Python objects.

You can also run Python code through source_python if it’s an entire script or py_eval/py_run_string if it’s a single line of code. Any objects (functions or data) created by the script are loaded into your R environment. Below is an example of using py_eval.

data("mtcars")
py_eval("r.mtcars.sum(axis=0)")
## mpg      642.900
## cyl      198.000
## disp    7383.100
## hp      4694.000
## drat     115.090
## wt       102.952
## qsec     571.160
## vs        14.000
## am        13.000
## gear     118.000
## carb      90.000
## dtype: float64

Notice the use of the r. prefix in front of the mtcars object in the python code. The r object exposes the R environment to the python session, it’s equivalent in the R session is the py object. The mtcars data.frame is converted to a pandas DataFrame to which I then applied the sumfunction on each column.

Clearly RStudio have put in a lot of effort to ensure a smooth interface to Python, from the easy conversion of objects to the IDE integration. Not only will reticulate enable R users to benefit from the wealth of functionality from Python, I believe it will also enable more collaboration and increased sharing of knowledge.

Enter mailman

So what is it exactly that you can do with Python that you can’t with R? I asked myself the same question until I came across the following use case.

While helping a colleague out with a blogpost it was suggested that I should publish it on a Tuesday. No rationale was given so naturally I wondered if I could provide one using data. The data would have to come from R-bloggers. This is a great resource for reading blogposts about R (and related topics) and they also provide a daily newsletter with a link to the blogposts from that day. At the time the newsletter seemed the easiest way to collect data 1. All I needed to do now is extract the data from my Gmail account.

Therein lies the problem as I want to avoid querying the Gmail server (it wouldn’t make it easy to reproduce). Fortunately, Google have made it easy to download your data (thanks to the Google Data Liberation Front) through Google Takeout. Unfortunately, all the e-mails are exported in the mbox format. Although this is a plain text based format it would take some effort to write a parser in R, something I wasn’t willing to do. And then came along Python, which has a built-in mbox-parser in the mailbox module.

Using reticulate I extracted the necessary information from each e-mail.

# import the module
mailbox <- import("mailbox")
# use the mbox function to open a file connection
cnx <- mailbox$mbox("rblogs_box.mbox")

# the messages are stored as key/value pairs
# in this case they are indexed by an integer id
message <- cnx$get_message(10L)
# each message has a number of fields with meta-data
message$get("Date")
## [1] "Mon, 12 Dec 2016 23:56:19 +0000"
message$get("Subject")
## [1] "[R-bloggers] Building Shiny App exercises part 1 (and 7 more aRticles)"

And there we have it! I just read an e-mail from an mbox-file with very little effort. Of course I will need to do this for all messages, so I wrote a function to help me. And because we’re living in the Age of R I placed this function in an R package. You can find it on the MangoTheCat github repo, it is called mailman.

To publish or not to publish?

I have yet to provide a rationale for publishing a blogpost on a particular day so let’s quickly get to it. With the package all sorted I can now call the function mailman::read_messages to get a tibble with everything I need.

We can extract the number of blogposts on a particular date from the subject of each e-mail. Aggregating that to day of week will then give us a good overview of which day is popular.

library(dplyr)
library(mailman)
library(lubridate)
library(stringr)

messages <- read_messages("rblogs_box.mbox", type="mbox") %>% 
  mutate(Date = as.POSIXct(Date, format="%a, %d %b %Y %H:%M:%S %z"),
         Day_of_Week = wday(Date, label=TRUE, abbr=TRUE),
         Number_Articles = str_extract(Subject, "[0-9](?=[\\n]* more aRticles)"), 
         # Whenever a regex works you feel like a superhero!
         Number_Articles = as.numeric(Number_Articles) + 1,
         # Ok, sometimes it doesn't work but you're still a hero for trying!
         Number_Articles = ifelse(is.na(Number_Articles), 1, Number_Articles)) %>% 
  select(Date, Day_of_Week, Number_Articles)

Judging by the graph, weekends would be a good time to publish a blogpost as there is less competition. Then again, not many people might read blogposts in the weekend. The next candidate would then be Monday which has the lowest average among the weekdays. Coming back to my original quest, I can conclude that publishing on a Tuesday is not the best option.

In summary

In my opinion, the reticulate package is a ground-breaking development. It allows me to combine the good parts of R with the good parts of Python (it’s already in use by the tensorflow and keras packages). Also, it allows the data science community to collaborate more easily and focus our energy on getting things done. This is the future, this is R and Python (Rython? PRython? PyR?).


  1. After I had collected all the data, Bob Rudis wrote about the Feedly API and released a dataset of blogposts over a longer time period. I would say his solution is preferable even though my results are slightly different due to the more recent time horizon.
Blogs home Featured Image

With EARL just next week, we have just one more speaker interview to share!

In today’s interview, Ruth Thomson, Practice Lead for Strategic Advice spoke to Jasmine Pengelly, whose career includes teaching Data Analysis and Data Science at General Assembly and permanent positions as a Data Analyst at Stack Overflow and DAZN.


Jasmine will be presenting a lightning talk “Putting the R in Bar” where she will show how businesses can make data-driven decisions using the example of a Cocktail Bar.


Thanks Jasmine for taking the time for this interview. Where did the idea for this project come from?

The idea came to me organically. My fiance owns a cocktail bar and it was clear to me how they could improve their business using advanced analytics even with limited technical expertise.

I started asking, what insight would be valuable to the decision makers in that business?

So where did you start?


I identified that I had two datasets to work with. Customer reviews which were spread over four separate websites and cocktail sales information.

The cocktail sales information led me to consider the choices on the menu. The decision of which cocktails to put on the menu had previously successfully been made using intuition, but there had been no data driven-decisions up until that point.

My approach was to use exploratory data analysis to build the best menu. I also started experimenting with regression models and I’ll be touching on my findings in this area in my talk.

For the other areas, I used text mining and natural language processing and I’m looking forward to sharing more detail about these two use cases at EARL soon.

What other businesses do you think would benefit from these examples?

The beauty of predictive analytics is that any business that provides a service to customers would benefit from using insight to make better decisions. It’s even more important for service-based businesses who also benefit from word of mouth marketing and referrals.

For many small and medium-sized business, analytics could be seen as difficult to use and complex. However, it doesn’t need to be.

We hope you’ve enjoyed our series of speaker interviews leading up to EARL London, we can’t wait to hear the talks in full.

There’s still time to get tickets.

 

Blogs home Featured Image

When our Chief Data Scientist, Rich Pugh, offered to write a blogpost about the DiagrammeR package, he asked what format we’d like it in.  “A poem?” we retorted.

He didn’t disappoint….

 

Gather ye and listen to my tale of daring do,

Of heroes, villains, diagrams – and a chunk of R code too,

Wherefore my tale involves the noble Knight, Sir DiagrammeR,

Who brought the power of Knights of old – our own true Excalibur!

 

My tale begins in the office towers of a client in the city,

Where the coffee’s really strong, and the banter’s rather witty,

A place where we’ve delivered lots of data-driven gold,

With the fighting led by Mango Knights, heroic, brave and bold,

 

Whence our discussions quickly led to Data Science skill,

And the time required to cross-train a group of Analysts, who will

Become good Sirs and Ladies of a new Data Science team,

But how to design and communicate, this audacious training scheme?

 

Up went the cry, “A diagram is really what we need,

To explain the likely investment and time needed to lead

These Analysts who wish to learn to wield an Analytic sword,

A diagram would help us communicate this to the board.”

 

In burst the good Sir Powerpoint, upon his paper-clip-shaped mount,

“Did someone want a diagram?” this Gentleman did shout,

“For I have pre-built charts and things that really look quite neat,

And as a creator of diagrams I really can’t be beat!”

 

“Huzzah!” they said “that surely is the answer to our plight”,

But the skillful Mango Knights demurred, “this course may not be right,

Sir Powerpoint may have armour and a flashy, prancing steed,

But he certainly won’t provide us with the capabilities we need”

 

“We really want to simulate this cross-train exercise,

With a presentation of results to verily delight the eyes,

So members of the board will clearly understand the way,

To build a team with DS skills so they can win the day!”

 

All agreed Sir Powerpoint was really not the answer,

So with a nod we all dismissed this brave and brilliant lancer,

Instead those expert Mango Knights immediately did clamour,

To summon that brave hero, the good Sir DiagrammeR!

 

In walked the gent, who truly was the answer to our prayers,

He helped us build a diagram of lines and text and squares,

And since it was produced in R (that really is quite swell),

We could proceed to populate it with simulated scenarios as well,

 

A create_node_df call meant we could define the nodes,

We also used this function to set attributes (of which there are loads),

Such as labels, colour and shape – oh, the things we could control!

We could even define a bespoke layout to help achieve our goal.

 

Next we used create_edge_df to define the transition,

Of the skills of their DS team throughout their training mission,

We also annotated edges with some labels to convey,

The estimated time to upskill folks to B from A,

 

A simple call to create_graph then allowed us to combine,

Both edge and node dfs into an object we did assign,

Then render_graph created the spectacular show,

With options for the layout (we just chose “neato”),

 

It really was quite easy to produce the final chart,

With fine control the diagram looked like a work of art,

The use of simple data frames meant creation was a breeze,

And decent online docs meant we could build the thing with ease,

 

The output was fantastic – and thus the day was won,

The finished docs and examples confirmed our work was done,

In such a timely manner too – we’d worked at such great speed,

We retired early to the tavern to celebrate with ales and mead.

 

And thus our tale is ended, and the hero of the hour,

Was certainly Sir DiagrammeR and his chart-creation power,

It really is a top package with well-written documentation,

We heartily recommend it for diagram creation!