Blogs home Featured Image

The Enterprise Applications of the R Language Conference (EARL) is the place to be for anyone using R in their organisation. You’ll be joined by R users from all over the data world, presenting their real-world projects and use cases, and ideas and solutions.

The conference is run by Mango Solutions, as part of our commitment to the data community.

EARL 2019’s speaker lineup is something special indeed, with Sainsbury’s Group Chief Data Officer, Helen Hunter delivering the opening keynote and taking us through Sainsbury’s ongoing data analytics journey into an increasingly digitally-led retail future.

Joining Helen in the opening session is none other than Stack Overflow’s Data Scientist Julia Silge. You probably already know Julia from her huge Twitter following, from GitHub, or – obviously – from her prolific posting on Stack Overflow itself.

At EARL 2019, Julia will be keynoting around the Stack Overflow Developer Survey – the world’s biggest and most comprehensive survey of people who code. With 90,000 respondents in 2019, Julia’s had a lot of data to comb through, and she’ll be taking us through how and why she used R to analyse that survey.

Julia says: “We are working with a complex dataset on a tight schedule and the R ecosystem provides the fluent data analysis tools we need to deliver compelling results on time”

Also speaking is Hasnain Mahmood, Senior Quantitative Associate at clearing house LCH, where he’s working in the Change and Innovation stream in the In-Business Risk Management team.

In his session, you can learn how Hasnain and his team have been using an R-focused technology stack, to carry out quantitative research to identify unique factors driving counterparty trading behaviour.

Hasnain says: “Put simply, we make financial markets safer”

It’s no secret that Heathrow Airport is in a state of huge development, with an intention to increase passenger movement through it from 80 million to 150 million as a result of the planned third runway.

Mitchell Stirling is the Capacity and Modelling Manager handling a critical element of this uptick in passengers – their baggage. Mitchell will be explaining how, working with Mango Solutions, he and his team converted legacy PERL script into an R package to cut down manual intervention, flag errors earlier, and generally stabilise the process – hopefully meaning less suitcases will show up in Seattle when you’re in New York.

Mitchell says: “For the past 20 years, growth at the airport has been constrained by its existing assets – but there is much to do to use them to their maximum potential”

Does using R “spark joy”? RStudio’s Kelly O’ Briant thinks it does, and as a Solutions Engineer, she’s passionate about bridging the gaps between development and production in data science projects. Covering how CI/CD tools can enhance reproducibility for R and data science, she’ll be showcasing practical examples in testing and deployment.

Kelly says: “Once you’ve embraced some basic development best practices in data science, what comes next? What does it take to feel confident that our data products will make it to production?”

Data journalism is now very much ‘a thing’, and that’s why we’ll be hearing from the BBC’s Nassos Stylianou.

The Senior Data Journalist will take us through how R is now being used to extract, wrangle and analyse data for major BBC stories, including the use of ggplot2 to create production-ready charts for a global audience, as well as informing the wider BBC of its value.

Nassos says: “The transition to R let us spread its use to other members of the Data and Visual Journalism team who had no prior knowledge of R”

Join us at EARL 2019 to see these speakers and more, as well as our hugely popular workshops covering a range of topics from package development in R to producing explainable, non-black box, machine learning models.
EARL Conference runs 10-12 September 2019 at the Tower Hotel, London, E1W 1LD.

Blogs home Featured Image

To the uninitiated, entering UCL’s packed Darwin lecture theatre on Monday evening knowing you held a golden ticket so coveted that 400 names remained on a waitlist, you could be forgiven for thinking this was the most popular meetup on the planet

And perhaps, on this occasion at least, LondonR was the focal point of the R universe. Because after all, it’s not every day that one has the opportunity to attend an in-person presentation by the great Hadley Wickham.

If you’ve been living under a rock for the past ten years, Hadley Wickham is Chief Scientist at RStudio as well as an adjunct professor of statistics at the Universities of Auckland, Stanford and Rice University.  In short, he builds tools (computational and cognivite) that make data science easier, faster and more fun.

As soon as the title slide: “Tidyverse: The Greatest Hits” was revealed, murmurs began to echo around the lecture theatre.  And these murmurs increased in intensity when Hadley dismissed the title as somewhat misleading and instead promised to talk about the “biggest mistakes” that have been made since the tidyverse came into being.

The hushed predictions of many were confirmed when a giant ggplot2 sticker appeared on the screen and the presentation that followed was enlightening, entertaining and introspective in equal measure.

Mistakes Part I

Beginning with his eternal remorse over discovering the magrittr pipe only after the first major wave of ggplot2 uptake, and via a sheepish admittance that masking stats::filter() was perhaps overly callous, it wasn’t too long before we arrived at the topic of tidyeval.

This was where Hadley was able to expand for the first time on one of the core messages of his talk: borrowing a quote from software development coach, GeePaw Hill, he advocated the benefits of making as many mistakes as possible as quickly as possible and described how, over the course of several years, he had experienced several “false epiphanies” which ultimately resulted in the creation of lazyeval and subsequently the tidyeval we all know and tolerate today.

Of course I jest; tidyeval is incredibly powerful, and Hadley was unwavering in his conviction that it will be the source of much future progress in R development, highlighting among other uses its fundamental role in innovative interface packages such as dbplyr and dtplyr.

He went on to acknowledge, however, that the number of people who share his passion for the underlying theory of tidyeval is rather small and subsequently reflected on the decision to reveal it to the world while still in its relative infancy. Although initially disappointed by the slowness of the community to warm to the new concepts, he has come to terms with the fact that not everyone will immediately jump onto the quosure bandwagon. Consequently there have been efforts in recent times to increase the accessibility of tidyeval, and we were proudly shown one of the latest developments: the “interpolation”, or “curly-curly”, or (Jenny Bryan’s wonderful coinage) “embrace” operator {{_}}.

The trend towards user-friendliness and, in particular, self-explanatory functionality, is set to continue: we can look forward to the imminent release of tidyr 1.0.0, where the introduction of pivot_longer() and pivot_wider() is sure to delight those of us who never wrapped our heads around gather() and spread(), and to delight the rest of us, who DID eventually get to grips with them but still had to look up the syntax of both every time we wanted to use either.

But what about Python?

We couldn’t claim to have hosted a top R event if we hadn’t had some mention of Python from the audience. Hadley took the mandatory “R vs Python” question in his stride, perhaps unsurprisingly given the frequency with which he must face it.

In order to use Python, he argued, one must necessarily learn at least a small amount of programming – enough that someone coming from a purely data science perspective might be discouraged from continuing beyond the earliest stages of learning.

It’s possible to do useful data science work in R without learning any programming at all, and then as greater complexity is required, one can start to learn more about programming and about the language itself. Once someone has reached that point though, it is more a question of what is most suitable for the task at hand, in the context at hand.  And here Hadley animatedly encouraged us to “use Python!” if that was the sensible option.

Mistakes Part II

This sense of unity was a common theme throughout the presentation. Approaching the conclusion, Hadley expressed some regret for, in his view, one of the largest mistakes of all: the decision to denominate a certain group of packages as “the tidyverse”.

The intention, he elaborated, was never to provide a complete-but-isolated paradigm. Putting aside our human tendency to see conflict where there are merely options, there is no “base vs tidyverse” turf war. A tidyverse package can be used, is designed to be used, in exactly the same way as any other R package, ie: in whichever context works best, with whatever other packages work best.

Hadley cited specific examples such as the effective combination of data.table and ggplot2, praising the utility and speed of the former in conjunction with the visualisation power of the latter. The name “tidyverse” is a blessing and a curse, he concluded.  Powerful as a label for the concepts it represents, but overly evocative of completeness and correctness.

Love for the R Community

In response to a question about how the R community has developed over the years, Hadley described how, at every stage of the community’s slow transition from the original R-Help mailing list, through StackOverflow, and most recently to Twitter and the RStudio Community forums, “asking for help” has gradually become a much easier thing to do.

The openness and friendliness of the community is one of the major strengths of R, and Hadley was quick to praise the community at large, giving RLadies a special mention for the work they have been doing in recent years.

After concluding with some enticing hints about where his efforts might be focused in the near future (look out for new and improved vctrs, maybe…) our time was up and Hadley had to leave for his next engagement, but not before hinting at the possibility of another visit when his “travel budget” allows.  But before you rush to put your name down for the next event, breathe steady folks, he’s a busy man and it might be a little while before he’s back over here on the wrong side of the world.

It was refreshing to hear someone like Hadley acknowledge that innovation isn’t a straight line and that forking and dead ends are essential parts of the process. Speaking to attendees afterwards this message was highly prized and it felt as though there was an increased confidence with many attendees to go out and try things without the fear of failure.

For those of you who were not lucky enough to get a golden ticket this time, don’t worry, all is not lost.  You can see a recording of Hadley’s presentation here.

And if this inspires to you to find out more about the R community, rest assured that spaces aren’t usually quite so keenly fought over.  We’re a friendly lot and you’ll probably find you even get a seat at the next event!

 

 

 

Blogs home Featured Image

We sent Johannes Tang Kristensen from Arla Foods a few questions about his upcoming talk at EARL London – ‘How much milk do our cows produce? Lessons learned from putting our first R model into production’.

How did the need for your project come about?

The project started out as part of a larger initiative in Arla with the goal of proving the need for and value of advanced analytics. In this particular case, our global planning team asked us whether we could have a look at their current forecasting approach and see if we could improve it. An interesting aspect of the challenge was that the performance of their current approach was very high so they did not necessarily expect us to come up with a model that could beat their forecasts, although they of course wouldn’t mind it if we did. Instead what they wanted was a model that could help them develop a more systematic approach towards creating their forecasts where it would be clearer what the underlying assumptions of the forecast were, where they would be less dependent on the knowledge of individuals in their team, and where they wouldn’t have to spend days creating a forecast in Excel.

Where did you start with your project?

The project started by conducting a proof-of-concept where we were given a data set compiled by our global planning department containing the variables they believed could be relevant. Using the data set we were able to build a model that in the end outperformed their current forecasting approach. In order to present the results in an interactive way we supplemented the model with a Shiny dashboard where our stakeholders could visualise the forecasting performance of the model in different cases and at different points in time. Based on this the project was approved and upgraded from proof-of-concept to an actual IT development project which meant we had to figure out how we actually put such a model into production.

How did you communicate the value of your work to the rest of the business?

The communication was mostly driven by our business stakeholders as they were the ones that best understood the actual value of the forecast improvements and time-savings provided by the model. However, we have of course also used it to showcase what we can deliver whenever possible.

Thank you to Johannes for answering our questions. Please take a look at the other brilliant speakers we have at EARL, we are now counting down the days! There is still time to get a ticket to EARL – our workshop spaces are filling up quickly, so don’t miss out. 

Blogs home Featured Image

Robert Duff (Transport for London) and Rahulan Chandrasekaran (Department for Transport)

Robert and Rahulan are doing a joint presentation titled ‘Let me in! Let me on! Quantifying highly frustrating events on the Underground’ on 11 September at EARL London. We dropped Robert an email to find out more around the subject of his and Rahulan’s talk.

How do you think technology is shaping modern transport?

Incredibly. It definitely feels like we are in something of a reboot stage at the moment. The challenge is staying relevant and positioning yourself to be flexible enough to adapt. The next advancement could be just around the corner. From the noticeable increase in ride-hailing services, electric vehicles and autonomous vehicles trials, it’s clear that technology is already shaping transport offerings as well as defining how users interact with them. The quality and quantity of information available to both public transport users and road users in recent years has really advanced. And of course, whilst we go on this journey it’s paramount to have safety at the forefront of our mind and to always be on the lookout for opportunities to encourage trips on more sustainable modes.

What challenges do organisations face in helping to shape modern transport?

Although it’s been mentioned a few times recently in various blogs that I’ve read, I’m just going to re-iterate here one of the key challenges. Organisations can do their best to keep up to date with technological trends and have vast amounts of data and the right mix of Data Scientist/Engineers, but the ability to shape really starts with making in-roads towards an organisational culture where data and openness is at the core.

To unlock the benefits of technological advancements you need to have the ability to influence and have decision-makers who are confident when talking about data.

It would be great for example, if everyone knows what machine learning is but we know this won’t happen overnight and part of the challenge is explaining such topics so that everyone has a chance of grasping what they mean. It also helps when everyone is upfront and honest, happily stating when they don’t quite understand – there’s absolutely no shame in asking someone to repeat themselves but in a slightly different way 😊. Organisations with strong analytical communities that fall naturally into the habit of sharing knowledge, learning from each other and unearthing best practices, are the ones in prime position to face this difficult challenge.

Why did you pick R for this project?

Wrangling and Visualising make up a big part of this work so R was a very good fit in that respect. Particularly important for Rahulan (my co-presenter) and I, was the ability to put the data into our stakeholders hands to interact with – and we found some fantastic packages for that.

What are you planning after this project?

The direction of travel for this project is pretty exciting. Since the project has got going, and from what we’re going to present at EARL, we’re now in a position where we have more data than before and in considerable quantities. We can now complement our ticketing and train movement data with WiFi data from within our stations. This gives us an extra dimension as we can begin to think about applying more advanced techniques to our problem, possibly taking a trip into predictive analytics territory with the aim of improving the customer experience.

Thanks to Robert for this interview – please take a look at the other speakers that we have presenting. It’s going to be 3 days of jam-packed R goodness!

There are only 4 weeks left until EARL, you can get your tickets here.

Blogs home Featured Image

There are so many wonderful EARL talks happening this year – it’s hard to highlight them all! But we thought we’d share some that the Mango team are really looking forward to:

Ana Henriques, PartnerRe

Using R in Production at PartnerRe

Ana Henriques is the Analytics Tool Lead in PartnerRe’s Life & Health Department. Ana is now focused on business-side delivery of platforms and tools to support data science and related functions. Her talk will focus on the open source infrastructure supporting this process: version control, continuous integration, containerisation and container deployment and orchestration.

Kevin Kuo, RStudio

Towards open collaboration in insurance analytics

Kevin is a software engineer at RStudio and is the founder of Kasa AI, a community organization for open research in insurance analytics. Kevin will be introducing Kasa AI, a not-for-profit community initiative for open research and software development for insurance analytics. Inspired by rOpenSci and Bioconductor, his team hopes to bring together the insurance community to solve the most impactful problems.

Charlotte Wise, Essense

Beyond the average: a bayesian approach for setting media targets

Charlotte manages a small team of analysts at Essence, a global media agency and part of GroupM,
WPP. Her talk will cover how the team at Essense overcame the issue of reporting ROI on marketing campaigns by using a hierarchical bayesian model.

Kasia Kulma, Mango Solutions

Integrating empathy in the Data Science process

Kasia Kulma is a Data Scientist at Mango Solutions and holds a PhD in evolutionary biology from Uppsala University. Kasia’s talk will demonstrate how empathy has a clearly defined role at every step of the Data Science process: from pitching project ideas and gathering requirements, to implementing solutions, informing and influencing stakeholders, and gauging the impact of the product.

Mitchell Stirling, Heathrow Airport

Understanding Airport Baggage Demand through R modelling 

Mitchell is a Senior Analyst at Heathrow Airport with seven years experience working in Operations, Commercial and Strategic positions. Heathrow Airport is entering a new phase of growth and the team there wanted to look at potential scenarios for occupancy and use of infrastructure to maximise existing assets and reduce the need for expensive capital works, early in the programme. To explore how these scenarios would impact the demand on baggage systems, Heathrow has worked with Mango to convert a legacy PERL script into an R package and make a number of improvements that cut down manual intervention, flag errors earlier, stabalise the process and allow for greater variation in key inputs.

There are plenty more speakers on the agenda for you to take a look at so why not join us in September for 3 days of R, learning, inspiration and fun!

Tickets available now.

 

 

Blogs home Featured Image

I recently had the pleasure of attending the second annual Insurance Data Science Conference in Zurich, Switzerland and it may not come as a surprise to you to learn that AI and ML in insurance topped the list of discussion items. But what may surprise you is the context within which they were discussed: autonomous cars. Yes, the reality of autonomous vehicles – AVs – and even connected and autonomous AVs – CAVs – is upon us and the implications for the insurance industry are significant. Even trucks are getting a look-in thanks to Tesla and Daimler’s foray into semi-autonomous big rigs. In short, the technology is ready, but what effect will the uptake of these autonomous vehicles have on insurers?

Insurers are having to ask – and find answers to – some serious questions, particularly if the government’s target to allow CAVs onto UK roads by 2021 is to be realised. Resolving questions such as how risk should be priced, or deciding upon best practice when it comes to managing claims, are crucial to the industry, but will also help address uncertainty amongst consumers, who are equally apprehensive at the prospect of CAVs taking to the roads in the next five to 10 years.

Technology has forever altered the transportation landscape, and so has the way in which we look at insurance for transportation. As a data scientist, the question I would then ask would be; it possible that data science can help to answer some of these questions? In my view, from a business perspective it certainly can, on the assumption that insurers regard data as a strategic asset that can inform better decision-making which requires four steps:

  1. Identifying the decisions that have the biggest business impact, in order to create an analytics framework
  2. Ensuring that the internal and external data being collected and stored is good quality to support those decisions
  3. Provision of expertise to identify ways to improve these and run analytics regularly
  4. Creating the right technology infrastructure to support data and analytics

These steps create the right foundation for the application of data science, which really comes to life in the right hands, such as AI engineers, data analysts and research specialists who are being hired in increasing numbers by insurers as they look to understand safety and financial-risk implications around CAVs, and seek to develop products around them. The US-based company is a great example of this – it had 74 job openings related to AI engineers and data scientists from almost 1000 open listings in the US between September and December 2018.

It’s an interesting time to be in the insurance sector, albeit uncertain as the impact of CAVs is not yet that clear. The best way for insurers to meet the challenge head on is to put data at the heart of the business, instilling the kind of data-driven culture that will help insurers stay agile and relevant, whether it be products and services for self-driving cars or the soon-to-be ‘old-fashioned’ people-driven kind!

Blogs home Featured Image

Earlier this month I, together with two other Mangoes, made my way to France for the 2019 edition of useR!.

useR! brings together users and developers both from academia and the industry. This year it was hosted in Toulouse and together with side-events covered the second week of July. It was my first time attending useR!, so I didn’t really know what to expect.

I enjoyed the good food, geeking out over the cool talks on the latest R developments, and making a bunch of new friends. Mango was also one of the conference sponsors, our way of showing appreciation and support for the community.

Tidyverse developer day

Things kicked off on Monday the 8th of July with a side-event, the Tidyverse Developer Day, an initiative aimed at encouraging users to contribute to the {tidyverse} by solving some curated issues. I found it very rewarding to be able to chip into a beloved suite of packages largely aimed at making our lives easier and, thus, lowering the barrier to entry in the R world.

I really loved working alongside fellow R users and having access to some of the most experienced R developers in the world. I improved the documentation for one of the {withr} functions, as a warm-up task, and then worked on a more complex {usethis} issue. Both pull request were accepted and I can’t wait for the {usethis} feature I helped add to make its way to the CRAN release.

The Conference

A couple of themes emerged for me from the sea of talks and tutorials: better science and better workflow. I think there is a lot of overlap between the two and the distinction is not as clear as my classification might make it sound. An obvious caveat – this interpretation is largely biased by the topics I’m interested in and the talks I chose to go to. Nevertheless, there were quite a few talks touching on the topic of better science, which is also a testament to R’s (and the conference’s) academic heritage.

Better science

In the grand theme of making better science, I would include talks touching on improving reproducibility, bridging the skills gap or encouraging good practices.

  • Julia Stewart Lowndes’s awesome keynote on using R to fill the rift between environmental data and data science (R for better science in less time).
  • Joe Cheng’s keynote on solving the reproducibility problem of Shiny apps. His commitment to give a talk on this topic was the catalyst for the development of {shinymeta} – an R package that captures Shiny logic and exposes it as R code (Shiny’s Holy Grail: Interactivity with reproducibility).
  • Julie Josse’s keynote on missing data and modelling in R. She is also one of the maintainers of the CRAN Task View on missing data (A missing value tour in R).
  • I will put Davis Vaughan’s tutorial on {hardhat} in this bucket too. {hardhat} is a developer focused package aimed at standardising modelling packages to follow best-practices (Design For Humans! A Toolkit For Creating Intuitive Modeling Packages).
  • Our own Hannah Frick took attendees on a lightning tour of {goodpractice} – a chatty package that analyses your code and gives you advice on best practices when building R packages (goodpractice – A Tool for Good Package Development). Hannah’s topic touches on both workflow and better scientific practice making it a good bridge between what I consider the two main themes of the conference.

Better workflow

Under workflow I grouped tools aimed at making your life easier (mostly by reducing cognitive load), be it when building and deploying Shiny apps, reshaping your data or building R packages. Talks that stood out for me under this umbrella were:

  • Vincent Guyader’s talk on {golem} – a package developed by the crew at ThinkR, aimed at abstracting away some of the complexities of developing a production-ready Shiny apps (Golem : A Framework for Building Robust & Production Ready Shiny Apps).
  • Jenny Bryan’s talk on {usethis} and its conscious uncoupling from {devtools}. {usethis} implements the DRY – don’t repeat yourself – concept and aims to facilitate of key steps of the package development workflow (DRY out your workflow with the usethis package). Hint: we love {usethis} and we’re going to make heavy use of it during our own EARL package building workshop.
  • Hadley Wickham’s talk on Enhancements to data tidying. Following feedback from users on finding {tidyr} gather() and spread()sometimes confusing, two new, more intuitive, functions were introduced called pivot_longer() and pivot_wider(). More info in the new vignette.
  • Romain François’s talks on new {dplyr} functionality (n() cool #dplyr things) and the {dance} experimental package looking to implement {dplyr}-like functionality, but supported by the latest low-level developments in R, such as the {rlang} and {vctrs} packages.

I spent most of the rest of my time in the programming and performance tracks where, in addition to those named above, there were some other remarkable talks:

  • Colin Gillespie’s – which probably was the funniest talk of the conference (R and security). He highlighted a few instances in which access to a users system can be gained via relatively inconspicuous R tasks.
  • Jim Hester’s talk on {vroom} – “the fastest delimited reader for R” – a new file import package that uses the altrep framework introduced in R 3.5 (Real-time file import with the vroom package).
  • Gábor Csárdi spoke about {pak} which proposes a new way to install packages – an alternative to install.packages() – that’s fast, safe and convenient (pak: a fresh approach to package installation).
  • Lionel Henry talked about writing {tidyverse}-like functions that make use of non-standard evaluation (Reusing tidyverse code, the easy way).

For those who missed out, the keynotes are available on the RConsortium YouTube channel. Slides for the presentations can be found alongside the programme – https://user2019.r-project.org/talk_schedule/Suthira Owlarn has collated more conference materials here – https://github.com/sowla/useR2019-materials.

So I wish you interesting R adventures and I hope to see you soon at LondonREARL or on Twitter!