Blogs home Featured Image

When technical capabilities and company culture combine, IoT-fed data lakes become a powerful brain at the heart of the business

Internet-enabled devices have led to an explosion in the growth of data. On its own, this data has some value, however, the only way to unlock its full potential is by combining it with other data that businesses already hold.

Together, pre-existing data and newly-minted IoT data can provide a full picture of specific insights around a single consumer. It is paramount, however, that companies don’t prioritise innovation at the expense of ethics. Sourcing and analytics must be done correctly – with the right context that respects consumer privacy and wishes around data usage.

The insights gained from successfully blending these two different data sources also unlock secondary benefits including new product development, possible upsells or the ability to build customer goodwill through advice-driven service delivery.

It’s a winning combination, but the challenge is how to actually merge device data with regular customer information.

No easy fit

This problem arises from the fact that IoT device data is a different “shape” to data in traditional customer records.

If you think of a customer record in a sales database as one long row of information, IoT collected information is more like an entire column of time series information, with a supporting web of additional detail. Trying to directly join the two is near impossible, and it is likely that some valuable semantic information could end up lost in the process.

But if IoT information fundamentally resists structure, and existing business databases are built on rigid structures, how do you find an environment that works for both? The answer is a data lake.

Pooling insight

A data lake is a more “fluid” approach to storing and connecting data. It is a central repository where data can be stored in the form it’s generated, whether that is in a relational database format or entirely unstructured. Analytics can then be applied over the top to connect different pieces of information and derive useful business insights.

However, there is more complexity involved in setting up a data lake than just combining all of an organisation’s data and hoping for the best. If you do that, you’ll likely end up with a data swamp – a disorganised, underperforming mess of data that lacks the necessary context to make it useful.

This can be avoided using the expertise of dedicated data engineers. These are the masterminds who build the framework for a data lake and manage the process of extracting data from its source, before transforming it into a usable format and then loading it into the data lake environment. Done properly, this will ensure data provenance, with appropriate metadata to guide users on allowable use cases and analysis.

“If you do that, you’ll likely end up with a data swamp – a disorganised, underperforming mess of data that lacks the necessary context to make it useful”

This sounds like a significant undertaking, and there’s no getting around the fact that doing data lakes right does take time and effort, but it is possible to take a staged approach. Many organisations start with a data “puddle” – a small collection of computers hosting a limited amount of data — and then slowly add to this, increasing the number of computers over time to form the full data lake.

A question of culture

In addition, technical considerations are just one side of the coin. The other side is one of culture. At the core of the problem is that businesses will not succeed with commercialising their IoT data if users are either unaware of, or distrusting of, the data lake and its potential.

While investment in big data continues to grow, a recent NewVantage Partners survey on Big Data and AI found that just 31 percent of organisations consider themselves data driven — the second year in a row that the number has fallen. Data lake technology has been around for several years now, and should be more than capable of enabling these types of organisations, but without the right culture in place, its benefits are seldom felt.

How do you create a culture that centres on being data-driven? As any management team knows, culture shifts are never easy, but a data-driven culture boils down to improving collaboration, communication and understanding between data professionals and business functions.

With a successful technical implementation of a data lake, you then need data professionals to advocate its benefits, and liaise with business departments to understand the types of insights that would be most useful to inform strategic decisions.

This then reinforces business confidence in the data function, and allows the data teams to expand their contributions to the business and be recognised for their hard work. When supported by senior buy-in, this positive feedback loop generates a growing culture of data savviness and data-driven approaches within the organisation.

Brain of the organisation

When technical capabilities and company culture combine, data lakes can become a powerful brain at the heart of the business. With the right analytics tools layered over the top, data lakes can reduce the time to finding insights and surface powerful information. These insights can serve business needs better and faster and are an outright win for any organisation. In short, they are well worth the time and investment.

Author: Dean Wood, Principal Data Scientist

Blogs home

We are excited to announce the speakers for this year’s EARL London Conference!

Every year, we receive an immense number of excellent abstracts and this year was no different – in fact, it’s getting harder to decide. We spent a lot of time deliberating and had to make some tough choices. We would like to thank everyone who submitted a talk – we appreciate the time taken to write and submit; if we could accept every talk, we would.

This year, we have a brilliant lineup, including speakers from Auto Trader, Marks and Spencer, Aviva, Hotels.com, Google, Ministry of Defence and KPMG. Take a look below at our illustrious list of speakers:

Full length talks
Abigail Lebrecht, Abigail Lebrecht Consulting
Alex Lewis, Africa’s Voices Foundation
Alexis Iglauer, PartnerRe
Amanda Lee, Merkle Aquila
Andrie de Vries, RStudio
Catherine Leigh, Auto Trader
Catherine Gamble, Marks and Spencer
Chris Chapman, Google
Chris Billingham, N Brown PLC
Christian Moroy, Edge Health
Christoph Bodner, Austrian Post
Dan Erben, Dyson
David Smith, Microsoft
Douglas Ashton, Mango Solutions
Dzidas Martinaitis, Amazon Web Services
Emil Lykke Jensen, MediaLytic
Gavin Jackson, Screwfix
Ian Jacob, HCD Economics
James Lawrence, The Behavioural Insights Team
Jeremy Horne, MC&C Media
Jobst Löffler, Bayer Business Services GmbH
Jo-fai Chow, H2O.ai
Jonathan Ng, HSBC
Kasia Kulma, Aviva
Leanne Fitzpatrick, Hello Soda
Lydon Palmer, Investec
Matt Dray, Department for Education
Michael Maguire, Tusk Therapeutics
Omayma Said, WUZZUF
Paul Swiontkowski, Microsoft
Sam Tazzyman, Ministry of Justice
Scott Finnie, Hymans Robertson
Sean Lopp, RStudio
Sima Reichenbach, KPMG
Steffen Bank, Ekstra Bladet
Taisiya Merkulova, Photobox
Tim Paulden, ATASS Sports
Tomas Westlake, Ministry Of Defence
Victory Idowu, Aviva
Willem Ligtenberg, CZ

Lightning Talks
Agnes Salanki, Hotels.com
Andreas Wittmann, MAN Truck & Bus AG
Ansgar Wenzel, Qbiz UK
George Cushen, Shop Direct
Jasmine Pengelly, DAZN
Matthias Trampisch, Boehringer Ingelheim
Mike K Smith, Pfizer
Patrik Punco, NOZ Medien
Robin Penfold, Willis Towers Watson

Some numbers

We thought we would share some stats from this year’s submission process:


This is based on a combination of titles, photos and pronouns.

Agenda

We’re still putting the agenda together, so keep an eye out for that announcement!

Tickets

Early bird tickets are available until 31 July 2018, get yours now.

Field Guide to the R Ecosystem
Blogs home Featured Image
Mark Sellors, Head of Data Engineering

I started working with R around about 5 years ago. Parts of the R world have changed substantially over that time, while other parts remain largely the same. One thing that hasn’t changed however, is that there has never been a simple, high-level text to introduce newcomers to the ecosystem. I believe this is especially important now that the ecosystem has grown so much. It’s no longer enough to just know about R itself. Those working with, or even around R, must now understand the ecosystem as a whole in order to best manage and support its use.

Hopefully the Field Guide to the R Ecosystem goes some way towards filling this gap.

The field guide aims to provide a high level introduction to the R ecosystem. Designed for those approaching the language for the first time, managers, ops staff, and anyone that just needs to get up to speed with the R ecosystem quickly.

This is not a programming guide and contains no information about the language itself, so it’s very definitely not aimed at those already developing with R. However, it is hoped that the guide will be useful to people around those R users. Whether that’s their managers, who’d just like to understand the ecosystem better, or ops staff tasked with supporting R in an enterprise, but who don’t know where to start.

Perhaps, you’re a hobbyist R user, who’d like to provide more information to your company in order to make a case for adopting R? Maybe you’re part of a support team who’ll be building out infrastructure to support R in your business, but don’t know the first thing about R. You might be a manager or executive keen to support the development of an advanced analytics capability within your organisation. In all of these cases, the field guide should be useful to you.

It’s relatively brief and no prior knowledge is assumed, beyond a general technical awareness. The topics covered include, R, packages and CRAN, IDEs, R in databases, commercial versions of R, web apps and APIs, publishing and the community.

I really hope you, or someone around you, finds the guide useful. If you have any feedback, find me on twitter and let me know. If you you’d like to propose changes to the guide itself, you’ll find instructions in the first chapter and the bookdown source on GitHub. Remember, the guide is intentionally high-level and is intended to provide an overview of the ecosystem only, rather than any deep-dive technical discussions. There are already plenty of great guides for that stuff!

I’d also like to say a huge thanks to everyone who has taken time out of their day to proof read this for me and provide invaluable feedback, suggestions and corrections. The community is undoubtedly one of R’s greatest assets.

Originally posted on Mark’s blog, here.

Blogs home Featured Image
Nic Crane, Data Scientist

At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output.

While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process.

So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation?

1. Analyse procedure calls to inform technology choice

health1

Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job?

A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R.

2. Use macro analysis to find the most and least important components of an application

Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used?

We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once.

Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality.

On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it.

3. Looking at the interrelated components to understand process flow

When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code?

The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first.

4. Examine code complexity to assess what needs redeveloping and redesigning

Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively?

Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers.

5. Use the high level overview to get an informed perspective which ties into your strategic objectives

Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives?

The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions.

For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process.

SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money.

If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450.

Ten reasons to join the Mango Solutions team
Blogs home Featured Image

We think Mango is a pretty great place to work and we could give you a list of reasons that every company comes up with, but we decided to talk to our team to find out why they love working here. There were definitely more than ten reasons, so we’ve picked the very best.

1. “We have great customers!”
Working in consultancy means every day is different. Sometimes we work with finance analysts, sometimes with retailers. Each have fascinating domain expertise, and yet all can benefit from your experience in other sectors.

2. “We get to travel.”
Our customer base is global and sometimes training or a design sprint work better face-to-face. We also keep up with advances in our sectors by attending conferences. So, from San Francisco to Seoul we teach, collaborate and network. We also encourage our team to present and run workshops at meetups, conferences and other industry events.

3. “We support the Data Science community.”
A well-connected community brings people together. We support the growing community by hosting a range of data science meetups and provide both commercial and free training. We make a lot of our internal R packages available on GitHub, and contribute to many other public projects as well.

4. “We work with awesome people!”
In an ever-changing field, supporting your colleagues is important. We’re proud to say we have a team of incredibly supportive people that are happy to give advice. We work together on challenges to successfully achieve our goals. There is plenty of friendly banter and warmth in the office as well.

5. “We get to work on a huge variety of projects.”
Consulting work gives you the opportunity to work with a huge variety of clients and industries. Sometimes we’re building traditional statistical models or sometimes its machine learning and cutting edge algorithms. No two projects are the same and with an exciting array of work you never feel like you’re stuck in a rut.

6. “Never stop learning”
Mango operates at the cutting-edge of open source technology. All of our consultants are encouraged to learn the latest methods and tools and, through regular internal seminars, encouraged to share that knowledge around the team. We have an internal training programme and prioritise giving people the space to investigate new technology.

7. “No idea is a bad idea…”
Okay, not all ideas are great ideas, but we love creative solutions and initiative at Mango. We are always open to new ideas when it comes to addressing challenges. As most projects require different ways of thinking and new problems to solve, your fresh ideas would always be welcome.

8. Free Food Fridays 
Who doesn’t love free food? One popular Mango perk is ‘free food Fridays’. Exactly as it suggests, we put on a free lunch once a month. It’s always great to get the whole team together over some food. We also provide free fruit for the whole office, so you can reach your 5-a-day!

9. EARL Conference
Becoming a member of the Mango Solutions team gives you the opportunity to attend EARL Conferences, which are held in some of the coolest cities in the world. These conferences have a really exciting vibe; almost every session could be showcasing ideas that could be directly applicable in your projects. EARL attendees are passionate about finding new ways to improve their business area and are great company, good at networking and fun to hang out with after sessions.

10. Diversity
Mango Solutions recognises that our people are our most valuable asset. For us to achieve and succeed we look to attract and retain the right skills and the best minds – and this means diversity. Diversity is a key driver of innovation and a diverse team harbours creative thinking, and allows us to work more effectively within a diverse marketplace.

These are just the top reasons why you should join the Mango team. There are many more, but to find out for yourself take a look at what roles we’re hiring for and talk to us.mango

Blogs home Featured Image

Adnan Fiaz

With two out of three EARL conferences part of R history we’re really excited about the next EARL conference in Boston (only 1 week away!). This calls for an(other) EARL conference analysis, this time with Twitter data. Twitter is an amazingly rich data source and a great starting point for any data analysis (I feel there should be a awesome-twitter-blogposts list somewhere).

I was planning on using the wonderful rtweet package by Michael Kearney (as advertised by Bob Rudis) but unfortunately the Twitter API doesn’t provide a full history of tweets. Instead I had to revert to a Python package (gasp) called GetOldTweets. I strongly recommend using the official Twitter API first before going down this path.

The Data

# I have used the Exporter script with the hashtags #EARLConf2017, #EARLConf and #EARL2017
tweets_df <- purrr::map_df(list.files('data/tweets', full.names = TRUE), 
~ readr::read_delim(.x, delim=";", quote="")) %>% 
# filter out company accounts
filter(username!="earlconf", username!="MangoTheCat") %>% 
mutate(shorttext = stringr::str_sub(text, end=50))

tweets_df %>% 
select(username, date, shorttext) %>% 
head() %>% 
knitr::kable()
username date shorttext
AlanHoKT 2017-10-02 02:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
johnon2 2017-09-23 16:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
AndySugs 2017-09-21 22:19:00 “RT: LearnRinaDay: EARL London 2017 ? That?s a wra
LearnRinaDay 2017-09-21 22:17:00 “EARL London 2017 ? That?s a wrap! https://www. r-
LouBajuk 2017-09-20 23:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
pjevrard 2017-09-20 13:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o

First things first, let’s get a timeline up:

 

The hashtags I used to search tweets were generic so the results include tweets from last year’s conferences. Let’s zoom in on this year’s conferences: EARL San Francisco (5-7 June) and EARL London (12-14 September). They clearly explain the large peaks in the above graph.

 

I’ve tried to highlight the period when the conferences were on but I don’t quite like the result. Let’s see if it works better with a bar chart.

earlconf_sf_dates <- lubridate::interval("2017-06-05", "2017-06-08")
earlconf_lon_dates <- lubridate::interval("2017-09-12", "2017-09-15")
tweets_df %>% 
filter(date > "2017-05-01") %>% 
mutate(day = lubridate::date(date)) %>% 
count(day) %>% 
mutate(conference = case_when(day %within% earlconf_sf_dates ~ "SF",
day %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
ggplot(aes(x=day, y=n)) + 
geom_bar(stat="identity", aes(fill=conference)) +
scale_fill_manual(guide=FALSE, values=c("#F8766D","black","#619CFF")) +
labs(x='Date', y='Number of tweets', title='Number of EARL-related tweets by day') +
scale_x_date(date_breaks="1 months", labels=date_format('%b-%y')) +
theme_classic()

 

Now that’s a lot better. The tweet counts in black surrounding the conferences look like small buildings which make the conference tweet counts look like giant skyscrapers (I was a failed art critic in a previous life).

Activity during conferences

I’ve been to my fair share of conferences/presentations and I’ve always wondered how people tweet so fast during a talk. It could be just my ancient phone or I may lack the necessary skills. Either way it would be interesting to analyse the tweets at the talk level. First I will need to link the tweets to a specific talks. I’ve translated the published agenda into a nicer format by hand and read it in below.

earl_agenda <- map_df(c("EARL_SF", "EARL_LON"), 
~ readxl::read_xlsx('data/earl_agenda.xlsx', sheet = .x) )
earl_agenda %>% 
select(StartTime, EndTime, Title, Presenter) %>% 
head() %>% 
knitr::kable()
StartTime EndTime Title Presenter
2017-06-06 11:00:00 2017-06-06 11:30:00 R?s role in Data Science Joe Cheng
2017-06-06 11:30:00 2017-06-06 12:00:00 ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
2017-06-06 12:00:00 2017-06-06 12:30:00 R Operating Model Mark Sellors
2017-06-06 11:00:00 2017-06-06 11:30:00 Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
2017-06-06 11:30:00 2017-06-06 12:00:00 Using data to identify risky prescribing habits in physicians Aaron Hamming
2017-06-06 12:00:00 2017-06-06 12:30:00 How we built a Shiny App for 700 users Filip Stachura

Before I merge the tweets with the agenda it’s a good idea to zoom in on the conference tweets (who doesn’t like a facetted plot).

conference_tweets <- tweets_df %>% 
mutate(conference = case_when(date %within% earlconf_sf_dates ~ "SF",
date %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
filter(conference != "NONE")

ggplot(conference_tweets, aes(x=date)) +
geom_histogram() +
facet_wrap(~ conference, scales = 'free_x')

 

Nothing odd in the pattern of tweets: there are no talks on the first day so barely any tweets; the amount of tweets spikes at the beginning of the other two days and then declines as the day progresses. There is something odd about the timing of the tweets though. I didn’t notice it before but when I compared the position of the bars on the x-axis the San Francisco tweets look shifted. And then my lack of travel experience hit me: time zones! The tweets were recorded in UTC time but the talks obviously weren’t in the evening in San Francisco.

After correcting for time zones I can finally merge the tweets with the agenda.

selection <- conference_tweets$conference=='SF'
conference_tweets[selection, 'date'] <- conference_tweets[selection, 'date'] - 8*60*60
# I intended to use a fuzzy join here and check if the tweet timestamp falls within the [start, end) of a talk
# unfortunately I couldn't get it to work with datetime objects
# so I resort to determining the cartesian product and simply filtering the relevant records
tweets_and_talks <- conference_tweets %>% 
mutate(dummy = 1) %>% 
left_join(earl_agenda %>% mutate(dummy=1)) %>% 
filter(date >= StartTime, date < EndTime) 

tweets_and_talks %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R Large-scale reproducible simulation pipelines in R using Docker Mike Gahan

You ever have that feeling that you’re forgetting something and then you’re at the airport without your passport? From the above table it’s obvious I’ve forgotten that talks are organised in parallel. So matching on time only will create duplicates. However, you may notice that some tweets also mention the presenter (that is considered good tweetiquette). We can use that information to further improve the matching.

talks_and_tweets <- tweets_and_talks %>% 
# calculate various scores based on what is said in the tweet text
mutate(presenter_score = ifelse(!is.na(mentions) & !is.na(TwitterHandle), stringr::str_detect(mentions, TwitterHandle), 0),
# check if the presenter's name is mentioned
presenter_score2 = stringr::str_detect(text, Presenter),
# check if the company name is mentioned
company_score = stringr::str_detect(text, Company),
# check if what is mentioned has any overlap with the title (description would've been better)
overall_score = stringsim(text, Title),
# sum all the scores
score = overall_score + presenter_score + presenter_score2 + company_score) %>% 
select(-presenter_score, -presenter_score2, -company_score, -overall_score) %>% 
# now select the highest scoring match
group_by(username, date) %>% 
top_n(1, score) %>% 
ungroup()

talks_and_tweets %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
Madhuraraju 2017-06-06 11:39:00 @aj2z @gdequeiroz from @SelfScore talking about u ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
hspter 2017-06-06 11:22:00 “#rstats is great for achieving”flow” while doing R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:20:00 @RStudioJoe showing the #RLadies logo and a big m R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng

That looks better but I am disappointed at the number of tweets (263) during talks. Maybe attendees are too busy listening to the talk instead of tweeting which is a good thing I suppose. Nevertheless I can still try to create some interesting visualisations with this data.

tweets_by_presenter <- talks_and_tweets %>% 
count(conference, Title, Presenter) %>% 
ungroup() %>% 
arrange(conference, n)

tweets_by_presenter$Presenter <- factor(tweets_by_presenter$Presenter, levels=tweets_by_presenter$Presenter)

 

The visualisation doesn’t really work for the large number of presenters although I don’t really see another way to add the information about a talk. I also tried to sort the levels of the factor so they appear sorted in the plot but for some reason the SF facet doesn’t want to cooperate. There are a number of talks vying for the top spot in San Francisco but the differences aren’t that large. I’m of course assuming my matching heuristic worked perfectly but one or two mismatches and the results could look completely different. The same applies to EARL London but here Joe Cheng clearly takes the crown.

Follow the leader…

Let’s go down a somewhat more creepy road and see what talks people go to.

tweeters <- talks_and_tweets %>% 
group_by(username) %>% 
mutate(num_tweets = n()) %>% 
ungroup() %>% 
filter(num_tweets > 4) %>% 
mutate(day = ifelse(conference == "SF", (Session > 6)+1, (Session > 9)+1),
day = ifelse(day==1, "Day 1", "Day 2")) %>% 
select(username, conference, StartTime, Stream, day)

Each line is a twitter user (twitterer? tweeter? tweep?) and each observation represents a tweet during a presentation. My expectation was that by drawing a line between the observations you could see how people switch (or don’t switch) between talks. That has clearly failed as the tweeting behaviour isn’t consistent or numerous enough to actually see that. I’m quite glad it’s not possible since tracking people isn’t what Twitter is for.

The code and data for this blogpost are available on GitHub so feel free to play around with it yourself. Do let us know if you create any awesome visualisations or if we can improve on any of the above. If you also want to tweet at conferences, EARL Boston is happening on 1-3 November and ticketsare still available. I promise we won’t track you!

Putting the cat in scatterplot
Blogs home Featured Image
Clara Schartner, Data Scientist

It will come as no surprise that cats and ggplot are among our favourite things here at Mango, luckily there is an easy way to combine both.

Using the function annotation_custom in the popular ggplot2 package it is possible to display images on a plot i.e. points of a scatterplot. This way data can be displayed in a more fun, creative way.

In keeping with the cat theme I have chosen a data set about cats and a cat icon based on Mango the cat. The MASS package provides a data set called cats which contains the body weight, heart weight and sex of adult cats.

library(MASS)
data(cats)
head(cats)
set.seed(1234)
cats <- cats[sample(1:144, size = 40),]

First a normal scatterplot is defined on which the images will be plotted later:

library(ggplot2)
sCATter <-ggplot(data = cats, aes(x = Bwt, y = Hwt)) +
geom_point(size = 0, aes(group = Sex, colour = Sex)) +
theme_classic() +
xlab("Body weight") +
ylab("Heart weight") +
ggtitle("sCATterplot") +
theme(plot.title = element_text(hjust = 0.5)) +
# create a legend
scale_color_manual(
values = c("#999999", "#b35900" ),
name = "Cat",
labels = c("Male cat", "Female cat")
) +
guides(colour = guide_legend(override.aes = list(size = 10)))

Any png image can be used for the plot, however images with a transparent background are preferable.

library(png)
library(grid)
mCat <- readPNG("MaleCat.png")
feCat<- readPNG("FemaleCat.png")

In the last step the cats are iteratively plotted onto the plot using annotation_custom.

for (i in 1:nrow(cats)) {
# distinguishing the sex of the cat
if (cats$Sex[i] == "F") {
image <- feCat
} else{
image <- mCat
}
sCATter = sCATter +
annotation_custom(
rasterGrob(image),
xmin = cats$Bwt[i] - 0.6,
xmax = cats$Bwt[i] + 0.6,
ymin = cats$Hwt[i] - 0.6,
ymax = cats$Hwt[i] + 0.6
)
}

The cat´s paw trail is displaying a linear regression of heart on body weight. This can easily be added by computing a linear Regression, defining a grid to calculate the expected values and plotting cats on top of this data.

LmCat <- lm(Hwt~Bwt, data = cats)

steps <- 20
Reg <- data.frame(Bwt = 
seq(from = min(cats$Bwt), 
to = max(cats$Bwt), 
length.out = steps))
Reg$Hwt <- predict(LmCat, newdata = Reg)
sCATter <- sCATter + 
geom_point(data = Reg, aes(Bwt, Hwt), size = 0)

paw <- readPNG("paw.png")
for (i in 1:nrow(Reg)) {
sCATter = sCATter +
annotation_custom(
rasterGrob(paw),
xmin = Reg$Bwt[i] - 0.6,
xmax = Reg$Bwt[i] + 0.6,
ymin = Reg$Hwt[i] - 0.6,
ymax = Reg$Hwt[i] + 0.6
)
}
sCATter

I hope you have as much fun as I did with this ggplot2 package!

Blogs home Featured Image

We have been working within the Pharmaceutical sector for over a decade. Our expertise, knowledge of the industry, and presence in the R community mean we are used to providing services within a GxP environment and providing best practice.

We are excited to be at three great events in the coming months. Find us at:

• 15th Annual Pharmaceutical IT Congress, 27-28 September – London, England
• Pharmaceutical Users Software Exchange (PhUSE), 8-11 October 2017 – Edinburgh, Scotland
• American Conference on Pharmacometrics (ACoP8), 15-18 October – Fort Lauderdale, USA

Our dedicated Pharma team will be at the events to address any of your questions or concerns around data science and using R within your organisation.

How Mango can help with your Data Science needs

A validated version of R

Because the use of R is growing in the pharmaceutical sector, it’s one of the enquiries we get the most at Mango, so we’d love to talk to you about how you can use it in your organisation.

We know that a major concern for using R within the Pharma sector is its open source nature, especially when using R for regulatory submissions.

R contains many capabilities specifically aimed at helping users perform their day-to-day activities, but with the concerns over meeting compliance, understandably some companies are hesitant to make the move.

To eliminate risk, we’ve created ValidR – a series of scripts and services that deliver a validated build of R to an organisation.

For each validated package, we apply our ISO9001 accredited software quality process of identifying requirements, performing code review, testing that the requirements have been met and installing the application in a controlled and reproducible manner. We have helped major organisations adopt R in compliance with FDA 21 CFR Part 11 guidelines on open source software.

Consultancy

We have helped our clients adopt or migrate to R by providing a range of consultancy services from our unique mix of Mango Data Scientists who have both extensive technical and real-world experience.

Our team of consultants have been deployed globally in projects including, SAS to R migration, Shiny application development, script validation and much more. Our team also provide premier R training with courses designed specifically for the Pharmaceutical sector.

Products

Organisations today are not only looking for how they can validate their R code but how that information is retained, shared and stored across teams globally. Our dedicated validation team have a specialised mix of software developers who build rich analytic web and desktop applications using technologies such as Java, .NET and JavaScript.

Our applications, ModSpace and Navigator, have been deployed within Pharma organisations globally. These both help organisations maintain best practice but also again achieve a ‘validated working environment’.

Why Mango?

All of our work, including the support of open source software such as R, is governed by our Quality Management System, which is regularly audited by Pharmaceutical companies in order to ensure compliance with industry best practices and regulatory guidelines.

Make sure you stop by our stand and talk to us about how we can help you make the most of your data!