Blogs home Featured Image

In case you hadn’t spotted it already, a major new release of ggplot2 hit CRAN last week. Among the many new features, the one that caught my attention was support for simple features. Or in other words, it’s now really easy to plot spatial data in ggplot2.

On Friday afternoon I spent a couple of hours playing with this new functionality as well as mapview, a package I hadn’t come across before but discovered on Friday. This blog is a quick walkthrough to show you how simple spatial visualisation in R really is.

In case you want to follow along, the full list of packages used were:

library(tidyverse)
library(ggplot2)
library(sf)
library(mapview)
library(rvest)
library(magrittr)

The Data

Whenever I start to play with new functions I always try to use my own data. In this case, thanks to a recent project, I happened to have a shape file for Italy to hand (You can download it for yourself from Istat).

Initially, I just plotted the region level data with random colours (as you may have already seen on Twitter) but to reflect my usual workflow I found some data on the amount of wine produced by region – it was Friday afternoon after all!

I won’t go into all the detail of the web scraping and data manipulation, but the code is included throughout so that you can replicate this example.

wine <- read_html("https://italianwinecentral.com/wine-production-in-italy-by-region/")
wine %<>% 
  html_table() %>% 
  extract2(1) %>%
  slice(-21)
wine[,-1] %<>% map_df(parse_number)

This gave me an interesting dataset to play with that wasn’t actually too dissimilar to what I often end up trying to plot i.e. change over time.

head(wine)
Region
<chr>
2012
<dbl>
2013
<dbl>
2014
<dbl>
2015
<dbl>
2016
<dbl>
1 Abruzzo 2443 2728 2273 2936 2937
2 Basilicata 189 178 102 87 93
3 Calabria 400 370 314 404 391
4 Campania 1542 1644 1183 1614 1286
5 Emilia Romagna 6273 7396 6958 6752 7039
6 Friuli-Venezia Giulia 1281 1073 1367 1872 1856

Importing Shape Files

Now for the interesting part.

To import the spatial data I used the sf package. This package makes reading the data very simple and stores it in a way that is much easier to reason about, especially for data that is intended to provide regions.

To import the shape file we can use the st_read function, in this case pointing to the file location, although you can also point to a database. Once the data is imported we have a single row for each region. The data includes a column geometry. This is where the information related to the grouping is defined. In this case the geometry stores (as a list-column) all of the information relating to polygon of the region. But, this could also be a single point or a whole host of other information. Check out the vignettes for more details.

italy <- st_read("./Limiti_2016_ED50_g/Reg2016_ED50_g/Reg2016_ED50_g.shp")
head(italy)
COD_REG
<int>
REGIONE
<fctr>
SHAPE_Leng
<dbl>
SHAPE_Area
<dbl>
geometry
<S3: sfc_MULTIPOLYGON>
1 1 Piemonte 1235685.5 25394259406 <S3: sfc_MULTIPOLYGON>
2 2 Valle D’Aosta 311141.5 3258954558 <S3: sfc_MULTIPOLYGON>
3 3 Lombardia 1410619.8 23862504702 <S3: sfc_MULTIPOLYGON>
4 5 Veneto 1058657.7 18406022144 <S3: sfc_MULTIPOLYGON>
5 4 Trentino-Alto Adige 800899.4 13607742436 <S3: sfc_MULTIPOLYGON>
6 7 Liguria 825474.4 5415107538 <S3: sfc_MULTIPOLYGON>

As this is just a simple data frame, it can be manipulated in the same way as any other dataframe. In this case that includes a little change to the region names to make them match exactly and joining the data so I have everything I want in one data set, ready to start plotting.

levels(italy$REGIONE) <- wine$Region
WineLocation <- full_join(italy, wine, by = c("REGIONE" = "Region"))

New Features of ggplot2

Now for the new features of ggplot2. There are a lot of them this time around. Many related to the specification of variables for programming purposes but I was most interested in how easy it was to plot my Italy data. And it turns out that the answer was extremely.

Point to the data, tell it which column to use as the fill and add the new geom_sf layer. Just to try it out, I have added the new viridis scale layer. But that is all I needed to do. All of the projections have been managed for me, the colours look much more impressive than the defaults and all of the axis labels and titles have been handled the way I would want them, everything has just worked without me having to do any thinking.

ggplot(data = WineLocation, aes(fill = `2016`)) + 
  geom_sf() + 
  scale_fill_viridis_c()

Taking this even further, I happened to pick a data set with multiple years in it. A little manipulation to get the data into a better format to plot and an additional line in my plotting code and I have facetted maps. Again, something I didn’t need to do but I have tried out the new vars specification for facetting. Don’t worry, the formula interface is still there, and this is one of the features that will make programming with ggplot2 much easier. But this alternative is certainly very simple and may even be easier for some ggplot2 novices.

wineLong <- gather(wine, Year, Quantity, -Region)
fullWineLong <- full_join(italy, wineLong, by = c("REGIONE" = "Region"))

ggplot(data = fullWineLong, aes(fill = Quantity)) + 
  geom_sf() + 
  facet_wrap(vars(Year)) +
  scale_fill_viridis_c() 

I tried this with all the shape files I could get my hands on and I am pleased to report it was just as easy. As it’s ggplot2 I was able to add layers as I wanted them, including additional shape files and all the projections were handled for me – although plotting the US and Italy on one map did look a little odd by default!

A quick look at mapview

Whilst I was reviewing the vignettes for sf I came across an example of using mapview. Despite this package having been on CRAN for a few years I had never come across it before, but I am very glad I have now.

In a nutshell, it is a wrapper to leaflet. The good thing is that it works very easily with simple features data. There are a huge number of options but to recreate an interactive version of my first map I simply needed to define the data, the column that defines the colouring (zcol) and as an extra the data that gives labels when I hover on the map, in this case the region names.

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE)

You can take a look at the map created on RPubs.

There are a huge number of options to mapview and I certainly haven’t tried them all out yet, but some of the notable features include:

  • Adding additional shape files by simply adding two mapview objects together
  • Adding leaflet layers, using the usual leaflet functions
  • Including multiple layers of data
  • Defining the pop-up functionality

This last one is particularly interesting as you can not only pop-up tables of data but also graphics. As an example, suppose that we want to be able to click on the region of interest and see the change in wine production over time. All we have to do is create a list object that includes all of the graphics that we need (in the same order as the data) and pass this to the popup option using the popupGraph function.

plotByRegion <- function(region){
  fullWineLong %>%
    filter(REGIONE == region) %>%
  ggplot() +
    geom_line(aes(Year, Quantity, group = REGIONE)) +
    labs(title = region)
}

p <- lapply(WineLocation$REGIONE, plotByRegion)

fullWineLong %>% 
  filter(Year == 2016) %>%
  mapview(zcol = "Quantity", label = .$REGIONE, 
          popup = popupGraph(p))

The documentation for mapview does state that it is aimed at quick visualisation for exploratory purposes rather than for production-ready graphics and it really suited that purpose for me. Personally, I can see my usage becoming a combination of both mapview and leaflet for interactive graphics and ggplot2 for static.

Right now I am very happy that I have an incredibly quick and easy workflow for getting my spatial data into R and visualising it without having to worry about map projections and little details that distract me from analysis.

Blogs home Featured Image

Adnan Fiaz

With two out of three EARL conferences part of R history we’re really excited about the next EARL conference in Boston (only 1 week away!). This calls for an(other) EARL conference analysis, this time with Twitter data. Twitter is an amazingly rich data source and a great starting point for any data analysis (I feel there should be a awesome-twitter-blogposts list somewhere).

I was planning on using the wonderful rtweet package by Michael Kearney (as advertised by Bob Rudis) but unfortunately the Twitter API doesn’t provide a full history of tweets. Instead I had to revert to a Python package (gasp) called GetOldTweets. I strongly recommend using the official Twitter API first before going down this path.

The Data

# I have used the Exporter script with the hashtags #EARLConf2017, #EARLConf and #EARL2017
tweets_df <- purrr::map_df(list.files('data/tweets', full.names = TRUE), 
~ readr::read_delim(.x, delim=";", quote="")) %>% 
# filter out company accounts
filter(username!="earlconf", username!="MangoTheCat") %>% 
mutate(shorttext = stringr::str_sub(text, end=50))

tweets_df %>% 
select(username, date, shorttext) %>% 
head() %>% 
knitr::kable()
username date shorttext
AlanHoKT 2017-10-02 02:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
johnon2 2017-09-23 16:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
AndySugs 2017-09-21 22:19:00 “RT: LearnRinaDay: EARL London 2017 ? That?s a wra
LearnRinaDay 2017-09-21 22:17:00 “EARL London 2017 ? That?s a wrap! https://www. r-
LouBajuk 2017-09-20 23:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
pjevrard 2017-09-20 13:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o

First things first, let’s get a timeline up:

 

The hashtags I used to search tweets were generic so the results include tweets from last year’s conferences. Let’s zoom in on this year’s conferences: EARL San Francisco (5-7 June) and EARL London (12-14 September). They clearly explain the large peaks in the above graph.

 

I’ve tried to highlight the period when the conferences were on but I don’t quite like the result. Let’s see if it works better with a bar chart.

earlconf_sf_dates <- lubridate::interval("2017-06-05", "2017-06-08")
earlconf_lon_dates <- lubridate::interval("2017-09-12", "2017-09-15")
tweets_df %>% 
filter(date > "2017-05-01") %>% 
mutate(day = lubridate::date(date)) %>% 
count(day) %>% 
mutate(conference = case_when(day %within% earlconf_sf_dates ~ "SF",
day %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
ggplot(aes(x=day, y=n)) + 
geom_bar(stat="identity", aes(fill=conference)) +
scale_fill_manual(guide=FALSE, values=c("#F8766D","black","#619CFF")) +
labs(x='Date', y='Number of tweets', title='Number of EARL-related tweets by day') +
scale_x_date(date_breaks="1 months", labels=date_format('%b-%y')) +
theme_classic()

 

Now that’s a lot better. The tweet counts in black surrounding the conferences look like small buildings which make the conference tweet counts look like giant skyscrapers (I was a failed art critic in a previous life).

Activity during conferences

I’ve been to my fair share of conferences/presentations and I’ve always wondered how people tweet so fast during a talk. It could be just my ancient phone or I may lack the necessary skills. Either way it would be interesting to analyse the tweets at the talk level. First I will need to link the tweets to a specific talks. I’ve translated the published agenda into a nicer format by hand and read it in below.

earl_agenda <- map_df(c("EARL_SF", "EARL_LON"), 
~ readxl::read_xlsx('data/earl_agenda.xlsx', sheet = .x) )
earl_agenda %>% 
select(StartTime, EndTime, Title, Presenter) %>% 
head() %>% 
knitr::kable()
StartTime EndTime Title Presenter
2017-06-06 11:00:00 2017-06-06 11:30:00 R?s role in Data Science Joe Cheng
2017-06-06 11:30:00 2017-06-06 12:00:00 ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
2017-06-06 12:00:00 2017-06-06 12:30:00 R Operating Model Mark Sellors
2017-06-06 11:00:00 2017-06-06 11:30:00 Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
2017-06-06 11:30:00 2017-06-06 12:00:00 Using data to identify risky prescribing habits in physicians Aaron Hamming
2017-06-06 12:00:00 2017-06-06 12:30:00 How we built a Shiny App for 700 users Filip Stachura

Before I merge the tweets with the agenda it’s a good idea to zoom in on the conference tweets (who doesn’t like a facetted plot).

conference_tweets <- tweets_df %>% 
mutate(conference = case_when(date %within% earlconf_sf_dates ~ "SF",
date %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
filter(conference != "NONE")

ggplot(conference_tweets, aes(x=date)) +
geom_histogram() +
facet_wrap(~ conference, scales = 'free_x')

 

Nothing odd in the pattern of tweets: there are no talks on the first day so barely any tweets; the amount of tweets spikes at the beginning of the other two days and then declines as the day progresses. There is something odd about the timing of the tweets though. I didn’t notice it before but when I compared the position of the bars on the x-axis the San Francisco tweets look shifted. And then my lack of travel experience hit me: time zones! The tweets were recorded in UTC time but the talks obviously weren’t in the evening in San Francisco.

After correcting for time zones I can finally merge the tweets with the agenda.

selection <- conference_tweets$conference=='SF'
conference_tweets[selection, 'date'] <- conference_tweets[selection, 'date'] - 8*60*60
# I intended to use a fuzzy join here and check if the tweet timestamp falls within the [start, end) of a talk
# unfortunately I couldn't get it to work with datetime objects
# so I resort to determining the cartesian product and simply filtering the relevant records
tweets_and_talks <- conference_tweets %>% 
mutate(dummy = 1) %>% 
left_join(earl_agenda %>% mutate(dummy=1)) %>% 
filter(date >= StartTime, date < EndTime) 

tweets_and_talks %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R Large-scale reproducible simulation pipelines in R using Docker Mike Gahan

You ever have that feeling that you’re forgetting something and then you’re at the airport without your passport? From the above table it’s obvious I’ve forgotten that talks are organised in parallel. So matching on time only will create duplicates. However, you may notice that some tweets also mention the presenter (that is considered good tweetiquette). We can use that information to further improve the matching.

talks_and_tweets <- tweets_and_talks %>% 
# calculate various scores based on what is said in the tweet text
mutate(presenter_score = ifelse(!is.na(mentions) & !is.na(TwitterHandle), stringr::str_detect(mentions, TwitterHandle), 0),
# check if the presenter's name is mentioned
presenter_score2 = stringr::str_detect(text, Presenter),
# check if the company name is mentioned
company_score = stringr::str_detect(text, Company),
# check if what is mentioned has any overlap with the title (description would've been better)
overall_score = stringsim(text, Title),
# sum all the scores
score = overall_score + presenter_score + presenter_score2 + company_score) %>% 
select(-presenter_score, -presenter_score2, -company_score, -overall_score) %>% 
# now select the highest scoring match
group_by(username, date) %>% 
top_n(1, score) %>% 
ungroup()

talks_and_tweets %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
Madhuraraju 2017-06-06 11:39:00 @aj2z @gdequeiroz from @SelfScore talking about u ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
hspter 2017-06-06 11:22:00 “#rstats is great for achieving”flow” while doing R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:20:00 @RStudioJoe showing the #RLadies logo and a big m R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng

That looks better but I am disappointed at the number of tweets (263) during talks. Maybe attendees are too busy listening to the talk instead of tweeting which is a good thing I suppose. Nevertheless I can still try to create some interesting visualisations with this data.

tweets_by_presenter <- talks_and_tweets %>% 
count(conference, Title, Presenter) %>% 
ungroup() %>% 
arrange(conference, n)

tweets_by_presenter$Presenter <- factor(tweets_by_presenter$Presenter, levels=tweets_by_presenter$Presenter)

 

The visualisation doesn’t really work for the large number of presenters although I don’t really see another way to add the information about a talk. I also tried to sort the levels of the factor so they appear sorted in the plot but for some reason the SF facet doesn’t want to cooperate. There are a number of talks vying for the top spot in San Francisco but the differences aren’t that large. I’m of course assuming my matching heuristic worked perfectly but one or two mismatches and the results could look completely different. The same applies to EARL London but here Joe Cheng clearly takes the crown.

Follow the leader…

Let’s go down a somewhat more creepy road and see what talks people go to.

tweeters <- talks_and_tweets %>% 
group_by(username) %>% 
mutate(num_tweets = n()) %>% 
ungroup() %>% 
filter(num_tweets > 4) %>% 
mutate(day = ifelse(conference == "SF", (Session > 6)+1, (Session > 9)+1),
day = ifelse(day==1, "Day 1", "Day 2")) %>% 
select(username, conference, StartTime, Stream, day)

Each line is a twitter user (twitterer? tweeter? tweep?) and each observation represents a tweet during a presentation. My expectation was that by drawing a line between the observations you could see how people switch (or don’t switch) between talks. That has clearly failed as the tweeting behaviour isn’t consistent or numerous enough to actually see that. I’m quite glad it’s not possible since tracking people isn’t what Twitter is for.

The code and data for this blogpost are available on GitHub so feel free to play around with it yourself. Do let us know if you create any awesome visualisations or if we can improve on any of the above. If you also want to tweet at conferences, EARL Boston is happening on 1-3 November and ticketsare still available. I promise we won’t track you!

Putting the cat in scatterplot
Blogs home Featured Image
Clara Schartner, Data Scientist

It will come as no surprise that cats and ggplot are among our favourite things here at Mango, luckily there is an easy way to combine both.

Using the function annotation_custom in the popular ggplot2 package it is possible to display images on a plot i.e. points of a scatterplot. This way data can be displayed in a more fun, creative way.

In keeping with the cat theme I have chosen a data set about cats and a cat icon based on Mango the cat. The MASS package provides a data set called cats which contains the body weight, heart weight and sex of adult cats.

library(MASS)
data(cats)
head(cats)
set.seed(1234)
cats <- cats[sample(1:144, size = 40),]

First a normal scatterplot is defined on which the images will be plotted later:

library(ggplot2)
sCATter <-ggplot(data = cats, aes(x = Bwt, y = Hwt)) +
geom_point(size = 0, aes(group = Sex, colour = Sex)) +
theme_classic() +
xlab("Body weight") +
ylab("Heart weight") +
ggtitle("sCATterplot") +
theme(plot.title = element_text(hjust = 0.5)) +
# create a legend
scale_color_manual(
values = c("#999999", "#b35900" ),
name = "Cat",
labels = c("Male cat", "Female cat")
) +
guides(colour = guide_legend(override.aes = list(size = 10)))

Any png image can be used for the plot, however images with a transparent background are preferable.

library(png)
library(grid)
mCat <- readPNG("MaleCat.png")
feCat<- readPNG("FemaleCat.png")

In the last step the cats are iteratively plotted onto the plot using annotation_custom.

for (i in 1:nrow(cats)) {
# distinguishing the sex of the cat
if (cats$Sex[i] == "F") {
image <- feCat
} else{
image <- mCat
}
sCATter = sCATter +
annotation_custom(
rasterGrob(image),
xmin = cats$Bwt[i] - 0.6,
xmax = cats$Bwt[i] + 0.6,
ymin = cats$Hwt[i] - 0.6,
ymax = cats$Hwt[i] + 0.6
)
}

The cat´s paw trail is displaying a linear regression of heart on body weight. This can easily be added by computing a linear Regression, defining a grid to calculate the expected values and plotting cats on top of this data.

LmCat <- lm(Hwt~Bwt, data = cats)

steps <- 20
Reg <- data.frame(Bwt = 
seq(from = min(cats$Bwt), 
to = max(cats$Bwt), 
length.out = steps))
Reg$Hwt <- predict(LmCat, newdata = Reg)
sCATter <- sCATter + 
geom_point(data = Reg, aes(Bwt, Hwt), size = 0)

paw <- readPNG("paw.png")
for (i in 1:nrow(Reg)) {
sCATter = sCATter +
annotation_custom(
rasterGrob(paw),
xmin = Reg$Bwt[i] - 0.6,
xmax = Reg$Bwt[i] + 0.6,
ymin = Reg$Hwt[i] - 0.6,
ymax = Reg$Hwt[i] + 0.6
)
}
sCATter

I hope you have as much fun as I did with this ggplot2 package!