Love Machine: Automating the romantic songwriting process
Blogs home Featured Image
Owen Jones, Placement Student

Songwriting is a very mysterious process. It feels like creating something from nothing. It’s something I don’t feel like I really control.

— Tracy Chapman

It is February. The shortest, coldest, wettest, miserablest month of the British year.

Only two things happen in Britain during February. For a single evening, the people refrain from dipping all their food in batter and deep-frying it, and instead save some time by pouring the batter straight into a frying pan and eating it by itself; and for an entire day, the exchange of modest indications of affection between consenting adults is permitted, although the government advises against significant deviation from the actions specified in the state-issued Approved Romantic Gestures Handbook.

In Section 8.4 (Guidelines for Pre-Marital Communication) the following suggestion is made:

"Written expressions of emotion should be avoided where possible. Should it become absolutely necessary to express emotion in a written format, it should be limited to a 'popular' form of romantic lyricism. Examples of such 'popular' forms include 'love poem' and 'love song'.

Thankfully, for those who have not achieved at least a master’s degree in a related field, writing a poem or song is a virtually impossible task. And following the sustained and highly successful effort to persuade the British youth that a career in the arts is a fast-track to unemployment, the number of applications to study non-STEM subjects at British universities has been falling consistently since the turn of the decade. This ensures that only the very best and most talented songwriters, producing the most creatively ingenuous work, are able to achieve widespread recognition, and therefore the British public are only exposed to high-quality creative influences.

But to us scientists, the lack of method is disturbing. This “creativity” must have a rational explanation. There must be some pattern.

This is unquestionably a problem which can be solved by machine learning, so let’s take the most obvious approach we can: we’ll train a recurrent neural network to generate song lyrics character by character.

You write down a paragraph or two describing several different subjects creating a kind of story ingredients-list, I suppose, and then cut the sentences into four or five-word sections; mix ’em up and reconnect them. You can get some pretty interesting idea combinations like this. You can use them as is or, if you have a craven need to not lose control, bounce off these ideas and write whole new sections.

— David Bowie

To build our neural network I’m going to be using the Keras machine learning interface (which we’re very excited about here at Mango right now – keep an eye out for workshops in the near future!). I’ve largely followed the steps in this example from the Keras for R website, and I’m going to stick to a high-level description of what’s going on, but if you’re the sort of person who would rather dive head-first into the code, don’t feel like you have to hang around here – go ahead and have a play! And if you want to read more about RNNs, this excellent post by Andrej Kaparthy is at least as entertaining and significantly more informative than the one you’re currently reading.

We start by scraping as many love song lyrics as possible from the web – these will form our training material. Here’s the sort of thing we’re talking about:

Well… that’s how they look to us. Actually, after a bit of preprocessing, the computer sees something more like this:

All line breaks are represented by the pair of characters “\n”, and so all the lyrics from all the songs are squashed down into one big long string.

Then we use this string to train the network. We show the network a section of the string, and tell it what comes next.

So the network gradually learns which characters tend to follow a given fixed-length “sentence”. The more of these what-comes-next examples it sees, the better it gets at correctly guessing what should follow any sentence we feed in.

At this point, our network is like a loyal student of a great artist, dutifully copying every brushstroke in minuscule detail and receiving a slap on the wrist and a barked correction every time it slips up. Via this process it appears to have done two things.

Firstly, it seems to have developed an understanding of the “rules” of writing a song. These rules are complex and multi-levelled; the network first had to learn the rules of English spelling and grammar, before it could start to make decisions about when to move to a new line or which rhyming pattern to use.

(Of course, it hasn’t actually “developed an understanding” of these rules. It has no idea what a “word” is, or a “new line”. It just knows that every few characters it should guess " ", and then sometimes it should put in a "\", and whenever it puts in a "\" then it’s got to follow that up with a "n" and then immediately a capital letter. Easy peasy.)

Secondly, and in exactly the same way, the network will have picked up some of the style of the work it is copying. If we were training it on the songs one specific artist, it would have learned to imitate the style of that particular artist – but we’ve gone one better than that and trained it on all the love songs we could find. So effectively, it’s learned how everyone else writes love songs.

But no-one gets famous by writing songs which have already been written. What we need now is some creativity, some passion, a little bit of je ne sais quoi.

Let’s stop telling our network what comes next. Let’s give it the freedom to write whatever it likes.

I don’t think you can ever do your best. Doing your best is a process of trying to do your best.

— Townes van Zandt

It’s interesting to look at the songwriting attempts of the network in the very early stages of training. At first, it is guessing more or less at random what character should come next, so we end up with semi-structured gobbledegook:

fameliawmalYaws. Boflyi, methabeethirts yt
play3mppioty2=ytrnfuunuiYs blllstis
Byyovcecrowth andtpazo's youltpuduc,s Ijd"a]bemob8b>fiume,;Co
Bliovlkfrenuyione (ju'te,'ve ru t Kis
go arLUUs,k'CaufkfR )s'xCvectdvoldes

4So
Avanrvous Ist'dyMe Dolriri

But notice that even in that example, which was taken from a very early training stage, the network has already nailed the “\n” newline combo and has even started to pick up on other consistent structural patterns like closing a “(” with a “)”. Actually, the jumbled nonsense becomes coherent English (or English-esque) ramblings quite quickly.

There is one interesting parameter to adjust when we ask the model to produce some output: the “diversity” parameter, which determines how adventurous the network should be in its choice of character. The higher we set this parameter, the more the network will favour slightly-less-probable characters over the most obvious choice at each point.

If we set the diversity parameter too low, we often degenerate into uncontrolled bursts of la-ing:

la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
(... lots more "la"s)

But set it too high and the network decides dictionary English is too limiting.

Oh, this younan every, drock on
Scridh's tty'
Is go only ealled
You could have like the one don'm I dope
Love me
And woment while you all that
Was it statiinc. I living you must?
We dirls anythor

It’s difficult to find the right balance between syllabic repetition and progressive vocabulary, and there’s a surprisingly fine line between the two – this will probably prove to be a fruitful area for further academic research.

I think that identifying the optimal diversity parameter is probably the key to good songwriting.

Songwriting is like editing. You write down all this stuff – all this bad, stupid stuff – and then you have to get rid of everything except the very best.

— Juliana Hatfield

So, that’s what I did.

Here are some particularly “beautiful” passages taken from the huge amount of (largely poor) material the model produced. I haven’t done any editing other than to isolate a few consecutive lines at a time and in the last few examples, to start the network off with certain sentences.

Automated love

I know your eyes in the morning sun
I feel the name of love
Love is a picked the sun
All my life I can make me wanna be with you
I just give up in your head
And I can stay that you want a life
I’ve stay the more than I do

How long will I love you
As long as there is that songs
All the things that you want to find you
I could say me true
I want to fall in love with you
I want my life
And you’re so sweet
When I see you wanted to that for you
I can see you and thing, baby
I wanna be alone

Oh yeah I tell you somethin’
I think you’ll understand
When I say that somethin’
I thought the dartion hyand
I want me way to hear
All the things what you do

Wise men say
Only fools rush in
But I can hear your love
And I don’t wanna be alone

If I should stay
I would only be in your head
I wanna know that I hope I see the sun
I want a best there for me too
I just see that I can have beautiful
So hold me to you

Wishing you a Happy Valentine’s Day! (And, I don’t recommend reciting this to your loved one, they might run away.)

Blogs home

Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.

Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.

The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.

Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.

This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.

The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.

The code for the Shiny app is available on github.

EARL Seattle Keynote Speaker Announcement: Julia Silge
Blogs home Featured Image

We’re delighted to announce that Julia Silge will be joining us on 7 November in Seattle as our Keynote speaker.

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

We can’t wait to see what Julia has to say in November.

Submit an abstract

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Julia in Seattle as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

In Between A Rock And A Conditional Join
Blogs home Featured Image

Joining two datasets is a common action we perform in our analyses. Almost all languages have a solution for this task: R has the built-in merge function or the family of join functions in the dplyr package, SQL has the JOIN operation and Python has the merge function from the pandas package. And without a doubt these cover a variety of use cases but there’s always that one exception, that one use case that isn’t covered by the obvious way of doing things.

In my case this is to join two datasets based on a conditional statement. So instead of there being specific columns in both datasets that should be equal to each other I am looking to compare based on something else than equality (e.g. larger than). The following example should hopefully make things clearer.

myData <- data.frame(Record = seq(5), SomeValue=c(10, 8, 14, 6, 2))
myData
## Record SomeValue
## 1 1 10
## 2 2 8
## 3 3 14
## 4 4 6
## 5 5 2

The above dataset, myData, is the dataset to which I want to add values from the following dataset:

linkTable <- data.frame(ValueOfInterest = letters[1:3], LowerBound = c(1, 4, 10),
UpperBound = c(3, 5, 16))
linkTable
## ValueOfInterest LowerBound UpperBound
## 1 a 1 3
## 2 b 4 5
## 3 c 10 16

This second dataset, linkTable, is the dataset containing the information to be added to myData. You may notice the two dataset have no columns in common. That is because I want to join the data based on the condition that SomeValue is between LowerBound and UpperBound. This may seem like an artificial (and perhaps trivial) example but just imagine SomeValue to be a date or zip code. Then imagine the LowerBoundand UpperBound to be bounds on a specific time period or geographical region respectively.

In Mango’s R training courses one of the most important lessons we teach our participants is that the answer is just as important as how you obtain the answer. So i’ll try to convey that here too instead of just giving you the answer.

Helping you help yourself

So the first step in finding the answer is to explore R’s comprehensive help system and documentation. Since we’re talking about joins its only natural to look at the documentation of the merge function or the join functions from the dplyr package. Unfortunately both only have the option to supply columns that are compared to each other based on equality. However the documentation for the merge functions does mention that when no columns are given the function performs a Cartesian product. That’s just a seriously cool way of saying every row from myData is joined with every row from linkTable. It might not solve the task but it does give me the following idea:

# Attempt #1: Do a cartesian product and then filter the relevant rows
merge(myData, linkTable) %>% 
filter(SomeValue >= LowerBound, SomeValue <= UpperBound) %>% 
select(-LowerBound, -UpperBound)
## Record SomeValue ValueOfInterest
## 1 5 2 a
## 2 1 10 c
## 3 3 14 c

You can do the above in dplyr as well but I’ll leave that as an exercise. The more important question is: what is wrong with the above answer? You may notice that we’re missing records 2 and 4. That’s because these didn’t satisfy the filtering condition. If we wanted to add them back in we would have to do another join. Something that you won’t notice with these small example datasets is that a Cartesian product is an expensive operation, combining all the records of two datasets can result in an explosion of values.

(Sometimes) a SQL is better than the original

When neither of the built-in functions or functions from packages you know solve the problem, the next step is to expand the search. You can directly resort to your favourite search engine (which will inevitably redirect you to Stack Overflow) but it helps to first narrow the search by thinking about any possible clues. For me that clue was that joins are an important part of SQL so I searched for a SQL solution that works in R.

The above search directed me to the excellent sqldf package. This package allows you to write SQL queries and execute them using data.frames instead of tables in a database. I can thus write a SQL JOIN query with a BETWEEN clause and apply it to my two tables.

library(sqldf)
# Attempt #2: Execute a SQL query
sqldf('SELECT Record, SomeValue, ValueOfInterest 
FROM myData 
LEFT JOIN linkTable ON SomeValue BETWEEN LowerBound and UpperBound')
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Marvellous! That gives me exactly the result I want and with little to no extra effort. The sqldf package takes the data.frames and creates corresponding tables in a temporary database (SQLite by default). It then executes the query and returns a data.frame. Even though the package isn’t built for performance it handles itself quite well, even with large datasets. The only disadvantage I can think of is that you must know a bit of SQL.

So now that I have found the answer I can continue with the next step in the analysis. That would’ve been the right thing to do but then curiosity got the better of me and I continued to find other solutions. For completeness I have listed some of these solutions below.

Fuzzy wuzzy join

If you widen the search for a solution you will (eventually, via various GitHub issues and StackOverflow questions) come across the fuzzyjoin package. If you’re looking for flexible ways to join two data.frames then look no further. The package has a few ready-to-use solutions for a number of use cases: matching on equality with a tolerance (difference_inner_join), string matching (stringdist_inner_join), matching on euclidean distance (distance_inner_join) and many more. For my usecase I will use the more generic fuzzy_left_join which allows for one or more matching functions.

library(fuzzyjoin)
# Attempt #3: use the fuzzyjoin package
fuzzy_left_join(myData, linkTable, 
by=c("SomeValue"="LowerBound", "SomeValue"="UpperBound"),
match_fun=list(`>=`, `<=`)) %>% 
select(Record, SomeValue, ValueOfInterest)
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Again, this is exactly what we’re looking for. Compared to the SQL alternative it takes a little more time to figure out what is going on but that is a minor disadvantage. On the other hand, now there is no need to go back and forth with a database backend. I haven’t checked what the performance differences are, that is a little out of scope for this post.

If not dplyr then data.table

I know it can be slightly annoying when someone answers your question about dplyr by saying it can be done in data.table but it’s always good to keep an open mind. Especially when one solves a task the other can’t (yet). It doesn’t take much effort to convert from a data.frame to a data.table. From there we can use the foverlaps function to do a non-equi join (as it is referred to in data.table-speak).

library(data.table)
# Attempt #4: Use the data.table package
myDataDT <- data.table(myData)
myDataDT[, SomeValueHelp := SomeValue]
linkTableDT <- data.table(linkTable)
setkey(linkTableDT, LowerBound, UpperBound)

result <- foverlaps(myDataDT, linkTableDT, by.x=c('SomeValue', 'SomeValueHelp'), 
by.y=c('LowerBound', 'UpperBound'))
result[, .(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

Ok so I’m not very well versed in the data.table way of doing things. I’m sure there is a less verbose way but this will do for now. If you know the magical spell please let me know (through the links provided at the end).

Update 6-Feb-2018
Stefan Fritsch provided the following (less verbose) way of doing it with data.table:

linkTableDT[myDataDT, on = .(LowerBound <= SomeValue, UpperBound >= SomeValue),
.(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

The pythonic way

Now that we’re off the tidyverse-reservoir, we might as well go all the way. During my search I also encountered a Python solution that looked interesting. It involves using pandas and some matrix multiplication and works as follows (yes, you can run Python code in a RMarkdown document).

import pandas as pd
# Attempt #5: Use python and the pandas package
# create the pandas Data Frames (kind of like R data.frame)
myDataDF = pd.DataFrame({'Record':range(1,6), 'SomeValue':[10, 8, 14, 6, 2]})
linkTableDF = pd.DataFrame({'ValueOfInterest':['a', 'b', 'c'], 'LowerBound': [1, 4, 10],
'UpperBound':[3, 5, 16]})
# set the index of the linkTable (kind of like setting row names) 
linkTableDF = linkTableDF.set_index('ValueOfInterest')
# now apply a function to each row of the linkTable
# this function checks if any of the values in myData are between the upper
# and lower bound of a specific row thus returning 5 values (length of myData)
mask = linkTableDF.apply(lambda r: myDataDF.SomeValue.between(r['LowerBound'], 
r['UpperBound']), axis=1)
# mask is a 3 (length of linkTable) by 5 matrix of True/False values
# by transposing it we get the row names (the ValueOfInterest) as the column names
mask = mask.T
# we can then matrix multiply mask with its column names
myDataDF['ValueOfInterest'] = mask.dot(mask.columns)
print(myDataDF)
## Record SomeValue ValueOfInterest
## 0 1 10 c
## 1 2 8 
## 2 3 14 c
## 3 4 6 
## 4 5 2 a

This is a nice way of doing it in Python but it’s definitely not as readable as the sqldf or fuzzyjoin alternatives. I for one had to blink at it a couple of times before I understood this witchcraft. I didn’t search extensively for a solution in Python so this may actually not be the right way of doing it. If you know of a better solution let me know via the links below.

Have no fear, the tidyverse is here

As you search for solutions to your own tasks you will undoubtedly come across many Stack Overflow questions and Github Issues. Hopefully, they will provide the answer to your question or at least guide you to one. When they do, don’t forget to upvote or leave a friendly comment. When they don’t, do not despair but see it as a challenge to contribute your own solution. In my case the issue had already been reported and the dplyr developers are on it. I look forward to trying out their solution in the near future.

The code for this post is available on Github. I welcome any feedback, please let me know via Twitter or Github.

The EARLy career scholarship
Blogs home Featured Image

At Mango, we’re passionate about R and promoting its use in enterprise – it’s why we created the EARL Conferences. We understand the importance of sharing knowledge to generate new ideas and change the way organisations use R for the better.

This year we are on a mission to actively encourage the attendance of R users who are either in a very early stage of their career or are finishing their academic studies and looking at employment options.

We’re offering EARLy career R users a chance to come to EARL – we have a number of 2-day conference passes for EARL London and tickets for each 1-day event in the US. This year’s dates are:
London, 12-13 September
Seattle, 7 November
Houston, 9 November
Boston, 13 November

Who can apply?

  • Anyone in their first year of employment
  • Anyone doing an internship or work placement
  • Anyone who has recently finished – or will soon be finishing – their academic studies and is actively pursuing a career in Analytics

To apply for a free EARLy Career ticket, tell us why you would like to attend an EARL Conference and how attending will help you advance your knowledge and your career.

(Minimum 200 words, maximum 500 words)

Submit your response here.

Terms and conditions: ‘Winners’ will receive tickets for any EARL Conference of their choice. This does not include travel or accommodation. The tickets are non-transferable. The tickets cannot be exchanged for cash.

Join Us For Some R And Data Science Knowledge Sharing In 2018
Blogs home Featured Image

We’re proud to be part of the Data Science and R communities.

We recognise the importance of knowledge sharing across industries, helping people with their personal and professional development, networking, and collaboration in improving and growing the community. This is why we run a number of events and participate in many others.

Each year, we host and sponsor events across the UK, Europe and the US. Each event is open everyone —experienced or curious— and aims to help people share and gain knowledge about Data Science and to get them involved with the wider community. To get you started we’ve put together a list of our events you can attend over the next 12 months:

Free community events

LondonR

We host LondonR in central London every two months. At each meet up we have three brilliant R presentations followed by networking drinks – which are on us. Where possible we also offer free workshops about a range of R topics, including Shiny, ggplot2 and the Tidyverse.

The next event is on 27 March at UCL, you can sign up to our mailing list to hear about future events.

Manchester R

Manchester R takes place four times a year. Following the same format as LondonR, you will get three presentations followed by networking drinks on us. We also offer free workshops before the main meeting so you can stay up-to-date with the latest tools.

Our next event is on 6 February where the R-Ladies are taking over for the night. For more information visit the Manchester R website.

Bristol Data Scientists

Our Bristol Data Science events have a wider focus, but they follow the same format as our R user groups – three great presentations from the community and then drinks on us. If you’re interested in Data Science, happen to be a Data Scientist or work with data in some way then you are welcome to join us.

This year, we’re introducing free Data Science workshops before the meeting, so please tell us what you’d like to hear more about.

The Bristol meetup takes place four times a year at the Watershed in central Bristol. If you’d like to come we recommend joining the meetup group to stay in the loop.

BaselR

This meet up is a little further afield, but if you’re based in or near Basel, you’ll catch us twice a year running this R user group. Visit the BaselR websitefor details on upcoming events.

OxfordR

As you may have guessed, we love R, so we try to support the community where we can. We’ve partnered up with OxfordR this year to bring you pizza and wine while you network after the main presentation. OxfordR is held on the first Monday of every month, you can find details here on their website.

BirminghamR

BirminghamR is under new management and we are helping them get started. Their first event for 2018 is coming up on 25 January; for more information check out their meetup page.

Data Engineering London

One of our newest meetup groups focuses on Data Engineering. We hold two events a year that give Data Engineers in London the opportunity to listen to talks on the latest technology, network with fellow engineers and have a drink or two on us. The next event will be announced in the coming months. To stay up-to-date please visit the meetup group.

Speaking opportunities

As well as attending our free events, you can let us know if you’d like to present a talk. If you have something you’d like to share just get in touch with the team by emailing us.

EARL Conferences

Our EARL Conferences were developed on the success of our R User Groups and the rapid growth of R in enterprise. R users in organisations around the country were looking for a place to share, learn and find inspiration. The enterprise focus of EARL makes it ideal for people to come and get some ideas to implement in the workplace. Every year delegates walk away feeling inspired and ready to work R magic in their organisations.

This year our EARL Conference dates are: London: 11-13 September at The Tower Hotel Seattle: 7 November at Loews Hotel 1000 Houston, 9 November at Hotel Derek Boston, 13 November at The Charles Hotel

Speak at EARL

If you’re doing exciting things with R in your organisation, submit an abstract so others can learn from your wins. Accepted speakers get a free ticket for the day they are speaking.

Catch us at…

As well as hosting duties we are proud to sponsor some great community events, including PyData London in April and eRum in May.

Plus, you’ll find members of the Mango team speaking at Data Science events around the country. If you’d love to have one of them present at your event, please do get in touch.

Wherever you’re based we hope we will see you soon.

Field Guide to the R Ecosystem
Blogs home Featured Image
Mark Sellors, Head of Data Engineering

I started working with R around about 5 years ago. Parts of the R world have changed substantially over that time, while other parts remain largely the same. One thing that hasn’t changed however, is that there has never been a simple, high-level text to introduce newcomers to the ecosystem. I believe this is especially important now that the ecosystem has grown so much. It’s no longer enough to just know about R itself. Those working with, or even around R, must now understand the ecosystem as a whole in order to best manage and support its use.

Hopefully the Field Guide to the R Ecosystem goes some way towards filling this gap.

The field guide aims to provide a high level introduction to the R ecosystem. Designed for those approaching the language for the first time, managers, ops staff, and anyone that just needs to get up to speed with the R ecosystem quickly.

This is not a programming guide and contains no information about the language itself, so it’s very definitely not aimed at those already developing with R. However, it is hoped that the guide will be useful to people around those R users. Whether that’s their managers, who’d just like to understand the ecosystem better, or ops staff tasked with supporting R in an enterprise, but who don’t know where to start.

Perhaps, you’re a hobbyist R user, who’d like to provide more information to your company in order to make a case for adopting R? Maybe you’re part of a support team who’ll be building out infrastructure to support R in your business, but don’t know the first thing about R. You might be a manager or executive keen to support the development of an advanced analytics capability within your organisation. In all of these cases, the field guide should be useful to you.

It’s relatively brief and no prior knowledge is assumed, beyond a general technical awareness. The topics covered include, R, packages and CRAN, IDEs, R in databases, commercial versions of R, web apps and APIs, publishing and the community.

I really hope you, or someone around you, finds the guide useful. If you have any feedback, find me on twitter and let me know. If you you’d like to propose changes to the guide itself, you’ll find instructions in the first chapter and the bookdown source on GitHub. Remember, the guide is intentionally high-level and is intended to provide an overview of the ecosystem only, rather than any deep-dive technical discussions. There are already plenty of great guides for that stuff!

I’d also like to say a huge thanks to everyone who has taken time out of their day to proof read this for me and provide invaluable feedback, suggestions and corrections. The community is undoubtedly one of R’s greatest assets.

Originally posted on Mark’s blog, here.

Blogs home Featured Image
Nic Crane, Data Scientist

At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output.

While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process.

So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation?

1. Analyse procedure calls to inform technology choice

health1

Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job?

A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R.

2. Use macro analysis to find the most and least important components of an application

Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used?

We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once.

Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality.

On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it.

3. Looking at the interrelated components to understand process flow

When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code?

The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first.

4. Examine code complexity to assess what needs redeveloping and redesigning

Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively?

Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers.

5. Use the high level overview to get an informed perspective which ties into your strategic objectives

Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives?

The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions.

For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process.

SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money.

If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450.

Blogs home Featured Image

Adnan Fiaz

With two out of three EARL conferences part of R history we’re really excited about the next EARL conference in Boston (only 1 week away!). This calls for an(other) EARL conference analysis, this time with Twitter data. Twitter is an amazingly rich data source and a great starting point for any data analysis (I feel there should be a awesome-twitter-blogposts list somewhere).

I was planning on using the wonderful rtweet package by Michael Kearney (as advertised by Bob Rudis) but unfortunately the Twitter API doesn’t provide a full history of tweets. Instead I had to revert to a Python package (gasp) called GetOldTweets. I strongly recommend using the official Twitter API first before going down this path.

The Data

# I have used the Exporter script with the hashtags #EARLConf2017, #EARLConf and #EARL2017
tweets_df <- purrr::map_df(list.files('data/tweets', full.names = TRUE), 
~ readr::read_delim(.x, delim=";", quote="")) %>% 
# filter out company accounts
filter(username!="earlconf", username!="MangoTheCat") %>% 
mutate(shorttext = stringr::str_sub(text, end=50))

tweets_df %>% 
select(username, date, shorttext) %>% 
head() %>% 
knitr::kable()
username date shorttext
AlanHoKT 2017-10-02 02:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
johnon2 2017-09-23 16:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
AndySugs 2017-09-21 22:19:00 “RT: LearnRinaDay: EARL London 2017 ? That?s a wra
LearnRinaDay 2017-09-21 22:17:00 “EARL London 2017 ? That?s a wrap! https://www. r-
LouBajuk 2017-09-20 23:15:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o
pjevrard 2017-09-20 13:02:00 “. @TIBCO ’s @LouBajuk spoke at #EARL2017 London o

First things first, let’s get a timeline up:

 

The hashtags I used to search tweets were generic so the results include tweets from last year’s conferences. Let’s zoom in on this year’s conferences: EARL San Francisco (5-7 June) and EARL London (12-14 September). They clearly explain the large peaks in the above graph.

 

I’ve tried to highlight the period when the conferences were on but I don’t quite like the result. Let’s see if it works better with a bar chart.

earlconf_sf_dates <- lubridate::interval("2017-06-05", "2017-06-08")
earlconf_lon_dates <- lubridate::interval("2017-09-12", "2017-09-15")
tweets_df %>% 
filter(date > "2017-05-01") %>% 
mutate(day = lubridate::date(date)) %>% 
count(day) %>% 
mutate(conference = case_when(day %within% earlconf_sf_dates ~ "SF",
day %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
ggplot(aes(x=day, y=n)) + 
geom_bar(stat="identity", aes(fill=conference)) +
scale_fill_manual(guide=FALSE, values=c("#F8766D","black","#619CFF")) +
labs(x='Date', y='Number of tweets', title='Number of EARL-related tweets by day') +
scale_x_date(date_breaks="1 months", labels=date_format('%b-%y')) +
theme_classic()

 

Now that’s a lot better. The tweet counts in black surrounding the conferences look like small buildings which make the conference tweet counts look like giant skyscrapers (I was a failed art critic in a previous life).

Activity during conferences

I’ve been to my fair share of conferences/presentations and I’ve always wondered how people tweet so fast during a talk. It could be just my ancient phone or I may lack the necessary skills. Either way it would be interesting to analyse the tweets at the talk level. First I will need to link the tweets to a specific talks. I’ve translated the published agenda into a nicer format by hand and read it in below.

earl_agenda <- map_df(c("EARL_SF", "EARL_LON"), 
~ readxl::read_xlsx('data/earl_agenda.xlsx', sheet = .x) )
earl_agenda %>% 
select(StartTime, EndTime, Title, Presenter) %>% 
head() %>% 
knitr::kable()
StartTime EndTime Title Presenter
2017-06-06 11:00:00 2017-06-06 11:30:00 R?s role in Data Science Joe Cheng
2017-06-06 11:30:00 2017-06-06 12:00:00 ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
2017-06-06 12:00:00 2017-06-06 12:30:00 R Operating Model Mark Sellors
2017-06-06 11:00:00 2017-06-06 11:30:00 Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
2017-06-06 11:30:00 2017-06-06 12:00:00 Using data to identify risky prescribing habits in physicians Aaron Hamming
2017-06-06 12:00:00 2017-06-06 12:30:00 How we built a Shiny App for 700 users Filip Stachura

Before I merge the tweets with the agenda it’s a good idea to zoom in on the conference tweets (who doesn’t like a facetted plot).

conference_tweets <- tweets_df %>% 
mutate(conference = case_when(date %within% earlconf_sf_dates ~ "SF",
date %within% earlconf_lon_dates ~ "LON",
TRUE ~ "NONE")) %>% 
filter(conference != "NONE")

ggplot(conference_tweets, aes(x=date)) +
geom_histogram() +
facet_wrap(~ conference, scales = 'free_x')

 

Nothing odd in the pattern of tweets: there are no talks on the first day so barely any tweets; the amount of tweets spikes at the beginning of the other two days and then declines as the day progresses. There is something odd about the timing of the tweets though. I didn’t notice it before but when I compared the position of the bars on the x-axis the San Francisco tweets look shifted. And then my lack of travel experience hit me: time zones! The tweets were recorded in UTC time but the talks obviously weren’t in the evening in San Francisco.

After correcting for time zones I can finally merge the tweets with the agenda.

selection <- conference_tweets$conference=='SF'
conference_tweets[selection, 'date'] <- conference_tweets[selection, 'date'] - 8*60*60
# I intended to use a fuzzy join here and check if the tweet timestamp falls within the [start, end) of a talk
# unfortunately I couldn't get it to work with datetime objects
# so I resort to determining the cartesian product and simply filtering the relevant records
tweets_and_talks <- conference_tweets %>% 
mutate(dummy = 1) %>% 
left_join(earl_agenda %>% mutate(dummy=1)) %>% 
filter(date >= StartTime, date < EndTime) 

tweets_and_talks %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R Large-scale reproducible simulation pipelines in R using Docker Mike Gahan

You ever have that feeling that you’re forgetting something and then you’re at the airport without your passport? From the above table it’s obvious I’ve forgotten that talks are organised in parallel. So matching on time only will create duplicates. However, you may notice that some tweets also mention the presenter (that is considered good tweetiquette). We can use that information to further improve the matching.

talks_and_tweets <- tweets_and_talks %>% 
# calculate various scores based on what is said in the tweet text
mutate(presenter_score = ifelse(!is.na(mentions) & !is.na(TwitterHandle), stringr::str_detect(mentions, TwitterHandle), 0),
# check if the presenter's name is mentioned
presenter_score2 = stringr::str_detect(text, Presenter),
# check if the company name is mentioned
company_score = stringr::str_detect(text, Company),
# check if what is mentioned has any overlap with the title (description would've been better)
overall_score = stringsim(text, Title),
# sum all the scores
score = overall_score + presenter_score + presenter_score2 + company_score) %>% 
select(-presenter_score, -presenter_score2, -company_score, -overall_score) %>% 
# now select the highest scoring match
group_by(username, date) %>% 
top_n(1, score) %>% 
ungroup()

talks_and_tweets %>% 
select(username, date, shorttext, Title, Presenter) %>% 
tail() %>% 
knitr::kable()
username date shorttext Title Presenter
Madhuraraju 2017-06-06 11:39:00 @aj2z @gdequeiroz from @SelfScore talking about u ‘Full Stack’ Data Science with R: production data science and engineering with open source tools Gabriela de Queiroz
hspter 2017-06-06 11:22:00 “#rstats is great for achieving”flow” while doing R?s role in Data Science Joe Cheng
RLadiesGlobal 2017-06-06 11:20:00 @RStudioJoe showing the #RLadies logo and a big m R?s role in Data Science Joe Cheng
hspter 2017-06-06 11:17:00 “Nice shout out to @rOpenSci as prodigious package Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
RLadiesGlobal 2017-06-06 11:14:00 “#RLadies @b23kellytalking about #rstats at #EARL Large-scale reproducible simulation pipelines in R using Docker Mike Gahan
hspter 2017-06-06 11:14:00 “I’m digging the postmodern data scientist from @R R?s role in Data Science Joe Cheng

That looks better but I am disappointed at the number of tweets (263) during talks. Maybe attendees are too busy listening to the talk instead of tweeting which is a good thing I suppose. Nevertheless I can still try to create some interesting visualisations with this data.

tweets_by_presenter <- talks_and_tweets %>% 
count(conference, Title, Presenter) %>% 
ungroup() %>% 
arrange(conference, n)

tweets_by_presenter$Presenter <- factor(tweets_by_presenter$Presenter, levels=tweets_by_presenter$Presenter)

 

The visualisation doesn’t really work for the large number of presenters although I don’t really see another way to add the information about a talk. I also tried to sort the levels of the factor so they appear sorted in the plot but for some reason the SF facet doesn’t want to cooperate. There are a number of talks vying for the top spot in San Francisco but the differences aren’t that large. I’m of course assuming my matching heuristic worked perfectly but one or two mismatches and the results could look completely different. The same applies to EARL London but here Joe Cheng clearly takes the crown.

Follow the leader…

Let’s go down a somewhat more creepy road and see what talks people go to.

tweeters <- talks_and_tweets %>% 
group_by(username) %>% 
mutate(num_tweets = n()) %>% 
ungroup() %>% 
filter(num_tweets > 4) %>% 
mutate(day = ifelse(conference == "SF", (Session > 6)+1, (Session > 9)+1),
day = ifelse(day==1, "Day 1", "Day 2")) %>% 
select(username, conference, StartTime, Stream, day)

Each line is a twitter user (twitterer? tweeter? tweep?) and each observation represents a tweet during a presentation. My expectation was that by drawing a line between the observations you could see how people switch (or don’t switch) between talks. That has clearly failed as the tweeting behaviour isn’t consistent or numerous enough to actually see that. I’m quite glad it’s not possible since tracking people isn’t what Twitter is for.

The code and data for this blogpost are available on GitHub so feel free to play around with it yourself. Do let us know if you create any awesome visualisations or if we can improve on any of the above. If you also want to tweet at conferences, EARL Boston is happening on 1-3 November and ticketsare still available. I promise we won’t track you!

Putting the cat in scatterplot
Blogs home Featured Image
Clara Schartner, Data Scientist

It will come as no surprise that cats and ggplot are among our favourite things here at Mango, luckily there is an easy way to combine both.

Using the function annotation_custom in the popular ggplot2 package it is possible to display images on a plot i.e. points of a scatterplot. This way data can be displayed in a more fun, creative way.

In keeping with the cat theme I have chosen a data set about cats and a cat icon based on Mango the cat. The MASS package provides a data set called cats which contains the body weight, heart weight and sex of adult cats.

library(MASS)
data(cats)
head(cats)
set.seed(1234)
cats <- cats[sample(1:144, size = 40),]

First a normal scatterplot is defined on which the images will be plotted later:

library(ggplot2)
sCATter <-ggplot(data = cats, aes(x = Bwt, y = Hwt)) +
geom_point(size = 0, aes(group = Sex, colour = Sex)) +
theme_classic() +
xlab("Body weight") +
ylab("Heart weight") +
ggtitle("sCATterplot") +
theme(plot.title = element_text(hjust = 0.5)) +
# create a legend
scale_color_manual(
values = c("#999999", "#b35900" ),
name = "Cat",
labels = c("Male cat", "Female cat")
) +
guides(colour = guide_legend(override.aes = list(size = 10)))

Any png image can be used for the plot, however images with a transparent background are preferable.

library(png)
library(grid)
mCat <- readPNG("MaleCat.png")
feCat<- readPNG("FemaleCat.png")

In the last step the cats are iteratively plotted onto the plot using annotation_custom.

for (i in 1:nrow(cats)) {
# distinguishing the sex of the cat
if (cats$Sex[i] == "F") {
image <- feCat
} else{
image <- mCat
}
sCATter = sCATter +
annotation_custom(
rasterGrob(image),
xmin = cats$Bwt[i] - 0.6,
xmax = cats$Bwt[i] + 0.6,
ymin = cats$Hwt[i] - 0.6,
ymax = cats$Hwt[i] + 0.6
)
}

The cat´s paw trail is displaying a linear regression of heart on body weight. This can easily be added by computing a linear Regression, defining a grid to calculate the expected values and plotting cats on top of this data.

LmCat <- lm(Hwt~Bwt, data = cats)

steps <- 20
Reg <- data.frame(Bwt = 
seq(from = min(cats$Bwt), 
to = max(cats$Bwt), 
length.out = steps))
Reg$Hwt <- predict(LmCat, newdata = Reg)
sCATter <- sCATter + 
geom_point(data = Reg, aes(Bwt, Hwt), size = 0)

paw <- readPNG("paw.png")
for (i in 1:nrow(Reg)) {
sCATter = sCATter +
annotation_custom(
rasterGrob(paw),
xmin = Reg$Bwt[i] - 0.6,
xmax = Reg$Bwt[i] + 0.6,
ymin = Reg$Hwt[i] - 0.6,
ymax = Reg$Hwt[i] + 0.6
)
}
sCATter

I hope you have as much fun as I did with this ggplot2 package!