Blogs home

Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.

Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.

The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.

Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.

This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.

The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.

The code for the Shiny app is available on github.

EARL Seattle Keynote Speaker Announcement: Julia Silge
Blogs home Featured Image

We’re delighted to announce that Julia Silge will be joining us on 7 November in Seattle as our Keynote speaker.

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

We can’t wait to see what Julia has to say in November.

Submit an abstract

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Julia in Seattle as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

In Between A Rock And A Conditional Join
Blogs home Featured Image

Joining two datasets is a common action we perform in our analyses. Almost all languages have a solution for this task: R has the built-in merge function or the family of join functions in the dplyr package, SQL has the JOIN operation and Python has the merge function from the pandas package. And without a doubt these cover a variety of use cases but there’s always that one exception, that one use case that isn’t covered by the obvious way of doing things.

In my case this is to join two datasets based on a conditional statement. So instead of there being specific columns in both datasets that should be equal to each other I am looking to compare based on something else than equality (e.g. larger than). The following example should hopefully make things clearer.

myData <- data.frame(Record = seq(5), SomeValue=c(10, 8, 14, 6, 2))
myData
## Record SomeValue
## 1 1 10
## 2 2 8
## 3 3 14
## 4 4 6
## 5 5 2

The above dataset, myData, is the dataset to which I want to add values from the following dataset:

linkTable <- data.frame(ValueOfInterest = letters[1:3], LowerBound = c(1, 4, 10),
UpperBound = c(3, 5, 16))
linkTable
## ValueOfInterest LowerBound UpperBound
## 1 a 1 3
## 2 b 4 5
## 3 c 10 16

This second dataset, linkTable, is the dataset containing the information to be added to myData. You may notice the two dataset have no columns in common. That is because I want to join the data based on the condition that SomeValue is between LowerBound and UpperBound. This may seem like an artificial (and perhaps trivial) example but just imagine SomeValue to be a date or zip code. Then imagine the LowerBoundand UpperBound to be bounds on a specific time period or geographical region respectively.

In Mango’s R training courses one of the most important lessons we teach our participants is that the answer is just as important as how you obtain the answer. So i’ll try to convey that here too instead of just giving you the answer.

Helping you help yourself

So the first step in finding the answer is to explore R’s comprehensive help system and documentation. Since we’re talking about joins its only natural to look at the documentation of the merge function or the join functions from the dplyr package. Unfortunately both only have the option to supply columns that are compared to each other based on equality. However the documentation for the merge functions does mention that when no columns are given the function performs a Cartesian product. That’s just a seriously cool way of saying every row from myData is joined with every row from linkTable. It might not solve the task but it does give me the following idea:

# Attempt #1: Do a cartesian product and then filter the relevant rows
merge(myData, linkTable) %>% 
filter(SomeValue >= LowerBound, SomeValue <= UpperBound) %>% 
select(-LowerBound, -UpperBound)
## Record SomeValue ValueOfInterest
## 1 5 2 a
## 2 1 10 c
## 3 3 14 c

You can do the above in dplyr as well but I’ll leave that as an exercise. The more important question is: what is wrong with the above answer? You may notice that we’re missing records 2 and 4. That’s because these didn’t satisfy the filtering condition. If we wanted to add them back in we would have to do another join. Something that you won’t notice with these small example datasets is that a Cartesian product is an expensive operation, combining all the records of two datasets can result in an explosion of values.

(Sometimes) a SQL is better than the original

When neither of the built-in functions or functions from packages you know solve the problem, the next step is to expand the search. You can directly resort to your favourite search engine (which will inevitably redirect you to Stack Overflow) but it helps to first narrow the search by thinking about any possible clues. For me that clue was that joins are an important part of SQL so I searched for a SQL solution that works in R.

The above search directed me to the excellent sqldf package. This package allows you to write SQL queries and execute them using data.frames instead of tables in a database. I can thus write a SQL JOIN query with a BETWEEN clause and apply it to my two tables.

library(sqldf)
# Attempt #2: Execute a SQL query
sqldf('SELECT Record, SomeValue, ValueOfInterest 
FROM myData 
LEFT JOIN linkTable ON SomeValue BETWEEN LowerBound and UpperBound')
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Marvellous! That gives me exactly the result I want and with little to no extra effort. The sqldf package takes the data.frames and creates corresponding tables in a temporary database (SQLite by default). It then executes the query and returns a data.frame. Even though the package isn’t built for performance it handles itself quite well, even with large datasets. The only disadvantage I can think of is that you must know a bit of SQL.

So now that I have found the answer I can continue with the next step in the analysis. That would’ve been the right thing to do but then curiosity got the better of me and I continued to find other solutions. For completeness I have listed some of these solutions below.

Fuzzy wuzzy join

If you widen the search for a solution you will (eventually, via various GitHub issues and StackOverflow questions) come across the fuzzyjoin package. If you’re looking for flexible ways to join two data.frames then look no further. The package has a few ready-to-use solutions for a number of use cases: matching on equality with a tolerance (difference_inner_join), string matching (stringdist_inner_join), matching on euclidean distance (distance_inner_join) and many more. For my usecase I will use the more generic fuzzy_left_join which allows for one or more matching functions.

library(fuzzyjoin)
# Attempt #3: use the fuzzyjoin package
fuzzy_left_join(myData, linkTable, 
by=c("SomeValue"="LowerBound", "SomeValue"="UpperBound"),
match_fun=list(`>=`, `<=`)) %>% 
select(Record, SomeValue, ValueOfInterest)
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Again, this is exactly what we’re looking for. Compared to the SQL alternative it takes a little more time to figure out what is going on but that is a minor disadvantage. On the other hand, now there is no need to go back and forth with a database backend. I haven’t checked what the performance differences are, that is a little out of scope for this post.

If not dplyr then data.table

I know it can be slightly annoying when someone answers your question about dplyr by saying it can be done in data.table but it’s always good to keep an open mind. Especially when one solves a task the other can’t (yet). It doesn’t take much effort to convert from a data.frame to a data.table. From there we can use the foverlaps function to do a non-equi join (as it is referred to in data.table-speak).

library(data.table)
# Attempt #4: Use the data.table package
myDataDT <- data.table(myData)
myDataDT[, SomeValueHelp := SomeValue]
linkTableDT <- data.table(linkTable)
setkey(linkTableDT, LowerBound, UpperBound)

result <- foverlaps(myDataDT, linkTableDT, by.x=c('SomeValue', 'SomeValueHelp'), 
by.y=c('LowerBound', 'UpperBound'))
result[, .(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

Ok so I’m not very well versed in the data.table way of doing things. I’m sure there is a less verbose way but this will do for now. If you know the magical spell please let me know (through the links provided at the end).

Update 6-Feb-2018
Stefan Fritsch provided the following (less verbose) way of doing it with data.table:

linkTableDT[myDataDT, on = .(LowerBound <= SomeValue, UpperBound >= SomeValue),
.(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

The pythonic way

Now that we’re off the tidyverse-reservoir, we might as well go all the way. During my search I also encountered a Python solution that looked interesting. It involves using pandas and some matrix multiplication and works as follows (yes, you can run Python code in a RMarkdown document).

import pandas as pd
# Attempt #5: Use python and the pandas package
# create the pandas Data Frames (kind of like R data.frame)
myDataDF = pd.DataFrame({'Record':range(1,6), 'SomeValue':[10, 8, 14, 6, 2]})
linkTableDF = pd.DataFrame({'ValueOfInterest':['a', 'b', 'c'], 'LowerBound': [1, 4, 10],
'UpperBound':[3, 5, 16]})
# set the index of the linkTable (kind of like setting row names) 
linkTableDF = linkTableDF.set_index('ValueOfInterest')
# now apply a function to each row of the linkTable
# this function checks if any of the values in myData are between the upper
# and lower bound of a specific row thus returning 5 values (length of myData)
mask = linkTableDF.apply(lambda r: myDataDF.SomeValue.between(r['LowerBound'], 
r['UpperBound']), axis=1)
# mask is a 3 (length of linkTable) by 5 matrix of True/False values
# by transposing it we get the row names (the ValueOfInterest) as the column names
mask = mask.T
# we can then matrix multiply mask with its column names
myDataDF['ValueOfInterest'] = mask.dot(mask.columns)
print(myDataDF)
## Record SomeValue ValueOfInterest
## 0 1 10 c
## 1 2 8 
## 2 3 14 c
## 3 4 6 
## 4 5 2 a

This is a nice way of doing it in Python but it’s definitely not as readable as the sqldf or fuzzyjoin alternatives. I for one had to blink at it a couple of times before I understood this witchcraft. I didn’t search extensively for a solution in Python so this may actually not be the right way of doing it. If you know of a better solution let me know via the links below.

Have no fear, the tidyverse is here

As you search for solutions to your own tasks you will undoubtedly come across many Stack Overflow questions and Github Issues. Hopefully, they will provide the answer to your question or at least guide you to one. When they do, don’t forget to upvote or leave a friendly comment. When they don’t, do not despair but see it as a challenge to contribute your own solution. In my case the issue had already been reported and the dplyr developers are on it. I look forward to trying out their solution in the near future.

The code for this post is available on Github. I welcome any feedback, please let me know via Twitter or Github.

The EARLy career scholarship
Blogs home Featured Image

At Mango, we’re passionate about R and promoting its use in enterprise – it’s why we created the EARL Conferences. We understand the importance of sharing knowledge to generate new ideas and change the way organisations use R for the better.

This year we are on a mission to actively encourage the attendance of R users who are either in a very early stage of their career or are finishing their academic studies and looking at employment options.

We’re offering EARLy career R users a chance to come to EARL – we have a number of 2-day conference passes for EARL London and tickets for each 1-day event in the US. This year’s dates are:
London, 12-13 September
Seattle, 7 November
Houston, 9 November
Boston, 13 November

Who can apply?

  • Anyone in their first year of employment
  • Anyone doing an internship or work placement
  • Anyone who has recently finished – or will soon be finishing – their academic studies and is actively pursuing a career in Analytics

To apply for a free EARLy Career ticket, tell us why you would like to attend an EARL Conference and how attending will help you advance your knowledge and your career.

(Minimum 200 words, maximum 500 words)

Submit your response here.

Terms and conditions: ‘Winners’ will receive tickets for any EARL Conference of their choice. This does not include travel or accommodation. The tickets are non-transferable. The tickets cannot be exchanged for cash.

Join Us For Some R And Data Science Knowledge Sharing In 2018
Blogs home Featured Image

We’re proud to be part of the Data Science and R communities.

We recognise the importance of knowledge sharing across industries, helping people with their personal and professional development, networking, and collaboration in improving and growing the community. This is why we run a number of events and participate in many others.

Each year, we host and sponsor events across the UK, Europe and the US. Each event is open everyone —experienced or curious— and aims to help people share and gain knowledge about Data Science and to get them involved with the wider community. To get you started we’ve put together a list of our events you can attend over the next 12 months:

Free community events

LondonR

We host LondonR in central London every two months. At each meet up we have three brilliant R presentations followed by networking drinks – which are on us. Where possible we also offer free workshops about a range of R topics, including Shiny, ggplot2 and the Tidyverse.

The next event is on 27 March at UCL, you can sign up to our mailing list to hear about future events.

Manchester R

Manchester R takes place four times a year. Following the same format as LondonR, you will get three presentations followed by networking drinks on us. We also offer free workshops before the main meeting so you can stay up-to-date with the latest tools.

Our next event is on 6 February where the R-Ladies are taking over for the night. For more information visit the Manchester R website.

Bristol Data Scientists

Our Bristol Data Science events have a wider focus, but they follow the same format as our R user groups – three great presentations from the community and then drinks on us. If you’re interested in Data Science, happen to be a Data Scientist or work with data in some way then you are welcome to join us.

This year, we’re introducing free Data Science workshops before the meeting, so please tell us what you’d like to hear more about.

The Bristol meetup takes place four times a year at the Watershed in central Bristol. If you’d like to come we recommend joining the meetup group to stay in the loop.

BaselR

This meet up is a little further afield, but if you’re based in or near Basel, you’ll catch us twice a year running this R user group. Visit the BaselR websitefor details on upcoming events.

OxfordR

As you may have guessed, we love R, so we try to support the community where we can. We’ve partnered up with OxfordR this year to bring you pizza and wine while you network after the main presentation. OxfordR is held on the first Monday of every month, you can find details here on their website.

BirminghamR

BirminghamR is under new management and we are helping them get started. Their first event for 2018 is coming up on 25 January; for more information check out their meetup page.

Data Engineering London

One of our newest meetup groups focuses on Data Engineering. We hold two events a year that give Data Engineers in London the opportunity to listen to talks on the latest technology, network with fellow engineers and have a drink or two on us. The next event will be announced in the coming months. To stay up-to-date please visit the meetup group.

Speaking opportunities

As well as attending our free events, you can let us know if you’d like to present a talk. If you have something you’d like to share just get in touch with the team by emailing us.

EARL Conferences

Our EARL Conferences were developed on the success of our R User Groups and the rapid growth of R in enterprise. R users in organisations around the country were looking for a place to share, learn and find inspiration. The enterprise focus of EARL makes it ideal for people to come and get some ideas to implement in the workplace. Every year delegates walk away feeling inspired and ready to work R magic in their organisations.

This year our EARL Conference dates are: London: 11-13 September at The Tower Hotel Seattle: 7 November at Loews Hotel 1000 Houston, 9 November at Hotel Derek Boston, 13 November at The Charles Hotel

Speak at EARL

If you’re doing exciting things with R in your organisation, submit an abstract so others can learn from your wins. Accepted speakers get a free ticket for the day they are speaking.

Catch us at…

As well as hosting duties we are proud to sponsor some great community events, including PyData London in April and eRum in May.

Plus, you’ll find members of the Mango team speaking at Data Science events around the country. If you’d love to have one of them present at your event, please do get in touch.

Wherever you’re based we hope we will see you soon.

Field Guide to the R Ecosystem
Blogs home Featured Image
Mark Sellors, Head of Data Engineering

I started working with R around about 5 years ago. Parts of the R world have changed substantially over that time, while other parts remain largely the same. One thing that hasn’t changed however, is that there has never been a simple, high-level text to introduce newcomers to the ecosystem. I believe this is especially important now that the ecosystem has grown so much. It’s no longer enough to just know about R itself. Those working with, or even around R, must now understand the ecosystem as a whole in order to best manage and support its use.

Hopefully the Field Guide to the R Ecosystem goes some way towards filling this gap.

The field guide aims to provide a high level introduction to the R ecosystem. Designed for those approaching the language for the first time, managers, ops staff, and anyone that just needs to get up to speed with the R ecosystem quickly.

This is not a programming guide and contains no information about the language itself, so it’s very definitely not aimed at those already developing with R. However, it is hoped that the guide will be useful to people around those R users. Whether that’s their managers, who’d just like to understand the ecosystem better, or ops staff tasked with supporting R in an enterprise, but who don’t know where to start.

Perhaps, you’re a hobbyist R user, who’d like to provide more information to your company in order to make a case for adopting R? Maybe you’re part of a support team who’ll be building out infrastructure to support R in your business, but don’t know the first thing about R. You might be a manager or executive keen to support the development of an advanced analytics capability within your organisation. In all of these cases, the field guide should be useful to you.

It’s relatively brief and no prior knowledge is assumed, beyond a general technical awareness. The topics covered include, R, packages and CRAN, IDEs, R in databases, commercial versions of R, web apps and APIs, publishing and the community.

I really hope you, or someone around you, finds the guide useful. If you have any feedback, find me on twitter and let me know. If you you’d like to propose changes to the guide itself, you’ll find instructions in the first chapter and the bookdown source on GitHub. Remember, the guide is intentionally high-level and is intended to provide an overview of the ecosystem only, rather than any deep-dive technical discussions. There are already plenty of great guides for that stuff!

I’d also like to say a huge thanks to everyone who has taken time out of their day to proof read this for me and provide invaluable feedback, suggestions and corrections. The community is undoubtedly one of R’s greatest assets.

Originally posted on Mark’s blog, here.

Blogs home Featured Image

Prelude

Maybe you’re looking for a change of scene. Maybe you’re looking for your first job. Maybe you’re stuck in conversation with a relative who you haven’t spoken to since last Christmas and who has astonishingly strong opinions on whether cells ought to be merged or not in Excel spreadsheets.

The fact of the matter is that you have just encountered the term “data science” for the first time, and it sounds like it might be interesting but you don’t have a clue what it is. Something to do with computers? Should you bring a lab coat, or a VR headset? Or both? What is a data and how does one science it?

Fear not. I am here to offer subjective, questionable and most importantly FREE advice from the perspective of someone who was in that very position not such a long time ago. Read on at your peril.

I. Adagio: Hear about data science

This is the hard bit. It’s surprisingly difficult to stumble upon data science unless someone tells you about it.

But the good news is that you’re reading this, so you’ve already done it. Possibly a while ago, or possibly just now; either way, put a big tick next to Step 1. Congratulations!

(By the way, you’ll remember the person who told you about data science. When you grow in confidence yourself, be someone else’s “person who told me about data science”. It’s a great thing to share. But all in good time…)

II. Andante: Find out more

But what actually is data science?

To be honest, it’s a fairly loosely-defined term. There are plenty of articles out there that try to give an overview, but most descend into extended discussions about the existence of unicorns or resort to arranging countless combinations of potentially relevant acronyms in hideous indecipherable Venn diagrams.

You’re much better off finding examples of people “doing” data science. Find some blogs (here are a few awesome ones to get you started) and read about what people are up to in the real world.

Don’t be afraid to narrow down and focus on a specific topic that interests you – there’s so much variety out there that you’re bound to find something that inspires you to keep reading and learning. But equally, explore as many new areas as you can, because the more context you can get about the sector the better your understanding will be and you’ll start to see how different subjects and different roles relate to each other.

Believe it or not, one of the best tools for keeping up to date with the latest developments in the field is Twitter. If you follow all your blog-writing heroes, not only will you be informed whenever they publish a new article but you’ll also get an invaluable glimpse into their day-to-day jobs and working habits, as well as all the cool industry-related stuff they share. Even if you never tweet anything yourself you’ll be exposed to much more than you’d be able to find on your own. If you want to get involved there’s no need to be original – you could just use it to share content you’ve found interesting yourself.

If you’re super keen, you might even want to get yourself some data science books tackling a particular topic. Keep an eye out for free online/ebook versions too!

III. Allegretto: Get hands-on

Observing is great, but it will only get you so far.

Imagine that you’ve just heard about an amazing new thing called “piano”. It sounds great. No, it sounds INCREDIBLE. It’s the sort of thing you really want to be good at.

So you get online and read more about it. Descriptions, analyses, painstaking breakdowns of manual anatomy and contrapuntal textures. You watch videos of people playing pianos, talking about pianos, setting pianos on fire and hurling them across dark fields. You download reams of free sheet music and maybe even buy a book of pieces you really want to learn.

But at some point… you need to play a piano.

The good news is that with data science, you don’t need to buy a piano, or find somewhere to keep it, or worry about bothering your family/friends/neighbours/pets with your late-night composing sessions.

Online interactive coding tutorials are a great place to start if you want to learn a new programming language. Sites like DataCamp and Codecademy offer a number of free courses to get yourself started with data science languages like R and Python. If you are feeling brave enough, take the plunge and run things on your own machine! (I’d strongly recommend using R with RStudio and using Anaconda for Python.) Language-specific “native-format” resources such as [SWIRL]() for R or this set of Jupyter notebooks for Python are a great way to learn more advanced skills. Take advantage of the exercises in any books you have – don’t just skip them all!

Data science is more than just coding though – it’s all about taking a problem, understanding it, solving it and then communicating those ideas to other people. So Part 1 of my Number One Two-Part Top Tip for you today is:

  1. Pick a project and write about it

How does one “pick a project”? Well, find something that interests you. For me it was neural networks (and later, car parks…) but it could be literally anything, so long as you’re going to be able to find some data to work with. Maybe have a look at some of the competitions hosted on Kaggle or see if there’s a group in your area which publishes open data.

Then once you’ve picked something, go for it! Try out that cool package you saw someone else using. Figure out why there are so many missing values in that dataset. Take risks, explore, try new things and push yourself out of your comfort zone. And don’t be afraid to take inspiration from something that someone else has already done: regardless of whether you follow the same process or reach the same outcome, your take on it is going to be different to theirs.

By writing about that project – which is often easier than deciding on one in the first place – you’re developing your skills as a communicator by presenting your work in a coherent manner, rather than as a patchwork of dodgy scripts interspersed with the occasional hasty comment. And even if you don’t want to make your writing public, you’ll be amazed how often you go back and read something you wrote before because it’s come up again in something else you’re working on and you’ve forgotten how to do it.

I’d really encourage you to get your work out there though. Which brings us smoothly to…

IV. Allegro maestoso: Get yourself out there

If you never play the piano for anyone else, no-one’s ever going to find out how good you are! So Part 2 of my Number One Two-Part Top Tip is:

  1. Start a blog

It’s pretty easy to get going with WordPress or similar, and it takes your writing to the next level because now you’re writing for an audience. It may not be a very big audience, but if someone, somewhere finds your writing interesting or useful then surely it’s worth it. And if you know you’re potentially writing for someone other than yourself then you need to explain everything properly, which means you need to understand everything properly. I often learn more when I’m writing up a project than when I’m playing around with the code in the first place.

Also, a blog is a really good thing to have on your CV and to talk about at interviews, because it gives you some physical (well, virtual) evidence which you can point at as you say “look at this thing wot I’ve done”.

(Don’t actually say those exact words. Remember that you’re a Good Communicator.)

If you’re feeling brave you can even put that Twitter account to good use and start shouting about all the amazing things you’re doing. You’ll build up a loyal following amazingly quickly. Yes, half of them will probably be bots, but half of them will be real people who enjoy reading your work and who can give you valuable feedback.

Speaking of real people…

  1. Get involved in the community

Yes, that was indeed Part 3 of my Number One Two-Part Top Tip, but it’s so important that it needs to be in there.

The online data science community is one of the best out there. The R community in particular is super friendly and supportive (check out forums like RStudio Community, community groups like R4DS, and the #rstats tag on Twitter). Get involved in conversations, learn from people already working in the sector, share your own knowledge and make friends.

Want to go one better than online?

Get a Meetup account, sign up to some local groups and go out to some events. It might be difficult to force yourself to go for the first time, but pluck up the courage and do it. Believe me when I say there’s no substitute for meeting up and chatting to people. Many good friends are people I met for the first time at meetups. And of course, it’s the perfect opportunity to network – I met 5 or 6 of my current colleagues through BathML before I even knew about Mango! (If you’re in or near Bristol or London, Bristol Data Scientists and LondonR are both hosted by Mango and new members are always welcome!)

Postlude

Of course, everything I’ve just said is coming from my point of view and is entirely based on my own experiences.

For example, I’ve talked about coding quite a lot because I personally code quite a lot; and I code quite a lot because I enjoy it. That might not be the case for you. That’s fine. In fact it’s more than “fine”; the huge diversity in people’s backgrounds and interests is what makes data science such a fantastic field to be working in right now.

Maybe you’re interested in data visualisation. Maybe you’re into webscraping. Or stats. Or fintech, or NLP, or AI, or BI, or CI. Maybe youare the relative at Christmas dinner who won’t stop banging on about why you should NEVER, under ANY circumstances, merge cells in an Excel spreadsheet (UNLESS it is PURELY for purposes of presentation).

Oh, why not:

  1. Find the parts of data science that you enjoy and arrange them so that they work for you.

Blogs home Featured Image
Nic Crane, Data Scientist

At Mango, we’re seeing more and more clients making the decision to modernise their analytics process; moving away from SAS and on to R, Python, and other technologies. There are a variety of reasons for this, including SAS license costs, the increase of recent graduates with R and Python skills, SAS becoming increasingly uncommon, or the need for flexible technologies which have the capability for advanced analytics and quality graphics output.

While such transitions are typically about much more than just technology migration, the code accounts for a significant degree of the complexity. So, in order to support our clients, we have developed a software suite to analyse the existing SAS code and simplify this process.

So how can a SAS Code Health Check help you decide on how to tackle this kind of transformation?

1. Analyse procedure calls to inform technology choice

health1

Using the right technology for the right job is important if we want to create code which is easy to maintain for years, saving us time and resources. But how can we determine the best tool for the job?

A key part of any SAS code analysis involves looking at the procedure calls in the SAS codebase to get a quick view of the key functionality. For example, we can see from the analysis above that this codebase mainly consists of calls to PROC SORT and PROC SQL –SAS procedures which reorder data and execute SQL commands used for interacting with databases or tables of data. As there are no statistics related procs, we may decide —if we migrate this application away from SAS— to move this functionality directly into the database. The second graph shows an application which has a high degree of statistical functionality, using the FORECAST, TIMESERIES, and ARIMA procedures to fit complex predictive time series models. As R has sophisticated time series modelling packages, we might decide to move this application to R.

2. Use macro analysis to find the most and least important components of an application

Looking at the raw source code doesn’t give us any context about what the most important components of our codebase are. How do we know which code is most important and needs to be prioritised? And how can we avoid spending time redeveloping code which has been written, but is never actually used?

We can answer these questions by taking a look at the analysis of the macros and how often they’re used in the code. Macros are like user-defined functions which can combine multiple data steps, proc steps, and logic, and are useful for grouping commands we want to call more than once.

Looking at the plot above, we can see that the transfer_data macro is called 17 times, so we know it’s important to our codebase. When redeveloping the code, we might want to pay extra attention to this macro as it’s crucial to the application’s functionality.

On the other hand, looking at load_other, we can see that it’s never called – this is known as ‘orphaned code’ and is common in large legacy codebases. With this knowledge, we can automatically exclude this to avoid wasting time and resource examining it.

3. Looking at the interrelated components to understand process flow

When redeveloping individual applications, planning the project and allocating resources requires an understanding of how the different components fit together and which parts are more complex than others. How do we gain this understanding without spending hours reading every line of code?

The process flow diagram above allows us to see which scripts are linked to other scripts. Each node represents a script in the codebase, and is scaled by size. The nodes are coloured by complexity. Looking at the diagram above, we can instantly see that the create_metadata script is both large and complex, and so we might choose to assign this to a more experienced developer, or look to restructure it first.

4. Examine code complexity to assess what needs redeveloping and redesigning

Even with organisational best practice guidelines, there can still be discrepancies in the quality and style of code produced when it was first created. How do we know which code is fit for purpose, and which code needs restructuring so we can allocate resources more effectively?

Thankfully, we can use ‘cyclomatic complexity’ which will assess how complex the code is. The results of this analysis will determine: whether it needs to be broken down into smaller chunks, how much testing is needed, and which code needs to be assigned to more experienced developers.

5. Use the high level overview to get an informed perspective which ties into your strategic objectives

Analytics modernisation programs can be large and complex projects, and the focus of a SAS Code Health Check is to allow people to make well-informed decisions by reducing the number of unknowns. So, how do we prioritise our applications in a way that ties into our strategic objectives?

The overall summary can be used to answer questions around the relative size and complexity of multiple applications; making it possible to estimate more accurately on the time and effort required for redevelopment. Custom comparison metrics can be created on the basis of strategic decisions.

For example, if your key priority is to consolidate your ETL process and you might first focus on the apps which have a high number of calls to proc SQL. Or you might have a goal of improving the quality of your graphics and so you’ll focus on the applications which produce a large number of plots. Either way, a high level summary like the one below collects all the information you need in one place and simplifies the decision-making process.

SAS conversion projects tend to be large and complicated, and require deep expertise to ensure their success. A SAS Health Check can help reduce uncertainty, guide your decisions and save you time, resources and, ultimately, money.

If you’re thinking of reducing or completely redeveloping your SAS estate, and want to know more about how Mango Solutions can help, get in touch with with our team today via sales@mango-solutions.com or +44 (0)1249 705 450.

Blogs home Featured Image

We have been working within the Pharmaceutical sector for over a decade. Our expertise, knowledge of the industry, and presence in the R community mean we are used to providing services within a GxP environment and providing best practice.

We are excited to be at three great events in the coming months. Find us at:

• 15th Annual Pharmaceutical IT Congress, 27-28 September – London, England
• Pharmaceutical Users Software Exchange (PhUSE), 8-11 October 2017 – Edinburgh, Scotland
• American Conference on Pharmacometrics (ACoP8), 15-18 October – Fort Lauderdale, USA

Our dedicated Pharma team will be at the events to address any of your questions or concerns around data science and using R within your organisation.

How Mango can help with your Data Science needs

A validated version of R

Because the use of R is growing in the pharmaceutical sector, it’s one of the enquiries we get the most at Mango, so we’d love to talk to you about how you can use it in your organisation.

We know that a major concern for using R within the Pharma sector is its open source nature, especially when using R for regulatory submissions.

R contains many capabilities specifically aimed at helping users perform their day-to-day activities, but with the concerns over meeting compliance, understandably some companies are hesitant to make the move.

To eliminate risk, we’ve created ValidR – a series of scripts and services that deliver a validated build of R to an organisation.

For each validated package, we apply our ISO9001 accredited software quality process of identifying requirements, performing code review, testing that the requirements have been met and installing the application in a controlled and reproducible manner. We have helped major organisations adopt R in compliance with FDA 21 CFR Part 11 guidelines on open source software.

Consultancy

We have helped our clients adopt or migrate to R by providing a range of consultancy services from our unique mix of Mango Data Scientists who have both extensive technical and real-world experience.

Our team of consultants have been deployed globally in projects including, SAS to R migration, Shiny application development, script validation and much more. Our team also provide premier R training with courses designed specifically for the Pharmaceutical sector.

Products

Organisations today are not only looking for how they can validate their R code but how that information is retained, shared and stored across teams globally. Our dedicated validation team have a specialised mix of software developers who build rich analytic web and desktop applications using technologies such as Java, .NET and JavaScript.

Our applications, ModSpace and Navigator, have been deployed within Pharma organisations globally. These both help organisations maintain best practice but also again achieve a ‘validated working environment’.

Why Mango?

All of our work, including the support of open source software such as R, is governed by our Quality Management System, which is regularly audited by Pharmaceutical companies in order to ensure compliance with industry best practices and regulatory guidelines.

Make sure you stop by our stand and talk to us about how we can help you make the most of your data!