Blogs home

ABOUT THE BOOK:

With the open source R programming language and its immense library of packages, you can perform virtually any data analysis task. Now, in just 24 lessons of one hour or less, you can learn all the skills and techniques you’ll need to import, manipulate, summarize, model, and plot data with R, formalize analytical code; and build powerful R packages using current best practices.

Each short, easy lesson builds on all that’s come before: you’ll learn all of R’s essentials as you create real R solutions.

R in 24 hours, Sams Teach Yourself covers the entire data analysis workflow from the viewpoint of professionals whose code must be efficient, reproducible and suitable for sharing with others.

 

WHAT YOU’LL LEARN:

You’ll learn all this, and much more:

  • Installing and configuring the R environment
  • Creating single-mode and multi-mode data structures
  • Working with dates, times, and factors
  • Using common R functions, and writing your own
  • Importing, exporting, manipulating, and transforming data
  • Handling data more efficiently, and writing more efficient R code
  • Plotting data with ggplot2 and Lattice graphics
  • Building the most common types of R models
  • Building high-quality packages, both simple and complex – complete with data and documentation
  • Writing R classes: S3, S4, and beyond
  • Using R to generate automated reports
  • Building web applications with Shiny

Step-by-step instructions walk you through common questions, issues, and tasks; Q & As, Quizzes, and Exercises build and test your knowledge; “Did You Know?” tips offer insider advice and shortcuts and “Watch Out!” alerts help you avoid pitfalls.

By the time you’re finished, you’ll be comfortable going beyond the book to solve a wide spectrum of analytical and statistical problems with R.

If you are finding that you have some time on your hands and would like to enhance your skills, why not Teach yourself R in 24 hours?

The data and scripts to accompany the book can be accessed on GitHub here and the accompanying MangoTraining package can be installed from CRAN using the following in R:  install.packages(“mangoTraining”)

 

ORDERING A COPY OF THIS BOOK:

If you’d like to order a copy use the following ISBN codes:

ISBN-13: 978-0-672-33848-9

ISBN-10: 0-672-33848-3

Authors: Andy Nicholls, Richard Pugh and Aimee Gott.

the world of data in 2020
Blogs home Featured Image

 

With 2020 almost upon us, those of us involved in the production, management and utilisation of data are set for yet another fascinating year.  The growth in the volume of data that is produced will continue its inexorable rise, meaning that those who understand how to harness the commercial value of data will only become more important to organisations.  So I look to the future with optimism!  But as I do so, there is one thing I hope that those involved with data will talk less about next year and one of the many important areas that I hope people will spend more time talking about.

 

Let’s start with the topic I want to hear less about in 2020; data governance.  Sure, governance is essential to creating an organisation that uses data to power their business but I view a governance strategy as ‘table stakes.’  To give some context to my point, Caroline Carruthers and Peter Jackson’s fantastic book “The Chief Data Officer’s Playbook” does a great job of defining different generations of Chief Data Officers (CDOs) in terms of a first, second and third generation:

  • First Generation CDOs (FCDOs) – those who put in place the building blocks for future success, with a focus on governance, architecture, engagement and delivering quick wins
  • Second Generation CDOs (SCDOs) – use the foundations created by the FCDOs to “make the vehicle sing”, producing repeatable value for the organisation to show how data can be used at the core of business
  • Third Generation CDOs (TCDOs) – support the transition of a data-first approach into a “business as usual” state across the organisation

 

Now, as someone who works in the data industry and advises organisations on their data and analytic strategy, I go to a lot of “data” conferences and I’ve noticed something peculiar; most “data leader” type conferences these days are badged as having primarily “Second or Third Generation” themes (you know the exciting stuff!), but when you get there, they are just talking about governance…again.  I feel that the conversation hasn’t really moved on from “first generation” topics yet, and believe the industry is crying out to focus on such topics as generating value through AI, rather than another rehash of data governance strategies.  So, whilst I agree that Data Governance is important, I’d really like to talk about something else in 2020!

 

One of the many areas I’d like to hear talked about MORE in 2020 is on the subject of Analytics Ethics. To explain, earlier this month I read an excellent post by Ryan den Rooijen on this subject. Now I’m absolutely in agreement with Ryan that we absolutely need to start talking about ‘Analytic Ethics’ because by not doing so there is a real chance we could do serious harm to the perception of data and to its inherent value

For me there are 3 “ethical” considerations when using data and analytics to drive decision making:

  • Data Ethics – this is a mature conversation aligned to recent regulations that talks about our responsible handling of data and the limitations of use
  • Analytic Ethics – Ethics that guide the conversion of data into wisdom to inform some decision
  • Business Ethics – the framework that governs our business practices

Ryan’s post talks, quite rightly, about the Transparency, Accountability and Morality of analytics. However, I’m also concerned about the impact that analytics will have on business ethics, and I’m just not sure we’re ready for it.  Let me use a simple example to illustrate (loosely based on a real-life scenario that was, thankfully, narrowly avoided). Let’s imagine a company wants to use its data to market to a large group of people. Let’s assume they absolutely have the rights to use the data in this way. The idea is that they will call people on this huge list and offer them a product. However, based on reading an article, the Head of Customer Experience realises that analytics could be used to prioritise the list (good old propensity modelling, lift charts etc). So, their Junior Data Scientist goes away and fits a model to the data, creating an algorithm that is nicely embedded in the call centre’s system. The call centre staff get prompted to phone people in a certain order, which results in a significant lift over just randomly calling people. Everyone is happy – another great use of analytics.

However, what really happened is that the list contained a number of vulnerable people who, when contacted, will buy anything out of fear, with potential for getting themselves in financial difficulty and causing them a lot of stress.

In this example, how could this outcome have been avoided? Where is the “liability”? Was it on the data scientist, who doesn’t yet have the experience to understand why their model prioritised this subgroup? Or the business which just doesn’t understand modelling and just sees this as a “better” way of working?  It’s a thorny and complex topic (which is as much moral as technical in scope) and one that we, as an industry, have to, yes, talk about in 2020, in order to reach the right conclusion.  It’ll be a challenge I’m sure, but I’m confident that, with the right conversations, we’ll come to the right conclusions.

Looking forward again to 2020, I return to my sense of optimism about the future; the possibilities of driving value from data is only going to continue to increase and our ability to do so will only be limited by our intellect and imagination.  Here’s to a fantastic next 12 months!

data science & star wars
Blogs home Featured Image

 

Not many people know this, but the First Order (the bad guys from the latest Star Wars films) once created a Data Science team.  It all ended very badly, but based on some intel smuggled by a chippy R2 unit, we were able to piece together the story …

 

 

Analytics: Expectation vs Reality

Now, of course this is just (data) science fiction, but the basic plot will be familiar to many of you.

The marketing hype around AI and Data Science over the last few years has really raised the stakes in the analytics world.  It’s easy to see why – if you’re a salesperson selling AI software for £1m, then you’re going to need to be bullish about how many millions it is going to make/save the customer.

The reality though is that Data Science can add enormous value to an organisation, but:

  • It isn’t magic
  • It won’t happen overnight
  • It’s very difficult if the building blocks aren’t in place
  • It’s more about culture and change than algorithms and tech

So, how do we deal with a situation where leaders (whether they be evil Sith overlords or just impatient executives) have inflated expectations about what is possible (and have possibly over-invested on that basis)?

 

Education is key

With so much buzz and hype around analytics, it’s unsurprising that leadership are bombarded with an array of confusing terminology and unrealistic promises.  To counter that, it is important that Data Science teams look to educate the business and leadership on what these terms really mean.  In particular, we need to educate the business on the “practical” application of data science, what the possibilities are, and the potential barriers to success that exist.

 

Create a repeatable process

Once we’ve educated the business about the possibilities of analytics, we need to create a repeatable delivery process that is understood from both analytic AND business perspectives.  This moves the practice of analytics away from “moments of magic” producing anecdotal success to a process that is understandable, repeatable and produces consistent success.  Within this, we can establish shared understanding about how we will prioritise effort, measure success, and overcome the barriers to delivering initiatives (e.g. data, people, change).

 

Be consistent

Having established the above, we must engage with the business and leadership using our new consistent language and approach.  This will ensure the business understands the steps that are being carried out and the risk of success and failure.  After all, if there’s no signal in your data you can’t conjure accuracy from nowhere – ensuring that your stakeholders understand this (without getting into the detail of accuracy measures) is an important enabler to engaging effectively with them.

 

Summary

Being in a situation where the value and possibilities of data science have been significantly over-estimated can be very challenging.  The important thing is to educate the business, create a repeatable process for successful delivery and be consistent and clear about the realities and practicalities of applying data science.

Then again, if your executive sponsor starts wielding a Lightsaber – I’d get out quickly.

 

Blogs home Featured Image

 

Following on from the success of our recent graduate intake, we are already looking to find three more graduates and one yearlong placement to join us in September 2020.  Our placements and interns have been an integral part at Mango for several years now, and we’re proud to say that every single intern has come back once they’ve finished university and joined us as a permanent employee.

Mango hosted our very first graduate assessment day recently.  We thought that an assessment day would give us a better chance to really get to know the applicants, and to really show them what life at Mango is like – and it certainly did just that!

As wonderful as our current graduate intake is, I have to admit that all four of them are male.  As signatories of the Tech Talent Charter, and supporters of Women in Data, we were determined to change that statistic this year.  I’m pleased to say that of the eight candidates at the assessment day, there were four males and four females.  Also, Mango is also justifiably proud of the diversity of the background of our data science – and this cohort was similarly diverse – we had representatives from five different subjects, and four different universities.

Following the recent Data Science Skills Survey – created in partnership with Women In Data UK and Datatech – that highlighted a national data science skills shortage, we were delighted that we had over 60 applicants for the three graduate roles and we have already whittled these down to the top six candidates who will move forward to the next stage of the application process to become a Mango graduate.

The next part of the process is about assessing skills and we do this by defining what we call a Minimally Viable Data Scientist – this is what we expect our graduates to achieve by the end of the graduate program.  We put exercises in place throughout the day to assess current skills as well as potential skills.

The more ‘technical’ skills were assessed at interview, whilst the softer skills, which are essential for our consultancy projects, were tested in individual and group exercises. We tasked the candidates with imagining a new project with Bath Cats and Dogs home, thinking about how that might play out.

We’re proud of some of the feedback that we received at the end of the day.  We consciously set out for this day to be two way – we wanted the candidates to want to work for Mango, just as much as we wanted to employ them. Some candidate’s feedback revealed that the day was “refreshingly open”, “actually enjoyable” and “not as daunting as I’d thought an assessment day would be”.

We’ve now got the incredibly difficult decision of which of the brilliant candidates to make offers to!

ODSC Europe 2019
Blogs home Featured Image

Thanks to the University of Bath and Mango I got the chance to volunteer and attend the Open Data Science Conference (ODSC) Europe 2019. This year the event saw around 1000 data-driven individuals flock to London for four days filled with workshops, tutorials and talks. My time there provided me with plenty to think about so here are some of my personal highlights!

Michael Wooldridge, PhD – The Future is Multiagent!

Having done research into Artificial Intelligence for 30 years, with the last seven of those as the head of the University of Oxford’s Computer Science department, Michael Wooldridge knows what he’s talking about. He shared with us his vision for the future of AI, multiagency! Think Siri talking to Siri or Tesla talking to Tesla. With self-interested computational systems becoming more common, eventually it will become vital that they can figure out not just how to do what’s best for themselves, but what’s also best for their whole ecosystem. The example Michael gave involved two supermarket stock-taking robots who needed to cross paths to complete their respective tasks. Without knowledge of each other these two robots would impede each other’s tasks, simply by getting in each other’s way – much in the same way as we communicate to complete tasks, these machines need the ability to communicate with each other and figure out an optimal solution which suits them both. It’s easy to see how this small problem could be scaled up for  autonomous cars on a road for example, or an automated meeting scheduler (which Michael is very keen for someone to develop). This is definitely an area I look forward to seeing advance, and maybe even becoming a part of our everyday lives!

Cassie Kozyrkov, PhD – Making Data Science Useful

Cassie tried to make our lives as Data Scientists easier by splitting the job into smaller, better defined roles (e.g. Analyst, Statistician, Machine Learning Engineer) and claiming that being able to excel at all of these is near enough impossible! These roles require vastly different skill sets, one of the most varied being the way of thinking. For example analysts are there to explore the data and formulate questions, which the Statisticians are then there to rigorously test and provide the answers to. This leads to the revelation that “Data Science is a team sport!” I personally couldn’t agree more, especially as someone just starting out in the field. Having the opportunity to work alongside other Data Scientists during my time at Mango has helped me to develop and has taught me so much more than working independently would of. I think that Cassie’s main message was that you don’t have to know everything, as long as you’ve got the passion for data and the drive to let that data inform better decisions then you can go a long way.

Ian Ozsvald – Tools for High Performance Python

Imagine being able to run code that was taking 90 minutes in just 30 seconds! Well, Ian taught us how to do just that using some Python cleverness (and it didn’t even seem that hard!). His first tip was to find out where your code is slow – otherwise how will you know where to speed it up? Profilers are a great tool for this – check out Robert Kern’s line_profiler for line by line profiling.

Once you’ve figured out where to focus your effort, it’s time for speeding up. Ian really nicely talked us through his approach, starting with simple changes such as swapping out your for loops for Pandas apply, then using the raw argument. You’ll start looking at utilising the multiple cores on your device with Swifter and Dask before you know it. But beware, there’s a pay off between time spent re-factoring code to speed things up and the gains that you’re actually making, which for me, was definitely the take home message from Ian’s talk. As someone with limited experience in Python, there was still a lot of technical takeaways for me both in terms of technology when I do pick up Python (Dask and Numba), and techniques that I can give a go next time my R code takes forever.

Rishabh Mehrotra, PhD – Multi-stakeholder Machine Learning for Marketplaces

Rishabh, a Senior Research Scientist at Spotify, gave a very interesting talk which detailed how it’s important to consider all of your stakeholders whenever using methods like Machine Learning within your company. For example, think of a company like Just Eat, they could easily just optimise to provide their app users (the people receiving the food) with the best experience. However, there are 2 other key stakeholders; the delivery workers and the restaurants themselves. If they optimise for the end user, this may put too much pressure on delivery drivers and restaurants which causes them to leave the site – reducing the options available to the customer which may lead to them choosing another service. The first step in this process is realising who your stakeholders are, then thinking about how they interact with each other. Traditional recommender engines may then not meet all stakeholders’ needs, in the case of Spotify there is a trade-off with relevance, fairness and satisfaction. Showing all users the most relevant artists may keep the majority of streamers happy, but is this fair to new artists trying to make an impact? Yet if we present too many of these less relevant artists to the user will this then have a negative effect on the users satisfaction? Spotify has certainly got a lot of interesting problems and if you want any more information on Rishabh’s work then check out his website.

Sudha Subramanian – Identifying Heart Disease Risk Factors from Clinical Notes

Natural Language Processing (NLP) was a frequent topic of talks at this year’s ODSC Europe, I found Sudha’s case study to be a really good example of using NLP in practice. Her problem was that she wanted to use clinical notes scribbled by practitioners to help identify presence of heart disease risk factors in patients. Of course, notes from a GP come in widely different forms – for example different GP’s will use different abbreviations to mean the same thing. BERT was a common theme through all of the NLP talks, a new method of pre-training language representations that has produced some very impressive results. It utilises dynamic word embedding’s, so the word is converted from text to a vector representation and this vector takes into account the context that the word is being used in by looking at the surrounding text. In Sudha’s case BERT worked really well, outperforming human annotators when it came to identifying the common risk factors for heart disease. This work reminded me of some of the ‘Data for good’ lightning talks we saw at EARL back in September. Sudha’s work will allow common risk factors for heart disease to be identified sooner after appointment. This can then be acted upon faster, giving patients the best possible chance of changing their lifestyle and preventing the disease.

Of course, these weren’t the only talks that I got to see, and choosing my favourites to mention was an incredibly tricky task. If you want to check out any of the talks then take a look at the ODSC Europe 2019 website.

Finally another thank you to Mango, the University of Bath and the ODSC conference for the chance to attend and help out!

Author: Jack Talboys, Data Scientist, Mango Solutions

If you have any requirements regarding the use or implementation of R or Python for your business, than contact us at info@mango-solutions.com or go to our contact us page.