Blogs home

Both R and distributed programming rank highly on my list of “good things”, so imagine my delight when two new packages used for distributed programming in R were released:

ddR (https://github.com/vertica/ddR) and

multidplyr (https://github.com/hadley/multidplyr)

 

Distributed programming is normally taken up for a variety of reasons:

  • To speed up a process or piece of code
  • To scale up an interface or application for multiple users

There has been a huge appetite for this in the R community for a long time so my first thought was “Why now? Why not before?”.

From a quick look at CRAN’s High Performance Computing page, we can see the mass of packages that were available for related problems already. None of them have quite the same focus of ddR and multidplyr though. Let me explain. R has many features that make it unique and great. It is high-level, interactive and most importantly, it also has a huge number of packages. It would be a huge shame to not be able to use these packages, or if we were to lose these features when writing R code to be run on a cluster.

Traditionally, distributed programming has contrasted with these principles, with much more focus on low-level infrastructures, such as communications between nodes on a cluster. Popular R packages that dealt with these in the past are the now deprecated packages, snow and multicore (released on CRAN in 2003 and 2009 respectively). However, working with low level functionality of a cluster can detract from analysis work because it requires a slightly different skill set.

In addition, the needs of R users are changing and this is, in part, due to big data. Data scientists now need to be able to run experiments on, and analyse and explore much larger data sets, where running computations on it can be time consuming. Due to the fluid nature of exploratory analysis, this can be a huge hindrance. For the same reason, there is a need to be able to write parallelized code without having to think too hard about low-level considerations, and for it to be fast to write as well as easy to read. My point is that fast parallelized code should not just be for production code. The answer to this is an interactive scripting language that can be run on a cluster.

The package written to replace snow and multicore is the parallel package, which includes modified versions of snow and multicore. It starts to bridge the gap between R and more low-level work by providing a unified interface to cluster management systems. The big advantage to this is that R code will be the same, regardless of what protocol for communicating with the cluster is being used under the covers.

Another huge advantage of the parallel package is the “apply” type functions that are provided through this unified interface. This is an obvious but powerful way to extend R with parallelism, because each any call to an “apply” function with, say, FUN = foo can be split into multiple calls to foo, executed at the same time. The recently released packages ddR and multidplyr extend on the functionality provided by the parallel package. They are similar in many ways. Indeed the most significant way is that they are based on the introduction of new datatypes that are specifically for parallel computing. New functions on these data types are used to “partition” data to describe how work can be split amongst multiple nodes and also a function to collect the work and combine them to produce a final result.

ddR then also reimplements a lot of base functions on the distributed data types, for example rbind and tail. ddR is written by Vertica Analytics group, owned by HP. It is written to work with HP’s distributedR, which provides a platform for distributed computing with R.

Hadley Wickham’s package, multidplyr also works with distributedR, in additional to snow and parallel. Where multidplyr differs to ddR is that it is written to be used with the dplyr package. All methods provided in the dplyr package are overloaded to work with the data-types provided by multidplyr, furthering Hadley’s eco-system of R packages.

After a quick play with the two packages, many more differences emerge between the two packages.

The package multidplyr seems more suited to data-wrangling, much like its single-threaded equivalent, dplyr.

The partition()  function can be given a series of vectors which describe how the data should be partitioned, very much like the group_by() function:

# Extract of code that uses the multidplyr package
library(dplyr)
library(multidplyr)
library(nycflights13)
planes %>% partition() %>% group_by(type) %>% summarize(n())

However, ddR has a very different “flavour”, with a stronger algorithmic focus, as can be seen by the example packages:  randomForest.ddRkmeans.ddR and glm.ddR, implemented with ddR. As can be seen in the code snippet below, certain algorithms such as random forests can be parallelised very naturally. Unlike multidplyr, the

partition()

function does not give the user control over how the data is split. However, provided in the

collect()

function is the

index

argument, which gives the user control over which workers to collect results from. Also, the list returned by

collect()

can then be fed into a

do.call()

to aggregate the results, for example, using

randomForest::combine() .
# Skeleton code for implementing very primitive version of random forests using ddR
library(ddR)
library(randomForest)
multipleRF <- dlapply(1:4, 
 function(n){
 randomForest::randomForest(Ozone ~ Wind + Temp + Month,
 data = airquality,
 na.action = na.omit)
})

listRF <- collect(multipleRF)
res <- do.call(randomForest::combine, collect(multipleRF))

To summarise, distributed programming in R has been slowly evolving for a long time but now in response to the high demand, many tools are being developed to suit the needs to R users who want to be able to run different types of analysis on a cluster. The prominent themes are as follows:

  • Parallel programming in R should be high-level.
  • Writing parallelised R code should be fast and easy, and not require too much planning.
  • Users should still be able to access the same libraries that they usually use.

Of course, some of the packages mentioned in this post are very young. However, due to the need for such tools, they are rapidly maturing and I look forward to seeing where it goes in the very near future.

Author: Paulin Shek

data team
Blogs home Featured Image

As more and more Data Science moves from individuals working alone, with small data sets on their laptops, to more productionised, or analytically mature settings, an increasing number of restrictions are being placed on Data Scientists in the workplace.

Perhaps, your organisation has standardised on a particular version of Python or R, or perhaps you’re using a limited subset of all available big data tools. This sort of standardisation can be incredibly empowering for the business. It ensures all analysts are working with a common set of tools and allows analyses to be run anywhere across the organisation It doesn’t matter if it’s a  laptop, server, or a large-scale cluster, Data Scientists and the wider business, can be safe in the knowledge that the versions of your analytic tools are the same in each environment.

While incredibly useful for the business, this can,  at times, feel very restricting for the individual Data Scientist. Maybe you want to try a new package that isn’t available for your ‘official’ version of R, or you want to try a new tool or technique that hasn’t made it into your officially supported environment yet. In all of these instances a Data Science Lab or Analytic Lab environment can prove invaluable to maintain pace with the fast paced data science world outside of your organisation.

An effective lab environment should be designed from the ground up to support innovation, both with new tools as well as new techniques and approaches. For the most part it’s rare that any two labs would be the same from one organisation to the next, however, the principles behind the implementation and operation are universal. The lab should provide a sandbox of sorts, where Data Scientists can work to improve what they do currently, as well as prepare for the challenges of tomorrow. A well implemented lab can be a source of immense value to it’s users as it can be a space for continual professional development. The benefits to the business however, can be even greater. By giving your Data Scientists the opportunity to be a part of driving requirements for your future analytic solutions, and with those solutions based on solid foundations derived from experiments and testing performed in the lab, the business can achieve and maintain true analytic maturity and meet new analytic challenges head-on.

In order to successfully implement a lab in your business, you must first establish the need. If your Data Scientists are using whatever tools are handy and nobody has a decent grasp on what tools are used, with what additional libraries, and at what versions, then you have bigger fish to fry right now and should come back when that’s sorted out!

If your business analytic landscape is well understood and documented, you must first identify and distil your existing tool set into a set of core tools. As these tools constitute the day-to-day analytic workhorses of your business, they will form the backbone of the lab. In a lot of cases, this may be a particular Hadoop distribution and version, or perhaps a particular version of python with scikit-learn and numpy, or a combination.

The next step, can often be the most challenging, as it often requires moving outside of the Data Science or Advanced Analytics team and working closely with your IT department in order to provision environments upon which the lab will be based. Naturally, if you’re lucky enough to have a suitable Data Engineer or DataOps professional on your team then you may avoid this requirement. A lot of that is going to depend on the agility model of you business and how reliant on strict silos it is.

Ideally any environments provisioned at this stage should be capable of being rapidly re-provisioned and re-purposed as needs arise, so working with a modern infrastructure is a high priority. It’s often wise at this stage to consider some form of image management for containers or VM’s, to speed deployment and ensure environments are properly managed. You need to be able to adapt the environment to the changing needs of the user base with the minimum of effort and fuss.

Once you have rapidly deployable environments at your disposal, you’re ready to start work. What form that work takes should be left largely up to your Data Science team, but broadly speaking they should be free to use and evaluate new tools or approaches. Remember, the lab is not a place where production work is done with ad hoc tools, it’s a safe space for experimentation and innovation, just like a real laboratory environment. Using the knowledge gained from running tests or trials in the lab however, can and should inform the evolution of your production tools and techniques.

A final word of warning for the business: A successful lab environment can’t be achieved through lip-service. The business must set aside time for Analysts or Data Scientists to develop the future analytic solutions that are increasingly becoming central to the success of the modern business.

For more information, or to get help building out an Analytics Lab of your own, or even if you’re just starting your journey on the path to analytic maturity, contact info@mango-solutions.com

Author:  Mark Sellors, Mango Solutions

Blogs home Featured Image

Since we first demoed it at our really successful trip to Strata London last year, a few people have asked us how we made the awesome looking Data Science Radar app that we were running on the tablets we had with us. In this post we’ll take a look at how we did it, and hopefully show you how easy it is to do yourself.

Mango is primarily known for its work with the R language, so it should come as no surprise that this is the secret sauce used in the creation of the app. More specifically, we used a Shiny app written by one of our Senior Data Scientists, Aimee Gott. The app uses the radarchart package which you can find on github.

I think the fact that it was written with Shiny has actually surprised a few people, largely because of how it looks and the way that we run it.

The tablets in question are cheap Windows 10 devices, nothing special, but we had to come up with a way of running the application that would be simple enough for the non-R-using members of the team. This meant that anything from the everyday world of R had to be abstracted away or made as simple to use as possible. In turn this means, not starting RStudio, or having to type anything in to start the app.

R and the required packages are installed on the tablets, ready to start the configuration that would allow the whole Mango team to use them in the high pressure, high visibility setting of a stand at an extremely busy conference.

We wrote a simple batch file that would start the app. This only got us part of the way though, because the default browser on Windows 10, Microsoft’s Edge, doesn’t have a full screen mode, which makes the app look less slick. We therefore changed the default browser to Microsoft’s IE, and put it in full screen mode (with F11) when it first opened. The good news here is that IE, remembers that it was in full screen mode when you close and re-open it, so that’s another problem solved. The app now opens automatically and covers the full screen.

The code for the batch file is a simple one-liner and looks like this:

"C:\Program Files\R\R-3.3.0\bin\Rscript.exe" -e "shiny::runApp('/Users/training2/Documents/dsRadar/app.R', launch.browser = TRUE)"

Next, it was necessary to set the rotation lock on the tablets, to avoid the display flipping round to portrait mode while in use on the stand. This is more cosmetic than anything else, and we did find that the Win10 rotation lock is slightly buggy in that it doesn’t always seem to remember which way round the lock is, so that it occasionally needs to be reset between uses. Remember, our app was written specifically for this device, so the layout is optimised correctly for the resolution and landscape orientation, you may want to approach that differently if you try this yourself.

We also found that the on-screen keyboard wasn’t enabled by default with our devices (which have a detachable keyboard), so we had to turn that on in the settings as well.

Having the application start via a Windows batch file, isn’t the prettiest way of starting an app as it starts the windows command prompt before launching the application itself. This is hidden behind the application when it’s fully started, but it just doesn’t look good enough. This problem can be avoided with a small amount of VBScript, which runs the contents of the batch file without displaying the command prompt. Unfortunately the VBScript icon you end up with is pretty horrid. The easiest way to change it is to create a shortcut to the VBScript file and then change the icon of the shortcut, which is much easier.

Here’s the VBScript:

Set objShell = WScript.CreateObject("WScript.Shell")

objShell.Run("C:\Users\training2\Desktop\dsRadar.bat"), 0, True

Check out the video below to see it in action, we hope you agree that it looks really good and we hope you find this simple method of turning a shiny application into a tablet or desktop app as useful as we do!

 

Author: Mark Sellors

 

Beth Ashlee Senior Data Scientist
Blogs home Featured Image

Spotlight on Beth Ashlee – Senior Data Scientist

 

Name: Beth Ashlee

Job title: Data Science Consultant

Qualification(s): BSc Biomedical Science

Time in current role:  4 years

Beth Ashlee joined Mango initially as an intern whilst studying Biomedical Science. 4 years on and she’s recently been promoted to a position of Senior Data Scientist. During this time, she has experienced many diverse opportunities and pathways that have accelerated her analytical competency.

In addition to having been exposed to a myriad of technical-based scenarios through her delivery of client training in R and Python, Beth spends much of her time collaborating on a variety of projects such as Shiny app development, data exploration or productionising models. One of Beth’s passions is her team lead responsibility for Mango’s graduate recruitment programme where she actively trains and mentors her team on both professional and personal development.

Beth is a master communicator which is reflected in the shape of her Data Science Radar – a tool used to assess core Data Science competencies. Soft skills in data science are essential to establishing meaningful relationships alongside the ability to translate business value across an organisation, an area where Beth certainly excels. Outside of work, Beth enjoys travelling to new places and attending music festivals.

 

Beth’s Top 3 traits: 

  • Programmer 
  • Communicator 
  • Data Wrangler

Beth scores high in both Visualisation and Programming which ties in with the types of projects she has been working on most recently. 

As would be expected given her role as a Consultant and Trainer, Beth scores strongly as a Communicator. During a recent Government project, which required significant stakeholder engagement, these skills proved essential for helping to mobilise teams around the possibilities of advanced analytics.

Beth has identified that modelling is something she needs to work on to become a more well-rounded data scientist. To support this development, she has recently been doing more self-learning and is now working on a client facing modelling project.

Having a thorough understanding of capabilities and skill levels mapped against core competencies like these for the team, can help guide and shape the data science project team best suited to the task. The result is a significantly more engaged workforce with a set of skills that the business understands and needs, to deliver data-driven value. For more information on Data Science Radar, check out our Building a Winning Data Science Team page.

Would you like to join our award-winning 2020 Data IQ Best Data and Analytics Team? Mango are currently recruiting.

 

Related blogs:

Spotlight on a Data Consultant: Karina Marks

Spotlight on a Junior Data Scientist: Joe Russell

EARL Conference 2020
Blogs home Featured Image

The countdown to VirtuEARL has begun! The online Enterprise Applications of the R Language conference starts on Thursday 8th October with the first part of Max Kuhn’s tidymodels workshop. 

If you can’t attend the live events(s) you don’t need to miss out; ticket holders can enjoy the sessions they register for even after these have finished, as they will be sent a recording of their registered sessions. So whatever commitments you have or time zone you’re in,  you can still enjoy VirtuEARL.

VirtuEARL agenda

8th & 9th October, 2pm-5pm: A quick introduction to Tidymodels workshop £180

13th October, 2pm-5pm: The Science (and Art) of Data Visualisation in {ggplot2} workshop £90

14th October, 2pm-5pm: Text Analysis workshop £90

15th October, 2pm-5pm: Good practices for {shiny} development using {golem} – An Introduction workshop £90

16th October, 9am-5pm: VirtuEARL presentations and keynote session £9.99

Profits made from ticket sales will be donated to ‘Data for Black Lives’.

Buy tickets

 

dataIQ award winners
Blogs home Featured Image

Mango are delighted to have been awarded the 2020 Data IQ Best Data & Analytics Team (Enabler) award as part of the people category.  The virtual awards ceremony took place early yesterday evening with Pete Scott, Mango’s Client Services Director accepting the award on behalf of the team.

As is sadly the case with virtual ceremonies, there wasn’t a cocktail or DJ in sight; nonetheless, the shortlist comprised very strong competition. Pete Scott said, “It really is fantastic to be recognised for such a prestigious award, designed to showcase the best of the data and analytics industry. Mango’s astonishing team of Data Science Consultants focus on solving real challenges through data and are dedicated to delivering customer-centric, data-driven value.  As a team they deliver expertise and innovative solutions in strategic advice, data and analytic project delivery, through to building analytic team capability.”

The consulting team, consisting of 35 data scientists and engineers with more than 200 years’ combined expertise between them, demonstrate exemplary technical excellence, collaborative working practices and processes, best practice frameworks and a commitment to proactive stakeholder engagement. The award entry demonstrated these commitments in abundance and in addition, it was their external engagements, notably their community and outreach activities showcasing Mango as being at the heart of an innovative data science community, which were no doubt recognised by the screening panel.

Mango would like to acknowledge the support of their key stakeholder partnerships, where the benefits of true collaborative relationships are realised. Working through the restrictions impacted by the COVID-19 pandemic, has certainly shown the benefit of Mango’s ‘agile’ project management practices, an approach that has allowed for reactive changes in accordance with rapidly changing conditions.

“This is an amazing achievement for Mango”, concluded Pete, “and reflects not only on the brilliance of the consulting team, but also on the support they receive from all areas of our 70-people consultancy. We all celebrate this win.”

Congratulations to all of the worthy award winners and shortlisters this year, we are very proud to have been amongst such stiff competition!