Data Science Competency for a post-COVID Future
Blogs home Featured Image

Written by Rich Pugh, Chief Data Scientist, Mango Solutions (an Ascent company)

The COVID-19 pandemic disrupted supply chains and markets and brought economies around the globe to a standstill.

Overnight, governments, public sector agencies, healthcare providers and businesses needed access to timely and accurate data like never before. As a result, demand for data analytics skyrocketed as organisations strived to navigate an uncertain future.

We recently surveyed data scientists working in a variety of UK industry sectors, asking them about:

  • how their organisation’s reliance on data changed during the pandemic
  • how their teams are having to re-align their skills sets to deliver the intelligence that’s needed; and
  • top trends on the horizon as organisations pursue a data-driven post-COVID recovery.

What they told us offers some interesting insights into the fast-evolving world of data science.

Decision intelligence gets real

Our findings highlight how the sudden disruption of the COVID-19 pandemic brought the importance of data analytics sharply into focus for business leaders and decision makers across the enterprise.

Almost two-thirds (65%) of those surveyed said that demand for data analytics rose across their organisation. The top request areas for problem-solving and enabling informed strategic decisions included:

  • Immediate crisis response (51%) – risk modelling, digital scaling and strategy as organisations looked to make near-term decisions to address key operational challenges.
  • Informing financial/cost-efficiency decisions (33%).
  • Logistics/supply chain (26%).

As reliance on data became mission-critical, data scientists in some industry sectors were at the nerve centre of COVID-19 response efforts as organisations looked to solve real-life problems fast.

Data scientists are adapting their skills sets quickly

As organisations beef up their data strategy to better prepare for future disruptive events and thrive and survive in the new normal, data scientists are having to adjust to new ways of working and adapt their skills sets fast. Indeed, 49% of data scientists say their organisation is now investing in building their internal capabilities through learning and development programmes, with 38% actively recruiting to fill gaps.

Now part and parcel of the enterprise decision-making team, data scientists confirm they are having to hone their business and communication skills to ensure they are able to support business leaders across the organisation better. Indeed, an impressive 34% identified working more effectively with business stakeholders was now a top priority. With data now being used more broadly across the organisation, one-third (33%) of the data scientists confirmed that they plan to boost their own communication and business skills so they can interact more cohesively with business leaders – and collectively identify the right problems to solve for their organisation.

Top data trends for 2021

As organisations continue to push ahead with operationalising their data and analytics infrastructures to handle complex business realities, data scientists are scaling up their deployment of machine learning algorithms to automate their analytical models.

According to our poll, upskilling their machine learning (ML) skills was identified as the #1 priority for 45% of data scientists as they look to accelerate their AI and ML computations and workloads and better align decisions throughout the organisation.

Similarly, big data analytical technologies (such as Spark, Storm and Fink) was the top priority for 39% of UK data science teams, as was getting to grips with deep learning (39%) as analytics teams look to jointly leverage data and analytics ecosystems to deliver coherent stacks that facilitate the rapid contextualisation decision-makers need.

Finally, with more people across the organisation becoming increasingly dependent on data-driven decision making, data scientists are having to find new ways to present data in ways that business teams will understand.

In a bid to democratise data and support faster decision making on the front line, they’re working on increasing their skills in areas like data visualisation (27%) and modelling (23%) so they can tease out trends, opportunities and risks in an easily digestible way that makes it easy for decision-makers to consume and engage.

New opportunities on the horizon

In a post-COVID world, organisations are looking to tap into an increasing number of data sources for the critical insights they’ll need to tackle emerging challenges. In response, data scientists are having to extract and analyse data quickly – even in real-time – and in the right way. Integrating data-driven insights into the decision-making process.

In response, data scientists are having to upgrade their technical and business skills as organisations look for efficient and innovative ways to use the big data at their disposal.

In summary, the research highlights both how important it is to align central data communities in order to boost and demonstrate value across the business, while ensuring that investment in L&D programmes is fully aligned with developing trends and business objectives.

 

 

NHS-R Community
Blogs home Featured Image

The NHS is one of the UK’s most valued institutions and serves as the healthcare infrastructure for millions of people. Mango has had the pleasure of supporting their internal NHS-R community over the last few years, supporting the initiative from its inception and sharing our knowledge and expertise at their events as they seek to promote the wider usage and adoption of R and develop best practice solutions to NHS problems.

According to a recent survey by Udemy, 62% of organisations are focusing on closing skills gaps, essential to keeping teams competitive, up to date and armed with the relevant skills to adapt to future challenges.  For many institutions, an important first step is connecting their analytics teams and data professionals to encourage the collaboration and sharing of knowledge. With ‘Data literacy’ fast becoming the new computer literacy, workforces with strong data skills are fast realising the strength and value of such skills across the whole organisation.

As the UK’s largest employer, comprising 207 clinical commissioning groups, 135 acute non-specialist trusts and 17 acute specialist trusts in England alone, the NHS faces a particularly daunting task when it comes to connecting their data professionals, a vast group which includes clinicians as well as performance, information and health analysts.

The NHS-R community was the brainchild of Professor Mohammed Mohammed, Principal Consultant (Strategy Unit), Professor of Healthcare, Quality & Effectiveness at the University of Bradford. He argues,  “I’m pretty sure there is enough brain power in NHS to tackle any analytical challenge, but what we have to do is harness that power, promoting R as the incredible tool that it is, and one that can enable the growing NHS analytics community to work collaboratively, rather than in silos”.

Three years in and the NHS-R Community has begun to address that issue, bringing together once disparate groups and individuals to create a community, sharing insights, use cases, best practices and approaches, designed to create better outputs across the NHS with a key aim of improving patient outcomes.  Having delivered workshops at previous NHS-R conferences, Mango consultants were pleased to support the most recent virtual conference with two workshops – An Introduction to the Tidyverse and Text Analysis in R. These courses proved to be a popular choice with the conference attendees, attracting feedback such as “The workshop has developed my confidence for using R in advanced analysis” and “An easy to follow and clear introduction to the topic.”

Liz Mathews, Mango’s Head of Community, has worked with Professor Mohammed from the beginning, sharing information and learnings from our own R community work and experience.  Professor Mohammed commented:

“The NHS-R community has, from its very first conference, enjoyed support from Mango who have a wealth of experience in using R for government sector work and great insight in how to develop and support R based communities. Mango hosts the annual R in Industry conference (EARL) to which NHS-R Community members are invited and from which we have learned so much. We see Mango as a friend and a champion for the NHS-R Community.”

Blogs home Featured Image

In 2020 the EARL conference was held virtually due to the restrictions imposed by COVID-19. Although this removed the valuable networking element of the conference, the ‘VirtuEARL’ virtual approach meant we reached a geographically wider audience and ensured a successful conference. Thought leadership from academia and industry logged in to discover how R can be used in business, and over 300 data science professionals convened to join workshops or hear presenters share their novel and interesting applications of R. The flexibility of scheduling allowed talks to be picked according to personal or team interests.

The conference kicked off with workshops delivered by Mango data scientists and guest presenters, Max Kuhn of RStudio and Colin Fay from ThinkR, with topics including data visualisation, text analysis and modelling. The presentation day both began and finished with keynote presentations: Annarita Roscino from Zurich spoke about her journey from data practitioner to data & analytics leader – sharing key insights from her role as a Head of Predictive Analytics, and Max Kuhn from RStudio used his keynote to introduce tidymodels – a collection of packages for modelling and machine learning using tidyverse principles.

Between these great keynotes, EARL offered a further 11 presentations from across a range of industry sectors and topics. A snapshot of these shows just some of the ways that R is being used commercially: Eryk Walczak from the Bank of England revealed his use of text analysis in R to study financial regulations, Joe Fallon and Gavin Thompson from HMRC presented on their impressive work behind the Self Employment Income Support Scheme launched by the Government in response to the Covid-19 outbreak, Dr. Lisa Clarke from Virgin Media gave an insightful and inspiring talk on how to maximize an analytics team’s productivity, whilst Dave Goody, lead data scientist from the Department of Education, presented on using R shiny apps at scale across a team of 100 to drive operational decision making.

Long time EARL friend and aficionado, Jeremy Horne of DataCove, demonstrated how to build an engaging marketing campaign using R, and Dr Adriana De Palma from the Natural History Museum showed her use of R to predict biodiversity loss.

Charity donation 

Due to the reduced overheads of delivering the conference remotely in 2020, the Mango team decided to donate the profits of the 2020 EARL conference to Data for Black Lives. This is a great non-profit organization dedicated to using data science to create concrete and measurable improvements to the lives of Black people. They aim to use data science to fight bias, promote civic engagement and build progressive movements. We are thrilled to be able to donate just over £12,000 to this brilliant charity.

Whilst EARL 2020 was our first such virtual event, the conference was highly successful. Attendees described it as an “unintimidating and friendly conference,” with “high-quality presentations from experts in their respective fields” and were delighted to see how R and data science in general were being used commercially. One attendee best described the conference: “EARL goes beyond introducing new packages and educates attendees on how R is being used around the world to make difficult decisions”.

If you’d like to learn more about EARL 2020 or see the conference presentations in full, click here.

Mango's success - a data conversation
Blogs home Featured Image

As we approach the new year, it seems an appropriate time to look back at how Mango’s 18-year history has reflected the evolving landscape of the data industry. It’s hard to believe that founders, Matt and Rich, have been sharing the data story since 2002, long before the term ‘data science’ gained popularity, and well before most organisations had begun to recognise the value of their data.  Matt and Rich have borne witness to this data revolution, via the big data era through to the current day where data is recognised as a new class of economic asset; universities routinely offer data science courses and Government departments have adopted algorithmic decision making.

Championing transition projects focussing on productivity through data science and a move towards repeatable and scalable models, Mango’s emphasis has been on ingraining data as part of a company’s DNA and supporting the creation of a data-driven culture.

It’s easy to see how the co-founders have remained at the forefront of the industry for so long, delivering data science projects to some of the world’s best-known companies. They credit their longevity to their open, honest and outcome-focused way of doing business and their deliberate shift from analytics as a reactive tool to adding value and the insights to drive decision making.

Asked about the most notable transition they had seen over the years, Matt referenced how the world has changed: “You can’t have barriers of data within organisations. Siloed data and analytics teams were once the norm, but these create structural, cultural and technological obstacles, wasting resource and inhibiting productivity. Many of the biggest challenges associated with data are not so much analytic problems, but fundamental information integration issues. Technology has moved at a huge pace in the past decade and that continuum between software advances and a recognition of the importance of data grows ever closer.”

Secrets of Success

There have been many secrets to Mango’s success, starting with its name.  “We considered lots of options incorporating ‘Statistics’ or ‘Analytics’ but they all seemed rather dull or dry and, in retrospect, would have dated very quickly,” remembers Rich. “Whilst ‘Stats Entertainment’ was just one of Matt’s inspired suggestions, our decision to name the company Mango, after his cat, has allowed us to continue to evolve and stay relevant through all the technological changes of the past 18 years.”

The name aside, it’s the founders’ approach that has been the real secret of their success. “Data for us has always been a way of doing business”, says Matt. “Looking back, we were right to place the emphasis on using analytics to empower end users. Our business has always been about making sense of data science, building out the capability by finding the experience, looking for knowledge and focussing on skills transfer and developing autonomy and support.  We’ve always believed in making data science easier for organisations, working alongside them and helping to broaden the scope and skills of the inhouse teams”.

Matt and Rich are unanimous that a vital element in Mango’s success, has been its people. “We’ve been lucky enough to attract extremely talented people, whilst also having a very successful internal graduate programme,” confirms Matt.  “My father’s advice was always to surround yourself with the best people and that’s exactly what we’ve managed to achieve. It was a proud moment to see that this year’s DataIQ list of Top 100 data professionals featured not only Rich, but also two of our former colleagues.”

Highlights

There have been many highlights along the way, but for Matt and Rich there have been some standout memories and high points over the past eighteen years.  “Standing on the platform at Zurich train station celebrating our first major contract win was a very memorable moment,” recalls Matt. “It was the point when we realised that we really were onto something new, securing a big customer who’d been won over by our style and attitude.”

A particular Mango achievement is their work in the R Community, including the creation of EARL (Enterprise Applications of the R Language), the first commercially focused R conference. The first EARL conference was delivered in 2014 and is now a firm annual fixture for R users across the UK and Europe.  Previous iterations have also seen EARL conferences delivered across the US. The original idea for the conference came from Rich, and the event is entirely organised and run by Mango staff. “The culture and openness displayed at EARL is fantastic, with companies keen to share their knowledge and use cases and talk frankly about their R journeys” remarks Rich. “Our work within the R community and the recognition that Mango has received for our R user groups and EARL is something we are particularly proud of.”

Lessons learned

Mango’s initial work was primarily within the life sciences and financial sectors. “A lot of our early work was in highly regulated industries and the rigour of working in those environments was massively valuable”, recalls Rich. “Everything we learned in those regulated industries we now deploy across industry ensuring a robust approach and the delivery of best data science practices and real practical advice.  Whilst much of our early work was in SAS, S- Plus and R, Mango has always been agnostic about tech, working within whichever language best meets our clients’ requirements and objectives; these days much of our work is in python.”

A phrase that resonates with Mango is ‘Give a man a fish and you feed him for one day; teach a man to fish and you feed him for a lifetime’.  “We work alongside our clients, mentoring and helping to upskill their teams, leaving them able to operate independently at the end of our involvement,” states Rich. “This approach is greatly valued by our customers, irrespective of where they are in their own digital transformation journey, who recognise the value that we add.”

Teamwork is at the heart of Mango’s work, whether it’s working in internal teams or as part of a client’s team. The introduction of the Belbin framework has been enormously useful in creating a team structure and awareness of individuals’ behavioural strengths, fostering more effective communication. “We started by employing the right people”, said Rich, “but the Belbin framework and our own Trusted Consultant programme has cemented a really productive team ethos.”

“Looking back, if there was one thing that we wished we’d done earlier, it would have been to introduce a marketing presence,” mulls Matt. “We were fortunate to grow organically and benefit from recommendations and repeat business, but in the past couple of years, the work undertaken by our marketing team to promote Mango to a wider audience has resulted in awards and recognition that have really amplified our presence and message.”

Looking ahead

“We are extremely proud of the company that we have built,” attests Rich, “and today Mango is focused on facilitating the sorts of conversations that we recognised as needing to be had some 18 years ago when we first founded.  We urge businesses to embrace methodical and pragmatic data processes before they dive in at AI/ML-level but are grateful, at least, that these latter tools have finally provoked the data conversation”.

 

 

where does digital value lie?
Blogs home Featured Image

Alex James, CTO at Ascent, helps us track down value in an increasingly digital age.

 

“Companies that don’t or can’t keep up in this age of digital transformation are going to get left behind by their competition,” says everyone, all the time.

But what does that mean, and where and how should companies be investing to actually drive digital transformation going into 2021?

We are currently seeing high levels of investment in a few key areas:

  • Predictive data & analytics
  • Data warehousing & reporting
  • Smart buildings
  • Industry 4.0 smart machinery
  • Artificial Intelligence
  • Smart asset monitoring

And in some areas we have seen a downward turn of interest – such as blockchain, drones and virtual reality.

Interconnected

Two things that become very clear when looking at the list are that investments are being made in areas where software and data meet, with high levels of interdependency. For example, smart buildings are used for asset monitoring purposes and contribute to data warehouses, which can be used for predictive purposes to drive efficiency gains. Each link in the chain incrementally increases value.

This interconnectedness is where we see the most value creation for organisations who get it right. Looking at investments in isolation, it’s hard to see the ROI against implementation costs and risks. When you look at these investments as an interconnected web (each node driving inputs and outputs with value generation at the centre), that picture changes. Technology areas which are seen as stand-alone drivers such as VR are struggling to attract the same broad levels of investment.

However, digital transformation is a journey, not a destination, and simply investing in a web of technologies doesn’t guarantee success, or even any kind of return. CIOs need to think further ahead and carefully balance current pain with future anticipated needs. Organisations are understandably rarely in the position to take giant leaps away from models and processes that have made them successful today, so evolutionary roadmaps that typically span 3-5 years are a common approach. A strong roadmap constantly evolves, actively acknowledging obsolescence, technical debt, and the operational pain of change – balancing these against technology’s ability and responsibility to deliver radical organisational improvement.

Dependent

An organisation’s ability to deliver successful change therefore depends upon the ability to execute both the technology roadmap and change management activities in sync. New capabilities in IoT or AI for example will only ever deliver value as part of a cohesive web of solutions – they are not the standalone ‘silver bullets’ some businesses expect them to be.

This is a bit of a move from some ideologies of the past. Lean practices have proven very successful in start-up technology companies and have spilled over into larger organisations. However, this approach may not be well suited to modern digital transformation projects. Focus on short-term ROI and individual projects is typically embedded in change-resistant organisations, leading to piecemeal investments without a strong roadmap and vision, which leads to poor returns as valuable data and information stay locked in siloes unable to drive or consume value from the rest of the organisation.

Standardised

One of the main obstacles to overcome in forming a strong digital strategy and not falling into this trap is the acknowledgment of pace of change and obsolescence. In the world of IoT, capital investments have often been written off long before they’ve even been depreciated off the balance sheet. Why is this? Unreasonably high ROI requirements, lack of flexibility in the original solution and lack of interoperability are all key culprits.

These experiences tend to make CIOs more cautious and pessimistic about their outlooks. The landscape right now is changing in regard to data sensitivity regulations, growing data sizes, cost of staff with the skills to maintain systems and lack of interoperability between solutions. All of these if ignored can cripple a solution and turn the return negative over time as they layer on increased cost and complexity.

However, all of these challenges can be overcome with a strong strategy. Capability-driven models that outsource much of the heavy lifting to SaaS providers and place cloud-based capabilities like Azure at the centre of their architectures remove much of the risk around data management. Similarly, carefully planned integration architectures and service-oriented designs with comprehensive APIs allow for changes down the road and a plug-and-play type approach to expanding services.

Evolving

While just a handful of years ago traditional hardware-centric IT skills may have been sufficient to maintain business operations, most organisations are finding themselves in a place where access to modern programming and software engineering skills are table stakes to keep their strategy on track. Over the next few years, skills such as data engineering and data science will start to move to the top of that list. So, another large part of setting a successful digital strategy is talent-focussed – not only in training and upskilling but in understanding, balancing and forecasting external vs internal expertise requirements: which capabilities belong in-house and which should be rented as a service and consumed as an Opex item.

Digital transformation doesn’t just change the technologies people use, but also how people work. We are moving into an age where cultural and process change needs to happen in step with technology change, and where an organisation’s technology proposition needs to be thought of as an interconnected web that creates value. The cost of implementing just one node or solution may not seem to create enough value in isolation, but as part of the whole value-producing web, it becomes an absolute necessity.

In summary…

Leaders and CIOs need to remember that internal capabilities are only part of the solution – limiting your ability to execute to your own domain of expertise will ultimately be restrictive. The fastest solution isn’t always the best, but well-paced solutions that take into consideration all other transformation vectors will always win in the long run. And roadmaps that directly deal with and allow for the realities of change, obsolescence and technical debt tend to be the most successful.

So, the answer to where does digital value lies is, counterintuitively, not in any particular area of investment or technology, but in the interconnected web of data, action and insight that lies between those investments, driven by a strong overarching digital strategy.

Value at the Intersection of Data and Software
Blogs home Featured Image

For the last 18 years, Mango have been helping customers deliver on the potential of data and analytics.

When we started Mango back in 2002, the wider world of data and analytics was mostly reactive, with workflows conducted by individuals who produced reports as ‘one time’ outputs. As such, while data professionals wrote code, it could largely be considered a by-product of what they did. The advent of data science, together with the increasing need for just-in-time intelligence, has driven more proactive analytic workflows underpinned by open-source technologies such as Python and R.

Working at the forefront of data science, Mango understands the vital role of technology; to allow data to be transformed into wisdom in a repeatable way and deployed to business users at the right time, to support informed decision making.

There is a clear learning here for modern technology initiatives:

Every data project is a software project, and every software project is a data project.

To realise business value, it is vital that we balance both data and software elements of technical projects around a common and clear purpose.

Every data project is a software project.

Back in 2012, Josh Wills described a data scientist as someone who is “better at statistics than any software engineer and better at software engineering than any statistician”. While modern data science incorporates a broader range of analytic approaches than statistical modelling alone, Josh’s description of data science at the intersection of analytics and software engineering still holds today.

The changing role of data and analytics from a reactive practice to a strategic approach has driven the need for advanced analytics to be combined effectively with software engineering. If analytics is now an always-on capability, we need to codify the intelligence in systems that can be properly deployed and scaled within a business.

A ‘local’ alternative is just not practical – you can’t become a true data-driven business if analytics is run by experts on their laptops. We can’t stop making intelligent decisions if a data scientist is on leave. If a consumer purchases a product on Amazon, they will not wait hours or days until a statistician crunches the data to come up with other recommended products.

To positively impact a business with data, an end-to-end analytic workflow needs to be implemented using software engineering approaches. This encompasses everything from the creation of data pipelines, the deployment of models, and the creation of user interfaces and applications that can convey insight in the right way, linked directly to operational systems to action and process outcomes.

Every software project is a data project.

Increasingly digitalisation and regulation have driven more focus on requirements regarding the role of data in software systems. We can consider 3 types of requirement regarding the treatment of data:

  • User – requirements relating to users and preferences to provide a more personalised experience
  • Governance – requirements relating to the way in which data is managed in a secure fashion to confirm with data regulations and protect confidential data
  • Provenance – requirements relating to historical system actions to provide an audit trail, or to enable rollout back to, or understanding of, previous actions
  • Beyond this, the most important consideration in the design of modern systems is the ability to leverage advances in data and analytics to create richer, more useful experiences and applications. A growing understanding of the possibilities offered by analytics allows us to strive to ask better questions – to build software tools that are truly aligned to a users’ objectives.

For example, imagine we are building a software application to be used by call centre staff when speaking with customers. Traditionally, we may have built a system that combined data from various sources to give the user a single view of the customer. Perhaps this included data on previous orders, previous interactions, demographic data etc.

With data science, we could extend the functionality for the user – perhaps to include an understanding of likely customer churn linked to suggested retention actions, or a suggested ‘next best offer’ for the customer, or suggestions around the ways in which to talk to the user. Perhaps when the customer calls the call centre they can be allocated to exactly the right person to talk to, as opposed to being randomly allocated to the next available agent.

The use of data and analytics in software can have a transformative effect on the quality and usefulness of our software systems.

In summary…

Helping customers build capabilities at the intersection of data and software is the most effective way to unlock value in an increasingly digital economy. Technology businesses like ours who want to be part of that customer journey need to be ambidextrous in their approach to data and software, agile in their execution and above all empathetic to each customer’s unique context.

We’re excited to apply our passion for data science to a wider market as we join forces with Ascent – increasing our combined ability to design and deliver ‘the big picture’ for customers that helps them compete and flourish.

Author: Rich Pugh, Chief Data Scientist

Blogs home

Both R and distributed programming rank highly on my list of “good things”, so imagine my delight when two new packages used for distributed programming in R were released:

ddR (https://github.com/vertica/ddR) and

multidplyr (https://github.com/hadley/multidplyr)

 

Distributed programming is normally taken up for a variety of reasons:

  • To speed up a process or piece of code
  • To scale up an interface or application for multiple users

There has been a huge appetite for this in the R community for a long time so my first thought was “Why now? Why not before?”.

From a quick look at CRAN’s High Performance Computing page, we can see the mass of packages that were available for related problems already. None of them have quite the same focus of ddR and multidplyr though. Let me explain. R has many features that make it unique and great. It is high-level, interactive and most importantly, it also has a huge number of packages. It would be a huge shame to not be able to use these packages, or if we were to lose these features when writing R code to be run on a cluster.

Traditionally, distributed programming has contrasted with these principles, with much more focus on low-level infrastructures, such as communications between nodes on a cluster. Popular R packages that dealt with these in the past are the now deprecated packages, snow and multicore (released on CRAN in 2003 and 2009 respectively). However, working with low level functionality of a cluster can detract from analysis work because it requires a slightly different skill set.

In addition, the needs of R users are changing and this is, in part, due to big data. Data scientists now need to be able to run experiments on, and analyse and explore much larger data sets, where running computations on it can be time consuming. Due to the fluid nature of exploratory analysis, this can be a huge hindrance. For the same reason, there is a need to be able to write parallelized code without having to think too hard about low-level considerations, and for it to be fast to write as well as easy to read. My point is that fast parallelized code should not just be for production code. The answer to this is an interactive scripting language that can be run on a cluster.

The package written to replace snow and multicore is the parallel package, which includes modified versions of snow and multicore. It starts to bridge the gap between R and more low-level work by providing a unified interface to cluster management systems. The big advantage to this is that R code will be the same, regardless of what protocol for communicating with the cluster is being used under the covers.

Another huge advantage of the parallel package is the “apply” type functions that are provided through this unified interface. This is an obvious but powerful way to extend R with parallelism, because each any call to an “apply” function with, say, FUN = foo can be split into multiple calls to foo, executed at the same time. The recently released packages ddR and multidplyr extend on the functionality provided by the parallel package. They are similar in many ways. Indeed the most significant way is that they are based on the introduction of new datatypes that are specifically for parallel computing. New functions on these data types are used to “partition” data to describe how work can be split amongst multiple nodes and also a function to collect the work and combine them to produce a final result.

ddR then also reimplements a lot of base functions on the distributed data types, for example rbind and tail. ddR is written by Vertica Analytics group, owned by HP. It is written to work with HP’s distributedR, which provides a platform for distributed computing with R.

Hadley Wickham’s package, multidplyr also works with distributedR, in additional to snow and parallel. Where multidplyr differs to ddR is that it is written to be used with the dplyr package. All methods provided in the dplyr package are overloaded to work with the data-types provided by multidplyr, furthering Hadley’s eco-system of R packages.

After a quick play with the two packages, many more differences emerge between the two packages.

The package multidplyr seems more suited to data-wrangling, much like its single-threaded equivalent, dplyr.

The partition()  function can be given a series of vectors which describe how the data should be partitioned, very much like the group_by() function:

# Extract of code that uses the multidplyr package
library(dplyr)
library(multidplyr)
library(nycflights13)
planes %>% partition() %>% group_by(type) %>% summarize(n())

However, ddR has a very different “flavour”, with a stronger algorithmic focus, as can be seen by the example packages:  randomForest.ddRkmeans.ddR and glm.ddR, implemented with ddR. As can be seen in the code snippet below, certain algorithms such as random forests can be parallelised very naturally. Unlike multidplyr, the

partition()

function does not give the user control over how the data is split. However, provided in the

collect()

function is the

index

argument, which gives the user control over which workers to collect results from. Also, the list returned by

collect()

can then be fed into a

do.call()

to aggregate the results, for example, using

randomForest::combine() .
# Skeleton code for implementing very primitive version of random forests using ddR
library(ddR)
library(randomForest)
multipleRF <- dlapply(1:4, 
 function(n){
 randomForest::randomForest(Ozone ~ Wind + Temp + Month,
 data = airquality,
 na.action = na.omit)
})

listRF <- collect(multipleRF)
res <- do.call(randomForest::combine, collect(multipleRF))

To summarise, distributed programming in R has been slowly evolving for a long time but now in response to the high demand, many tools are being developed to suit the needs to R users who want to be able to run different types of analysis on a cluster. The prominent themes are as follows:

  • Parallel programming in R should be high-level.
  • Writing parallelised R code should be fast and easy, and not require too much planning.
  • Users should still be able to access the same libraries that they usually use.

Of course, some of the packages mentioned in this post are very young. However, due to the need for such tools, they are rapidly maturing and I look forward to seeing where it goes in the very near future.

Author: Paulin Shek

data team
Blogs home Featured Image

As more and more Data Science moves from individuals working alone, with small data sets on their laptops, to more productionised, or analytically mature settings, an increasing number of restrictions are being placed on Data Scientists in the workplace.

Perhaps, your organisation has standardised on a particular version of Python or R, or perhaps you’re using a limited subset of all available big data tools. This sort of standardisation can be incredibly empowering for the business. It ensures all analysts are working with a common set of tools and allows analyses to be run anywhere across the organisation It doesn’t matter if it’s a  laptop, server, or a large-scale cluster, Data Scientists and the wider business, can be safe in the knowledge that the versions of your analytic tools are the same in each environment.

While incredibly useful for the business, this can,  at times, feel very restricting for the individual Data Scientist. Maybe you want to try a new package that isn’t available for your ‘official’ version of R, or you want to try a new tool or technique that hasn’t made it into your officially supported environment yet. In all of these instances a Data Science Lab or Analytic Lab environment can prove invaluable to maintain pace with the fast paced data science world outside of your organisation.

An effective lab environment should be designed from the ground up to support innovation, both with new tools as well as new techniques and approaches. For the most part it’s rare that any two labs would be the same from one organisation to the next, however, the principles behind the implementation and operation are universal. The lab should provide a sandbox of sorts, where Data Scientists can work to improve what they do currently, as well as prepare for the challenges of tomorrow. A well implemented lab can be a source of immense value to it’s users as it can be a space for continual professional development. The benefits to the business however, can be even greater. By giving your Data Scientists the opportunity to be a part of driving requirements for your future analytic solutions, and with those solutions based on solid foundations derived from experiments and testing performed in the lab, the business can achieve and maintain true analytic maturity and meet new analytic challenges head-on.

In order to successfully implement a lab in your business, you must first establish the need. If your Data Scientists are using whatever tools are handy and nobody has a decent grasp on what tools are used, with what additional libraries, and at what versions, then you have bigger fish to fry right now and should come back when that’s sorted out!

If your business analytic landscape is well understood and documented, you must first identify and distil your existing tool set into a set of core tools. As these tools constitute the day-to-day analytic workhorses of your business, they will form the backbone of the lab. In a lot of cases, this may be a particular Hadoop distribution and version, or perhaps a particular version of python with scikit-learn and numpy, or a combination.

The next step, can often be the most challenging, as it often requires moving outside of the Data Science or Advanced Analytics team and working closely with your IT department in order to provision environments upon which the lab will be based. Naturally, if you’re lucky enough to have a suitable Data Engineer or DataOps professional on your team then you may avoid this requirement. A lot of that is going to depend on the agility model of you business and how reliant on strict silos it is.

Ideally any environments provisioned at this stage should be capable of being rapidly re-provisioned and re-purposed as needs arise, so working with a modern infrastructure is a high priority. It’s often wise at this stage to consider some form of image management for containers or VM’s, to speed deployment and ensure environments are properly managed. You need to be able to adapt the environment to the changing needs of the user base with the minimum of effort and fuss.

Once you have rapidly deployable environments at your disposal, you’re ready to start work. What form that work takes should be left largely up to your Data Science team, but broadly speaking they should be free to use and evaluate new tools or approaches. Remember, the lab is not a place where production work is done with ad hoc tools, it’s a safe space for experimentation and innovation, just like a real laboratory environment. Using the knowledge gained from running tests or trials in the lab however, can and should inform the evolution of your production tools and techniques.

A final word of warning for the business: A successful lab environment can’t be achieved through lip-service. The business must set aside time for Analysts or Data Scientists to develop the future analytic solutions that are increasingly becoming central to the success of the modern business.

For more information, or to get help building out an Analytics Lab of your own, or even if you’re just starting your journey on the path to analytic maturity, contact info@mango-solutions.com

Author:  Mark Sellors, Mango Solutions

bath data science radar
Blogs home Featured Image

Spotlight on Beth Ashlee – Senior Data Scientist

 

Name: Beth Ashlee

Job title: Data Science Consultant

Qualification(s): BSc Biomedical Science

Time in current role:  4 years

Beth Ashlee joined Mango initially as an intern whilst studying Biomedical Science. 4 years on and she’s recently been promoted to a position of Senior Data Scientist. During this time, she has experienced many diverse opportunities and pathways that have accelerated her analytical competency.

In addition to having been exposed to a myriad of technical-based scenarios through her delivery of client training in R and Python, Beth spends much of her time collaborating on a variety of projects such as Shiny app development, data exploration or productionising models. One of Beth’s passions is her team lead responsibility for Mango’s graduate recruitment programme where she actively trains and mentors her team on both professional and personal development.

Beth is a master communicator which is reflected in the shape of her Data Science Radar – a tool used to assess core Data Science competencies. Soft skills in data science are essential to establishing meaningful relationships alongside the ability to translate business value across an organisation, an area where Beth certainly excels. Outside of work, Beth enjoys travelling to new places and attending music festivals.

 

Beth’s Top 3 traits: 

  • Programmer 
  • Communicator 
  • Data Wrangler

Beth scores high in both Visualisation and Programming which ties in with the types of projects she has been working on most recently. 

As would be expected given her role as a Consultant and Trainer, Beth scores strongly as a Communicator. During a recent Government project, which required significant stakeholder engagement, these skills proved essential for helping to mobilise teams around the possibilities of advanced analytics.

Beth has identified that modelling is something she needs to work on to become a more well-rounded data scientist. To support this development, she has recently been doing more self-learning and is now working on a client facing modelling project.

Having a thorough understanding of capabilities and skill levels mapped against core competencies like these for the team, can help guide and shape the data science project team best suited to the task. The result is a significantly more engaged workforce with a set of skills that the business understands and needs, to deliver data-driven value. For more information on Data Science Radar, check out our Building a Winning Data Science Team page.

Would you like to join our award-winning 2020 Data IQ Best Data and Analytics Team? Mango are currently recruiting.

 

Related blogs:

Spotlight on a Data Consultant: Karina Marks

Spotlight on a Junior Data Scientist: Joe Russell

dataIQ award winners
Blogs home Featured Image

Mango are delighted to have been awarded the 2020 Data IQ Best Data & Analytics Team (Enabler) award as part of the people category.  The virtual awards ceremony took place early yesterday evening with Pete Scott, Mango’s Client Services Director accepting the award on behalf of the team.

As is sadly the case with virtual ceremonies, there wasn’t a cocktail or DJ in sight; nonetheless, the shortlist comprised very strong competition. Pete Scott said, “It really is fantastic to be recognised for such a prestigious award, designed to showcase the best of the data and analytics industry. Mango’s astonishing team of Data Science Consultants focus on solving real challenges through data and are dedicated to delivering customer-centric, data-driven value.  As a team they deliver expertise and innovative solutions in strategic advice, data and analytic project delivery, through to building analytic team capability.”

The consulting team, consisting of 35 data scientists and engineers with more than 200 years’ combined expertise between them, demonstrate exemplary technical excellence, collaborative working practices and processes, best practice frameworks and a commitment to proactive stakeholder engagement. The award entry demonstrated these commitments in abundance and in addition, it was their external engagements, notably their community and outreach activities showcasing Mango as being at the heart of an innovative data science community, which were no doubt recognised by the screening panel.

Mango would like to acknowledge the support of their key stakeholder partnerships, where the benefits of true collaborative relationships are realised. Working through the restrictions impacted by the COVID-19 pandemic, has certainly shown the benefit of Mango’s ‘agile’ project management practices, an approach that has allowed for reactive changes in accordance with rapidly changing conditions.

“This is an amazing achievement for Mango”, concluded Pete, “and reflects not only on the brilliance of the consulting team, but also on the support they receive from all areas of our 70-people consultancy. We all celebrate this win.”

Congratulations to all of the worthy award winners and shortlisters this year, we are very proud to have been amongst such stiff competition!