Dockerisation
Blogs home Featured Image

Best practice in data science can lead to long-lived business results. A structure that encourages repeatable processes for generating value from data, leads to a fully productive team working, allowing reproducible results, time and time again. When this process is ingrained across a company’s culture and the business and data teams are working together in harmony with the business goals, then the value of data can be realised into an overall centre of excellence and a shared language for best practice.

A shared language of best practice

Layers of operational best practice allow a standard practice to be adopted – ensuring the best possible outcome of your data science investment. For a data science team, best practices could relate to developing models or structuring analysis, quality standards or how a project is delivered. Alternatively, they could even align to the selection of your data and analysis tools, as these can easily impact the success of your project.

With data science teams coming from a diverse range of backgrounds and experiences, what may be obvious to one can be a novelty to another. A shared language of best practice allows collaborators to focus on the all-important value generated. A workflow that adheres to a best practice ensures quality, whether that be business value of insights to the accuracy of models. Best practices take the guess work out, minimise mistakes and create a platform for future success.

4 best practices every data delivery teams should focus on:

  • Reproducibility – Whatever the task is. If your results can’t be repeated, then is it really done?
  • Robustness – Results and quality of analysis can have a huge impact, ensuring your best practices that has checks and balances will lead to better quality
  • Collaboration – What use are your results if they are difficult to share. Having standards for collaboration means business value can be attained
  • Automation – It is very easy to do work with no automation, frameworks for automation can help accelerate teams

Best practice in Dockerisation

My talk at the Big Data London Meet Up ‘ How Docker can help you become more reproducible’, takes one element of best practice in data science, focusing on Dockerisation which is proving to be a powerful tool – one that is already turning established best practices in teams on its head. The tools allow teams to collaborate much easier, to be much more reproducible and automate workflows, in an impressive way. Yet, it has not had as much adoption within data science as it has within software engineering. My talk will explore just how Docker can super charge workflow and your valuable use cases.

This talk will be of interest to any data scientist who has had trouble with, deploying or working with engineering teams, reproducing colleagues’ analysis. It will also be of interest to anyone wanting to know how docker can scale a team, making it less intimidating and perfectly arming practitioners with the tools to give it a go.

I look forward to seeing you at Mango’s Big Data London, Meet Up, 22nd September 6-8pm, Olympia AI & MLOPS Theatre. You can sign up here

Kapil Patel is one of Mango’s Data Science Consultants.

 

Blogs home Featured Image

It’s mostly preaching to the converted to say that ‘open-source is changing enterprises’. The 2020 Open Source Security and Risk Analysis (OSSRA) Report found that 99 per cent of enterprise codebases audited in 2019 contained open-source components, and that 70 per cent of all audited codebases were entirely open-source.

Hearts and minds have most certainly been won, then, but there are still a surprising number of enterprise outliers when it comes to adopting open-source tools and methods. It’s no surprise that regulated industries are one such open-source averse group.

It’s still difficult to shake off the reputation open-source resources can have for being badly-built, experimental, or put together by communities with less recognisable credentials than big players in software. When your industry exists on trust in your methods – be it protecting client finances in banking, or the health of your patients in pharma – it’s often easier just to make do, and plan something more adventurous ‘tomorrow’.

This approach made a certain amount of sense in years past, when embracing open-source was more a question of saving capex with ‘free’ software, and taking the risk.

Then, along comes something like Covid-19, and the CEO of Pfizer – who are now among those leading the way in a usable vaccine – singing the praises of open-source approaches back in March 2020. Months down the line, AstraZeneca and Oxford University’s 70 percent-efficacy Covid-19 vaccine emerged. AstraZeneca is having a public conversation around how it’s “embracing data science and AI across [the] organisation” while it continues to “push the boundaries of science to deliver life-changing medicines”.

Maybe tomorrow has finally arrived.

At Mango, our primary interest is in data science and analytics, but we also have a great interest in the open-source programming language R when we’re thinking about statistical programming. We’re not attached to R for any other reason than we find it hugely effective in overcoming the obstacles the pharmaceutical industry recognises implicitly – accessing better capabilities, and faster.

With a growing number of pharmaceutical companies starting to move towards R for clinical submissions, we thought it would be useful to find out why. Asking experts from Janssen, Roche, Bayer and more, we collected first-hand use cases, experiences and stories of challenges overcome, as well as finding out how these companies are breaking the deadlock of open-source’s reputation versus its huge potential for good in a world where everything needs to move faster, while performing exceptionally. Watch the full round table recording here.

If you’d like to find out more, please get in touch and we’d be happy to continue the conversation.

Author: Rich Pugh, Chief Data Scientist at Mango

data engineer
Blogs home Featured Image

Here at Mango, we are often asked to come and help companies who are in a mess with their data. They have huge technical debt, they can’t link all their data sources and the number of reports they have has ballooned beyond control. Everyone has their own version of the truth and business units are involved in ‘data wars’ where their data is right and everyone else has the wrong data. How does this happen? Put quite simply, hires are focused for the ‘shiny’, interesting aspects of data science where it is easy for the business to see how they get value from that hire – business intelligence (BI), management information (MI), or Data Scientists. This ignores the more technical and less exciting but essential pillar of delivering business value: the data management and data engineering pillar which is critical to underpin any data-driven business.

The thing is, you may have the best data team who can programme, model, visualise and report with data but without well-managed, curated data, over the longer term your systems and processes will be thrown into chaos and your data will become unmanageable. This isn’t because these analytical professionals aren’t doing their job, it’s because their job is extracting value from insight, not making sure the machine behind it all is ticking over smoothly. In F1, the driver would be useless without a whole range of engineers and mechanics. If your business only has BI and MI analysts or Data Scientists, you are asking the driver to win an F1 race with a Morris Minor – you need a Data Engineer.

 

Turning data into wisdom – the role of the Data Engineer

Why does this happen? Quite simply, organisations often might look at the price of hiring a senior experienced head of data/data engineering or a building a data management function and decide they don’t need one and instead hire a significantly cheaper BI resource instead, expecting this person to do it all. As a role, a head of data/data engineering has changed massively since the advent of advanced analytics and now requires both specialist and strategic knowledge to build the reliable systems to collect, transform, store and provision data for analytics or other complex purposes.  The right technical infrastructure required to turn the data into wisdom in a repeatable manner bridge the gap between strategy and execution.

From assessing a proliferation of data silos to hard to maintaining “legacy” data processing systems are just common challenges and with modern platforms, data warehouses are a more collaborative affair than ever before, many of the same principles still hold. A data engineer understands data modelling techniques to build data warehouses that can be trusted, maintained, and that deliver exactly what analysts need.

It’s a false economy to overlook the critical engineering needs that a data-driven busines has. There is also cost in fiscal terms. With poorly designed systems that don’t perform, we have seen costs of transformation projects moving to the cloud double purely because of poor data management. Add to that the cost of having to constantly upgrade database servers so they can keep up with the ever increasing workload and lifetime costs get even higher. This ignores the harder to quantify opportunity cost of not being able to leverage your data, or the cultural impact of business units arguing because they have a different data-driven view of the business.

Its essential to look at the investment in an appropriate data function holistically in terms of long term gain through increased opportunities to leverage data and make better decisions, a more efficient cost base for your technology over the long term alongside an easier transformation pathway when you need to evolve as a business. Without taking that long-term view of your business, it can be hard to see how a data management function can add value. However, without one, the opportunity for improved insight and the cultural benefit of happier staff who understand how to leverage data in a way that is sustainable and beneficial to all involved will be lost.

 

The Key to Extracting Value from your Data

Organisations need a good data engineering function to access the right data, at the right time, and with sufficient quality to empower analytics. But what is the definition of a data engineer’s role and why is this function so crucial to bridging the gap between strategy and execution when it comes to delivering a data science project?

As data experts, we know what companies need to do to become data-driven. If you are struggling to see how a data function fits in your business or don’t know how to move to the data-driven nirvana, we can help guide you on your whole journey, from first steps through to decisions being made from a ‘data first’ mindset.

Author: Dean Wood, Principal Data Scientist

graduate data scientist placement
Blogs home Featured Image

Pure Planet Placement

Climate change and the rise of machine learning are two dominating paradigm shifts in today’s business environment. Pure Planet sits at the intersection of the two – it is a data-driven, app-based, renewable energy supplier. They provide clean renewable energy in a new, technology focused way.

Pure Planet are further developing their data science capability, with a hoard of data from their automated chat bot ‘WattBot’, among other sources, they are positioning themselves to gain real value from plumbing this data into business decisions to better support their customers. Mango have been working with Pure Planet and their data science team to build up this capability and have developed the infrastructure (and knowledge) to get this data into the hands of those that need it – be it the marketing department, finance, or the operations teams – they all have access the insights produced.

Thanks to this great relationship, Mango and Pure Planet were able to organise a Graduate Placement and I was able to spend a month integrated into their data science team in Bath.

Consumer Energy is Very Price Sensitive

To a lot of consumers, energy is the same whoever supplies it (provided it is green…) and so price becomes the one of the dominating factors in whether a customer switches to or from, Pure Planet.

With the rise of price comparison websites, evaluating the market and subsequently switching is becoming easier than ever for consumers, and consequently the rate customers are switch is increasing. Ofgem, the UK energy regulator, states: ‘The total number of switches in 2019 for both gas and electricity was the highest recorded since 2003.’ – https://www.ofgem.gov.uk/data-portal/retail-market-indicators

Pure Planet knows this, and regularly reviews its price position with respect to the market, but the current process is too manual, not customer specific, and hard to integrate into the data warehouse. Ideally, competitor tariff data could be digested and easily pushed to various parts of the business, such as in finance to assess Pure Planet’s market position for our specific customer base, or to operations as an input in a predictive churn model to assess each customer’s risk of switching.

It is clear just how valuable this data is to making good strategic decisions – it is just a matter of getting it to where it needs to be.

Can We Extract Competitor Quotes?

Market data on prices from all the energy providers in the UK is available to Pure Planet from a third party supplier, making it possible to get data on the whole market. Currently, it is possible to manually get discrete High/Medium/Low usage quotes only. These are average usages defined by Ofgem.

An alternative was found by accessing the underlying data itself and re-building quotes. This would allow us to reconstruct quotes for the whole market for any given customer usage – far more useful when looking at our real position in the market for our customers.

The data exists in two tables: tariff data and discount data. From this it should be possible to reconstruct any quote from any supplier.

An Introduction to the Tariff Data

The two data files consist of the tariff data and the discount data.

The tariff data gives the standing charge and per-kilowatt hour cost of a given fuel, for a given region, for each tariff plan. This is further filtered by meter type (Standard, or Economy 7), single/dual fuel (if both gas and electricity are supplier, or just one), and payment method (monthly direct debit, on receipt of bill, etc.). Tariff data is further complicated by the Inclusion of Economy 7 night rates, and multi-tiered tariffs.

The discount data describes the value of a given discount, and on what tariffs the discount applies. This is typically broken down into a unique ID containing the company and tariff plan, along with the same filters as above.

Most quotes rely on discounts to both entice customers in, and to offer competitive rates. As a result, they are key to generating competitor quotes. However, joining the discount and tariff data correctly, to align a discount with the correct tariff it applies to, presented a significant challenge during this project.

The way the discounts had been encoded meant that it was impossible for a machine to join them to the tariffs without some help. To solve this problem a function had to be developed that captured all the possible scenarios and transformed the discounts into a more standard data structure.

The Two Deliverables

After an initial investigation phase, two key deliverables were determined. The first was a python package to help the users easily process the discounts data into a form that could easily and accurately join onto the tariff data. The second was a robust understanding of how quotes can be generated from the data. The idea being the package would be used in the ETL stage to process the data before storing it in the data base, and the knowledge would be mapped from python to SQL and applied when fetching a quote in other processes.

Although most tariffs and discounts were straight forward, for the few remaining there were several complications. As ever in life, it was these tricky ones that were the most interesting from a commercial perspective – hence the need to get this right!

The Methodology

Investigation and package development were undertaken in Jupyter notebooks, written in python, primarily using the `pandas` package. Here, functions were developed to process the discounts data into the preferred form. During development, tests were written with the `pytest` framework to check the function was doing the logic as intended. Each test tested a specific piece of logic as it was added to the function. This was a true blessing, as on more than one occasion the whole function needed re-writing as new edge cases were found, proving initial assumptions wrong. The new function was simply run through all the previous tests to check it still worked, saving vast amounts of time, and ensuring robustness for future development and deployment.

Once developed, the functions (along with their tests) were structured into a python package. Clear documentation was written to describe both the function logic, but also higher-level how-to guides to enable use of the package. All development was under version control using git and pushed to bit bucket for sharing with the Pure Planet data team.

Pure Planet uses Amazon Web Services for their cloud infrastructure, and as a result I became much more aware of this technology and what it can do. For example, using the Amazon Web Services Client to access data stored in shared S3 buckets. It was great to see how their data pipeline was set up, and just how effective it was.

To prove the understanding of how quotes were built up, a notebook was written to validate generated quotes by comparing these to the quote data fetched manually. This incorporated the newly developed package to processes the discount data and join this to the tariff data, followed by implementing the quote logic in pandas to generate quotes. It was then possible to compare the generated quotes to the manual quotes to prove the success of the project.

And Finally…

Big thanks to Doug Ashton and the fellow Data Science team at Pure Planet for making my time there so enjoyable. I really felt part of the team from day one. I would also like to extend my thanks to those at Mango and Pure Planet who made this graduate placement opportunity possible.

Author: Duncan Leng, Graduate Data Scientist at Mango

Blogs home Featured Image

When technical capabilities and company culture combine, IoT-fed data lakes become a powerful brain at the heart of the business

Internet-enabled devices have led to an explosion in the growth of data. On its own, this data has some value, however, the only way to unlock its full potential is by combining it with other data that businesses already hold.

Together, pre-existing data and newly-minted IoT data can provide a full picture of specific insights around a single consumer. It is paramount, however, that companies don’t prioritise innovation at the expense of ethics. Sourcing and analytics must be done correctly – with the right context that respects consumer privacy and wishes around data usage.

The insights gained from successfully blending these two different data sources also unlock secondary benefits including new product development, possible upsells or the ability to build customer goodwill through advice-driven service delivery.

It’s a winning combination, but the challenge is how to actually merge device data with regular customer information.

No easy fit

This problem arises from the fact that IoT device data is a different “shape” to data in traditional customer records.

If you think of a customer record in a sales database as one long row of information, IoT collected information is more like an entire column of time series information, with a supporting web of additional detail. Trying to directly join the two is near impossible, and it is likely that some valuable semantic information could end up lost in the process.

But if IoT information fundamentally resists structure, and existing business databases are built on rigid structures, how do you find an environment that works for both? The answer is a data lake.

Pooling insight

A data lake is a more “fluid” approach to storing and connecting data. It is a central repository where data can be stored in the form it’s generated, whether that is in a relational database format or entirely unstructured. Analytics can then be applied over the top to connect different pieces of information and derive useful business insights.

However, there is more complexity involved in setting up a data lake than just combining all of an organisation’s data and hoping for the best. If you do that, you’ll likely end up with a data swamp – a disorganised, underperforming mess of data that lacks the necessary context to make it useful.

This can be avoided using the expertise of dedicated data engineers. These are the masterminds who build the framework for a data lake and manage the process of extracting data from its source, before transforming it into a usable format and then loading it into the data lake environment. Done properly, this will ensure data provenance, with appropriate metadata to guide users on allowable use cases and analysis.

“If you do that, you’ll likely end up with a data swamp – a disorganised, underperforming mess of data that lacks the necessary context to make it useful”

This sounds like a significant undertaking, and there’s no getting around the fact that doing data lakes right does take time and effort, but it is possible to take a staged approach. Many organisations start with a data “puddle” – a small collection of computers hosting a limited amount of data — and then slowly add to this, increasing the number of computers over time to form the full data lake.

A question of culture

In addition, technical considerations are just one side of the coin. The other side is one of culture. At the core of the problem is that businesses will not succeed with commercialising their IoT data if users are either unaware of, or distrusting of, the data lake and its potential.

While investment in big data continues to grow, a recent NewVantage Partners survey on Big Data and AI found that just 31 percent of organisations consider themselves data driven — the second year in a row that the number has fallen. Data lake technology has been around for several years now, and should be more than capable of enabling these types of organisations, but without the right culture in place, its benefits are seldom felt.

How do you create a culture that centres on being data-driven? As any management team knows, culture shifts are never easy, but a data-driven culture boils down to improving collaboration, communication and understanding between data professionals and business functions.

With a successful technical implementation of a data lake, you then need data professionals to advocate its benefits, and liaise with business departments to understand the types of insights that would be most useful to inform strategic decisions.

This then reinforces business confidence in the data function, and allows the data teams to expand their contributions to the business and be recognised for their hard work. When supported by senior buy-in, this positive feedback loop generates a growing culture of data savviness and data-driven approaches within the organisation.

Brain of the organisation

When technical capabilities and company culture combine, data lakes can become a powerful brain at the heart of the business. With the right analytics tools layered over the top, data lakes can reduce the time to finding insights and surface powerful information. These insights can serve business needs better and faster and are an outright win for any organisation. In short, they are well worth the time and investment.

Author: Dean Wood, Principal Data Scientist

Blogs home Featured Image

We were thrilled to host Hadley Wickham who delivered, as ever, a funny and engaging talk to a packed house at LondonR in August. In fact, to give you an idea of how much anticipated this event was, tickets to see Hadley sold out in under two hours!

It’s always fascinating for us elder members of the R community who remember the good old days, to witness the move from academic tools through to commercial adoption and engagement. For many years, R was proposed and rejected by many organisations due to the environment and architecture that existed. We used to spend time trying to work out data sizes and whether things would help.

I remember talking to Hadley at the first EARL in the US about creating toolsets that allowed organisations who didn’t “Love” R to use it and deploy it internally, comfortably. Hadley’s and latterly his team’s work, has allowed the ecosystem around R to develop from introspection, to a wide view of the analytic landscape, and his talk reflected I felt on some of these shifts.

Hadley’s insight into the mistakes he has made rang very true when considering the scale of the user base today compared to when he started developing packages. That moment of clarity when you realise that you need to prepare things in order for people you don’t know to pick up and use them efficiently, lies at the heart of good programming practice but sometimes is easily forgotten. This has driven Hadley on, to create better and easier codebases that are central platforms but also initiating others thoughts and developments.

It was great to hear someone like Hadley acknowledge that innovation isn’t a straight line and that forking and dead ends are essential parts of the process. Speaking to attendees afterwards, this message was highly prized and it felt as though there was an increased confidence with many attendees to go out and try things without the fear of failure.

All in all a fantastic evening that reinforced just how great the R community is.

If you’d like to view Hadley’s LondonR presentation, you can download it here.

50 shades of R
Blogs home Featured Image

 

I’ve been joking about R’s “200 shades of grey” on training courses for a long time. The popularity of the book “50 Shades of Grey” has changed the meaning of this statement somewhat. As the film is due to be released on Valentine’s Day I thought this might be worth a quick blog post.

Firstly, where did I get “200 shades of grey” from? This statement was originally derived from the 200 available named colours that contain either “grey” or “gray” in the vector generated by the colours function. As you will see there are in fact 224 shades of grey in R.

greys <- grep("gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 224

 

This is because there are also colours such as slategrey, darkgrey and even dimgrey! So lets now remove anything that is more than just “grey” or “gray”.

 

greys <- grep("^gr[ea]y", colours(), value = TRUE)

length(greys)

[1] 204

 

So in fact there are 204 that are classified as “grey” or “gray”. If we take a closer look though its clear that there are not 204 unique shades of grey in R as we are doubling up so that we can use both the British, “grey”, and US, “gray”. This is really useful for R users not having to remember to change the way they usually spell grey/gray (you might also notice that I have used the function colours rather than colors) but when it comes to unique greys it means we have to be a little more specific in our search pattern. So stripping back to just shades of “grey”:

 

greys <- grep("^grey", colours(), value = TRUE)

length(greys)

[1] 102

 

we find we are actually down to just 102. Interestingly we don’t double up on all grey/gray colours, slategrey4 doesn’t exist but slategray4 does!

So really we have 102 shades of grey in R. Of course this is only using the named colours, if we were to define the colour using rgb we can make use of all 256 colour values!

 

 

So how can we get 50 shades of grey? Well the colorRampPalette function can help us out by allowing us to generate new colour palettes based on colours we give it. So a palette that goes from grey0 (black) to grey100 (white) can easily be generated.

 

shadesOfGrey <- colorRampPalette(c("grey0", "grey100"))

shadesOfGrey(2)

[1] "#000000" "#FFFFFF"

 

And 50 shades of grey?

 

R 50 shades of grey

 

fiftyGreys <- shadesOfGrey(50)

mat <- matrix(rep(1:50, each = 50))

image(mat, axes = FALSE, col = fiftyGreys)

box()

 

I hear the film is not as “graphic” as the book – but hope this fits bill!

 

Author: Andy Nicholls, Data Scientist