virtual training blog
Blogs home Featured Image

“Education is the passport to the future, for tomorrow belongs to those who prepare for it today”. Malcom X

As Nelson Mandela acknowledged, education is truly the most powerful weapon which can be used to change the world. This year especially, we’ve all experienced the need to be agile, to adapt to changing circumstances and recognise how upskilling and learning is essential to expanding our capability to meet an ever-adapting environment.

It is upon foundations such as these that Mango has built education solutions on; working with data science teams to build future proofing skills aligned to strategic objectives.

A recent survey by Udemy noted the recent and specific emphasis on upskilling and reskilling as a result of the pandemic, with 62% of organisations aspiring to close their skills gap. A key starting point for business is in assessing the skills needed to meet their strategic aims and business objectives. In addition, they need to consider the diverse capabilities across their teams, to truly embrace the right learning culture. Team building and upskilling is an integral part of embracing change for a data-driven future. According to the report by Udemy – data literacy is the new computer literacy.

Workforces with strong data skills across the organisation, not just limited to the analytics team, can help embrace these positive changes. Quite often, it’s the business stakeholders who own the targets and processes who are most empowered by engaging in the data & analytics conversation.

As educators in data science and advanced analytics, we’d like to share some of the most effective strategies when looking to upskilling teams and business stakeholders appropriately:

  • Ensure a dynamic, facilitated learning environment – This year, like no other before, saw providers going virtual. Whilst nothing replaces face to face contact, any method that brings your teams together virtually into a real classroom setting, is the next best thing. Applying learning to a workflow requires a mentoring-based approach to help build lasting and best in class capability. Self-learning or pre-recorded lectures have their place but lack the interactive ‘in person trainer’ approach, where wider questions can be answered.
  • Apply a unique learning experience so that it is tailored to the market or industry – the ability to adapt to the needs of a diverse group is a core skill as a trainer. Practice exercises tailored to help beyond the classroom will allow newly applied their skills to be incorporated into a daily workflow.
  • Ensure training partners have real world experience from industry – the ability to showcase relevant examples and not just the theory, can really help bring a programme to life. In our experience, the ability to share and provide value via real-world applications, combined with practical, proven approaches and best practice advice, is key.
  • Choose courses as an integral part of a leaning pathway – courses for individuals and teams should be chosen as part of learning pathways to fill any capability gaps. Processes which invest in capability, with ongoing development of skills are proven to help staff retention. There should also be processes in place to retest these new skills.
  • Demystifying data science – the ability to establish a common language between all functions of an organisation is essential to a collaborative partnership. This depth of understanding will then ensure a close alignment between analytics and strategy, supporting any barriers to change.

Following the delivery of a recent training programme delivered to AstraZeneca, Gabriella Rustici-Data Science Learning Director, commented:

“Having worked with Mango previously on a training project, we reached out to them as a trusted partner to assist us with a data science training initiative which involved helping cohorts from our R&D data science function embark on their R journey.  We were also looking for a workshop to help our scientists ‘demystify’ data science and understand the terminology – establishing a common language between scientists and data scientists.

Mango helped us create a remote virtual classroom R training program, which included support surgeries designed to enable participants to really absorb what they had learnt from the program. Feedback received from course participants was excellent, with comments such as: “ The instructor was great, really patient”, “The instructor was very enthusiastic, clearly knew their topic and the learning material was great”, “ I got a lot from the course” and “I’m keen to learn more R, hopefully with Mango”. The workshop was well received and has certainly given us a good start to increasing awareness of what data science can do.”

About Mango Training

Whether you’re seeking R training courses, Python training courses or more, our comprehensive training programmes are specially designed to guide practising data scientists and data engineers from breakthrough to mastery level in R programming, Python, Shiny, AI/ML and more.


Blogs home Featured Image

We were thrilled to host Hadley Wickham who delivered, as ever, a funny and engaging talk to a packed house at LondonR in August. In fact, to give you an idea of how much anticipated this event was, tickets to see Hadley sold out in under two hours!

It’s always fascinating for us elder members of the R community who remember the good old days, to witness the move from academic tools through to commercial adoption and engagement. For many years, R was proposed and rejected by many organisations due to the environment and architecture that existed. We used to spend time trying to work out data sizes and whether things would help.

I remember talking to Hadley at the first EARL in the US about creating toolsets that allowed organisations who didn’t “Love” R to use it and deploy it internally, comfortably. Hadley’s and latterly his team’s work, has allowed the ecosystem around R to develop from introspection, to a wide view of the analytic landscape, and his talk reflected I felt on some of these shifts.

Hadley’s insight into the mistakes he has made rang very true when considering the scale of the user base today compared to when he started developing packages. That moment of clarity when you realise that you need to prepare things in order for people you don’t know to pick up and use them efficiently, lies at the heart of good programming practice but sometimes is easily forgotten. This has driven Hadley on, to create better and easier codebases that are central platforms but also initiating others thoughts and developments.

It was great to hear someone like Hadley acknowledge that innovation isn’t a straight line and that forking and dead ends are essential parts of the process. Speaking to attendees afterwards, this message was highly prized and it felt as though there was an increased confidence with many attendees to go out and try things without the fear of failure.

All in all a fantastic evening that reinforced just how great the R community is.

If you’d like to view Hadley’s LondonR presentation, you can download it here.

Blogs home Featured Image

It’s easy to get stuck in the day-to-day at the office and there’s never time to upskill or even think about career development. However, to really grow and develop your organisation, it’s important to grow and develop your team.

While there are many ways to develop teams, including training and providing time to complete personal (and relevant) projects, conferences provide a range of benefits.

Spark innovation
Some of the best in the business present their projects, ideas and solutions at EARL each year. It’s the perfect opportunity to see what’s trending and what’s really working. Topics at EARL Conferences have included, best practice SAS to R; Shiny applications; using social media data; web scraping, plus presentations on R in marketing, healthcare, finance, insurance and transport.

A cross-sector conference like EARL can help your organisation think outside the box because learnings are transferable, regardless of industry.

Imbue knowledge
This brings us to knowledge. Learning from the best in the business will help employees expand their knowledge base. This can keep them motivated and engaged in what they’re doing, and a wider knowledge base can also inform their everyday tasks enabling them to advance the way they do their job.

When employees feel like you want to invest in them, they stay engaged and are more likely to remain in the same organisation for longer.

Encourage networking
EARL attracts R users from all levels and industries and not just to speak. The agenda offers plenty of opportunities to network with some of the industry’s most engaged R users. This is beneficial for a number of reasons, including knowledge exchange and sharing your organisation’s values.

Boost inspiration
We often see delegates who have come to an EARL Conference with a specific business challenge in mind. By attending, they get access to the current innovations, knowledge and networking mentioned above, and can return to their team —post-conference— with a renewed vigour to solve those problems using their new-found knowledge.

Making the most out of attending EARL

After all of that, the next step is making sure your organisation makes the most out of attending EARL. We recommend:

Setting goals
Do you have a specific challenge you’re trying to solve in your organisation? Going with a set challenge in mind means your team can plan which sessions to sit in and who they should talk to during the networking sessions.

This is two-fold:
1) Writing a post-conference report will help your team put what they have learnt at EARL into action.
2) Not everyone can attend, so those who do can share their new-found knowledge with their peers who can learn second-hand from their colleague’s experience.

Following up
We’re all guilty of going to a conference, coming back inspired and then getting lost in the day-to-day. Assuming you’ve set goals and de-briefed, it should be easy to develop a follow-up plan.

You can make the most of inspired team members to put in place new strategies, technologies and innovations through further training, contact follow-ups and new procedure development.

EARL Conference can offer a deal for organisations looking to send more than 5 delegates.

Buy tickets now

Blogs home Featured Image

We told you EARL 2018 was going to be awesome!

We’re excited to announce that Hadley Wickham will be the Keynote Speaker at our EARL Houston event on 9 November 2018.

Technically, we think Hadley needs no introduction, but just in case…

Hadley is Chief Scientist at RStudio, the company that created the most-used IDE for businesses and individuals using R around the world. He is interested in building computational and cognitive tools that make data ingest, manipulation, visualisation and analysis easier, particularly via the more than 30 R packages he has developed. He also leads the team that creates and maintains the widely used ‘tidyverse’, which contains some of the most popular packages in the R community.

An encouraging and supportive member of the R community, Hadley is well-known for his deep insight and willingness to answer questions and share his knowledge, authoring a number of books and online resources. While the topic of his talk will be a surprise, we know delegates will come away from his session with plenty to think about.

Take the stage with Hadley

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Hadley in Houston as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

Effective Data Analytics In Manufacturing
Blogs home Featured Image

Data analytics is rapidly changing the face of manufacturing as we know it. At Mango, we’re seeing companies using their data effectively to gain an advantage over competitors.

These companies are using data science to properly set up and control manufacturing. For example, automatically adjusting parameters for specific parts/production lines to decrease wastage and meet demand. Research has shown that 68% of manufacturers were already investing in data science to achieve a range of improvements. This means that more than 30% of manufacturers still haven’t adopted a data-driven approach and are therefore not yet working leaner, smarter, improving yields and reducing costs for an increased bottom line.

We know that manufacturing is an asset-intensive industry and companies need to move fast, be more innovative and work smart in order to be competitive. To remain ahead of the game, manufacturers need to adopt a different way of thinking when it comes to data. However, any transition from the industrial to the digital age can be both daunting and a minefield.

Too much data

One of the main problems for many companies – especially within the manufacturing sector – is the speed at which they are collecting massive amounts of real-time data, making it hard to work out what data is actually important. This is even harder without the right tools.

A solution – building a data science capability

To understand their data better, many organisations have started to build teams of Data Scientists. The Data Scientist is indeed becoming an increasingly valuable asset within any organization looking to make the most of their data.

The aim of building a data science capability is to harvest and analyse the data being collected to drive business change. However, many companies struggle to get the right skillsets in their team. In response to this need, we developed the Data Science Radar. The Radar is a conceptual framework that explores character traits and it is a visual aid to support our customers to build and shape an existing data science team, identifying gaps in skillsets and monitoring learning needs. The application has been such a success we provide it free to help companies start their data-driven journey. Take a look at the Data Science Radar here:

Choosing the right tools for the job

Data science requires tools that go beyond the capabilities of spreadsheet programs like Excel —which is still often the tool used for data analysis in manufacturing. It is a common, but false, belief that the only alternatives to this are expensive off-the-shelf software packages, which can differ greatly in terms of cost, usability data capacity and visualisation capabilities.

While we use a range of cutting edge tools for our projects, we often recommend one being used around the world by thousands of analysts – the open source R language. From computational science to extensive marketing techniques, R is the most popular analytic language in the world today and a fundamental analytic tool within a range of industries. The growth and popularity of R programming has been helping data-driven organisations succeed for years.

Our knowledge, experience and passion for Data Science means we have engaged in some truly amazing analytic projects. We understand the challenges faced by the manufacturing industry and have worked with companies all over the world to lower product development and operating costs, increase production quality, improve customer experience, and improve manufacturing yields – all using the power of R!

Analytics for non-technical stakeholders

Visualization tools communicate the results of analytics in a clear and precise manner. It’s possible you may have overheard Shiny in discussions between your data analysts and noted it in some of the below case studies. But what is Shiny?

Shiny combines the computational power of R with the interactivity of the modern web. It is a powerful and popular web framework for R programmers to elevate the way people —both technical and non-technical decision makers— consume analytics.

R allows data scientists to effectively analyse large amounts of real-time data but Shiny visualises that data effectively and easily to present outputs for non-analysts, allowing non-technical stakeholders to easily review and filter the data. Outputs can then be hosted on a client’s own servers or via RStudio’s hosting service.

Here are just a few examples of our successful projects:

Mango delivered a large SAS to R migration project with a global semiconductor manufacturer. A complex Shiny application was created to replace the expensive SAS application software already in use. This made it possible to exit an expensive SAS license and adopt modern analytic techniques. This has resulted in improved production yields, reduced costs and enthused production teams with a modern production infrastructure.

Mondelez were using a SAS Roast Coffee Blend Generator. Mango used advanced prescriptive analytics to migrate the client to R, resulting in optimization of their coffee recipe, improved yield qualities and reduced production costs.

Mango helped a global agrochemical company by providing an in-depth code review of their Shiny application, including modification of code to improve performance. A pack of Shiny coding best practices was also developed by Mango for the client to reference in their future developments, thus helping them improve performance and yields.

Campden BRI have a large Consumer and Sensory Science department who perform comprehensive analysis on sensory and consumer data. Due to years of adding additional features to their exisiting database, the internal systems had come to rely on a restrictive ‘jigsaw of legacy code’. Using R, Mango helped rationalise the work flows and processes to provide a more robust solution, which resulted in a neat application that users could use intuitively. The team have streamlined their work and their use of software packages, saving time, money and effort.

Names have been removed where required, more case examples can be found on our website.

Why Mango?

Mango Solutions have been long-term trusted partners with companies in a wide range of industries, including Manufacturing, Pharmaceutical, Retail, Travel, Automotive, Finance, Energy and Government since 2002. Our team of Data Scientists, Data Engineers, Technical Architects and Software Developers deliver independent, forward thinking, critical, predictive and prescriptive analytical solutions.

Mango have assisted hundreds of companies reap the business gains that come from effective data science because our unique mix of both technical and commercial real-world experience ensures best practice approaches.

Are you ready to become data-driven? Please contact us for an obligation-free conversation today with Christina Halliday: 

*RStudio is a partner of Mango Solutions and the creators of Shiny and Shiny commercial products.

ANNOUNCEMENT: EARL London 2018 + abstract submissions open!
Blogs home Featured Image

14 February 2018

Mango Solutions are delighted to announce that loyalty programme pioneer and data science innovator, Edwina Dunn, will keynote at the 2018 Enterprise Applications of the R Language (EARL) Conference in London on 11-13 September.

Mango Solutions’ Chief Data Scientist, Richard Pugh, has said that it is a privilege to have Ms Dunn address Conference delegates.

“Edwina helped to change the data landscape on a global scale while at dunnhumby; Tesco’s Clubcard, My Kroger Plus and other loyalty programmes have paved the way for data-driven decision making in retail,” Mr Pugh said.

“Having Edwina at EARL this year is a win for delegates, who attend the Conference to find inspiration in their use of analytics and data science using the R Language.

“In this centenary year of the 1918 Suffrage act, Edwina’s participation is especially appropriate, as she is the founder of The Female Lead, a non-profit organization dedicated to giving women a platform to share their inspirational stories,” he said.

Ms Dunn is currently CEO at Starcount, a consumer insights company that combines the science of purchase and intent and brings the voice of the customer into the boardroom.

The EARL Conference is a cross-sector conference focusing on the commercial use of the R programming language with presentations from some of the world’s leading practitioners.

More information and tickets are available on the EARL Conference website:


For more information, please contact:
Karis Bouher, Marketing Manager: or +44 (0)1249 705 450

Blogs home

Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.

Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.

The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.

Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.

This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.

The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.

The code for the Shiny app is available on github.

EARL Seattle Keynote Speaker Announcement: Julia Silge
Blogs home Featured Image

We’re delighted to announce that Julia Silge will be joining us on 7 November in Seattle as our Keynote speaker.

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

We can’t wait to see what Julia has to say in November.

Submit an abstract

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Julia in Seattle as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

In Between A Rock And A Conditional Join
Blogs home Featured Image

Joining two datasets is a common action we perform in our analyses. Almost all languages have a solution for this task: R has the built-in merge function or the family of join functions in the dplyr package, SQL has the JOIN operation and Python has the merge function from the pandas package. And without a doubt these cover a variety of use cases but there’s always that one exception, that one use case that isn’t covered by the obvious way of doing things.

In my case this is to join two datasets based on a conditional statement. So instead of there being specific columns in both datasets that should be equal to each other I am looking to compare based on something else than equality (e.g. larger than). The following example should hopefully make things clearer.

myData <- data.frame(Record = seq(5), SomeValue=c(10, 8, 14, 6, 2))
## Record SomeValue
## 1 1 10
## 2 2 8
## 3 3 14
## 4 4 6
## 5 5 2

The above dataset, myData, is the dataset to which I want to add values from the following dataset:

linkTable <- data.frame(ValueOfInterest = letters[1:3], LowerBound = c(1, 4, 10),
UpperBound = c(3, 5, 16))
## ValueOfInterest LowerBound UpperBound
## 1 a 1 3
## 2 b 4 5
## 3 c 10 16

This second dataset, linkTable, is the dataset containing the information to be added to myData. You may notice the two dataset have no columns in common. That is because I want to join the data based on the condition that SomeValue is between LowerBound and UpperBound. This may seem like an artificial (and perhaps trivial) example but just imagine SomeValue to be a date or zip code. Then imagine the LowerBoundand UpperBound to be bounds on a specific time period or geographical region respectively.

In Mango’s R training courses one of the most important lessons we teach our participants is that the answer is just as important as how you obtain the answer. So i’ll try to convey that here too instead of just giving you the answer.

Helping you help yourself

So the first step in finding the answer is to explore R’s comprehensive help system and documentation. Since we’re talking about joins its only natural to look at the documentation of the merge function or the join functions from the dplyr package. Unfortunately both only have the option to supply columns that are compared to each other based on equality. However the documentation for the merge functions does mention that when no columns are given the function performs a Cartesian product. That’s just a seriously cool way of saying every row from myData is joined with every row from linkTable. It might not solve the task but it does give me the following idea:

# Attempt #1: Do a cartesian product and then filter the relevant rows
merge(myData, linkTable) %>% 
filter(SomeValue >= LowerBound, SomeValue <= UpperBound) %>% 
select(-LowerBound, -UpperBound)
## Record SomeValue ValueOfInterest
## 1 5 2 a
## 2 1 10 c
## 3 3 14 c

You can do the above in dplyr as well but I’ll leave that as an exercise. The more important question is: what is wrong with the above answer? You may notice that we’re missing records 2 and 4. That’s because these didn’t satisfy the filtering condition. If we wanted to add them back in we would have to do another join. Something that you won’t notice with these small example datasets is that a Cartesian product is an expensive operation, combining all the records of two datasets can result in an explosion of values.

(Sometimes) a SQL is better than the original

When neither of the built-in functions or functions from packages you know solve the problem, the next step is to expand the search. You can directly resort to your favourite search engine (which will inevitably redirect you to Stack Overflow) but it helps to first narrow the search by thinking about any possible clues. For me that clue was that joins are an important part of SQL so I searched for a SQL solution that works in R.

The above search directed me to the excellent sqldf package. This package allows you to write SQL queries and execute them using data.frames instead of tables in a database. I can thus write a SQL JOIN query with a BETWEEN clause and apply it to my two tables.

# Attempt #2: Execute a SQL query
sqldf('SELECT Record, SomeValue, ValueOfInterest 
FROM myData 
LEFT JOIN linkTable ON SomeValue BETWEEN LowerBound and UpperBound')
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Marvellous! That gives me exactly the result I want and with little to no extra effort. The sqldf package takes the data.frames and creates corresponding tables in a temporary database (SQLite by default). It then executes the query and returns a data.frame. Even though the package isn’t built for performance it handles itself quite well, even with large datasets. The only disadvantage I can think of is that you must know a bit of SQL.

So now that I have found the answer I can continue with the next step in the analysis. That would’ve been the right thing to do but then curiosity got the better of me and I continued to find other solutions. For completeness I have listed some of these solutions below.

Fuzzy wuzzy join

If you widen the search for a solution you will (eventually, via various GitHub issues and StackOverflow questions) come across the fuzzyjoin package. If you’re looking for flexible ways to join two data.frames then look no further. The package has a few ready-to-use solutions for a number of use cases: matching on equality with a tolerance (difference_inner_join), string matching (stringdist_inner_join), matching on euclidean distance (distance_inner_join) and many more. For my usecase I will use the more generic fuzzy_left_join which allows for one or more matching functions.

# Attempt #3: use the fuzzyjoin package
fuzzy_left_join(myData, linkTable, 
by=c("SomeValue"="LowerBound", "SomeValue"="UpperBound"),
match_fun=list(`>=`, `<=`)) %>% 
select(Record, SomeValue, ValueOfInterest)
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Again, this is exactly what we’re looking for. Compared to the SQL alternative it takes a little more time to figure out what is going on but that is a minor disadvantage. On the other hand, now there is no need to go back and forth with a database backend. I haven’t checked what the performance differences are, that is a little out of scope for this post.

If not dplyr then data.table

I know it can be slightly annoying when someone answers your question about dplyr by saying it can be done in data.table but it’s always good to keep an open mind. Especially when one solves a task the other can’t (yet). It doesn’t take much effort to convert from a data.frame to a data.table. From there we can use the foverlaps function to do a non-equi join (as it is referred to in data.table-speak).

# Attempt #4: Use the data.table package
myDataDT <- data.table(myData)
myDataDT[, SomeValueHelp := SomeValue]
linkTableDT <- data.table(linkTable)
setkey(linkTableDT, LowerBound, UpperBound)

result <- foverlaps(myDataDT, linkTableDT, by.x=c('SomeValue', 'SomeValueHelp'), 
by.y=c('LowerBound', 'UpperBound'))
result[, .(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

Ok so I’m not very well versed in the data.table way of doing things. I’m sure there is a less verbose way but this will do for now. If you know the magical spell please let me know (through the links provided at the end).

Update 6-Feb-2018
Stefan Fritsch provided the following (less verbose) way of doing it with data.table:

linkTableDT[myDataDT, on = .(LowerBound <= SomeValue, UpperBound >= SomeValue),
.(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

The pythonic way

Now that we’re off the tidyverse-reservoir, we might as well go all the way. During my search I also encountered a Python solution that looked interesting. It involves using pandas and some matrix multiplication and works as follows (yes, you can run Python code in a RMarkdown document).

import pandas as pd
# Attempt #5: Use python and the pandas package
# create the pandas Data Frames (kind of like R data.frame)
myDataDF = pd.DataFrame({'Record':range(1,6), 'SomeValue':[10, 8, 14, 6, 2]})
linkTableDF = pd.DataFrame({'ValueOfInterest':['a', 'b', 'c'], 'LowerBound': [1, 4, 10],
'UpperBound':[3, 5, 16]})
# set the index of the linkTable (kind of like setting row names) 
linkTableDF = linkTableDF.set_index('ValueOfInterest')
# now apply a function to each row of the linkTable
# this function checks if any of the values in myData are between the upper
# and lower bound of a specific row thus returning 5 values (length of myData)
mask = linkTableDF.apply(lambda r: myDataDF.SomeValue.between(r['LowerBound'], 
r['UpperBound']), axis=1)
# mask is a 3 (length of linkTable) by 5 matrix of True/False values
# by transposing it we get the row names (the ValueOfInterest) as the column names
mask = mask.T
# we can then matrix multiply mask with its column names
myDataDF['ValueOfInterest'] =
## Record SomeValue ValueOfInterest
## 0 1 10 c
## 1 2 8 
## 2 3 14 c
## 3 4 6 
## 4 5 2 a

This is a nice way of doing it in Python but it’s definitely not as readable as the sqldf or fuzzyjoin alternatives. I for one had to blink at it a couple of times before I understood this witchcraft. I didn’t search extensively for a solution in Python so this may actually not be the right way of doing it. If you know of a better solution let me know via the links below.

Have no fear, the tidyverse is here

As you search for solutions to your own tasks you will undoubtedly come across many Stack Overflow questions and Github Issues. Hopefully, they will provide the answer to your question or at least guide you to one. When they do, don’t forget to upvote or leave a friendly comment. When they don’t, do not despair but see it as a challenge to contribute your own solution. In my case the issue had already been reported and the dplyr developers are on it. I look forward to trying out their solution in the near future.

The code for this post is available on Github. I welcome any feedback, please let me know via Twitter or Github.

The EARLy career scholarship
Blogs home Featured Image

At Mango, we’re passionate about R and promoting its use in enterprise – it’s why we created the EARL Conferences. We understand the importance of sharing knowledge to generate new ideas and change the way organisations use R for the better.

This year we are on a mission to actively encourage the attendance of R users who are either in a very early stage of their career or are finishing their academic studies and looking at employment options.

We’re offering EARLy career R users a chance to come to EARL – we have a number of 2-day conference passes for EARL London and tickets for each 1-day event in the US. This year’s dates are:
London, 12-13 September
Seattle, 7 November
Houston, 9 November
Boston, 13 November

Who can apply?

  • Anyone in their first year of employment
  • Anyone doing an internship or work placement
  • Anyone who has recently finished – or will soon be finishing – their academic studies and is actively pursuing a career in Analytics

To apply for a free EARLy Career ticket, tell us why you would like to attend an EARL Conference and how attending will help you advance your knowledge and your career.

(Minimum 200 words, maximum 500 words)

Submit your response here.

Terms and conditions: ‘Winners’ will receive tickets for any EARL Conference of their choice. This does not include travel or accommodation. The tickets are non-transferable. The tickets cannot be exchanged for cash.