Effective Data Analytics In Manufacturing
Blogs home Featured Image

Data analytics is rapidly changing the face of manufacturing as we know it. At Mango, we’re seeing companies using their data effectively to gain an advantage over competitors.

These companies are using data science to properly set up and control manufacturing. For example, automatically adjusting parameters for specific parts/production lines to decrease wastage and meet demand. Research has shown that 68% of manufacturers were already investing in data science to achieve a range of improvements. This means that more than 30% of manufacturers still haven’t adopted a data-driven approach and are therefore not yet working leaner, smarter, improving yields and reducing costs for an increased bottom line.

We know that manufacturing is an asset-intensive industry and companies need to move fast, be more innovative and work smart in order to be competitive. To remain ahead of the game, manufacturers need to adopt a different way of thinking when it comes to data. However, any transition from the industrial to the digital age can be both daunting and a minefield.

Too much data

One of the main problems for many companies – especially within the manufacturing sector – is the speed at which they are collecting massive amounts of real-time data, making it hard to work out what data is actually important. This is even harder without the right tools.

A solution – building a data science capability

To understand their data better, many organisations have started to build teams of Data Scientists. The Data Scientist is indeed becoming an increasingly valuable asset within any organization looking to make the most of their data.

The aim of building a data science capability is to harvest and analyse the data being collected to drive business change. However, many companies struggle to get the right skillsets in their team. In response to this need, we developed the Data Science Radar. The Radar is a conceptual framework that explores character traits and it is a visual aid to support our customers to build and shape an existing data science team, identifying gaps in skillsets and monitoring learning needs. The application has been such a success we provide it free to help companies start their data-driven journey. Take a look at the Data Science Radar here: www.dsradar.com

Choosing the right tools for the job

Data science requires tools that go beyond the capabilities of spreadsheet programs like Excel —which is still often the tool used for data analysis in manufacturing. It is a common, but false, belief that the only alternatives to this are expensive off-the-shelf software packages, which can differ greatly in terms of cost, usability data capacity and visualisation capabilities.

While we use a range of cutting edge tools for our projects, we often recommend one being used around the world by thousands of analysts – the open source R language. From computational science to extensive marketing techniques, R is the most popular analytic language in the world today and a fundamental analytic tool within a range of industries. The growth and popularity of R programming has been helping data-driven organisations succeed for years.

Our knowledge, experience and passion for Data Science means we have engaged in some truly amazing analytic projects. We understand the challenges faced by the manufacturing industry and have worked with companies all over the world to lower product development and operating costs, increase production quality, improve customer experience, and improve manufacturing yields – all using the power of R!

Analytics for non-technical stakeholders

Visualization tools communicate the results of analytics in a clear and precise manner. It’s possible you may have overheard Shiny in discussions between your data analysts and noted it in some of the below case studies. But what is Shiny?

Shiny combines the computational power of R with the interactivity of the modern web. It is a powerful and popular web framework for R programmers to elevate the way people —both technical and non-technical decision makers— consume analytics.

R allows data scientists to effectively analyse large amounts of real-time data but Shiny visualises that data effectively and easily to present outputs for non-analysts, allowing non-technical stakeholders to easily review and filter the data. Outputs can then be hosted on a client’s own servers or via RStudio’s hosting service.

Here are just a few examples of our successful projects:

Mango delivered a large SAS to R migration project with a global semiconductor manufacturer. A complex Shiny application was created to replace the expensive SAS application software already in use. This made it possible to exit an expensive SAS license and adopt modern analytic techniques. This has resulted in improved production yields, reduced costs and enthused production teams with a modern production infrastructure.

Mondelez were using a SAS Roast Coffee Blend Generator. Mango used advanced prescriptive analytics to migrate the client to R, resulting in optimization of their coffee recipe, improved yield qualities and reduced production costs.

Mango helped a global agrochemical company by providing an in-depth code review of their Shiny application, including modification of code to improve performance. A pack of Shiny coding best practices was also developed by Mango for the client to reference in their future developments, thus helping them improve performance and yields.

Campden BRI have a large Consumer and Sensory Science department who perform comprehensive analysis on sensory and consumer data. Due to years of adding additional features to their exisiting database, the internal systems had come to rely on a restrictive ‘jigsaw of legacy code’. Using R, Mango helped rationalise the work flows and processes to provide a more robust solution, which resulted in a neat application that users could use intuitively. The team have streamlined their work and their use of software packages, saving time, money and effort.

Names have been removed where required, more case examples can be found on our website.

Why Mango?

Mango Solutions have been long-term trusted partners with companies in a wide range of industries, including Manufacturing, Pharmaceutical, Retail, Travel, Automotive, Finance, Energy and Government since 2002. Our team of Data Scientists, Data Engineers, Technical Architects and Software Developers deliver independent, forward thinking, critical, predictive and prescriptive analytical solutions.

Mango have assisted hundreds of companies reap the business gains that come from effective data science because our unique mix of both technical and commercial real-world experience ensures best practice approaches.

Are you ready to become data-driven? Please contact us for an obligation-free conversation today with Christina Halliday: challiday@mango-solutions.com 

*RStudio is a partner of Mango Solutions and the creators of Shiny and Shiny commercial products.

ANNOUNCEMENT: EARL London 2018 + abstract submissions open!
Blogs home Featured Image

14 February 2018

Mango Solutions are delighted to announce that loyalty programme pioneer and data science innovator, Edwina Dunn, will keynote at the 2018 Enterprise Applications of the R Language (EARL) Conference in London on 11-13 September.

Mango Solutions’ Chief Data Scientist, Richard Pugh, has said that it is a privilege to have Ms Dunn address Conference delegates.

“Edwina helped to change the data landscape on a global scale while at dunnhumby; Tesco’s Clubcard, My Kroger Plus and other loyalty programmes have paved the way for data-driven decision making in retail,” Mr Pugh said.

“Having Edwina at EARL this year is a win for delegates, who attend the Conference to find inspiration in their use of analytics and data science using the R Language.

“In this centenary year of the 1918 Suffrage act, Edwina’s participation is especially appropriate, as she is the founder of The Female Lead, a non-profit organization dedicated to giving women a platform to share their inspirational stories,” he said.

Ms Dunn is currently CEO at Starcount, a consumer insights company that combines the science of purchase and intent and brings the voice of the customer into the boardroom.

The EARL Conference is a cross-sector conference focusing on the commercial use of the R programming language with presentations from some of the world’s leading practitioners.

More information and tickets are available on the EARL Conference website: earlconf.com


For more information, please contact:
Karis Bouher, Marketing Manager: marketing@mango-solutions.com or +44 (0)1249 705 450

Love Machine: Automating the romantic songwriting process
Blogs home Featured Image
Owen Jones, Placement Student

Songwriting is a very mysterious process. It feels like creating something from nothing. It’s something I don’t feel like I really control.

— Tracy Chapman

It is February. The shortest, coldest, wettest, miserablest month of the British year.

Only two things happen in Britain during February. For a single evening, the people refrain from dipping all their food in batter and deep-frying it, and instead save some time by pouring the batter straight into a frying pan and eating it by itself; and for an entire day, the exchange of modest indications of affection between consenting adults is permitted, although the government advises against significant deviation from the actions specified in the state-issued Approved Romantic Gestures Handbook.

In Section 8.4 (Guidelines for Pre-Marital Communication) the following suggestion is made:

"Written expressions of emotion should be avoided where possible. Should it become absolutely necessary to express emotion in a written format, it should be limited to a 'popular' form of romantic lyricism. Examples of such 'popular' forms include 'love poem' and 'love song'.

Thankfully, for those who have not achieved at least a master’s degree in a related field, writing a poem or song is a virtually impossible task. And following the sustained and highly successful effort to persuade the British youth that a career in the arts is a fast-track to unemployment, the number of applications to study non-STEM subjects at British universities has been falling consistently since the turn of the decade. This ensures that only the very best and most talented songwriters, producing the most creatively ingenuous work, are able to achieve widespread recognition, and therefore the British public are only exposed to high-quality creative influences.

But to us scientists, the lack of method is disturbing. This “creativity” must have a rational explanation. There must be some pattern.

This is unquestionably a problem which can be solved by machine learning, so let’s take the most obvious approach we can: we’ll train a recurrent neural network to generate song lyrics character by character.

You write down a paragraph or two describing several different subjects creating a kind of story ingredients-list, I suppose, and then cut the sentences into four or five-word sections; mix ’em up and reconnect them. You can get some pretty interesting idea combinations like this. You can use them as is or, if you have a craven need to not lose control, bounce off these ideas and write whole new sections.

— David Bowie

To build our neural network I’m going to be using the Keras machine learning interface (which we’re very excited about here at Mango right now – keep an eye out for workshops in the near future!). I’ve largely followed the steps in this example from the Keras for R website, and I’m going to stick to a high-level description of what’s going on, but if you’re the sort of person who would rather dive head-first into the code, don’t feel like you have to hang around here – go ahead and have a play! And if you want to read more about RNNs, this excellent post by Andrej Kaparthy is at least as entertaining and significantly more informative than the one you’re currently reading.

We start by scraping as many love song lyrics as possible from the web – these will form our training material. Here’s the sort of thing we’re talking about:

Well… that’s how they look to us. Actually, after a bit of preprocessing, the computer sees something more like this:

All line breaks are represented by the pair of characters “\n”, and so all the lyrics from all the songs are squashed down into one big long string.

Then we use this string to train the network. We show the network a section of the string, and tell it what comes next.

So the network gradually learns which characters tend to follow a given fixed-length “sentence”. The more of these what-comes-next examples it sees, the better it gets at correctly guessing what should follow any sentence we feed in.

At this point, our network is like a loyal student of a great artist, dutifully copying every brushstroke in minuscule detail and receiving a slap on the wrist and a barked correction every time it slips up. Via this process it appears to have done two things.

Firstly, it seems to have developed an understanding of the “rules” of writing a song. These rules are complex and multi-levelled; the network first had to learn the rules of English spelling and grammar, before it could start to make decisions about when to move to a new line or which rhyming pattern to use.

(Of course, it hasn’t actually “developed an understanding” of these rules. It has no idea what a “word” is, or a “new line”. It just knows that every few characters it should guess " ", and then sometimes it should put in a "\", and whenever it puts in a "\" then it’s got to follow that up with a "n" and then immediately a capital letter. Easy peasy.)

Secondly, and in exactly the same way, the network will have picked up some of the style of the work it is copying. If we were training it on the songs one specific artist, it would have learned to imitate the style of that particular artist – but we’ve gone one better than that and trained it on all the love songs we could find. So effectively, it’s learned how everyone else writes love songs.

But no-one gets famous by writing songs which have already been written. What we need now is some creativity, some passion, a little bit of je ne sais quoi.

Let’s stop telling our network what comes next. Let’s give it the freedom to write whatever it likes.

I don’t think you can ever do your best. Doing your best is a process of trying to do your best.

— Townes van Zandt

It’s interesting to look at the songwriting attempts of the network in the very early stages of training. At first, it is guessing more or less at random what character should come next, so we end up with semi-structured gobbledegook:

fameliawmalYaws. Boflyi, methabeethirts yt
play3mppioty2=ytrnfuunuiYs blllstis
Byyovcecrowth andtpazo's youltpuduc,s Ijd"a]bemob8b>fiume,;Co
Bliovlkfrenuyione (ju'te,'ve ru t Kis
go arLUUs,k'CaufkfR )s'xCvectdvoldes

Avanrvous Ist'dyMe Dolriri

But notice that even in that example, which was taken from a very early training stage, the network has already nailed the “\n” newline combo and has even started to pick up on other consistent structural patterns like closing a “(” with a “)”. Actually, the jumbled nonsense becomes coherent English (or English-esque) ramblings quite quickly.

There is one interesting parameter to adjust when we ask the model to produce some output: the “diversity” parameter, which determines how adventurous the network should be in its choice of character. The higher we set this parameter, the more the network will favour slightly-less-probable characters over the most obvious choice at each point.

If we set the diversity parameter too low, we often degenerate into uncontrolled bursts of la-ing:

la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la la
(... lots more "la"s)

But set it too high and the network decides dictionary English is too limiting.

Oh, this younan every, drock on
Scridh's tty'
Is go only ealled
You could have like the one don'm I dope
Love me
And woment while you all that
Was it statiinc. I living you must?
We dirls anythor

It’s difficult to find the right balance between syllabic repetition and progressive vocabulary, and there’s a surprisingly fine line between the two – this will probably prove to be a fruitful area for further academic research.

I think that identifying the optimal diversity parameter is probably the key to good songwriting.

Songwriting is like editing. You write down all this stuff – all this bad, stupid stuff – and then you have to get rid of everything except the very best.

— Juliana Hatfield

So, that’s what I did.

Here are some particularly “beautiful” passages taken from the huge amount of (largely poor) material the model produced. I haven’t done any editing other than to isolate a few consecutive lines at a time and in the last few examples, to start the network off with certain sentences.

Automated love

I know your eyes in the morning sun
I feel the name of love
Love is a picked the sun
All my life I can make me wanna be with you
I just give up in your head
And I can stay that you want a life
I’ve stay the more than I do

How long will I love you
As long as there is that songs
All the things that you want to find you
I could say me true
I want to fall in love with you
I want my life
And you’re so sweet
When I see you wanted to that for you
I can see you and thing, baby
I wanna be alone

Oh yeah I tell you somethin’
I think you’ll understand
When I say that somethin’
I thought the dartion hyand
I want me way to hear
All the things what you do

Wise men say
Only fools rush in
But I can hear your love
And I don’t wanna be alone

If I should stay
I would only be in your head
I wanna know that I hope I see the sun
I want a best there for me too
I just see that I can have beautiful
So hold me to you

Wishing you a Happy Valentine’s Day! (And, I don’t recommend reciting this to your loved one, they might run away.)

Blogs home

Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.

Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.

The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.

Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.

This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.

The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.

The code for the Shiny app is available on github.

EARL Seattle Keynote Speaker Announcement: Julia Silge
Blogs home Featured Image

We’re delighted to announce that Julia Silge will be joining us on 7 November in Seattle as our Keynote speaker.

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

We can’t wait to see what Julia has to say in November.

Submit an abstract

Abstract submissions are open for both the US Roadshow in November and London in September. You could be on the agenda with Julia in Seattle as one of our speakers if you would like to share the R successes in your organisation.

Submit your abstract here.

Early bird tickets now available

Tickets for all EARL Conferences are now available:
London: 11-13 September
Seattle: 7 November
Houston: 9 November
Boston: 13 November

In Between A Rock And A Conditional Join
Blogs home Featured Image

Joining two datasets is a common action we perform in our analyses. Almost all languages have a solution for this task: R has the built-in merge function or the family of join functions in the dplyr package, SQL has the JOIN operation and Python has the merge function from the pandas package. And without a doubt these cover a variety of use cases but there’s always that one exception, that one use case that isn’t covered by the obvious way of doing things.

In my case this is to join two datasets based on a conditional statement. So instead of there being specific columns in both datasets that should be equal to each other I am looking to compare based on something else than equality (e.g. larger than). The following example should hopefully make things clearer.

myData <- data.frame(Record = seq(5), SomeValue=c(10, 8, 14, 6, 2))
## Record SomeValue
## 1 1 10
## 2 2 8
## 3 3 14
## 4 4 6
## 5 5 2

The above dataset, myData, is the dataset to which I want to add values from the following dataset:

linkTable <- data.frame(ValueOfInterest = letters[1:3], LowerBound = c(1, 4, 10),
UpperBound = c(3, 5, 16))
## ValueOfInterest LowerBound UpperBound
## 1 a 1 3
## 2 b 4 5
## 3 c 10 16

This second dataset, linkTable, is the dataset containing the information to be added to myData. You may notice the two dataset have no columns in common. That is because I want to join the data based on the condition that SomeValue is between LowerBound and UpperBound. This may seem like an artificial (and perhaps trivial) example but just imagine SomeValue to be a date or zip code. Then imagine the LowerBoundand UpperBound to be bounds on a specific time period or geographical region respectively.

In Mango’s R training courses one of the most important lessons we teach our participants is that the answer is just as important as how you obtain the answer. So i’ll try to convey that here too instead of just giving you the answer.

Helping you help yourself

So the first step in finding the answer is to explore R’s comprehensive help system and documentation. Since we’re talking about joins its only natural to look at the documentation of the merge function or the join functions from the dplyr package. Unfortunately both only have the option to supply columns that are compared to each other based on equality. However the documentation for the merge functions does mention that when no columns are given the function performs a Cartesian product. That’s just a seriously cool way of saying every row from myData is joined with every row from linkTable. It might not solve the task but it does give me the following idea:

# Attempt #1: Do a cartesian product and then filter the relevant rows
merge(myData, linkTable) %>% 
filter(SomeValue >= LowerBound, SomeValue <= UpperBound) %>% 
select(-LowerBound, -UpperBound)
## Record SomeValue ValueOfInterest
## 1 5 2 a
## 2 1 10 c
## 3 3 14 c

You can do the above in dplyr as well but I’ll leave that as an exercise. The more important question is: what is wrong with the above answer? You may notice that we’re missing records 2 and 4. That’s because these didn’t satisfy the filtering condition. If we wanted to add them back in we would have to do another join. Something that you won’t notice with these small example datasets is that a Cartesian product is an expensive operation, combining all the records of two datasets can result in an explosion of values.

(Sometimes) a SQL is better than the original

When neither of the built-in functions or functions from packages you know solve the problem, the next step is to expand the search. You can directly resort to your favourite search engine (which will inevitably redirect you to Stack Overflow) but it helps to first narrow the search by thinking about any possible clues. For me that clue was that joins are an important part of SQL so I searched for a SQL solution that works in R.

The above search directed me to the excellent sqldf package. This package allows you to write SQL queries and execute them using data.frames instead of tables in a database. I can thus write a SQL JOIN query with a BETWEEN clause and apply it to my two tables.

# Attempt #2: Execute a SQL query
sqldf('SELECT Record, SomeValue, ValueOfInterest 
FROM myData 
LEFT JOIN linkTable ON SomeValue BETWEEN LowerBound and UpperBound')
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Marvellous! That gives me exactly the result I want and with little to no extra effort. The sqldf package takes the data.frames and creates corresponding tables in a temporary database (SQLite by default). It then executes the query and returns a data.frame. Even though the package isn’t built for performance it handles itself quite well, even with large datasets. The only disadvantage I can think of is that you must know a bit of SQL.

So now that I have found the answer I can continue with the next step in the analysis. That would’ve been the right thing to do but then curiosity got the better of me and I continued to find other solutions. For completeness I have listed some of these solutions below.

Fuzzy wuzzy join

If you widen the search for a solution you will (eventually, via various GitHub issues and StackOverflow questions) come across the fuzzyjoin package. If you’re looking for flexible ways to join two data.frames then look no further. The package has a few ready-to-use solutions for a number of use cases: matching on equality with a tolerance (difference_inner_join), string matching (stringdist_inner_join), matching on euclidean distance (distance_inner_join) and many more. For my usecase I will use the more generic fuzzy_left_join which allows for one or more matching functions.

# Attempt #3: use the fuzzyjoin package
fuzzy_left_join(myData, linkTable, 
by=c("SomeValue"="LowerBound", "SomeValue"="UpperBound"),
match_fun=list(`>=`, `<=`)) %>% 
select(Record, SomeValue, ValueOfInterest)
## Record SomeValue ValueOfInterest
## 1 1 10 c
## 2 2 8 <NA>
## 3 3 14 c
## 4 4 6 <NA>
## 5 5 2 a

Again, this is exactly what we’re looking for. Compared to the SQL alternative it takes a little more time to figure out what is going on but that is a minor disadvantage. On the other hand, now there is no need to go back and forth with a database backend. I haven’t checked what the performance differences are, that is a little out of scope for this post.

If not dplyr then data.table

I know it can be slightly annoying when someone answers your question about dplyr by saying it can be done in data.table but it’s always good to keep an open mind. Especially when one solves a task the other can’t (yet). It doesn’t take much effort to convert from a data.frame to a data.table. From there we can use the foverlaps function to do a non-equi join (as it is referred to in data.table-speak).

# Attempt #4: Use the data.table package
myDataDT <- data.table(myData)
myDataDT[, SomeValueHelp := SomeValue]
linkTableDT <- data.table(linkTable)
setkey(linkTableDT, LowerBound, UpperBound)

result <- foverlaps(myDataDT, linkTableDT, by.x=c('SomeValue', 'SomeValueHelp'), 
by.y=c('LowerBound', 'UpperBound'))
result[, .(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

Ok so I’m not very well versed in the data.table way of doing things. I’m sure there is a less verbose way but this will do for now. If you know the magical spell please let me know (through the links provided at the end).

Update 6-Feb-2018
Stefan Fritsch provided the following (less verbose) way of doing it with data.table:

linkTableDT[myDataDT, on = .(LowerBound <= SomeValue, UpperBound >= SomeValue),
.(Record, SomeValue, ValueOfInterest)]
## Record SomeValue ValueOfInterest
## 1: 1 10 c
## 2: 2 8 NA
## 3: 3 14 c
## 4: 4 6 NA
## 5: 5 2 a

The pythonic way

Now that we’re off the tidyverse-reservoir, we might as well go all the way. During my search I also encountered a Python solution that looked interesting. It involves using pandas and some matrix multiplication and works as follows (yes, you can run Python code in a RMarkdown document).

import pandas as pd
# Attempt #5: Use python and the pandas package
# create the pandas Data Frames (kind of like R data.frame)
myDataDF = pd.DataFrame({'Record':range(1,6), 'SomeValue':[10, 8, 14, 6, 2]})
linkTableDF = pd.DataFrame({'ValueOfInterest':['a', 'b', 'c'], 'LowerBound': [1, 4, 10],
'UpperBound':[3, 5, 16]})
# set the index of the linkTable (kind of like setting row names) 
linkTableDF = linkTableDF.set_index('ValueOfInterest')
# now apply a function to each row of the linkTable
# this function checks if any of the values in myData are between the upper
# and lower bound of a specific row thus returning 5 values (length of myData)
mask = linkTableDF.apply(lambda r: myDataDF.SomeValue.between(r['LowerBound'], 
r['UpperBound']), axis=1)
# mask is a 3 (length of linkTable) by 5 matrix of True/False values
# by transposing it we get the row names (the ValueOfInterest) as the column names
mask = mask.T
# we can then matrix multiply mask with its column names
myDataDF['ValueOfInterest'] = mask.dot(mask.columns)
## Record SomeValue ValueOfInterest
## 0 1 10 c
## 1 2 8 
## 2 3 14 c
## 3 4 6 
## 4 5 2 a

This is a nice way of doing it in Python but it’s definitely not as readable as the sqldf or fuzzyjoin alternatives. I for one had to blink at it a couple of times before I understood this witchcraft. I didn’t search extensively for a solution in Python so this may actually not be the right way of doing it. If you know of a better solution let me know via the links below.

Have no fear, the tidyverse is here

As you search for solutions to your own tasks you will undoubtedly come across many Stack Overflow questions and Github Issues. Hopefully, they will provide the answer to your question or at least guide you to one. When they do, don’t forget to upvote or leave a friendly comment. When they don’t, do not despair but see it as a challenge to contribute your own solution. In my case the issue had already been reported and the dplyr developers are on it. I look forward to trying out their solution in the near future.

The code for this post is available on Github. I welcome any feedback, please let me know via Twitter or Github.