Blogs home Featured Image

Julia Silge is joining us as one of our keynote speakers at EARL London 2019. We can’t wait to hear Julia’s full keynote, but until then she kindly answered a few questions. Julia shared with us what we can expect from her address – which will focus on how Stack Overflow uses R and their recent developer survey.

Hi Julia! Tell us about the StackOverflow Developer Survey and your role at Stack Overflow

The Stack Overflow Developer Survey is the largest and most comprehensive survey of people who code around the world each year. This year, we had almost 90,000 respondents who shared their opinions on topics including their favourite technologies, their priorities in looking for a job, and what music they listen to while coding. I am the data scientist who works on this survey, and I am involved throughout the process from initial design to writing copy about results. We have an amazing team who works together on this project, including a project manager, designers, community managers, marketers, and developers.

My role focuses on data analysis. Before the survey was fielded, I worked with one of our UX researchers on question writing, so that our expectations for data analysis were aligned, as well as using data from previous years’ surveys and our site to choose which technologies to include this year. After the survey was fielded, I cleaned and analyzed the data, created data visualizations, and wrote the text for both our developer-facing and business-facing reports.

Why did you use R to analyse the survey?

All of our data science tooling at Stack Overflow is R-centric, but specifically, with our annual survey, we are working with a complex dataset on a tight schedule and the R ecosystem provides the fluent data analysis tools we need to deliver compelling results on time. From munging complicated raw data to creating beautiful visualizations to delivering data deliverables via an API, R is the right tool for the job for us.

Were there results from the survey this year that came as a surprise?

This is such a rich dataset to get to work with, full of interesting things to notice! One result this year that I didn’t expect ahead of time was with our question about whether a respondent eventually wanted to move from technical work into people management. We found that younger, less experienced respondents were more likely to say that they wanted to make the switch! Once I thought about it more carefully, I came to think that those more experienced folks with an interest in managing probably had already shifted careers and were not there to answer that question anymore. Another result that was a surprise to me was just how many different kinds of metal people listen to, more than I even knew existed!

Do you see the gender imbalance improving?

Although our annual survey has a broad capacity for informing useful and actionable conclusions, including about gender, our results don’t represent everyone in the developer community evenly. We know that people from marginalized groups and underrepresented groups in tech participate on Stack Overflow at lower rates than they participate in the software workforce. This means that we undersample such groups in our survey (because of how we invite respondents to the survey, mostly on our site itself). Over the past few years, we have seen incremental improvement in the proportion of responses that are from marginalized or underindexed groups such as minority genders or minority racial/ethnic groups; we are so happy to see this because we want to hear from everyone who codes, everywhere. We believe the biggest driver of this kind of positive change is and will continue to be improving the balance of who participates on Stack Overflow itself, and we are committed to making Stack Overflow a more welcoming and inclusive platform. This kind of work can be difficult and slow, but we are in it for the long haul.

What future trends might you be able to predict from the survey?

One trend we’ve seen over the past several years that I expect to continue is the normalization of salaries for data work. Several years ago, people who worked as data scientists were extreme outliers in salary. Salaries for data scientists have started to move toward the norm for software engineering work, especially if you control for education (for example, comparing a data scientist with a master’s degree to a software engineer with a master’s degree). I don’t see this as entirely bad news, because it is associated with some standardization of data science as a role and increased industry agreement about what a data scientist is, what a data engineer is, how to hire for these roles, and what career paths might look like.

Given Python’s rise again this year, do you see this continuing? How will this affect the use of R?

Python has exhibited a meteoric rise over the past several years and is the fastest-growing major programming language in the world. Python has been climbing in the ranks of our survey over the past several years, edging past first PHP, then C#, then Java this year. It currently sits just below SQL in the ranking. I have a hard time imagining that next year more developers will say they use Python than say they use SQL! You can dig this interview up next year and point out my prediction failure if I am wrong.

In terms of R and R’s future, it’s important to note that R’s use has also been growing dramatically on Stack Overflow, both absolutely and relatively. R is now a top 10 to top 15 programming language (both in questions asked and traffic). Data technologies are in general growing a lot, and there are many factors that go into an individual or an organization deciding to embrace R, or Python, or both.

Thanks Julia! 

You can catch Julia and a whole host of other brilliant speakers at EARL London on 10-12 September at The Tower Hotel London.

We have discounted early bird tickets available for a limited time – please visit the EARL site for more information, we hope to see you there!

Blogs home Featured Image

While we’re aiming to try and fit in as many EARL talks as we can, we know it’s impossible to see them all! We’ve asked some of the Mango team to let us know who they’re looking forward to seeing speak. First up is Alfie Smith one of our Data Scientists – more team picks to follow!

Alfie Smith

Avision Ho’s “Why a Nobel Prize algorithm is not always optimal for business” looks to be an interesting presentation on the problems that come with translating academic research into commercial applications. As data science consultants, we have to be able to tell our clients the risks of rushing to the newest algorithm; particularly when every new research paper creates a hype-bubble.

I’m excited to hear “Promoting the use of R in the NHS – progress and challenges” by Professor Mohammed A Mohammed. As the brother of an NHS doctor, I’ve heard many stories of the NHS’ dependence on archaic tech and the bottle necks it creates. I’m fascinated to hear whether R is solving some of these problems and whether my R skills could be of value to the UK’s health service.

Lastly, I’m very intrigued by Theo Boutaris’ “Deep Milk: The Quest of identifying Milk-Related Instagram Posts using Keras”. At EARL, we’re going to hear lots of stories of R solving huge business problems. However, it’s often the smaller, wackier, stories that I remember long after the event. I’m hoping Theo’s presentation will give me an anecdote to talk about at the next Bristol Data Science meet-up.

If any of these talks sound interesting please take a look at who else is speaking – we also have early bird ticket prices available for a limited time.

Blogs home Featured Image

We were thrilled with the overall quality and amount of abstracts we received for this year’s EARL London Conference. Which made our job of selecting the final speakers even more difficult!

We are pleased to share with you the speakers for EARL London 2019 – we will be interviewing some of our speakers over the next few months, so you can find out what to expect from their talks. As you can see we have a wide range of topics and industries covered – so there will be something for everyone.

The final agenda with times will be released in the next few weeks – in the meantime, take a look at who’s talking and make the most of our early bird ticket offers.

Blogs home Featured Image

We are thrilled to announce that Julia Silge from Stack Overflow will join us at EARL London (10-12 September) as a keynote speaker. After wowing us all at EARL Seattle last year we knew we had to get Julia over to London!

About Julia Silge

Julia Silge is a data scientist at Stack Overflow, with a PhD in astrophysics and an abiding love for Jane Austen. She is both an international keynote speaker and a real-world practitioner focusing on data analysis and machine learning practice. She is the author of Text Mining with R, with her coauthor David Robinson. She loves making beautiful charts and communicating about technical topics with diverse audiences.

We will be shortly interviewing Julia to find out her views on all things R and what she is looking forward to at this year’s EARL.

If you’d like to join Julia as a speaker you have until 8 April to submit your abstract!

Blogs home Featured Image

Amid stronger business competition than ever before, companies need to do more than simply embrace buzzwords or trends. It’s something we see all the time when out in the field talking to customers, or speaking at events. When it comes to the role of data, the emphasis should instead be on instilling transformation into the very DNA of an organisation.

Quick fixes are not the order of the day and, while the utilisation of tools such as Artificial Intelligence (AI) and Machine Learning (ML) may reap initial rewards, focus needs to switch to a longer term, more all-encompassing cultural shift surrounding data analytics.

This is, and has been, Mango’s view over the past 16 years, and is one that’s expanded on in detail by Rich Pugh, Mango’s chief data scientist and co-founder, and CEO Matt Aldridge, in the Future of Data Report, recently published in The Times. According to Rich, the notion that ideas like AI or ML can just be plugged in and the company then watches as money pours out of their servers is dangerous. But at least it’s opened the door to having the conversation about how companies can become data driven. “Our organisation is focused on facilitating these conversations that we believe should have been occurring 16 years ago, so we can help companies avoid quick buzzword-led reactions and instead strive for a cultural transformation based on data. The question for all reverts to ‘where are you on your data-driven journey and what’s the best way forward for your company?”

Download the Future of Data Report, as seen in the Raconteur in The Times, to read Rich and Matt’s full article. Other data-focused topics covered in this comprehensive 16-page report include the ‘data versus humans conflict’, the new discipline of ‘infonomics’, the use of AI for creating value from unstructured data, the future Data Scientist, and an infographic that tracks the volume of data generated in a single day. We’ll be sharing Mango’s views on some of these very topical themes, so watch this space.

In the meantime, get in touch with us if you’d like to find out how to transform your business model using the power of data.

Blogs home

At Mango, we talk a lot about going on a ‘data-driven journey’ with your business. We’re passionate about data and getting the best use out of it. But for now, instead of looking at business journeys, I wanted to talk to the Mango team and find out how they started on their own ‘data journey’ – what attracted them to a career in data science and what they enjoy about their day-to-day work. (It’s not just typing in random numbers?! What?!)

We are hugely fortunate to have a wonderful team of data scientists who are always generous in sharing their skills or don’t mind teaching the Marketing and Events Coordinator (me) R for uber beginners. So let’s see what they have to say on becoming a Mango…

Jack Talboys

Jack joined us last year as a year-long placement student 

“I actually had no idea what Data Science was until I discovered Mango about a year and a half ago. I was at the university career fair – not really impressed by the prospect of working in finance or as a statistician for a large company. I stumbled across Liz Matthews and Owen Jones who were there representing Mango, drawn in by the title “Data Science” we started talking. Data Science seemed to tick all of my boxes, being able to use my knowledge of statistics and probability while doing lots of coding in R.

I’m now 6 months in at Mango and it couldn’t be going better. I’ve greatly improved my proficiency in R, alongside learning new skills like Git, SQL and Python. I’ve been given a great deal of responsibility, assisting in delivering training to a client and attending the EARL 2018 conference making up some of my highlights. There have also been opportunities for me to be client-facing, giving me a deeper understanding of what it takes to be a Data Science Consultant.

Working at Mango hasn’t just developed my technical skills however, without really noticing I’ve found that I have become a better communicator. Whether organising tasks with the other members of the ValidR team or talking to clients I have discovered a new sense of confidence and trust in myself. Even as a relative newbie I can see that Data Science as an industry is growing massively – and I’m excited to be part of this growth and make the most of the exciting opportunities it presents with Mango.”

Beth Ashlee, Data Scientist

“I got into data science after applying for a summer internship at Mango. I didn’t really know much about the data science community previously, but spent the next few weeks learning more technical and practical skills than I had in 3 years at university.

I’ve been working as a Data Science Consultant for nearly 3 years and due to the wide variety of projects I’ve never had a dull moment. I have had amazing opportunities to travel worldwide teaching training courses and interacting with customers from all industries. The variety is my favourite part of the job, you could be building a Shiny application to help a pharmaceutical company visualise their assay data one week and the next teaching a training course at the head offices of large companies such as Screwfix.”

Owen Jones, Data Scientist

“To be honest, it rarely feels like work… since we’re a consultancy, there’s always a wide variety of projects on the go, and you can get yourself involved in the areas you find most interesting! Plus, you have the opportunity to visit new places, and you’re always meeting and working with new people – which means new conversations, new ideas and new understanding. I love it.”

Nick Howlett, Data Scientist

Nick is currently working on a client project in Italy.

“During my time creating simulations in academic contexts I found myself more motivated to meet my supervisor’s requirements than pursuing niche research topics. Towards the end of my studies, I discovered data science and realised that the client-consultant relationship was a situation very similar to this.

Working at Mango has allowed me to develop personal relationships with clients across many sectors – and get to know their motivations and individual data requirements. Mango has also given me the opportunity to travel on both short term training projects and more long term projects abroad.”

Karina Marks, Data Scientist

If you’d like to join the Mango team, take a look at the positions we have currently.

Blogs home Featured Image

Today at 12pm Mango’s Chief Data Scientist, Rich Pugh, will be presenting at the Corporate IT Forum. His talk will cover how companies can transform themselves into effective data-driven businesses.

Almost all companies are investing in some sort of data project – data analytics, big data, AI, and data science. It’s important that data science delivers rather than a phrase that becomes associated with expensive initiatives that not make a meaningful business impact.

With careful planning, organisations are delivering value out of their data, but what they are attempting isn’t really about digital first and foremost. It’s actually about the necessity of transforming their business model.

So what is the jargon of data science, what does it really mean, and why should business leaders care?

Rich Pugh says ‘I’ll look at some effective tips for businesses that have the best chance of succeeding in their mission to become data-driven – and therefore in their wider digital transformation strategy as a whole’.

If your business is ready to find out more – take a look at our leadership level courses designed to help you start on your data-driven journey.

 

 

Blogs home Featured Image

Can you tell us about your upcoming keynote at EARL and what the key take-home messages will be for delegates?

I’m going to talk about functional programming which I think is one of the most important programming techniques used with R. It’s not something you need on day 1 as a data scientist but it gives you some really powerful tools for repeating the same action again and again with code. It takes a little while to get your head around it but recently, because I’ve been working writing the second edition of Advanced R, I’ve prepared some diagrams that make it easier to understand. So the take-home message will be to use more functional programming because it will make your life easier!

Writer, developer, analyst, educator or speaker – what do you enjoy most? Why?

The two things that motivate me most in the work that I do is the intellectual joy of understanding how you can take a big problem and break it up into small pieces that can be combined together in different ways – for example, the algebras and grammars of tools like dplyr and ggplot2. I’ve been working a lot in Advanced R to understand how the bits and pieces of R fit together which I find really enjoyable. The other thing that I really enjoy is hearing from people who have done cool stuff with the things that I’ve worked on which has made their life easier – whether that’s on Twitter or in person. Those are the two things that I really enjoy and pretty much everything else comes out of that. Educating, for example, is just helping other people understand how I’ve broken down a problem and sharing in ways that they can understand too.

What is your preferred industry or sector for data, analysis and applying the tools that you have developed?

I don’t do much data analysis myself anymore so when I do it, it’s normally data related to me in some way or, for example, data on RStudio packages. I do enjoy challenges like figuring out how to get data from a web API and turning it into something useful but the domain for my analysis is very broadly on data science topics.

When developing what are your go-to resources?

I still use StackOverflow quite a bit and Google in general. I also do quite a bit of comparative reading to understand different programming languages, seeing what’s going on across different languages, the techniques being used and learning about the evolving practices in other languages which is very helpful.

Is there anything that has surprised you about how any of the tools you’ve created has been used by others?

I used to but I’ve lost my capability to be surprised now just because the diversity of uses is crazy. I guess now, I think it’s most notable is when someone uses any of my tools to commit academic fraud (well-publicised examples sometimes). Otherwise, people are using R and data to understand pretty much every aspect of the world which is really neat.

What are the biggest changes that you see between data science now and when you started?

I think the biggest difference is that there’s a term for it – data science. I think it’s been useful to have that term rather than just Applied Statistician or Data Analyst because I think Data Science is becoming different to what these roles have been traditionally. It’s different from data analysis because data science uses programming heavily and it’s different from statistics since there’s a much greater emphasis on correct data import and data engineering with the goal may be to eventually turn the data analysis into a product, web app or something other than a standard report.

Where do you currently perceive the biggest bottlenecks in data science to be?

I think there are still a lot of bottlenecks in getting high-quality data and that’s what most people currently struggle with. I think another bottleneck is how to help people learn about all the great tools that are available, understand what their options are, where all the tools are and what they should be learning. I think there are still plenty of smaller things to improve with data manipulation, data visualization and tidying but by and large it feels to me like all the big pieces are there. Now it’s more about getting everything polished and working together really well. But still, getting data to a place to even start an analysis can be really frustrating so a major bottleneck is the whole pipeline that occurs before arriving in R.

What topic would you like to be presenting on in a data science conference a year from now?

I think one thing I’m going to be talking more about next year is this vctrs package that I’ve been working on. The package provides tools for handling object types in R and managing the types of inputs and outputs that a function expects and produces. My motivation for this is partly because there are a lot of inconsistencies in the tidyverse and base R that vctrs aims to fix. I think of this as part of my mental model because when I read R code, there’s a simplified R interpreter in my head which mainly focuses on the types of objects and predicts whether some code is going to work at all or if it’s going to fail. So part of the motivation behind this package is me thinking about how to get stuff out of my head and into the heads of other people so they can write well-functioning and predictable R code.

What do you hope to see out of RStudio in the next 5 years?

Generally, I want RStudio to be continually strengthening the connections between R and other programming languages. There’s an RStudio version 1.2 coming out which has a bunch of features to make it easy to use SQL, Stan and Python from RStudio. Also, the collaborative work we do with Wes McKinney and Ursa Labs – I think we’re just going to see more and more of that because data scientists are working in bigger and bigger teams on a fundamentally collaborative activity so making it as easy as possible to get data in and out of R is a big win for everyone.

I’m also excited to see the work that Max Kuhn has been doing on tidy modelling. I think the idea is really appealing because it gives modelling an API that is very similar to the tidyverse. But I think the thing that’s really neat about this work is that it takes inspiration from dplyr to separate the expression of a model from its computation so you can express the model once and fit it in R, Spark, tensorflow, Stan or whatever. The R language is really well suited to exploit this type of tool where the computation is easily described in R and executed somewhere that is more suitable for high performance.

 

Blogs home Featured Image

Becoming a data-driven business is high on the agenda of most companies but it can be difficult to know how to get started or know what the opportunities could be.

Mango Solutions’ has a bespoke workshop, ‘Art of The Possible’ to help senior leadership teams see the potential and overcome some of the common challenges.

Find out more in our new report developed in partnership with IBM.

Blogs home Featured Image

The challenge

Recently I saw that StackOverflow released their survey data and had been posted on Kaggle. The data came with the following context “Want to dive into the results yourself and see what you can learn about salaries or machine learning or diversity in tech?” and given that June is indeed shiny-appreciation month, I thought this would be a fun opportunity to combine an interesting public dataset, some exploratory analysis in shiny and uhhh.. my first blog.

Talking about this with a colleague at work, we both decided to independently have a crack and then compare our approaches. You can check out his blog here.

The data and idea

The data looked very rich, with almost 100k rows and 129 variables – a feast for any data addict. Feeling gluttonous, I wanted to take a bite out of every variable but without writing a tonne of code. So… iteration!

Exploration through iteration

Typically, the apps I have written in the past have consisted of a small number of well defined outputs in a fixed layout. Whilst this is appropriate for presenting a prescribed analysis, it is a bit cumbersome and slow for exploration. For this app, I wanted a different approach which had fast exploration, dynamic layout and hopefully powered by a small amount of code.

To do this, I would need a more programmatic way of creating UI elements and their corresponding outputs. The only non-reactive ui element that I needed is a simple selectInput (for the user to select variables), all other UI elements and outputs would then react to this selection. To allow for reactive ui elements I needed to stick in a ui placeholder uiOutput which would then be defined server-side with renderUI

ui <- fluidPage(
  # User selection
  selectInput("cols",
              label = "Select columns to visualise",
              choices = colnames(survey_data),
              multiple = TRUE,
              width = '800px'),
  
  # UI placeholder
  uiOutput("p")
)

Having finished the basic structure on the UI side, I made a start on a programmatic way of creating the ui outputs (in this case, using plotOutput). The basic idea here is to create an output placeholder for each selection the user makes. Each of these outputs will have an associated id of the selected variables. To make it work, we require returning a list of these outputs inside the renderUI expression – this can be achieved by using tagList (shiny equivalent of a list) and assigning elements iteratively using a for loop.

server <- function(input, output) {
  
  # Define our UI placeholder using the user selection
  output$p <- renderUI({
    
    # Define an empty tagList
    plots <- tagList()
    
    # Iterate over input column selection
    for (col in input$cols) {
      plots[[col]] <- plotOutput(col)
    }
    
    plots
    
  })
  
}

At this point, we have created a list of output placeholders but haven’t defined what these outputs will look like.

Before showing the code, I want to address some of the parts that need to come together to make this work. Firstly, we need our output to reactively render depending on the selection made by the user. In our case, we have a reactive input called input$cols. Now for each element of this input, we want to render a plot so the looping needs to be outside of the render statement (unlike the ui creation that looped inside a render statement) but as we are looping over a reactive object, we require a reactive context – in this case an observe statement fits perfectly.

server <- function(input, output) {
  
  # Define our UI placeholder using the user selection
  output$p <- renderUI({
    
    # Define an empty tagList (shiny equivalent of a list)
    plots <- tagList()
    
    # Iterate over input column selection
    for (col in input$cols) {
      plots[[col]] <- plotOutput(col)
    }
    
    plots
    
  })
  
  # Define created outputs
  observe({
    
    # Iterate over input column selection
    lapply(input$cols, function(col) {
      
      # Render output for each associated output id
      output[[col]] <- renderPlot({
        
        hist(rnorm(10), main = col)
        
      })
      
    })
    
  })
  
}

Back to the StackOverflow survey

In the previous sections, I introduced how to construct ui and server creation by way of iteration. Now I want to combine it with the StackOverflow survey to show how this kind of app can be effective for quick exploratory analysis.

There are many ways that we can explore a dataset but I have gone for pure simplicity – plotting a univariate distribution for each variable (histogram for continuous, barplot for categorical). The only other notable shiny addition is the use of validateto firstly have some control over which variables are plotted but also to give the user an informative message of why a plot isn’t shown.

You can test out the app (with a sample of the total data) here or the full code can be found below:

# Packages
library(shiny)
library(readr)
library(dplyr)
library(ggplot2)

# Data (change to appropriate file path)
survey_data <- read_csv("../data/survey_results_public.csv")

# Defining the UI part of the app
ui <- fluidPage(
  
  column(
    width = 6,
    offset = 3,
    fluidRow(
      selectInput("cols",
                  label = "Select columns to visualise",
                  choices = colnames(survey_data),
                  multiple = TRUE,
                  width = '800px')
    )
  ),
  
  # Main output
  column(
    width = 8,
    align = "center",
    offset = 2,
    uiOutput("p")
  )
  
)

# Defining the server part of the app
server <- function(input, output) {
  
  # Define our UI placeholder using the user selection
  output$p <- renderUI({
    
    # Define an empty tagList (shiny equivalent of a list)
    plots <- tagList()
    
    # Iterate over input column selection
    for (col in input$cols) {
      plots[[col]] <- plotOutput(col)
    }
    
    plots
    
  })
  
  # Trigger everytime the input columns are changed
  observe({
    
    lapply(input$cols, function(col) {
      
      # Remove NA entries
      d <- survey_data[!is.na(survey_data[[col]]), ]
      
      # Column class
      cc <- class(d[[col]])
      
      output[[col]] <- renderPlot({
        
        # Plots only defined for chracter, numeric & integer
        validate(need(cc %in% c("character", "numeric", "integer"),
                      paste("Plots undefined for class", cc)))
        
        if (cc == "character") {
          
          # Only show barplot if we have fewer than 20 bars
          validate(need(length(unique(d[[col]])) < 20,
                        paste(col, "contains more than 20 levels")))
          
          # Basic ggplot barplot
          ggplot(d, aes_string(x = col)) +
            geom_bar() +
            coord_flip() +
            theme_bw() +
            ggtitle(col)
          
        } else {
          
          # Basic ggplot histogram
          ggplot(d, aes_string(x = col)) +
            geom_histogram() +
            theme_bw() +
            ggtitle(col)
          
        }
        
      })
      
    })
    
  })
  
}

shinyApp(ui = ui, server = server)

Conclusion

For me, Shiny’s great strength is it’s versatility; analysis can come in many shapes and sizes so it is only fitting that Shiny, a framework for sharing analysis, can also be written in multitude of ways. In this blog post, the app wasn’t particularly snazzy (so probs not to be shared with the senior managers or as an ‘inspirational’ linkedin post) but it does what I want: quick exploration, dynamic layout & less than 100 lines of code!

My colleague Graham opted for a totally different app, one which presented the user with a well thought out and defined model. Again demonstrating the many use cases Shiny can have in a project. Not only can it be used at the start as an exploratory tool but also at the end to present findings, predictions and other project outcomes.

Asides from that, I had a lot of fun exploring the survey data, especially variables relating to health-lifestyle of developers – I’m no physician but, devs, try to do at least some exercise… Wishing you all a happy end to shiny-appreciation month, keep writing ’em apps, and see ya next time 🙂