Blogs home Featured Image

If you read my series of posts on writing command line utilities in R, but were wondering how to do the same thing in Python, you’ve come to the right place.

I learned a lot with that original series of posts though, so this time we’re going to switch things up a bit, dive right into a complete working example, and cover it all off in a single post.

So let’s roll up our sleeves and get started!

Recap – What are command line utilities?

Command line utilities are tools that you can run on the command line of a computer. We most often see these on Linux and MacOS computers using the ‘bash’ shell, but Windows users have options like CMD, git-bash and powershell too.

These tools allow you to instruct the computer to do things using text alone. You can also start chaining the commands together to give you a really powerful way to get computers to do things for you. You’ve possibly already used command line tools like ls and cd before, to ‘list” a directory’s contents and ’change directory’ respectively, but by writing our own tools we really unlock the power of the command line.

Imagine you’re a entomologist studying the sounds that Cicadas make. You have field equipment set up that records audio overnight and sends it to a Linux computer in your lab. Every morning you come into the lab, see what files you have and then begin to process them.

First, you might list the files in the directory with ls. You notice there’s lots of other stuff in that directory as well as your ‘wav’ audio files, so you can do ls *.wav to just list the files you’re interested in. Then you have to run a pre-processing command on each file to turn the audio into the data you need. Finally you need to generate some preliminary plots from that data. And that’s all before you even start to do any analysis!

Wouldn’t it be better if you could get the computer to do all that for you before you’ve even arrived at the lab? Using the command line and writing our own tools for it, we can do just that. In the example above we’d want to do something like the following pseudo-code (which is mostly standard bash syntax)…

# process each wav file using the fictional 'audio-to-data' command line 
# tool, which generates a csv file for each input file
for wavfile in *.wav
  do
    ./audio-to-data ${wavfile}
  done

# process each data file to create preliminary plots using the fictional 
# 'data-to-plot' command line tool, which outputs a png file for each input file
for datafile in *.csv
  do
    ./data-to-plot ${datafile}
  done
  
# now we can tidy up

## move all the raw audio files to the 'raw-audio' subdirectory
mv *.wav ./raw-audio/

## move all the csv files to a 'data' subdirectory
mv *.csv ./data/

## move all the preliminary plots to a 'plots' subdirectory
mv *.png ./plots/

Now that we’ve written out the entomologist’s morning routine like this, it makes sense to get the computer to run that automatically. We can then use the scheduling tools built into the operating system (a thing called ‘cron’ in this instance), to run this as a script each morning at 6am. That means that all this work is already done by the time our entomologist arrives at the lab, meaning they can get on with the job of actually analysing the data and plots.

This is all well and good, but I cheated! Some of the commands in my example were fictional! The operating system sometimes doesn’t have a built-in command that can help you – for instance there’s no built in command to detect Cicada sounds! – and that’s why we write our own command line utilities. To fill a specific need that isn’t already met by your operating system.

A Python Sorting Hat

In this post we’re not going to try anything quite so ambitious as an audio-file-to-csv converter, but instead we’ll take a look at an example which provides some good foundations that you can build on yourself.

Below is the Python code for a command line sorting hat. If you did follow along on the R version of this you should recognise it. If you didn’t, it’s basically a small, text-based, program that takes a name as input and then tells you which Hogwarts house that person has been sorted in to.

$ ./sortinghat.py Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.py Hermione
Hello Hermione, you can join Ravenclaw!

The code for the R version appeared in the sixth installment of the R series. Here’s the output of that one:

$ ./sortinghat.R Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.R Hermione
Hello Hermione, you can join Ravenclaw!

Exact same thing. Poor Hermione!

Here’s the full code for the Python version.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""
import argparse
import hashlib
PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")
ARGV = PARSER.parse_args()
def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))
debug_msg("Debug option is set")
debug_msg("Your name is - ", ARGV.name)
HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }
NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()
debug_msg("The name_hash is - ", NAME_HASH)
HOUSE_KEY = NAME_HASH[0]
debug_msg("The house_key is - ", HOUSE_KEY)
HOUSE = HOUSES[HOUSE_KEY]
if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

In order to actually run this thing, you can either type it out yourself, or just copy and paste it into a file called ‘sortinghat.py’.

We could just run this with python sortinghat.py, but that doesn’t make our utility feel like it’s a proper command line tool. In order for Linux and MacOS shells (and Windows Subsystem for Linux and git-bash) to treat the file as ‘executable’ we must mark it as such by changing the ‘mode’ of the file.

Make sure you’re in the same directory as your file and run:

$ chmod +x ./sortinghat.py

Now you can just run ./sortinghat.py to run the command.

Breaking things down

shebang and docstring

Next we’re going to look at each section in turn to look at the functionality.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""

That very first line is referred to as a ‘shebang’ and it tells your command line shell (of which there are many, but ‘bash’ is the most common) which program to use to execute everything that follows. In this case we’re using a command called env to tell bash where to find python.

Note: I’m using python 3 for this example. On some systems that have both python 2 and 3, 3 is referred to as python3, not just python. If that’s the case for you, you’ll need to modify this script to reflect that.

After the shebang is a standard python docstring, just telling you what the app is all about.

import

import argparse
import hashlib

Next we import the external modules we’re going to use. Lucky for us, python has an extensive and varied standard library of modules that ship with it, so we don’t need to install anything extra.

argparse’ will parse command line arguments for us. If you think of a command line tool like ls, arguments are things you can put after it to modify its behaviour. For example ls -l has -l as the argument and causes ls to print ‘longer’ output with more information than the standard output. For ls *.wav, the *.wav argument is a pattern which causes ls to only emit files that match that pattern.

hashlib’ is a module that implements various hash and message digest algorithms, which we’ll need later on for the sorting part of the utility.

Handling arguments

PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")
ARGV = PARSER.parse_args()

This block sets up a new argument parser for us and adds some arguments to it. Arguments that don’t start with -- are ‘positional’, which basically means that it’s a mandatory argument. If you define multiple positional arguments they must be specified at run-time in the order they are defined.

In our case, if we don’t specify the ‘name’, then we’ll get an error:

$ ./sortinghat.py
usage: sortinghat.py [-h] [-d] [-s] name
sortinghat.py: error: the following arguments are required: name

We didn’t have to create this error message, argparse did that for us because it knows that ‘name’ is a required argument.

The other arguments are ‘flags’, which means we can turn things on and off with them. Flags are specified with -- for the long form and - for the short form, you don’t have to have both but this has developed into something of a convention over the years. Specifying them separately like this is also useful as it gives you full control over how the short options relate to the longer ones.

If, for instance, you wanted two arguments in your application called --force and --file, the convention would be to use -f as the short form, but you can’t use it for both. Explicitly assigning the short form version allows you to decide what you want to use instead. Maybe you’d go for -i for an input file or -o for an output file or something like that.

These arguments are flags because we set action="store_true" in them, which stores True if they’re set and False if they’re not.

If you omit the action="store_true", you get an optional argument. This could be something like --file /path/to/file, where you must specify something immediately after the argument. You can use these for specifying additional parameters for your scripts. We’re not really covering that in this script though, so here are a few quick examples to get you thinking:

  • --config /path/to/config_file – specify an alternate config file to use instead of the default
  • --environment production – run against product data rather than test data
  • --algo algorithm_name – use a different algorithm instead of the default
  • --period weekly – change the default calculation period of your utility
  • --options /path/to/options/file – provide options for an analysis from an external file

Another freebie we get from argparse is -h and --help. These are built-in and print nicely formatted help output for your users or future-self!

$ ./sortinghat.py -h
usage: sortinghat.py [-h] [-d] [-s] name

positional arguments:
  name         name of the person to sort

optional arguments:
  -h, --help   show this help message and exit
  -d, --debug  enable debug mode
  -s, --short  output only the house

Lastly for this section, we use parse_args() to assign the arguments that have been constructed to a new namespace called ARGV so we can use them later. Arguments stored in ARGV are retrievable using the long version of the argument name, so in this example: ARGV.nameARGV.debug and ARGV.short.

Everything from this point onward is largely to do with the functionality of the utility, not the command line execution of it, so we’ll go through it quite quickly.

Printing debug messages

I didn’t want to get bogged down using a proper logging library for this small tool, so this function takes care of our very basic needs for us.

def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))

Essentially, it will only print a message if ARGV.debug is True and that will only be true if we set the -d flag when we run the tool on the command line.

We can then put messages like debug_msg("Debug option is set") in our code and they’ll do nothing unless that -d flag is set. If it is set, you’ll get output like:

$ ./sortinghat.py -d Mark
DEBUG: Debug option is set
DEBUG: Your name is - Mark
DEBUG: The name_hash is - f1b5a91d4d6ad523f2610114591c007e75d15084
DEBUG: The house_key is - f
Hello Mark, you can join Slytherin!

Using a technique like this – or perhaps a --verbose flag – can help to provide additional information about what’s going on inside your utility at run time that could be helpful to others or your future-self if they encounter any difficulties with it.

The debug_msg() function is used in this way throughout the rest of the program.

Figuring out the house

To figure out what house to assign someone to we use the same approach that we did for the R version. We calculate the hash of the input name and store the hexadecimal representation. Since hex uses the numbers 0-9 and the characters a-f, we can assign the four Hogwarts houses to these 16 symbols evenly in a Python dictionary.

We can then use the first character of the input name hash as the key when retrieving the value from the dictionary

HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }
NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()
HOUSE_KEY = NAME_HASH[0]
HOUSE = HOUSES[HOUSE_KEY]

We also make sure that the input name is converted to lower case first to prevent us from running into any discrepancies between, for example, ‘Mark’ and ‘mark’.

Printing output

Here in the final section, we use the value of ARGV.short to decide whether to print the long output or the short output. Flags are False by default with argparse, so we can test if it’s been set to True (by specifying the -s flag on the command line) and print accordingly.

if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

Using the -s flag on the command line results in the following short output:

$ ./sortinghat.py -s Mark
Slytherin

Since the flags are optional you can combine them if you need to, so something like ./sortinghat.py -s -d Mark will produce the expected output – debug info with the short version of the final message.

That’s it for now

I hope you found this post useful and that you have some great ideas for things in your workflows that could be automated with command line utilities. If you do end up writing your own tool find me on twitter and let me know about it. I love hearing about all the awesome ways people are using these techniques to solve real world problems.

Originally posted on Mark’s blog, here.

Blogs home Featured Image

Ruth Thomson, Interim Director of Strategic Innovation sat down with Jelena, one of Mango’s machine learning experts.

Thanks Jelena for your time. It is an absolute pleasure to have this opportunity to discuss machine learning today. Tell me about your background with machine learning

I’ve been using machine learning for many years. I first started using machine learning as part of my PhD in Physics over 12 years ago using a neural net for mainly pattern recognition in particle physics. And I’ve been using it ever since in projects and for clients when it is the best tool for the job.

Because I’ve been using it for so long, I find it amusing that machine learning is now being marketed as a new thing. The algorithms and approaches have been around since the 1970s. The exciting thing is that we now have computers powerful enough to enable the greater use of machine learning to help improve decision making.

Now as a Senior Data Scientist at Mango, I use machine learning in our consultancy projects and I also train data scientists around the world in the world’s largest companies in how to use machine learning. Most recently, for example, I have been training analysts and data scientists at one the UK’s largest banks.

One thing I love is understanding how tools like machine learning can be used to drive business value. How has Mango helped clients use machine learning recently?

We’ve been using machine learning in a range of different companies recently. For example, one way has been to help our customers reduce the costs resulting from late payment of invoices and another is using machine learning to create better sales forecasts. It is a powerful tool, that for the right problem, can be very effective.

What advice would you give to organisations who want to gain value from machine learning?

Recognise that machine learning is one of a suite of advanced analytics tools you can use to drive value. The most important thing to do is to define the business problem or opportunity you have and then use machine learning if it is the most appropriate tool.

That is so interesting. It can be really easy to get drawn into the hype around machine learning. As someone who has used machine learning extensively over many years. What’s your opinion?

From my experience, the most dangerous misconception is that machine learning is an ultimate oracle. Businesses see all things that have been achieved and think that machine learning is the right tool for every situation. In reality, it is a useful tool but it needs to be applied in a smart way.

I totally agree. We have seen and heard of so many projects where machine learning was used, to great cost and investment, when a simpler and better solution could have been applied. This has happened where organisations have started with the answer – machine learning, rather than with the question – what business problem are we trying to solve?

Exactly. And another danger is that machine learning is being sold as a tool that you can use out of the box. Just press a button and the answer will appear. In reality that is so far from the truth and some businesses have had to find that out the hard way.

Is it fair to say that the businesses who are going to drive real business value are ones who have a clear focus on the question they are trying to answer with machine learning?

Yes. A smart business will have a clear business case and a clear question that is being answered with machine learning. Then look at whether machine learning is the appropriate tool to answer that question. Baring in mind that setting up a machine learning environment is not a cheap exercise and no business wants to waste money on tools that are not needed.

Another important area for businesses to consider is the data they have available. In machine learning, having the right data is critical. In many businesses, far more attention needs to be paid to the data available, the data quality and preparation to enable machine learning.

I feel we could talk about this topic for hours! In summary, if you could a message with businesses considering using machine learning, what would it be.

Machine learning is a powerful tool but it is only one of many tools you can use. It is also not a tool, yet, that can fully replace the data analysis process. Use it as part of an advanced analytics programme focused on driving business value.

Thanks Jelena.

If you’re an organisation considering using machine learning or improving your current use of machine learning, get in touch with us.

 

Blogs home Featured Image

Can you tell us about your upcoming keynote at EARL and what the key take-home messages will be for delegates?

I’m going to talk about functional programming which I think is one of the most important programming techniques used with R. It’s not something you need on day 1 as a data scientist but it gives you some really powerful tools for repeating the same action again and again with code. It takes a little while to get your head around it but recently, because I’ve been working writing the second edition of Advanced R, I’ve prepared some diagrams that make it easier to understand. So the take-home message will be to use more functional programming because it will make your life easier!

Writer, developer, analyst, educator or speaker – what do you enjoy most? Why?

The two things that motivate me most in the work that I do is the intellectual joy of understanding how you can take a big problem and break it up into small pieces that can be combined together in different ways – for example, the algebras and grammars of tools like dplyr and ggplot2. I’ve been working a lot in Advanced R to understand how the bits and pieces of R fit together which I find really enjoyable. The other thing that I really enjoy is hearing from people who have done cool stuff with the things that I’ve worked on which has made their life easier – whether that’s on Twitter or in person. Those are the two things that I really enjoy and pretty much everything else comes out of that. Educating, for example, is just helping other people understand how I’ve broken down a problem and sharing in ways that they can understand too.

What is your preferred industry or sector for data, analysis and applying the tools that you have developed?

I don’t do much data analysis myself anymore so when I do it, it’s normally data related to me in some way or, for example, data on RStudio packages. I do enjoy challenges like figuring out how to get data from a web API and turning it into something useful but the domain for my analysis is very broadly on data science topics.

When developing what are your go-to resources?

I still use StackOverflow quite a bit and Google in general. I also do quite a bit of comparative reading to understand different programming languages, seeing what’s going on across different languages, the techniques being used and learning about the evolving practices in other languages which is very helpful.

Is there anything that has surprised you about how any of the tools you’ve created has been used by others?

I used to but I’ve lost my capability to be surprised now just because the diversity of uses is crazy. I guess now, I think it’s most notable is when someone uses any of my tools to commit academic fraud (well-publicised examples sometimes). Otherwise, people are using R and data to understand pretty much every aspect of the world which is really neat.

What are the biggest changes that you see between data science now and when you started?

I think the biggest difference is that there’s a term for it – data science. I think it’s been useful to have that term rather than just Applied Statistician or Data Analyst because I think Data Science is becoming different to what these roles have been traditionally. It’s different from data analysis because data science uses programming heavily and it’s different from statistics since there’s a much greater emphasis on correct data import and data engineering with the goal may be to eventually turn the data analysis into a product, web app or something other than a standard report.

Where do you currently perceive the biggest bottlenecks in data science to be?

I think there are still a lot of bottlenecks in getting high-quality data and that’s what most people currently struggle with. I think another bottleneck is how to help people learn about all the great tools that are available, understand what their options are, where all the tools are and what they should be learning. I think there are still plenty of smaller things to improve with data manipulation, data visualization and tidying but by and large it feels to me like all the big pieces are there. Now it’s more about getting everything polished and working together really well. But still, getting data to a place to even start an analysis can be really frustrating so a major bottleneck is the whole pipeline that occurs before arriving in R.

What topic would you like to be presenting on in a data science conference a year from now?

I think one thing I’m going to be talking more about next year is this vctrs package that I’ve been working on. The package provides tools for handling object types in R and managing the types of inputs and outputs that a function expects and produces. My motivation for this is partly because there are a lot of inconsistencies in the tidyverse and base R that vctrs aims to fix. I think of this as part of my mental model because when I read R code, there’s a simplified R interpreter in my head which mainly focuses on the types of objects and predicts whether some code is going to work at all or if it’s going to fail. So part of the motivation behind this package is me thinking about how to get stuff out of my head and into the heads of other people so they can write well-functioning and predictable R code.

What do you hope to see out of RStudio in the next 5 years?

Generally, I want RStudio to be continually strengthening the connections between R and other programming languages. There’s an RStudio version 1.2 coming out which has a bunch of features to make it easy to use SQL, Stan and Python from RStudio. Also, the collaborative work we do with Wes McKinney and Ursa Labs – I think we’re just going to see more and more of that because data scientists are working in bigger and bigger teams on a fundamentally collaborative activity so making it as easy as possible to get data in and out of R is a big win for everyone.

I’m also excited to see the work that Max Kuhn has been doing on tidy modelling. I think the idea is really appealing because it gives modelling an API that is very similar to the tidyverse. But I think the thing that’s really neat about this work is that it takes inspiration from dplyr to separate the expression of a model from its computation so you can express the model once and fit it in R, Spark, tensorflow, Stan or whatever. The R language is really well suited to exploit this type of tool where the computation is easily described in R and executed somewhere that is more suitable for high performance.

 

Blogs home Featured Image

Dr. Gentleman’s work at 23andMe focuses on the exploration of how human genetic and trait data in the 23andMe database can be used to identify new therapies for disease. Dr. Gentleman is also recognised as one of the originators of the R programming language and has been awarded the Benjamin Franklin Award, a recognition for Open Access in the Life Sciences presented by the Bioinformatics Organisation. His keynote will focus on the History of R and some thoughts on data science.

Dr Robert Gentleman it is an honour to have you as a keynote speaker at our EARL conference in Houston, we are intrigued to hear more about your career to date and how your work around open access to scientific data has helped shape valuable research worldwide…

Amongst your significant achievements to date has been the development of the R programming language alongside fellow statistician Ross Ihaka at the University of Auckland in the mid 1990’s. What prompted you to develop a new statistical programing language?

Ross and I had a lot of familiarity with S (from Bell Labs) and at that time (my recollection is) there were a lot more languages for Statistics around.  We were interested in how languages were used for data analysis. Both Ross and I had some experience with Lisp and Scheme and at that time some of the work in computer science was showing how one could easily write interpreters for different types of languages, largely based on simple Scheme prototypes. We liked a lot about S, but there were a few places where we thought that different design decisions might provide improved functionality. So we wrote a simple Scheme interpreter and then gradually modified it into the core of the R language. As we went forward we added all sorts of different capabilities and found a large number of great collaborators.

As we made some progress we found that were others around the world who were also interested in developing a system like R. And luckily at just about that time, the internet became reliable and tools evolved that really helped support a distributed software development process. That group of collaborators became R Core and then later formed the nucleus for the R Foundation.

Probably the most important development was CRAN and some of the important tools that were developed to support the widespread creation of packages. Really a very large part of the success of R is due to the ability of any scientist to write a package containing code to carry out an analysis and to share that.

In 2008 you were awarded the Benjamin Franklin Award for your contribution to open access research. What areas of your research contributed towards being awarded this prestigious accolade?

I believe that my work on R was important, but perhaps more important for that award was the creation of the Bioconductor Project, together with a number of really great colleagues. Our paper in Genome Biology describes both those involved and what we did.

In your opinion how has the application of open source R for big data analysis, predictive modelling, data science and visualisation evolved since its inception?

In too many ways for me to do a good job of really describing.  As I said above, the existence of CRAN (and the Bioconductor package repository) where there are a vast number of packages. UseRs can easily get a package to try out just about any idea. Those packages are not always well written, or well supported, but they provide a simple, fast way to try ideas out. And mostly the packages are of high quality and the developers are often very happy to discuss ideas with users. The community aspect is important. And R has become increasingly performant.

Your work today involves the fascinating work of combining bioinformatics and computational drug discovery at 23andMe. What led to your transition to drug discovery or was it a natural progression?

I had worked in two cancer centers including The Dana Farber in Boston and The Fred Hutchinson in Seattle. When I was on the faculty at Harvard, and then the Fred Hutchinson in Seattle, I was developing a computational biology department. The science is fantastic at both institutions and I learned a lot about cancer and how we could begin to use computational methods to begin exploring and understanding some of the computational molecular biology that is important.

But I also became convinced that making new drugs was something that would happen in a drug company. I wanted to see how computational methods could help lead to better and faster target discovery. When I was approached by Genentech it seemed like a great opportunity – and it was. Genentech is a fantastic company, I spend almost six years there and learned a huge amount.

As things progressed, I became convinced that using human genetics to do drug discovery was likely to be more effective than any other strategy that I was aware of. And when the opportunity came to join 23andMe I took it. And 23andMe is also a great company. We are in the early stages of drug discovery still, but I am very excited about the progress we are making and the team I am working with.

How is data science being used to improve health and accelerate the discovery of therapies?

If we are using a very broad definition (by training I am a statistician, and still think that careful hypothesis-driven research is essential to much discovery) of data science – it is essential. Better models, more data and careful analysis has yielded many breakthroughs.

Perhaps a different question is ‘where are the problems’?  And for me, the biggest problem is that I am not sure there is much appreciation of the impact of bias as there should be. Big data is great – but it really only addresses the variance problem. Bias is different, it is harder to discover and its effects can be substantial. Perhaps, put another way, is just how generalizable are the results?

 In addition to the development of the R programming language? What have been your proudest career achievements that you’d like to share?

The Bioconductor Project, working with my graduate students and post-docs and pretty much anytime I gave someone good advice.

Can you tell us about what to expect from your keynote talk and what might be the key take-home messages for our EARL delegates?

I hope an appreciation of why it is important to get involved in developing software systems and tools. And I hope some things to think about when approaching large-scale data analysis projects.

Inspired by the work of Dr Robert Gentleman, what questions would you like to ask? Tickets to EARL Houston are still available. Find out more and get tickets here.