Integrating Python and R into a Data Analysis Pipeline – Part 1

By Chris Musselle and Kate Ross-Smith

For a conference in the R language, EARL London 2015 saw a surprising number of discussions about Python. I like to think that at least some of this was to do with the fact that the day before the conference, we ran a 3-hour workshop outlining various strategies for integrating Python and R.

This is the first in a series of three blog posts that:

  • outline the basic strategy for integrating Python and R;
  • run through the different steps involved in this process; and
  • give a real example of how and why you would want to do this.

This post kicks everything off by:

  • covering the reasons why you may want to include both languages in a pipeline;
  • introducing ways of running R and Python from the command line; and
  • showing how you can accept inputs as arguments and write outputs to various file formats.

Why “And” not “Or”?

From a quick internet search for articles about “R Python”, of the top 10 results, only 2 discuss the merits of using both R and Python rather than pitting them against each other. This is understandable; from their inception, both have had very distinctive strengths and weaknesses. Historically, though, the split has been one of educational background: statisticians have preferred the approach that R takes, whereas programmers have made Python their language of choice. However, with the growing breed of data scientists, this distinction blurs:

Data Scientist (n.): Person who is better at statistics than any
software engineer and better at software engineering than any
statistician. — twitter @josh_wills

With the wealth of distinct library resources provided by each language, there is a growing need for data scientists to be able to leverage their relative strengths. For example:

Python tends to outperform R in such areas as:

  • Web scraping and crawling: though rvest has simplified web scraping and crawling within R, Python’s beautifulsoup and Scrapy are more mature and deliver more functionality.
  • Database connections: though R has a large number of options for connecting to databases, Python’s sqlachemy offers this in a single package and is widely used in production environments.

Whereas R outperforms Python in such areas as:

  • Statistical analysis options: though Python’s combination of Scipy, Pandas and statsmodels offer a great set of statistical analysis tools, R is built specifically around statistical analysis applications and so provides a much larger collection of such tools.
  • Interactive graphics/dashboards: bokeh, plotly and intuitics have all recently extended the use of Python graphics onto web browsers, but getting an example up and running using shiny and shiny dashboard in R is faster, and often requires less code.

Further, as data science teams now have a relatively wide range of skills, the language of choice for any application may come down to prior knowledge and experience. For some applications – especially in prototyping and development – it is faster for people to use the tool that they already know.

Flat File “Air Gap” Strategy

In this series of posts we are going to consider the simplest strategy for integrating the two languages, and step though it with some examples. Using a flat file as an air gap between the two languages requires you to do the following steps.

  1. Refactor your R and Python scripts to be executable from the command line and accept command line arguments.
  2. Output the shared data to a common file format.
  3. Execute one language from the other, passing in arguments as required.

Pros

  • Simplest method, so commonly the quickest
  • Can view the intermediate outputs easily
  • Parsers already exist for many common file formats: CSV, JSON, YAML

Cons

  • Need to agree upfront on a common schema or file format
  • Can become cumbersome to manage intermediate outputs and paths if the pipeline grows.
  • Reading and writing to disk can become a bottleneck if data becomes large.

Command Line Scripting

Running scripts from the command line via a Windows/Linux-like terminal environment is similar in both R and Python. The command to be run is broken down into the following parts,

<command_to_run> <path_to_script> <any_additional_arguments>

where:

  • <command> is the executable to run (Rscript for R code and Python for Python code),
  • <path_to_script> is the full or relative file path to the script being executed. Note that if there are any spaces in the path name, the whole file path must me enclosed in double quotes.
  • <any_additional_arguments> This is a list of space delimited arguments parsed to the script itself. Note that these will be passed in as strings.

So for example, an R script is executed by opening up a terminal environment and running the following:

Rscript path/to/myscript.R arg1 arg2 arg3

A Few Gotchas

  • For the commands Rscript and Python to be found, these executables must already be on your path. Otherwise the full path to their location on your file system must be supplied.
  • Path names with spaces create problems, especially on Windows, and so must be enclosed in double quotes so they are recognised as a single file path.

Accessing Command Line Arguments in R

In the above example where arg1, arg2 and arg3 are the arguments parsed to the R script being executed, these are accessible using the commandArgs function.

## myscript.R

# Fetch command line arguments
myArgs <- commandArgs(trailingOnly = TRUE)

# myArgs is a character vector of all arguments
print(myArgs)
print(class(myArgs))

By setting trailingOnly = TRUE, the vector myArgs only contains arguments that you added on the command line. If left as FALSE (by default), there will be other arguments included in the vector, such as the path to the script that was just executed.

Accessing Command Line Arguments in Python

For a Python script executed by running the following on the command line

python path/to/myscript.py arg1 arg2 arg3

the arguments arg1, arg2 and arg3 can be accessed from within the Python script by first importing the sys module. This module holds parameters and functions that are system specific, however we are only interested here in the argv attribute. This argv attribute is a list of all the arguments passed to the script currently being executed. The first element in this list is always the full file path to the script being executed.

# myscript.py
import sys

# Fetch command line arguments
my_args = sys.argv

# my_args is a list where the first element is the file executed.
print(type(my_args))
print(my_args)

If you only wished to keep the arguments parsed into the script, you can use list slicing to select all but the first element.

# Using a slice, selects all but the first element
my_args = sys.argv[1:]

As with the above example for R, recall that all arguments are parsed in as strings, and so will need converting to the expected types as necessary.

Writing Outputs to a File

You have a few options when sharing data between R and Python via an intermediate file. In general for flat files, CSVs are a good format for tabular data, while JSON or YAML are best if you are dealing with more unstructured data (or metadata), which could contain a variable number of fields or more nested data structures.

All these are very common data serialisation formats, and parsers already exist in both languages. In R the following packages are recommended for each format:

And in Python:

The csv and json modules are part of the Python standard library, distributed with Python itself, whereas PyYAML will need installing separately. All R packages will also need installing in the usual way.

Summary

So passing data between R and Python (and vice-versa) can be done in a single pipeline by:

  • using the command line to transfer arguments, and
  • transferring data through a commonly-structured flat file.

However, in some instances, having to use a flat file as an intermediate data store can be both cumbersome and detrimental to performance. In the next post, we will look at how R and Python can directly call each other and return the output in memory.

 

facebooktwittergoogle_plusredditpinterestlinkedinmail