Blogs home Featured Image

If you read my series of posts on writing command line utilities in R, but were wondering how to do the same thing in Python, you’ve come to the right place.

I learned a lot with that original series of posts though, so this time we’re going to switch things up a bit, dive right into a complete working example, and cover it all off in a single post.

So let’s roll up our sleeves and get started!

Recap – What are command line utilities?

Command line utilities are tools that you can run on the command line of a computer. We most often see these on Linux and MacOS computers using the ‘bash’ shell, but Windows users have options like CMD, git-bash and powershell too.

These tools allow you to instruct the computer to do things using text alone. You can also start chaining the commands together to give you a really powerful way to get computers to do things for you. You’ve possibly already used command line tools like ls and cd before, to ‘list” a directory’s contents and ’change directory’ respectively, but by writing our own tools we really unlock the power of the command line.

Imagine you’re a entomologist studying the sounds that Cicadas make. You have field equipment set up that records audio overnight and sends it to a Linux computer in your lab. Every morning you come into the lab, see what files you have and then begin to process them.

First, you might list the files in the directory with ls. You notice there’s lots of other stuff in that directory as well as your ‘wav’ audio files, so you can do ls *.wav to just list the files you’re interested in. Then you have to run a pre-processing command on each file to turn the audio into the data you need. Finally you need to generate some preliminary plots from that data. And that’s all before you even start to do any analysis!

Wouldn’t it be better if you could get the computer to do all that for you before you’ve even arrived at the lab? Using the command line and writing our own tools for it, we can do just that. In the example above we’d want to do something like the following pseudo-code (which is mostly standard bash syntax)…

# process each wav file using the fictional 'audio-to-data' command line 
# tool, which generates a csv file for each input file
for wavfile in *.wav
  do
    ./audio-to-data ${wavfile}
  done

# process each data file to create preliminary plots using the fictional 
# 'data-to-plot' command line tool, which outputs a png file for each input file
for datafile in *.csv
  do
    ./data-to-plot ${datafile}
  done
  
# now we can tidy up

## move all the raw audio files to the 'raw-audio' subdirectory
mv *.wav ./raw-audio/

## move all the csv files to a 'data' subdirectory
mv *.csv ./data/

## move all the preliminary plots to a 'plots' subdirectory
mv *.png ./plots/

Now that we’ve written out the entomologist’s morning routine like this, it makes sense to get the computer to run that automatically. We can then use the scheduling tools built into the operating system (a thing called ‘cron’ in this instance), to run this as a script each morning at 6am. That means that all this work is already done by the time our entomologist arrives at the lab, meaning they can get on with the job of actually analysing the data and plots.

This is all well and good, but I cheated! Some of the commands in my example were fictional! The operating system sometimes doesn’t have a built-in command that can help you – for instance there’s no built in command to detect Cicada sounds! – and that’s why we write our own command line utilities. To fill a specific need that isn’t already met by your operating system.

A Python Sorting Hat

In this post we’re not going to try anything quite so ambitious as an audio-file-to-csv converter, but instead we’ll take a look at an example which provides some good foundations that you can build on yourself.

Below is the Python code for a command line sorting hat. If you did follow along on the R version of this you should recognise it. If you didn’t, it’s basically a small, text-based, program that takes a name as input and then tells you which Hogwarts house that person has been sorted in to.

$ ./sortinghat.py Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.py Hermione
Hello Hermione, you can join Ravenclaw!

The code for the R version appeared in the sixth installment of the R series. Here’s the output of that one:

$ ./sortinghat.R Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.R Hermione
Hello Hermione, you can join Ravenclaw!

Exact same thing. Poor Hermione!

Here’s the full code for the Python version.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""
import argparse
import hashlib
PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")
ARGV = PARSER.parse_args()
def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))
debug_msg("Debug option is set")
debug_msg("Your name is - ", ARGV.name)
HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }
NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()
debug_msg("The name_hash is - ", NAME_HASH)
HOUSE_KEY = NAME_HASH[0]
debug_msg("The house_key is - ", HOUSE_KEY)
HOUSE = HOUSES[HOUSE_KEY]
if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

In order to actually run this thing, you can either type it out yourself, or just copy and paste it into a file called ‘sortinghat.py’.

We could just run this with python sortinghat.py, but that doesn’t make our utility feel like it’s a proper command line tool. In order for Linux and MacOS shells (and Windows Subsystem for Linux and git-bash) to treat the file as ‘executable’ we must mark it as such by changing the ‘mode’ of the file.

Make sure you’re in the same directory as your file and run:

$ chmod +x ./sortinghat.py

Now you can just run ./sortinghat.py to run the command.

Breaking things down

shebang and docstring

Next we’re going to look at each section in turn to look at the functionality.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""

That very first line is referred to as a ‘shebang’ and it tells your command line shell (of which there are many, but ‘bash’ is the most common) which program to use to execute everything that follows. In this case we’re using a command called env to tell bash where to find python.

Note: I’m using python 3 for this example. On some systems that have both python 2 and 3, 3 is referred to as python3, not just python. If that’s the case for you, you’ll need to modify this script to reflect that.

After the shebang is a standard python docstring, just telling you what the app is all about.

import

import argparse
import hashlib

Next we import the external modules we’re going to use. Lucky for us, python has an extensive and varied standard library of modules that ship with it, so we don’t need to install anything extra.

argparse’ will parse command line arguments for us. If you think of a command line tool like ls, arguments are things you can put after it to modify its behaviour. For example ls -l has -l as the argument and causes ls to print ‘longer’ output with more information than the standard output. For ls *.wav, the *.wav argument is a pattern which causes ls to only emit files that match that pattern.

hashlib’ is a module that implements various hash and message digest algorithms, which we’ll need later on for the sorting part of the utility.

Handling arguments

PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")
ARGV = PARSER.parse_args()

This block sets up a new argument parser for us and adds some arguments to it. Arguments that don’t start with -- are ‘positional’, which basically means that it’s a mandatory argument. If you define multiple positional arguments they must be specified at run-time in the order they are defined.

In our case, if we don’t specify the ‘name’, then we’ll get an error:

$ ./sortinghat.py
usage: sortinghat.py [-h] [-d] [-s] name
sortinghat.py: error: the following arguments are required: name

We didn’t have to create this error message, argparse did that for us because it knows that ‘name’ is a required argument.

The other arguments are ‘flags’, which means we can turn things on and off with them. Flags are specified with -- for the long form and - for the short form, you don’t have to have both but this has developed into something of a convention over the years. Specifying them separately like this is also useful as it gives you full control over how the short options relate to the longer ones.

If, for instance, you wanted two arguments in your application called --force and --file, the convention would be to use -f as the short form, but you can’t use it for both. Explicitly assigning the short form version allows you to decide what you want to use instead. Maybe you’d go for -i for an input file or -o for an output file or something like that.

These arguments are flags because we set action="store_true" in them, which stores True if they’re set and False if they’re not.

If you omit the action="store_true", you get an optional argument. This could be something like --file /path/to/file, where you must specify something immediately after the argument. You can use these for specifying additional parameters for your scripts. We’re not really covering that in this script though, so here are a few quick examples to get you thinking:

  • --config /path/to/config_file – specify an alternate config file to use instead of the default
  • --environment production – run against product data rather than test data
  • --algo algorithm_name – use a different algorithm instead of the default
  • --period weekly – change the default calculation period of your utility
  • --options /path/to/options/file – provide options for an analysis from an external file

Another freebie we get from argparse is -h and --help. These are built-in and print nicely formatted help output for your users or future-self!

$ ./sortinghat.py -h
usage: sortinghat.py [-h] [-d] [-s] name

positional arguments:
  name         name of the person to sort

optional arguments:
  -h, --help   show this help message and exit
  -d, --debug  enable debug mode
  -s, --short  output only the house

Lastly for this section, we use parse_args() to assign the arguments that have been constructed to a new namespace called ARGV so we can use them later. Arguments stored in ARGV are retrievable using the long version of the argument name, so in this example: ARGV.nameARGV.debug and ARGV.short.

Everything from this point onward is largely to do with the functionality of the utility, not the command line execution of it, so we’ll go through it quite quickly.

Printing debug messages

I didn’t want to get bogged down using a proper logging library for this small tool, so this function takes care of our very basic needs for us.

def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))

Essentially, it will only print a message if ARGV.debug is True and that will only be true if we set the -d flag when we run the tool on the command line.

We can then put messages like debug_msg("Debug option is set") in our code and they’ll do nothing unless that -d flag is set. If it is set, you’ll get output like:

$ ./sortinghat.py -d Mark
DEBUG: Debug option is set
DEBUG: Your name is - Mark
DEBUG: The name_hash is - f1b5a91d4d6ad523f2610114591c007e75d15084
DEBUG: The house_key is - f
Hello Mark, you can join Slytherin!

Using a technique like this – or perhaps a --verbose flag – can help to provide additional information about what’s going on inside your utility at run time that could be helpful to others or your future-self if they encounter any difficulties with it.

The debug_msg() function is used in this way throughout the rest of the program.

Figuring out the house

To figure out what house to assign someone to we use the same approach that we did for the R version. We calculate the hash of the input name and store the hexadecimal representation. Since hex uses the numbers 0-9 and the characters a-f, we can assign the four Hogwarts houses to these 16 symbols evenly in a Python dictionary.

We can then use the first character of the input name hash as the key when retrieving the value from the dictionary

HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }
NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()
HOUSE_KEY = NAME_HASH[0]
HOUSE = HOUSES[HOUSE_KEY]

We also make sure that the input name is converted to lower case first to prevent us from running into any discrepancies between, for example, ‘Mark’ and ‘mark’.

Printing output

Here in the final section, we use the value of ARGV.short to decide whether to print the long output or the short output. Flags are False by default with argparse, so we can test if it’s been set to True (by specifying the -s flag on the command line) and print accordingly.

if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

Using the -s flag on the command line results in the following short output:

$ ./sortinghat.py -s Mark
Slytherin

Since the flags are optional you can combine them if you need to, so something like ./sortinghat.py -s -d Mark will produce the expected output – debug info with the short version of the final message.

That’s it for now

I hope you found this post useful and that you have some great ideas for things in your workflows that could be automated with command line utilities. If you do end up writing your own tool find me on twitter and let me know about it. I love hearing about all the awesome ways people are using these techniques to solve real world problems.

Originally posted on Mark’s blog, here.

Blogs home Featured Image

In this post we’re going to model the prices of Airbnb apartments in London. In other words, the aim is to build our own price suggestion model. We will be using data from http://insideairbnb.com/ which was collected in April 2018. This work is inspired from the Airbnb price prediction model built by Dino Rodriguez, Chase Davis, and Ayomide Opeyemi. Normally we would be doing this in R but we thought we’d try our hand at Python for a change.

We present a shortened version here, but the full version is available on our GitHub.

Data Preprocessing

First, we import the listings gathered in the csv file.

import pandas as pd
listings_file_path = 'listings.csv.gz' 
listings = pd.read_csv(listings_file_path, compression="gzip", low_memory=False)
listings.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

The data has 95 columns or features. Our first step is to perform feature selection to reduce this number.

Feature selection

Selection on Missing Data

Features that have a high number of missing values aren’t useful for our model so we should remove them.

import matplotlib.pyplot as plt
%matplotlib inline

percentage_missing_data = listings.isnull().sum() / listings.shape[0]
ax = percentage_missing_data.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5))
ax.set_xlabel('Feature')
ax.set_ylabel('Percent Empty / NaN')
ax.set_title('Feature Emptiness')
plt.show()

As we can see, the features neighbourhood_group_cleansedsquare_feethas_availabilitylicense and jurisdiction_names mostly have missing values. The features neighbourhoodcleaning_fee and security_deposit are more than 30% empty which is too much in our opinion. The zipcode feature also has some missing values but we can either remove these values or impute them within reasonable accuracy.

useless = ['neighbourhood', 'neighbourhood_group_cleansed', 'square_feet', 'security_deposit', 'cleaning_fee', 
           'has_availability', 'license', 'jurisdiction_names']
listings.drop(useless, axis=1, inplace=True)

Selection on Sparse Categorical Features

Let’s have a look at the categorical data to see the number of unique values.

categories = listings.columns[listings.dtypes == 'object']
percentage_unique = listings[categories].nunique() / listings.shape[0]

ax = percentage_unique.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5))
ax.set_xlabel('Feature')
ax.set_ylabel('Percent # Unique')
ax.set_title('Feature Emptiness')
plt.show()

We can see that the street and amenities features have a large number of unique values. It would require some natural language processing to properly wrangle these into useful features. We believe we have enough location information with neighbourhood_cleansed and zipcode so we’ll remove street. We also remove amenitiescalendar_updated and calendar_last_updated features as these are too complicated to process for the moment.

to_drop = ['street', 'amenities', 'calendar_last_scraped', 'calendar_updated']
listings.drop(to_drop, axis=1, inplace=True)

Now, let’s have a look at the zipcode feature. The above visualisation shows us that there are lots of different postcodes, maybe too many?

print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 24774

Indeed, there are too many zipcodes. If we leave this feature as is it might cause overfitting. Instead, we can regroup the postcodes. At the moment, they are separated as in the following example: KT1 1PE. We’ll keep the first part of the zipcode (e.g. KT1) and accept that this gives us some less precise location information.

listings['zipcode'] = listings['zipcode'].str.slice(0,3)
listings['zipcode'] = listings['zipcode'].fillna("OTHER")
print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 461

A lot of zipcodes contain less than 100 apartments and a few zipcodes contain most of the apartments. Let’s keep these ones.

relevant_zipcodes = count_per_zipcode[count_per_zipcode > 100].index
listings_zip_filtered = listings[listings['zipcode'].isin(relevant_zipcodes)]

# Plot new zipcodes distribution
count_per_zipcode = listings_zip_filtered['zipcode'].value_counts()
ax = count_per_zipcode.plot(kind='bar', figsize = (22,4), color = '#E35A5C', alpha = 0.85)
ax.set_title("Zipcodes by Number of Listings")
ax.set_xlabel("Zipcode")
ax.set_ylabel("# of Listings")

plt.show()

print('Number of entries removed: ', listings.shape[0] - listings_zip_filtered.shape[0])
Number of entries removed:  5484

This distribution is much better, and we only removed 5484 rows from our dataframe which contained about 53904 rows.

Selection on Correlated Features

Next, we look at correlations.

import numpy as np
from sklearn import preprocessing

# Function to label encode categorical variables.
# Input: array (array of values)
# Output: array (array of encoded values)
def encode_categorical(array):
    if not array.dtype == np.dtype('float64'):
        return preprocessing.LabelEncoder().fit_transform(array) 
    else:
        return array
    
# Temporary dataframe
temp_data = listings_neighborhood_filtered.copy()

# Delete additional entries with NaN values
temp_data = temp_data.dropna(axis=0)

# Encode categorical data
temp_data = temp_data.apply(encode_categorical)
# Compute matrix of correlation coefficients
corr_matrix = temp_data.corr()
# Display heat map 
plt.figure(figsize=(7, 7))
plt.pcolor(corr_matrix, cmap='RdBu')
plt.xlabel('Predictor Index')
plt.ylabel('Predictor Index')
plt.title('Heatmap of Correlation Matrix')
plt.colorbar()

plt.show()

This reveals that calculated_host_listings_count is highly correlated with host_total_listings_count so we’ll keep the latter. We also see that the availability_* variables are correlated with each other. We’ll keep availability_365 as this one is less correlated with other variables. Finally, we decide to drop requires_license which has an odd correlation result of NA’s which will not be useful in our model.

useless = ['calculated_host_listings_count', 'availability_30', 'availability_60', 'availability_90', 'requires_license']
listings_processed = listings_neighborhood_filtered.drop(useless, axis=1)

Data Splitting: Features / labels – Training set / testing set

Now we split into features and labels and training and testing sets. We also convert the train and test dataframe into numpy arrays so that they can be used to train and test the models.

# Shuffle the data to ensure a good distribution for the training and testing sets
from sklearn.utils import shuffle
listings_processed = shuffle(listings_processed)

# Extract features and labels
y = listings_processed['price']
X = listings_processed.drop('price', axis = 1)

# Training and Testing Sets
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

train_X = np.array(train_X)
test_X = np.array(test_X)
train_y = np.array(train_y)
test_y = np.array(test_y)

train_X.shape, test_X.shape
((36185, 170), (12062, 170))

Modelling

Now that the data preprocessing is over, we can start the second part of this work: applying different Machine Learning models. We decided to apply 3 different models:

  • Random Forest, with the RandomForestRegressor from the Scikit-learn library
  • Gradient Boosting method, with the XGBRegressor from the XGBoost library
  • Neural Network, with the MLPRegressor from the Scikit-learn library.

Each time, we applied the model with its default hyperparameters and we then tuned the model in order to get the best hyperparameters. The metrics we use to evaluate the models are the median absolute error due to the presence of extreme outliers and skewness in the data set.

We only show the code the Random Forest here, for the rest of the code please see the full version of this blogpost on our GitHub.

Application of the Random Forest Regressor

Let’s start with the Random Forest model.

With default hyperparameters

We first create a pipeline that imputes the missing values then scales the data and finally applies the model. We then fit this pipeline to the training set.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

# Create the pipeline (imputer + scaler + regressor)
my_pipeline_RF = make_pipeline(Imputer(), StandardScaler(),
                               RandomForestRegressor(random_state=42))

# Fit the model
my_pipeline_RF.fit(train_X, train_y)

We evaluate this model on the test set, using the median absolute error to measure the performance of the model. We’ll also include the root-mean-square error (RMSE) for completeness. Since we’ll be doing this repeatedly it is good practice to create a function.

from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from math import sqrt

def evaluate_model(model, predict_set, evaluate_set):
    predictions = model.predict(predict_set)
    print("Median Absolute Error: " + str(round(median_absolute_error(predictions, evaluate_set), 2))) 
    RMSE = round(sqrt(mean_squared_error(predictions, evaluate_set)), 2)
    print("RMSE: " + str(RMSE)) 
evaluate_model(my_pipeline_RF, test_X, test_y)
Median Absolute Error: 14.2
RMSE: 126.16

Hyperparameters tuning

We had some good results with the default hyperparameters of the Random Forest regressor. But we can improve the results with some hyperparameter tuning. There are two main methods available for this:

  • Random search
  • Grid search

You have to provide a parameter grid to these methods. Then, they both try different combinations of parameters within the grid you provided. But the first one only tries several combinations whereas the second one tries all the possible combinations with the grid you provided.

We started with a random search to roughly evaluate a good combination of parameters. Once this is complete, we use the grid search to get more precise results.

Randomized Search with Cross Validation
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 1000, num = 11)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_features': max_features,
               'randomforestregressor__max_depth': max_depth,
               'randomforestregressor__min_samples_split': min_samples_split,
               'randomforestregressor__min_samples_leaf': min_samples_leaf,
               'randomforestregressor__bootstrap': bootstrap}
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import RandomizedSearchCV

# Random search of parameters, using 2 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = my_pipeline_RF, 
                               param_distributions = random_grid, 
                               n_iter = 50, cv = 2, verbose=2,
                               random_state = 42, n_jobs = -1, 
                               scoring = 'neg_median_absolute_error')
# Fit our model
rf_random.fit(train_X, train_y)

rf_random.best_params_
{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 35,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 5,
 'randomforestregressor__n_estimators': 1000}
Grid Search with Cross Validation
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'randomforestregressor__bootstrap': [True],
    'randomforestregressor__max_depth': [30, 35, 40], 
    'randomforestregressor__max_features': ['auto'],
    'randomforestregressor__min_samples_leaf': [2],
    'randomforestregressor__min_samples_split': [4, 5, 6],
    'randomforestregressor__n_estimators': [950, 1000, 1050] 
}

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = my_pipeline_RF, 
                           param_grid = param_grid, 
                           cv = 3, n_jobs = -1, verbose = 2, 
                           scoring = 'neg_median_absolute_error')

# Fit the grid search to the data
grid_search.fit(train_X, train_y)

grid_search.best_params_
{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 30,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 4,
 'randomforestregressor__n_estimators': 1050}
Final Model
# Create the pipeline (imputer + scaler + regressor)
my_pipeline_RF_grid = make_pipeline(Imputer(), StandardScaler(),
                                      RandomForestRegressor(random_state=42,
                                                            bootstrap = True,
                                                            max_depth = 30,
                                                            max_features = 'auto',
                                                            min_samples_leaf = 2,
                                                            min_samples_split = 4,
                                                            n_estimators = 1050))

# Fit the model
my_pipeline_RF_grid.fit(train_X, train_y)

evaluate_model(my_pipeline_RF_grid, test_X, test_y)
Median Absolute Error: 13.57
RMSE: 125.04

We get better results with the tuned model than with default hyperparameters, but the improvement of the median absolute error is not amazing. Maybe we will have better precision if we use another model.

Visualisation of all models’ performance

The tuned Random Forest and XGBoost gave the best results on the test set. Surprisingly, the Multi Layer Perceptron with default parameters gave the highest Median Absolute errors, and the tuned one did not even give better results than the default Random Forest. This is unusual, maybe the Multi Layer Perceptron needs more data to perform better, or it might need more tuning on important hyperparameters such as the hidden_layer_sizes.

Conclusion

In this post, we modelled Airbnb apartment prices using descriptive data from the Airbnb website. First, we preprocessed the data to remove any redundant features and reduce the sparsity of the data. Then we applied three different algorithms, initially with default parameters which we then tuned. In our results the tuned Random Forest and tuned XGBoost performed best.

To further improve our models we could include more feature engineering, for example, time-based features. We could also try more extensive hyperparameter tuning. If you would like to give it a go yourself, the code and data for this post can be found on GitHub

Blogs home Featured Image

Linux containers, of which Docker is the most well known, can be a really great way to improve reproducibility on your data projects (for more info see here), and create portable, reusable applications. But how would we manage the deployment of multiple containerised applications?

Kubernetes is an open source container management platform that automates the core operations for you. It allows you to automatically deploy and scale containerised applications and removes the manual steps that would otherwise be involved. Essentially, you cluster together groups of hosts running Linux containers, and Kubernetes helps you easily and efficiently manage those clusters. This is especially effective in cloud based environments.

Why use kubernetes in your data stack?

Since Kubernetes orchestrates containers and since containers are a great way to bundle up your applications with their dependencies — thus improving reproducibility — Kubernetes is a natural fit if you’re aiming for high levels of automation in your stack.

Kubernetes allows you to manage containerised apps that span multiple containers as well as scale and schedule the containers as necessary across the cluster.

For instance, if you’re building stateless microservices in flask (Python) and plumber (R) it’s easy to initially treat running them in containers as though they were running in a simple virtual machine. However, once these containers are in a production environment and scale becomes much more important, you’ll likely need to run multiple instances of the containers and Kubernetes can take care of that for you.

Automation is a key driver

When container deployments are small it can be tempting to try to manage them by hand. Starting and stopping the containers that are required to service your application. But this approach is very inflexible and beyond the smallest of deployments such an approach is not really practical. Kubernetes is designed to manage the complexity of looking after production scale container deployments. This takes away the complexity of trying to manage such systems by hand as they can quickly reach a size and level of complexity that does not lend itself to error-prone manual management.

Scheduling is another often overlooked feature of Kubernetes in data processing pipelines, as you could, for example, schedule refreshes of models in order to keep them fresh. Such processes could be scheduled for times when you know the cluster will be otherwise quiet (such as overnight, or on weekends), with the refreshed model being published automatically.

The Case for Kubernetes in your data stack

More broadly, it helps you fully implement and rely on a container-based infrastructure in production environments. This is especially beneficial when you’re trying to reduce infrastructure costs as it allows you to keep your cluster size at the bare minimum required to run your applications, which in turn saves you money on wasted compute resource.

The features of Kubernetes are too long to list here, but the key things to take away is that it can be used to run containerised apps across multiple hosts, can scale applications on the fly, can auto-restart applications that have fallen over and help automate deployments.

The wider Kubernetes ecosystem relies on many other projects to deliver these fully orchestrated services. These additional projects provide such additional features as registry services for your containers, networking, security and so on.

Kubernetes offers a rich toolset to manage complex application stacks and with data science, engineering and operations becoming increasingly large scale, automation is a key driver for many new projects. If you’re not containerising your apps yet, jumping into Kubernetes can seem daunting, but if you start small by building out some simple containerised applications to start with, the benefits of this approach should become clear pretty quickly.

For an in-depth technical look at running Kubernetes, this post by Mark Edmondson offers an excellent primer.