Blogs home Featured Image

Julia is a Data Scientist at Stack Overflow, has a PhD in astrophysics and an abiding love for Jane Austen (which we totally understand!). Before moving into Data Science and discovering R, Julia worked in academia and ed tech, and was a NASA Datanaut. She enjoys making beautiful charts, programming in R, text mining, and communicating about technical topics with diverse audiences. In fact, she loves R and text mining so much, she literally wrote the book on it: Text Mining with R: A Tidy Approach!

Lovely to speak to you Julia, could you give us a bit of a background around the work that you do? 

The open source work I do focuses on building a bridge between the tidyverse ecosystem of tools and the real world text data that so many of us need to use in our organizations, so we can use powerful, well-designed tidy tools with text data. In my day job, I work at Stack Overflow, using statistics and machine learning to make our site the best place for people who code to learn and share knowledge online, and to help our clients who want to engage with developers be successful.

What led to your career path?

My academic background is in physics and astronomy, where I was an observational astronomer who spent my time “in the trenches” with real-life data. Also, I’ve been heavily involved in education in various forms for a long time, whether speaking, teaching, writing, or otherwise. All of this together informs how I do data science, because a huge part of what I do is communicate with people about what a complex data analysis means. The fact that I analyze some dataset or train some machine learning model is great, but if I can’t explain it to my business partners, then we can’t make decisions.

Could you tell us what to expect from the content of your talk? And are there any key takeaway advice or tips that delegates will come away with?

Many R users working in fields from healthcare to finance to tech deal with messy text data (this includes me at Stack Overflow!); my talk focuses on a practical, flexible approach to use this text data to gain insight and make better decisions.

Can you give an example?

Folks at EARL can expect my talk to start with the fundamentals of exploratory data analysis for text. EDA is a fruitful and important part of the data science process, and in my own work, I know how much bang for the buck I get when I am deliberate about EDA strategies. We won’t stop there, though! We will also cover how to use tidy data principles for supervised and unsupervised machine learning for text.

What inspired you to write your book Text Mining with R – A Tidy Approach?

The book that my collaborator Dave and I wrote together grew organically out of the work we were doing in this space. We started by developing long-form documentation for our R package, invested more time in laying out best practices in workflows through blog posts, and eventually brought a book’s worth of content together in one cohesive, organized place.

Tell us about the type of work you get involved with on a day to day basis.

In my day job at Stack Overflow, I work on two main categories of questions. The first is centered on the ways that we directly generate revenue, through partnering with clients who want to hire, engage with, and enable the world’s developers. The second (which is of course connected to the first) is centered on the public Q&A community of Stack Overflow and the other Stack Exchange sites; I work on questions around how technologies are related to each other and changing, how to scaffold question askers to success, and how to make Stack Overflow more welcoming and inclusive.

What work do you do with the wider data science community and how do you see it evolving?

In my open source work, I maintain my own R packages, blog and speak about data analysis practices and share resources about data science and tech via social media. I have some ideas for new work I am excited about pursuing soon! I would love to evolve my data science work to more fully support best practices in machine learning for text. Another area that I want to continue to invest energy in, both in my day job and community work, is moving data science and tech toward more just and inclusive practices.

Come and see Julia and be inspired about her love for text mining and tidyverse applications at EARL Seattle on 7th November, we are really looking forward to the conference programme in Seattle, Houston and Boston.

Tickets can still be purchased here.

Blogs home Featured Image

In this post we’re going to model the prices of Airbnb apartments in London. In other words, the aim is to build our own price suggestion model. We will be using data from http://insideairbnb.com/ which was collected in April 2018. This work is inspired from the Airbnb price prediction model built by Dino Rodriguez, Chase Davis, and Ayomide Opeyemi. Normally we would be doing this in R but we thought we’d try our hand at Python for a change.

We present a shortened version here, but the full version is available on our GitHub.

Data Preprocessing

First, we import the listings gathered in the csv file.

import pandas as pd
listings_file_path = 'listings.csv.gz' 
listings = pd.read_csv(listings_file_path, compression="gzip", low_memory=False)
listings.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'access', 'interaction', 'house_rules',
       'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url',
       'host_id', 'host_url', 'host_name', 'host_since', 'host_location',
       'host_about', 'host_response_time', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url',
       'host_picture_url', 'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'street',
       'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

The data has 95 columns or features. Our first step is to perform feature selection to reduce this number.

Feature selection

Selection on Missing Data

Features that have a high number of missing values aren’t useful for our model so we should remove them.

import matplotlib.pyplot as plt
%matplotlib inline

percentage_missing_data = listings.isnull().sum() / listings.shape[0]
ax = percentage_missing_data.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5))
ax.set_xlabel('Feature')
ax.set_ylabel('Percent Empty / NaN')
ax.set_title('Feature Emptiness')
plt.show()

As we can see, the features neighbourhood_group_cleansedsquare_feethas_availabilitylicense and jurisdiction_names mostly have missing values. The features neighbourhoodcleaning_fee and security_deposit are more than 30% empty which is too much in our opinion. The zipcode feature also has some missing values but we can either remove these values or impute them within reasonable accuracy.

useless = ['neighbourhood', 'neighbourhood_group_cleansed', 'square_feet', 'security_deposit', 'cleaning_fee', 
           'has_availability', 'license', 'jurisdiction_names']
listings.drop(useless, axis=1, inplace=True)

Selection on Sparse Categorical Features

Let’s have a look at the categorical data to see the number of unique values.

categories = listings.columns[listings.dtypes == 'object']
percentage_unique = listings[categories].nunique() / listings.shape[0]

ax = percentage_unique.plot(kind = 'bar', color='#E35A5C', figsize = (16, 5))
ax.set_xlabel('Feature')
ax.set_ylabel('Percent # Unique')
ax.set_title('Feature Emptiness')
plt.show()

We can see that the street and amenities features have a large number of unique values. It would require some natural language processing to properly wrangle these into useful features. We believe we have enough location information with neighbourhood_cleansed and zipcode so we’ll remove street. We also remove amenitiescalendar_updated and calendar_last_updated features as these are too complicated to process for the moment.

to_drop = ['street', 'amenities', 'calendar_last_scraped', 'calendar_updated']
listings.drop(to_drop, axis=1, inplace=True)

Now, let’s have a look at the zipcode feature. The above visualisation shows us that there are lots of different postcodes, maybe too many?

print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 24774

Indeed, there are too many zipcodes. If we leave this feature as is it might cause overfitting. Instead, we can regroup the postcodes. At the moment, they are separated as in the following example: KT1 1PE. We’ll keep the first part of the zipcode (e.g. KT1) and accept that this gives us some less precise location information.

listings['zipcode'] = listings['zipcode'].str.slice(0,3)
listings['zipcode'] = listings['zipcode'].fillna("OTHER")
print("Number of Zipcodes:", listings['zipcode'].nunique())
Number of Zipcodes: 461

A lot of zipcodes contain less than 100 apartments and a few zipcodes contain most of the apartments. Let’s keep these ones.

relevant_zipcodes = count_per_zipcode[count_per_zipcode > 100].index
listings_zip_filtered = listings[listings['zipcode'].isin(relevant_zipcodes)]

# Plot new zipcodes distribution
count_per_zipcode = listings_zip_filtered['zipcode'].value_counts()
ax = count_per_zipcode.plot(kind='bar', figsize = (22,4), color = '#E35A5C', alpha = 0.85)
ax.set_title("Zipcodes by Number of Listings")
ax.set_xlabel("Zipcode")
ax.set_ylabel("# of Listings")

plt.show()

print('Number of entries removed: ', listings.shape[0] - listings_zip_filtered.shape[0])
Number of entries removed:  5484

This distribution is much better, and we only removed 5484 rows from our dataframe which contained about 53904 rows.

Selection on Correlated Features

Next, we look at correlations.

import numpy as np
from sklearn import preprocessing

# Function to label encode categorical variables.
# Input: array (array of values)
# Output: array (array of encoded values)
def encode_categorical(array):
    if not array.dtype == np.dtype('float64'):
        return preprocessing.LabelEncoder().fit_transform(array) 
    else:
        return array
    
# Temporary dataframe
temp_data = listings_neighborhood_filtered.copy()

# Delete additional entries with NaN values
temp_data = temp_data.dropna(axis=0)

# Encode categorical data
temp_data = temp_data.apply(encode_categorical)
# Compute matrix of correlation coefficients
corr_matrix = temp_data.corr()
# Display heat map 
plt.figure(figsize=(7, 7))
plt.pcolor(corr_matrix, cmap='RdBu')
plt.xlabel('Predictor Index')
plt.ylabel('Predictor Index')
plt.title('Heatmap of Correlation Matrix')
plt.colorbar()

plt.show()

This reveals that calculated_host_listings_count is highly correlated with host_total_listings_count so we’ll keep the latter. We also see that the availability_* variables are correlated with each other. We’ll keep availability_365 as this one is less correlated with other variables. Finally, we decide to drop requires_license which has an odd correlation result of NA’s which will not be useful in our model.

useless = ['calculated_host_listings_count', 'availability_30', 'availability_60', 'availability_90', 'requires_license']
listings_processed = listings_neighborhood_filtered.drop(useless, axis=1)

Data Splitting: Features / labels – Training set / testing set

Now we split into features and labels and training and testing sets. We also convert the train and test dataframe into numpy arrays so that they can be used to train and test the models.

# Shuffle the data to ensure a good distribution for the training and testing sets
from sklearn.utils import shuffle
listings_processed = shuffle(listings_processed)

# Extract features and labels
y = listings_processed['price']
X = listings_processed.drop('price', axis = 1)

# Training and Testing Sets
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state = 0)

train_X = np.array(train_X)
test_X = np.array(test_X)
train_y = np.array(train_y)
test_y = np.array(test_y)

train_X.shape, test_X.shape
((36185, 170), (12062, 170))

Modelling

Now that the data preprocessing is over, we can start the second part of this work: applying different Machine Learning models. We decided to apply 3 different models:

  • Random Forest, with the RandomForestRegressor from the Scikit-learn library
  • Gradient Boosting method, with the XGBRegressor from the XGBoost library
  • Neural Network, with the MLPRegressor from the Scikit-learn library.

Each time, we applied the model with its default hyperparameters and we then tuned the model in order to get the best hyperparameters. The metrics we use to evaluate the models are the median absolute error due to the presence of extreme outliers and skewness in the data set.

We only show the code the Random Forest here, for the rest of the code please see the full version of this blogpost on our GitHub.

Application of the Random Forest Regressor

Let’s start with the Random Forest model.

With default hyperparameters

We first create a pipeline that imputes the missing values then scales the data and finally applies the model. We then fit this pipeline to the training set.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import StandardScaler

# Create the pipeline (imputer + scaler + regressor)
my_pipeline_RF = make_pipeline(Imputer(), StandardScaler(),
                               RandomForestRegressor(random_state=42))

# Fit the model
my_pipeline_RF.fit(train_X, train_y)

We evaluate this model on the test set, using the median absolute error to measure the performance of the model. We’ll also include the root-mean-square error (RMSE) for completeness. Since we’ll be doing this repeatedly it is good practice to create a function.

from sklearn.metrics import median_absolute_error
from sklearn.metrics import mean_squared_error
from math import sqrt

def evaluate_model(model, predict_set, evaluate_set):
    predictions = model.predict(predict_set)
    print("Median Absolute Error: " + str(round(median_absolute_error(predictions, evaluate_set), 2))) 
    RMSE = round(sqrt(mean_squared_error(predictions, evaluate_set)), 2)
    print("RMSE: " + str(RMSE)) 
evaluate_model(my_pipeline_RF, test_X, test_y)
Median Absolute Error: 14.2
RMSE: 126.16

Hyperparameters tuning

We had some good results with the default hyperparameters of the Random Forest regressor. But we can improve the results with some hyperparameter tuning. There are two main methods available for this:

  • Random search
  • Grid search

You have to provide a parameter grid to these methods. Then, they both try different combinations of parameters within the grid you provided. But the first one only tries several combinations whereas the second one tries all the possible combinations with the grid you provided.

We started with a random search to roughly evaluate a good combination of parameters. Once this is complete, we use the grid search to get more precise results.

Randomized Search with Cross Validation
import numpy as np

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 1000, num = 11)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 5)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'randomforestregressor__n_estimators': n_estimators,
               'randomforestregressor__max_features': max_features,
               'randomforestregressor__max_depth': max_depth,
               'randomforestregressor__min_samples_split': min_samples_split,
               'randomforestregressor__min_samples_leaf': min_samples_leaf,
               'randomforestregressor__bootstrap': bootstrap}
# Use the random grid to search for best hyperparameters
from sklearn.model_selection import RandomizedSearchCV

# Random search of parameters, using 2 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = my_pipeline_RF, 
                               param_distributions = random_grid, 
                               n_iter = 50, cv = 2, verbose=2,
                               random_state = 42, n_jobs = -1, 
                               scoring = 'neg_median_absolute_error')
# Fit our model
rf_random.fit(train_X, train_y)

rf_random.best_params_
{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 35,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 5,
 'randomforestregressor__n_estimators': 1000}
Grid Search with Cross Validation
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search 
param_grid = {
    'randomforestregressor__bootstrap': [True],
    'randomforestregressor__max_depth': [30, 35, 40], 
    'randomforestregressor__max_features': ['auto'],
    'randomforestregressor__min_samples_leaf': [2],
    'randomforestregressor__min_samples_split': [4, 5, 6],
    'randomforestregressor__n_estimators': [950, 1000, 1050] 
}

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = my_pipeline_RF, 
                           param_grid = param_grid, 
                           cv = 3, n_jobs = -1, verbose = 2, 
                           scoring = 'neg_median_absolute_error')

# Fit the grid search to the data
grid_search.fit(train_X, train_y)

grid_search.best_params_
{'randomforestregressor__bootstrap': True,
 'randomforestregressor__max_depth': 30,
 'randomforestregressor__max_features': 'auto',
 'randomforestregressor__min_samples_leaf': 2,
 'randomforestregressor__min_samples_split': 4,
 'randomforestregressor__n_estimators': 1050}
Final Model
# Create the pipeline (imputer + scaler + regressor)
my_pipeline_RF_grid = make_pipeline(Imputer(), StandardScaler(),
                                      RandomForestRegressor(random_state=42,
                                                            bootstrap = True,
                                                            max_depth = 30,
                                                            max_features = 'auto',
                                                            min_samples_leaf = 2,
                                                            min_samples_split = 4,
                                                            n_estimators = 1050))

# Fit the model
my_pipeline_RF_grid.fit(train_X, train_y)

evaluate_model(my_pipeline_RF_grid, test_X, test_y)
Median Absolute Error: 13.57
RMSE: 125.04

We get better results with the tuned model than with default hyperparameters, but the improvement of the median absolute error is not amazing. Maybe we will have better precision if we use another model.

Visualisation of all models’ performance

The tuned Random Forest and XGBoost gave the best results on the test set. Surprisingly, the Multi Layer Perceptron with default parameters gave the highest Median Absolute errors, and the tuned one did not even give better results than the default Random Forest. This is unusual, maybe the Multi Layer Perceptron needs more data to perform better, or it might need more tuning on important hyperparameters such as the hidden_layer_sizes.

Conclusion

In this post, we modelled Airbnb apartment prices using descriptive data from the Airbnb website. First, we preprocessed the data to remove any redundant features and reduce the sparsity of the data. Then we applied three different algorithms, initially with default parameters which we then tuned. In our results the tuned Random Forest and tuned XGBoost performed best.

To further improve our models we could include more feature engineering, for example, time-based features. We could also try more extensive hyperparameter tuning. If you would like to give it a go yourself, the code and data for this post can be found on GitHub

Blogs home Featured Image

A year before I sat down here and started writing this sentence, I was about three months into a year-long work placement at Mango. I loved what I was doing, I loved the people I was doing it with, and I was generally having a great time.

But at some point last spring, people started to ask me when I was leaving. I couldn’t tell whether it was because they knew that at some point I’d have to go back to university to complete my course, or because they’d had enough of me already and wanted me to go away. Either way, whenever I mentioned that I was thinking about taking advantage of what will probably be the last long summer holiday of my life, everyone told me the same thing: I’d be a fool not to.

Therefore after months of meticulous planning, early one mid-July morning myself and a friend – summoning as much 18th-century spirit as possible – set off to complete a Grand Tour of the continent.

Over the course of 55 days, we visited 22 countries, we covered over 6000 miles of European highway, and due to the fact that I was very busy being on holiday, I wrote precisely 0 lines of code.

So I’m not writing about some cool project I’ve done, or some amazing new tech I’ve been researching, or really anything at all to do with code or a computer. Sorry. I suppose this piece should really be called “some things I learned which definitely have completely nothing at all to do with my job”.

Some things I learned which definitely have completely nothing at all to do with my job

1. Head for high ground

This is probably what you learn on day one in Army Commander School.

You’re in charge. The furious battle is raging on all sides. Suddenly, you realise that you are in serious danger of being completely overwhelmed. This is a good time to employ a tactic commonly referred to as “running away”.

But if this is a battle you want (or need!) to win, it’s probably not a good idea to run away forever. Your opponent isn’t going to hang around for a while wondering where you’ve gone, and then just decide “actually, yeah, fair enough, we lost, never mind”.

Instead, you should run away to somewhere nearby, but as high up as possible. This gives you a chance to widen your view: you can see where you’ve been, where you want to go, what’s going on right now, and how those three things relate to each other. After assessing the situation from this elevated position, it is much easier to see what needs to be done and to refocus your efforts accordingly.

If you want to be a top-level Army Commander one day, you can learn more from this. On your next conquest (or, in my case, unfamiliar city) find that high ground and go on a quick reconnaissance mission as early as possible: identify your goals, think about the best way to get to them, and scan the horizon for any threats (or scary grey clouds) which might be on their way towards you.

Once you have the high ground, it’s over

2. Record what you’ve done

On the whole, we humans are pretty smart. We’re good at figuring out how to do stuff, and once we’ve figured out what to do, we’re good at actually doing it.

Having said that, the same is true of other primates. And crows. And dolphins. And beavers. And so on, and so on.

The reason why we are smarter is that we have an awesome thing called “language”. Language lets us share our ideas and our experiences with other humans, so that they don’t have to come up with the same ideas or go through the same experiences in order to have the same knowledge.

Even better: at some point a few thousand years ago, someone figured out how to convert language into something physical. As a result, those of us who are alive right now have access to virtually all the knowledge developed by all of humankind since that point.

SO WHY YOU NO USE IT? Write down everything! Write down what you’ve done, and how you’ve done it, and why you’ve done it, and why you’ve done it like that, and everything that went wrong before you got it right, and everything you think it could lead to.

Do it for yourself, in anticipation of the moment when in six months’ time you realise you’ve forgotten where you were or who you were with or what the name of that street was.

Do it for other people, so that they don’t have to drive round eastern Prague four times trying to find the car park which was marked in the wrong place on the map.

Do it for the people who will stumble across your hastily scrawled notes years from now and, with a sudden flash of inspiration, will use them as the foundation to build myriad new and wonderful things.

My memory is terrible, but I wrote down all the embarrassing stories so that they’ll never be forgotten

3. Respect experience

Asking questions is a really really good thing to do. It’s one of the best ways to learn about things and you should never be afraid to ask about something you don’t understand.

However, it’s important to remember one thing: “always ask” is not the same as “always ask right now”.

If someone with more experience than you tells you to do something, and if you know that there is almost certainly a good reason, then even if you don’t know what that reason is… you should probably do the thing.

Wait until the pressure has eased a bit before demanding an explanation. You should still ask for one, but perhaps when everyone’s a little bit less stressed.

4. Call a spade a spade

Names can be controversial.

Pavement or sidewalk? Biscuit or cookie? Dinner or tea or supper? Bun or bap or roll? GIF or GIF?

But there are some names that virtually everyone agrees on. In particular, this tends to happen if it is important that everyone agrees on the name.

For example, “police” is an important concept: it represents protection, order, assistance, and a bunch of other useful words. Pretty much all European languages have almost exactly the same spelling and pronunciation for “police” as English does.

How to say “police” in the 18 different European languages which we came across during our trip

This means that if you speak any one of these languages, you can travel to any place where they speak any one of the others; and even if you’re in an unfamiliar environment where your understanding is limited, you aren’t completely on your own. If you need help, you can yell “POLICE!”, and someone in a uniform will probably come running.

Unless you’re in Hungary, because Hungarian is very strange.

… actually, someone will come running even in Hungary, because virtually everyone speaks English as a second language. They have to, because very few people choose to learn it as a second language – Hungarian is only really spoken in Hungary, and as previously mentioned, it really is very strange. Nevertheless, it is the first language of around 12 million people, so there’s a reasonable chance that at some point you’ll need to find a friendly Hungarian to do some translation for you.

I suppose there are two points to take from this little section. Firstly, if you call things by more or less the same name as everyone else does, then this will usually help to improve shared understanding and will aid communication. Secondly, people who can speak multiple languages – and especially less widely-spoken languages – are super super valuable!

5. Call a spade a spade, but that doesn’t mean you should assume/demand that everyone else is going to call every single item in their toolshed by exactly the same names as you call all the things which you have in YOUR toolshed

Just to add an important caveat to the previous section: sure, it’s helpful if someone speaks the same language as you, and even more exciting if you realise they speak it with the same accent. But once you’ve traded your initial stories, that gets boring quite quickly.

Plus, you’re definitely going to struggle to make new friends if you go around loudly insisting that everyone speaks to you in your language, and getting angry or patronising people if they get something “wrong”. Socialise, compromise, learn.

6. New is often exciting, but exciting doesn’t have to be new

Humans have been around for a while now, which means we’ve already gone to most places. If you want to go somewhere no-one else has been before then your options are already fairly limited. If you add the complication of getting there in the first place, and the fairly high probability that you won’t find anything particularly interesting there anyway, then it begins to look like a bit of a daunting prospect.

However.

You don’t have to go somewhere no-one else has been before. You can go to the same places and do the same things that someone else has already done, and as long as you’re enjoying yourself, it really doesn’t matter that someone has been there and done it before. There’s always a slightly different route to the next place, or a slightly different angle to view something from, or something to take inspiration from when you’re planning your next adventure.

Maybe one day in the future, you’ll decide that you want a bigger challenge. Then you can dust off your old maps and start thinking about making that expedition out into the middle of nowhere. But there are plenty of other wonderful places to go and things to do first – and honestly, some of those places really are well worth a visit.

If you see an awesome thing that someone else has already done, don’t be afraid to recreate it yourself (or to take photos of your friend recreating it)

7. Get out there and do stuff

There is so much out there.

No really, there is SO MUCH out there.

Go to places. Meet people. Talk to those people, then find more people. Read stuff, write stuff, look at things, show your friends, share opinions, debate stuff, be creative, demand feedback, ask questions, learn things, challenge yourself, pass on your passion, and while you’re busy doing all that never let anyone take away the thing that makes you you.

Go right now and carry on being awesome.