AlmaBetter Student at almaBetter
Learn different techniques of Exploratory Data Analysis using NYC Airbnb 2019 dataset
I’m just starting over into this field and everyday I keep learning and exploring bits and pieces with randomness in data, I don’t step back finding what significance it holds and what can be achieved out of it. So far, I have been enjoying my day mostly on Stack Overflow and various Data Science blogs.
My journey till now has been amazing and I’m yet to learn many things on the way. This is just a beginning with my exploratory Data Analysis and there are more new contents coming up. I hope this detailed analysis can help as well as set a stepping stone to many Data driven enthusiasts looking to break into this field.
So, there are no thumb rules for data exploration. If you are in a state of mind, that machine learning can only sail you away from every data storm, trust me, it won’t. After some point of time, you’ll realize that you are struggling at improving model’s accuracy. In such situation, data exploration techniques will come to your rescue. If you’re yet new to this battle of data exploration and want to have an overall idea of how this can be achieved step by step with ease, please don’t divert yourselves from this detailed exploration on Airbnb 2019 dataset comprising of the bookings made in NYC!
This San-Francisco based startup offers you someone’s home as a place to stay instead of a hotel. You might be thinking of another unicorn in town as to OYO Hotels which has kind of a relatable business model but Airbnb allows you to be host for anyone anywhere with rooms/beds available in your personal space. OYO Rooms and Airbnb are by no means similar to each other, in fact, they are almost as opposite as the sky and sea. So, having much said let’s just deep dive into our actuals on why are we basically here? Scroll below and have a feel!
This dataset has around ** 48,895** observations with 16 columns and it is a mix between categorical and numeric values. I have portrayed this detailed analysis as much simple as required to get a basic understanding even if someone is very new to this ????
Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present a more unique, personalized way of experiencing the world. Nowadays, Airbnb became one of a kind service that is used by the whole world. Data analysts become a crucial factor for the company that provided millions of listings through Airbnb. These listings generate a lot of data that can be analyzed and used for security, business decisions, understanding of customers’ and providers’ behavior on the platform, implementing innovative additional services, guiding marketing initiatives, and much more.
The very basic information about the dataset using df.info()
By basic inspection, a particular property name will have one particular host name hosted by that same individual but a particular host name can have multiple properties in an area. So, host_name is a categorical variable here. Also neighbourhood_group (comprising of Manhattan, Brooklyn, Queens, Bronx, Staten Island), neighbourhood and room_type (private,shared,Entire home/apt) fall into this category.
~id,latitude,longitude,price,minimum_nights,number_of_reviews,last_review, reviews_per_month, calculated_host_listings_count, availability_365 are numerical variables.
A host on airbnb holds multiple properties in a neighbourhood group(boroughs of NYC) with different host-ids but a host with a particular property/listing in a particular neighbourhood of a neighbourhood group holds a same host-id(also not mandatory as there are exceptions where few hosts have different id’s for each listing/property in a neighbourhood) Also the data tells, there might be cases where a particular host has co-hosted someone else’s property/listing in a neighbourhood on Airbnb.
Let’s not worry, I have provided my github code repo at the end. You’re free to explore and get the essence of the stories depicted here.
I was curious to check the distribution of price over the entire dataset looking at the five-number summary of the data, later found out something like this.
I have used seaborn distplot to plot this distribution curve.
The distribution has a positively skewed tail at the very extreme as we can see. Also getting the skewness as 19.118939 and kurtosis to be 585.672879, depicting the skewness value>1 and kurtosis is much high indicating presence of good amount of outliers, we will look later into this when we handle outliers!!
The famous price vs the minimum_nights!
We’ll be finding the relationship between these two numerical variables using seaborn scatter plot as below:
There’s an interesting lookout from the above scatter plot, what do you see? many data points are clustured on 0 price range, few have min nights for stay but price is 0. looks like anomaly in price. Let’s see the boxplot of this price column to have a feel of the presence of outliers. Don’t worry, I’ll be handling these outlier values!
Also let’s check the correlation matrix to understand how are the features interrelated with each other. I have plotted using seaborn heatmap to understand the strength between the variables used.
There’s correlation among host_id to reveiws_per_month & availability_365 (sequential color bar is used between value and color). Also there’s noticiable correlation between min_nights, no_of_listings_count & availability_365. Price also shows some correlation with availability_365 & host_listings_count.
no_of_reviews and reviews_per_month gives almost the same information. so we can carry out analysis with any of the two variable. Also, no_of_reviews is correlated to availability_365!
I have done most of the common data pre-processing steps like missing values treatment, checking duplicate records but here comes the very important part in EDA which many overlook before fitting a ml model is removing the outliers values as many machine learning algorithms do not support missing values and also making ml models robust to outliers.
Well an outlier is a data point that lies outside the overall pattern in a distribution. Say we’re trying to understand the people’s income based on start and end of a project. We might measure income levels of our sample group at the start and the end. Imagine our results follow a linear distribution and looks like this:
I think by now you’ve gotten an idea, what’s that point outside of that distribution doing here? well, you guessed correct that’s an outlier in this distribution of data. I’ll be more simple, suppose there’s a list of data points like: [5,7,8,7,8,10,11,11,15,10,10000]
So by now you must be able to get the intuition what can possibly a outlier value in this list of data points. I hope you guessed correct and going by that let’s also know where do this outlier values come from?
Also, there can be any low range or high range outliers and you can pretty well find out using IQR(Inter-quartile range) approach. A commonly used rule says that a data point is an outlier if it is more than 1.5. IQR above the third quartile or below the first quartile. Said differently, low outliers are below Q1–1.5*IQR and high range outliers are above Q3+1.5*IQR.
So, let’s come to the main objective here(obviously you can do some google search to know about these). There’s also another way to handle outliers using the numpy percentile() function on the nth percentiles of the dataset: [25Q1,75Q3] and subsetting the dataframe to respective upperbound and lowerbound values for the variable containing the outliers.
In this Airbnb dataset, I have used the IQR approach to handle the outliers as it had performed the best in removing almost all the extreme values present.
Now, let’s check the box plot once again as by now I have removed the outliers.
See? how good the box plot looks and the whiskers are also visible well enough which literally got squeezed above due to upperbound outliers.
Well, must be curious to see the distribution of price after removing of outliers? Let’s check this/
Now, lets do some univariate analysis(working with single variables)
Next, I’ll be showing plots representing the count of Airbnb’s in different neighbourhood groups and neighbourhoods of NewYork City. From the plot, we can easily visualize that maximum number of houses or apartments listed on Airbnb.
Well, by now you have easily got to know the neighbourhood group having the highest no of listings in NYC.
Okay, let’s not forget to have an idea of the count of room types in NYC based on their listings in different neighbourhood groups.
Manhattan has more listed properties with Entire home/apt around 27% of total listed properties followed by Brooklyn with around 19.6%.
Private rooms are more in Brooklyn as in 20.7% of the total listed properties followed by Manhattan with 16.3% of them. While 6.9% of private rooms are from Queens. Very few of the total listed have shared rooms listed on Airbnb where there’s negligible or almost very rare shared rooms in Staten Island and Bronx.
We can infer that Brooklyn, Queens, Bronx has more** private room** types while Manhattan which has the highest no of listings in entire NYC has more Entire home/apt room types. let’s see some multivariate/bivariate analysis(working with more than one variables)
Lets now check for distribution of price across: Manhattan, Brooklyn, Queens, Bronx & Staten Island. Instead of checking distributions for each categories one by one we can simply do a violin plot for getting the overall statistics for each groups. But we’ll get to know the median of price/neighbourhood group.
As usual Manhattan being the most costliest place to live in, have price more than 140 USD followed by Brooklyn with around 80 USD on an average for the listings.
Queens, Staten Island are on the same page with price on listings.
The bar plot above clearly depicts the neighbourhoods with listings having highest average price/day in each neighbourhood groups of NYC.
Among the top neighbourhoods in each neighbourhood groups, top 2 of them namely: Fort Wadsworth & Sea Gate, origins from Staten Island & Brooklyn respectively.
I have plotted the above seaborn facetgrid to have a perfect visualization plot for two categorical variables with a numerical variable(price) and compare their price ranges across various neighbourhood groups in NYC.
Looks like a property/listing with Entire home/apt as room_type wins the show at NYC followed by private rooms.
Manhattan has the highest price for room types with Entire home/apt ranging to nearly 240 USD/night, followed by Private room with 110 USD/night. And it’s obvious being the most expensive place to live in! 3. On an average for how many nights people stayed in each room_types?
It clearly indicates that people mostly prefer living in an entire home/apt on an average of more than 8 nights followed by guests who stayed in shared room where average stay is 6–7 nights. 4. How monthly reviews varies with room types in each neighbourhood groups?
Seaborn stripplot function always treats one of the variables as categorical and draws data at ordinal positions (0, 1, … n) on the relevant axis, even when the data has a numeric or date type. So what do we conclude by this another kind of scatter plot?
So, Private rooms received the most no of reviews/month where Manhattan had the highest reviews received for Private rooms with more than 50 reviews/month, followed by Manhattan in the chase.
Manhattan & Queens got the most no of reviews for Entire home/apt room type.
There were less reviews received from shared rooms as compared to other room types and it was from **Staten Island **followed by Bronx.
Row NYC holds the title as the most reviewed host with more than 40 reviews/month on average. For other insights on hosts please do check my GitHub repo(link at the bottom)
Looking at the above categorical box plot we can infer that the listings in Staten Island seems to be more available throughout the year to more than 300 days. On an average, these listings are available to around 210 days every year followed by Bronx where every listings are available for 150 on an average every year.
**Now, let’s check for the distribution of types of rooms across all neighbourhood groups of NYC! **
By the two scatterplots of latitude vs longitude we can infer there’s is very less shared room throughout NYC as compared to private and Entire home/apt. 95% of the listings on Airbnb are either Private room or Entire/home apt. Very few guests had opted for shared rooms on Airbnb.
Also, guests mostly prefer this room types when they are looking for a rent on Airbnb as we found out previously in our analysis.
**Let’s have an idea of the price variations as well, across these co-ordinates in a more clear way. **
The scatterplot showing the price variables across these co-ordinates in a more authentic way using the original NYC boroughs map by saving the original map image in my local directory and then reading the image using cv2 imread function. We can infer that there are high range of prices across Manhattan followed by Brooklyn and Queens being the most costliest place to stay in NYC.
Listings availability in a year throughout NYC??
I’ve plotted the scatterplot depicting the availability of listings available throughout NYC in a year. I have used hues with different sizes based on the availability ranges.
Bronx & Staten Island has listings which are mostly available throughout the year, might be the case as they are not much costlier as compared to other boroughs as in Manhattan, Brooklyn & Queens. I’ve reached almost the end of the analysis. There might be few analysis which can be done more. But there’s always an ending to a story!
Through this exploratory data analysis and visualization, we gained several interesting insights into the Airbnb rental market. This Airbnb dataset for 2019 year appeared to be a very rich dataset with a variety of columns that allowed us to do deep data exploration on each significant column presented. After that, we proceeded with analyzing boroughs and neighborhood listing densities and what areas were more popular than another, their price variations, their availability as per room types. Also we emphasized on key findings like room types and their preferred stays by guests, the top reviewed hosts and their listings. Next, we put good use of latitude and longitude columns to create a geographical heatmap color-coded by the price of listings
I have used Seaborn and Matplotlib for creating all the visualizations. This is just a glimpse of eda on the airbnb dataset and there’s no any predictions involved. Also, here’s my Github repo for the full code reference. Please do shower some claps on this if you like it _/_
Thanks a lot for reading. Feel free to give any feedback!