“Too good to be true” posts in house rental website with python & beautiful soup

Ricardo Monteiro
5 min readApr 30, 2021

I’ve been pondering the hypothesis of moving. Thus, I’m frequently surfing through house rental posts to find potential houses at an affordable price. However, surfing the websites could be tedious and even dangerous. I’ve contacted some posts that turned out to be fraudulent. These posts usually offer good well located houses for a cheaper price that normal that serves as bait for the inexperienced. Facing these issues, I decided to take action!

Could we extract the cheapest houses of a given type (number of rooms) from all the posts on a website?

Knowing python, I wrote a simple script that searches for the cheapest houses, clustered by municipality and house type ( based on the number of divisions, excluding kitchens, bathrooms or pantries).

For this article, I’ll use scraped data from the popular Portuguese house rental website imovirtual. We will deal only with rental posts for apartments, excluding homes for sale, stores and other types of real estate.

To start, we will need the following packages:

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import numpy as np
from iteration_utilities import deepflatten

The next step is to look at the webpage structure and identify the tags where our information is. This is necessary to extract the data using Beautiful Soup find_all method. We could actually omit the find_all() according to the documentation, but I want it to be explicitly called for clarity.

The following code extracts the data we need from the page:

# The site has several pages, we need to find how many by scraping the first page# make a http request to the website
imo = requests.get('https://www.imovirtual.com/arrendar/apartamento/')
raw_html = imo.text # convert webpage code into raw text
soup = bs(raw_html) # soupify page text
# find number of pages
for i in soup.find_all('ul', class_='pager'):
pages = int(i.text.split()[-1])

# initialize url list with first page for subsequent appending
urls = ['https://www.imovirtual.com/arrendar/apartamento/']
# append further pages links
for page in range(2,pages+1):
urls.append('https://www.imovirtual.com/arrendar/apartamento/?page=' + str(page))

# initialize lists for house prices, types, locations and sizes. We'll also store links
prices = []
types = []
location = []
sizes = []
links = []
for u in urls:
imo = requests.get(u) # request http
raw_html = imo.text # convert webpage code into raw text
soup = bs(raw_html) # soupify page text
# Find the desired data: price, location, typology and size (m2)# price
for price in soup.find_all('li', class_="offer-item-price"):
prices.append(price.string.split('€')[0].replace(' ','').strip())
#type
for ty in soup.find_all('li', class_="offer-item-rooms hidden-xs"):
types.append(ty.string)
#location
for loc in soup.find_all('p', class_="text-nowrap"):
location.append(loc.text.split('Apartamento para arrendar: ')[1])
#size (m2)
for size in soup.find_all('li', class_="hidden-xs offer-item-area"):
sizes.append(size.next.split(' ')[0])

for link in soup.find_all('header', class_="offer-item-header"):
links.append(link.a['href'])

After, we’ll clean and organize the final data frame with the data. We’ll split the location in municipality and district. The ideia is to group houses by type and municipality. The final dataframe with all the data from all the posts is built with the following code:

all_houses = [] # list where all houses will be
all_houses.append([prices, types, location, sizes, links])
columns = ['price','type','location','size','link'] # columns names for the dataframe
all_data = pd.DataFrame(all_houses[0]) # create datafame with the list created
all_data = all_data.transpose() # transpose dataframe
all_data.columns = columns # rename dataframe columns
all_data = all_data.drop_duplicates() # drop duplicate columns
all_data = all_data[all_data['price'] != 'Preçosobconsulta'] # drop houses with negotiable prices
# replace house types with numeric values
type_dict = {'T2':2, 'T3':3, 'T1':1, 'T4':4, 'T0':0, 'T5':5, 'T6':6, 'T8':8, 'T7':7, 'T10 ou superior':10, 'T9':9}
all_data['type'] = all_data['type'].replace(type_dict)
# split location into municipality and district columns
municipality = []
district = []
for i in all_data.location.str.split(', '):
try:
municipality.append(i[-2])
except:
municipality.append(i[-1])

try:
district.append(i[-1])
except:
district.append('')
all_data['municipality'] = municipality
all_data['district'] = district
all_data = all_data.drop(columns='location', axis=1)
# change size column to float and price to int
all_data['size'] = all_data['size'].replace(',','.', regex=True).astype('float')
all_data['price'] = all_data['price'].replace(',','.', regex=True).astype('float').round().astype('int')
# reset index to match number of posts
all_data = all_data.reset_index()
# save data to local csv file
all_data.to_csv('house_data.csv', index=False)
all_data # show dataframe

To take a look at the data, we’ll build scatter plots and histograms:

data = pd.read_csv('house_data.csv', index_col=0) # load data# histograms prices in each municipality for each house typefor mun in data.municipality.unique():
for t in data.type.unique():
data[(data.municipality == mun) & (data.type == t)].hist(column='price', bins=25, grid=False)
plt.title(mun + ' T' + str(t))
plt.show()

# scatter plots size vs price for each district and house type
for dis in data.district.unique():
for t in data.type.unique():
data[(data.district == dis) & (data.type == t) & (data.price < 50000)].plot(x = 'size', y='price', kind='scatter', grid=False)
plt.title(dis + ' T' + str(t))
plt.show()

For big municipalities, we can see that the distribution approximates a poisson distribution. However, to simplify, we can choose extract based on the percentile method of numpy with the following function:

# FUNCTION THAT WILL EXTRACT OUTLIERS FROM PRICE LIST
def detect_outliers(data):

# find q1 and q3 values
q1, q3 = np.percentile(sorted(data), [25, 75])
# compute IRQ
iqr = q3 - q1
# find lower and upper bounds
lower_bound = q1 - (1.5 * iqr)
outliers = [x for x in data if x <= lower_bound]return outliers

So, to pick the cheapest houses in each municipality and for each house type.

outlier_indexes = [] #
for mun in house_data.municipality.unique():
for t in house_data.type.unique():

prices = house_data.price[(house_data.municipality == mun) & (house_data.type == t)].to_list()

# pass if there are no house for rent in a municipality and type
if prices != []:
data_outliers = detect_outliers(prices)
else:
pass

# pass if there are no outliers in the price
if data_outliers == []:
pass

else:
for o in data_outliers:
indexes = house_data.index[(house_data.municipality == mun) & (house_data.type == t) & (house_data.price==o)].to_list()

outlier_indexes.append(indexes)

outlier_indexes = list(deepflatten(outlier_indexes, depth=1))
outlier_links = pd.DataFrame(house_data.link[outlier_indexes])
outlier_links = outlier_links.reset_index(drop=True)
outlier_links.to_csv('outlier_links')

The final output of the code is a csv file with the links for the houses with lower outlier prices, meaning either they are good finding, mistakes or potential frauds. It should help the search

The goal of this post is to demonstrate how python and beautiful soup can aid in filtering data from websites. Although it’s applied to a specific website, its approach could be used for other websites. It used a simple outlier extraction method based on quartiles. More complex algorithms could be used but this approach should suffice for a simple application like this. Further data such as the condition of the apartment (new, renewed or used), the presence or absence of an elevator and other features could help extracting outliers more accurately.

All the code is available at my github @ https://github.com/ricmonteiro/imofind

--

--

Ricardo Monteiro

Scientist, developer and overall creator. I love using computers, science, AI and data to build solutions for people. Chemist and Biomedical Engineer