GitHub Repository: better-data-science/TensorFlow
Path: blob/main/002_TensorFlow_Regression.ipynb
Kernel: Python 3 (ipykernel)

Dataset import and exploration

import numpy as np import pandas as pd df = pd.read_csv('data/data.csv') df.sample(5)
(4600, 18)
date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 sqft_above 0 sqft_basement 0 yr_built 0 yr_renovated 0 street 0 city 0 statezip 0 country 0 dtype: int64

Drop columns we won't need

to_drop = ['date', 'street', 'statezip', 'country'] df = df.drop(to_drop, axis=1) df.head()

Feature engineering

  • Houses that weren't renovated have yr_renovated = 0

  • Here's how to get the first renovation year

df[df['yr_renovated'] != 0]['yr_renovated'].min()
  • Let's create a couple of features:

    • House age

    • Was the house renovated?

    • Was the renovation recent? (10 years or less)

    • Was the renovation not that recent (more than 10 years but less than 30)

  • We'll then drop the original features

# How old is the house? df['house_age'] = [2021 - yr_built for yr_built in df['yr_built']] # Was the house renovated and was the renovation recent? df['was_renovated'] = [1 if yr_renovated != 0 else 0 for yr_renovated in df['yr_renovated']] df['was_renovated_10_yrs'] = [1 if (2021 - yr_renovated) <= 10 else 0 for yr_renovated in df['yr_renovated']] df['was_renovated_30_yrs'] = [1 if 10 < (2021 - yr_renovated) <= 30 else 0 for yr_renovated in df['yr_renovated']] # Drop original columns df = df.drop(['yr_built', 'yr_renovated'], axis=1) df.head()
  • A lot of City options

Seattle 1573 Renton 293 Bellevue 286 Redmond 235 Issaquah 187 Kirkland 187 Kent 185 Auburn 176 Sammamish 175 Federal Way 148 Shoreline 123 Woodinville 115 Maple Valley 96 Mercer Island 86 Burien 74 Snoqualmie 71 Kenmore 66 Des Moines 58 North Bend 50 Covington 43 Duvall 42 Lake Forest Park 36 Bothell 33 Newcastle 33 SeaTac 29 Tukwila 29 Vashon 29 Enumclaw 28 Carnation 22 Normandy Park 18 Clyde Hill 11 Medina 11 Fall City 11 Black Diamond 9 Ravensdale 7 Pacific 6 Algona 5 Yarrow Point 4 Skykomish 3 Preston 2 Milton 2 Inglewood-Finn Hill 1 Snoqualmie Pass 1 Beaux Arts Village 1 Name: city, dtype: int64
  • Let's declare a function that will get rid of all city values that don't occur often

  • The original value will be replaced with 'Rare':

def remap_location(data: pd.DataFrame, location: str, threshold: int = 50) -> str: if len(data[data['city'] == location]) < threshold: return 'Rare' return location
  • Test:

remap_location(data=df, location='Seattle')
remap_location(data=df, location='Fall City')
df['city'] = df['city'].apply(lambda x: remap_location(data=df, location=x)) df.sample(10)

Target variable visualization

import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['figure.figsize'] = (16, 6) rcParams[''] = False rcParams['axes.spines.right'] = False
plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
  • The distribution is highly skewed, so let's calculate Z-scores and remove outliers (assume the distirbution is otherwise normal)

from scipy import stats df['price_z'] = np.abs(stats.zscore(df['price'])) df.head()
df = df[df['price_z'] <= 3] df.shape
(4566, 17)
plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
  • Still a bit of skew present

  • There seem to be houses selling for $0

    • Let's remove them:

df[df['price'] == 0]
df = df[df['price'] != 0] plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
df = df.drop('price_z', axis=1)

Data preparation for ML

  • We'll MinMaxScale the numerical features and one-hot encode the categorical ones

  • The features waterfront, was_renovated, was_renovated_10_yrs and was_renovated_30_yrs are ignored, since they're already in (0, 1) format

from sklearn.compose import make_column_transformer from sklearn.preprocessing import MinMaxScaler, OneHotEncoder transformer = make_column_transformer( (MinMaxScaler(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'house_age']), (OneHotEncoder(handle_unknown='ignore'), ['bedrooms', 'bathrooms', 'floors', 'view', 'condition']) )
  • Train/test split - 80:20:

from sklearn.model_selection import train_test_split X = df.drop('price', axis=1) y = df['price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train.shape, X_test.shape
((3613, 15), (904, 15))
  • Let's apply the transformations:

# Fit on the train set # Apply the transformation X_train = transformer.transform(X_train) X_test = transformer.transform(X_test)
X_train.shape, X_test.shape
((3613, 53), (904, 53))
  • Sparse array format:

<3613x53 sparse matrix of type '<class 'numpy.float64'>' with 33918 stored elements in Compressed Sparse Row format>
  • Convert to array:

array([[0.21438849, 0.33897196, 0.21438849, ..., 1. , 0. , 0. ], [0.26043165, 0.00742988, 0.10503597, ..., 0. , 0. , 1. ], [0.55251799, 0.02588045, 0.55251799, ..., 1. , 0. , 0. ], ..., [0.27194245, 0.01478794, 0.27194245, ..., 1. , 0. , 0. ], [0.56115108, 0.00799192, 0.4028777 , ..., 0. , 0. , 1. ], [0.21007194, 0.01236491, 0.21007194, ..., 1. , 0. , 0. ]])
X_train = X_train.toarray() X_test = X_test.toarray()

Model training

import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import Adam from tensorflow.keras import backend as K
  • RMSE is the best metric, as the error is displayed in the same units the target variable is in

def rmse(y_true, y_pred): return K.sqrt(K.mean(K.square(y_pred - y_true)))
  • Really simple model:

tf.random.set_seed(42) model = Sequential([ Dense(256, activation='relu'), Dense(256, activation='relu'), Dense(128, activation='relu'), Dense(1) ]) model.compile( loss=rmse, optimizer=Adam(), metrics=[rmse] ), y_train, epochs=100)
Metal device set to: Apple M1 Epoch 1/100
12/113 [==>...........................] - ETA: 0s - loss: 631261.2500 - rmse: 631261.2500
<tensorflow.python.keras.callbacks.History at 0x299dc4610>

  • Predict on the test set:

predictions = model.predict(X_test)
array([[ 500118.97], [ 597861.2 ], [1233606.4 ], [ 277795.9 ], [ 320446.3 ]], dtype=float32)
  • Convert to a 1D array before visualization:

predictions = np.ravel(predictions) predictions[:5]
array([ 500118.97, 597861.2 , 1233606.4 , 277795.9 , 320446.3 ], dtype=float32)
rmse(y_test, predictions).numpy()