CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
better-data-science

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: better-data-science/TensorFlow
Path: blob/main/002_TensorFlow_Regression.ipynb
Views: 47
Kernel: Python 3 (ipykernel)

Dataset import and exploration

import numpy as np import pandas as pd df = pd.read_csv('data/data.csv') df.sample(5)
df.shape
(4600, 18)
df.isnull().sum()
date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 sqft_above 0 sqft_basement 0 yr_built 0 yr_renovated 0 street 0 city 0 statezip 0 country 0 dtype: int64

Drop columns we won't need

to_drop = ['date', 'street', 'statezip', 'country'] df = df.drop(to_drop, axis=1) df.head()

Feature engineering

  • Houses that weren't renovated have yr_renovated = 0

  • Here's how to get the first renovation year

df[df['yr_renovated'] != 0]['yr_renovated'].min()
1912
  • Let's create a couple of features:

    • House age

    • Was the house renovated?

    • Was the renovation recent? (10 years or less)

    • Was the renovation not that recent (more than 10 years but less than 30)

  • We'll then drop the original features

# How old is the house? df['house_age'] = [2021 - yr_built for yr_built in df['yr_built']] # Was the house renovated and was the renovation recent? df['was_renovated'] = [1 if yr_renovated != 0 else 0 for yr_renovated in df['yr_renovated']] df['was_renovated_10_yrs'] = [1 if (2021 - yr_renovated) <= 10 else 0 for yr_renovated in df['yr_renovated']] df['was_renovated_30_yrs'] = [1 if 10 < (2021 - yr_renovated) <= 30 else 0 for yr_renovated in df['yr_renovated']] # Drop original columns df = df.drop(['yr_built', 'yr_renovated'], axis=1) df.head()
  • A lot of City options

df['city'].value_counts()
Seattle 1573 Renton 293 Bellevue 286 Redmond 235 Issaquah 187 Kirkland 187 Kent 185 Auburn 176 Sammamish 175 Federal Way 148 Shoreline 123 Woodinville 115 Maple Valley 96 Mercer Island 86 Burien 74 Snoqualmie 71 Kenmore 66 Des Moines 58 North Bend 50 Covington 43 Duvall 42 Lake Forest Park 36 Bothell 33 Newcastle 33 SeaTac 29 Tukwila 29 Vashon 29 Enumclaw 28 Carnation 22 Normandy Park 18 Clyde Hill 11 Medina 11 Fall City 11 Black Diamond 9 Ravensdale 7 Pacific 6 Algona 5 Yarrow Point 4 Skykomish 3 Preston 2 Milton 2 Inglewood-Finn Hill 1 Snoqualmie Pass 1 Beaux Arts Village 1 Name: city, dtype: int64
  • Let's declare a function that will get rid of all city values that don't occur often

  • The original value will be replaced with 'Rare':

def remap_location(data: pd.DataFrame, location: str, threshold: int = 50) -> str: if len(data[data['city'] == location]) < threshold: return 'Rare' return location
  • Test:

remap_location(data=df, location='Seattle')
'Seattle'
remap_location(data=df, location='Fall City')
'Rare'
df['city'] = df['city'].apply(lambda x: remap_location(data=df, location=x)) df.sample(10)

Target variable visualization

import matplotlib.pyplot as plt from matplotlib import rcParams rcParams['figure.figsize'] = (16, 6) rcParams['axes.spines.top'] = False rcParams['axes.spines.right'] = False
plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
  • The distribution is highly skewed, so let's calculate Z-scores and remove outliers (assume the distirbution is otherwise normal)

from scipy import stats df['price_z'] = np.abs(stats.zscore(df['price'])) df.head()
df = df[df['price_z'] <= 3] df.shape
(4566, 17)
plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
  • Still a bit of skew present

  • There seem to be houses selling for $0

    • Let's remove them:

df[df['price'] == 0]
df = df[df['price'] != 0] plt.hist(df['price'], bins=100);
Image in a Jupyter notebook
df = df.drop('price_z', axis=1)
df.head()

Data preparation for ML

  • We'll MinMaxScale the numerical features and one-hot encode the categorical ones

  • The features waterfront, was_renovated, was_renovated_10_yrs and was_renovated_30_yrs are ignored, since they're already in (0, 1) format

from sklearn.compose import make_column_transformer from sklearn.preprocessing import MinMaxScaler, OneHotEncoder transformer = make_column_transformer( (MinMaxScaler(), ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'house_age']), (OneHotEncoder(handle_unknown='ignore'), ['bedrooms', 'bathrooms', 'floors', 'view', 'condition']) )
  • Train/test split - 80:20:

from sklearn.model_selection import train_test_split X = df.drop('price', axis=1) y = df['price'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) X_train.shape, X_test.shape
((3613, 15), (904, 15))
  • Let's apply the transformations:

# Fit on the train set transformer.fit(X_train) # Apply the transformation X_train = transformer.transform(X_train) X_test = transformer.transform(X_test)
X_train.shape, X_test.shape
((3613, 53), (904, 53))
  • Sparse array format:

X_train
<3613x53 sparse matrix of type '<class 'numpy.float64'>' with 33918 stored elements in Compressed Sparse Row format>
  • Convert to array:

X_train.toarray()
array([[0.21438849, 0.33897196, 0.21438849, ..., 1. , 0. , 0. ], [0.26043165, 0.00742988, 0.10503597, ..., 0. , 0. , 1. ], [0.55251799, 0.02588045, 0.55251799, ..., 1. , 0. , 0. ], ..., [0.27194245, 0.01478794, 0.27194245, ..., 1. , 0. , 0. ], [0.56115108, 0.00799192, 0.4028777 , ..., 0. , 0. , 1. ], [0.21007194, 0.01236491, 0.21007194, ..., 1. , 0. , 0. ]])
X_train = X_train.toarray() X_test = X_test.toarray()

Model training

import tensorflow as tf from tensorflow.keras import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.optimizers import Adam from tensorflow.keras import backend as K
Init Plugin Init Graph Optimizer Init Kernel
  • RMSE is the best metric, as the error is displayed in the same units the target variable is in

def rmse(y_true, y_pred): return K.sqrt(K.mean(K.square(y_pred - y_true)))
  • Really simple model:

tf.random.set_seed(42) model = Sequential([ Dense(256, activation='relu'), Dense(256, activation='relu'), Dense(128, activation='relu'), Dense(1) ]) model.compile( loss=rmse, optimizer=Adam(), metrics=[rmse] ) model.fit(X_train, y_train, epochs=100)
Metal device set to: Apple M1 Epoch 1/100
2021-10-04 13:07:52.210577: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support. 2021-10-04 13:07:52.211252: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>) 2021-10-04 13:07:52.278898: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2) 2021-10-04 13:07:52.281322: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
12/113 [==>...........................] - ETA: 0s - loss: 631261.2500 - rmse: 631261.2500
2021-10-04 13:07:52.439835: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
113/113 [==============================] - 1s 5ms/step - loss: 606316.2500 - rmse: 606331.7500 Epoch 2/100 113/113 [==============================] - 0s 4ms/step - loss: 412443.6562 - rmse: 412272.9062 Epoch 3/100 113/113 [==============================] - 1s 5ms/step - loss: 266340.6250 - rmse: 266316.5000 Epoch 4/100 113/113 [==============================] - 1s 5ms/step - loss: 247171.2344 - rmse: 247113.8906 Epoch 5/100 113/113 [==============================] - 1s 5ms/step - loss: 237513.7188 - rmse: 237580.4062 Epoch 6/100 113/113 [==============================] - 0s 4ms/step - loss: 229471.9531 - rmse: 229423.3438 Epoch 7/100 113/113 [==============================] - 0s 4ms/step - loss: 225892.7188 - rmse: 225861.0469 Epoch 8/100 113/113 [==============================] - 0s 4ms/step - loss: 221565.3281 - rmse: 221595.5156 Epoch 9/100 113/113 [==============================] - 0s 4ms/step - loss: 217731.0625 - rmse: 217674.0312 Epoch 10/100 113/113 [==============================] - 0s 4ms/step - loss: 215962.8750 - rmse: 216019.6406 Epoch 11/100 113/113 [==============================] - 0s 4ms/step - loss: 213486.3906 - rmse: 213567.5156 Epoch 12/100 113/113 [==============================] - 0s 4ms/step - loss: 211225.4375 - rmse: 211196.5625 Epoch 13/100 113/113 [==============================] - 0s 4ms/step - loss: 210746.5781 - rmse: 210754.9688 Epoch 14/100 113/113 [==============================] - 0s 4ms/step - loss: 208147.5781 - rmse: 208114.0938 Epoch 15/100 113/113 [==============================] - 1s 5ms/step - loss: 208524.9062 - rmse: 208535.7344 Epoch 16/100 113/113 [==============================] - 0s 4ms/step - loss: 206380.1094 - rmse: 206366.5469 Epoch 17/100 113/113 [==============================] - 0s 4ms/step - loss: 207664.6250 - rmse: 207612.9688 Epoch 18/100 113/113 [==============================] - 0s 4ms/step - loss: 205700.5156 - rmse: 205647.2969 Epoch 19/100 113/113 [==============================] - 0s 4ms/step - loss: 205394.6719 - rmse: 205393.2344 Epoch 20/100 113/113 [==============================] - 0s 4ms/step - loss: 203322.3125 - rmse: 203320.2656 Epoch 21/100 113/113 [==============================] - 0s 4ms/step - loss: 203358.6094 - rmse: 203354.7500 Epoch 22/100 113/113 [==============================] - 0s 4ms/step - loss: 202313.2500 - rmse: 202333.3594 Epoch 23/100 113/113 [==============================] - 1s 5ms/step - loss: 202280.5781 - rmse: 202241.5000 Epoch 24/100 113/113 [==============================] - 1s 5ms/step - loss: 201794.9375 - rmse: 201801.6250 Epoch 25/100 113/113 [==============================] - 1s 4ms/step - loss: 201713.2500 - rmse: 201751.3438 Epoch 26/100 113/113 [==============================] - 0s 4ms/step - loss: 200348.1094 - rmse: 200331.2812 Epoch 27/100 113/113 [==============================] - 0s 4ms/step - loss: 200238.6719 - rmse: 200219.8438 Epoch 28/100 113/113 [==============================] - 0s 4ms/step - loss: 200881.5625 - rmse: 200896.2969 Epoch 29/100 113/113 [==============================] - 0s 4ms/step - loss: 200525.7656 - rmse: 200506.9688 Epoch 30/100 113/113 [==============================] - 0s 4ms/step - loss: 199505.8594 - rmse: 199496.1094 Epoch 31/100 113/113 [==============================] - 0s 4ms/step - loss: 199215.0000 - rmse: 199243.4219 Epoch 32/100 113/113 [==============================] - 0s 4ms/step - loss: 198035.6094 - rmse: 197981.7812 Epoch 33/100 113/113 [==============================] - 0s 4ms/step - loss: 198897.5312 - rmse: 198881.9062 Epoch 34/100 113/113 [==============================] - 0s 4ms/step - loss: 198897.9219 - rmse: 198880.8594 Epoch 35/100 113/113 [==============================] - 0s 4ms/step - loss: 198209.8281 - rmse: 198193.1875 Epoch 36/100 113/113 [==============================] - 1s 5ms/step - loss: 197476.4844 - rmse: 197439.0469 Epoch 37/100 113/113 [==============================] - 1s 5ms/step - loss: 197795.4219 - rmse: 197832.4531 Epoch 38/100 113/113 [==============================] - 0s 4ms/step - loss: 197787.1250 - rmse: 197777.3281 Epoch 39/100 113/113 [==============================] - 0s 4ms/step - loss: 198260.4531 - rmse: 198228.8125 Epoch 40/100 113/113 [==============================] - 1s 5ms/step - loss: 197410.2188 - rmse: 197455.0156 Epoch 41/100 113/113 [==============================] - 1s 4ms/step - loss: 197394.2656 - rmse: 197386.1719 Epoch 42/100 113/113 [==============================] - 0s 4ms/step - loss: 197170.6406 - rmse: 197196.6562 Epoch 43/100 113/113 [==============================] - 1s 4ms/step - loss: 196728.1719 - rmse: 196700.6250 Epoch 44/100 113/113 [==============================] - 1s 5ms/step - loss: 197018.8125 - rmse: 196994.9062 Epoch 45/100 113/113 [==============================] - 1s 5ms/step - loss: 197247.7812 - rmse: 197255.2969 Epoch 46/100 113/113 [==============================] - 1s 5ms/step - loss: 196440.9844 - rmse: 196368.7031 Epoch 47/100 113/113 [==============================] - 1s 4ms/step - loss: 196855.6875 - rmse: 196880.4062 Epoch 48/100 113/113 [==============================] - 0s 4ms/step - loss: 196975.5469 - rmse: 196926.5156 Epoch 49/100 113/113 [==============================] - 1s 4ms/step - loss: 194838.1562 - rmse: 194812.8281 Epoch 50/100 113/113 [==============================] - 1s 5ms/step - loss: 196711.4844 - rmse: 196738.3594 Epoch 51/100 113/113 [==============================] - 0s 4ms/step - loss: 195980.7969 - rmse: 195955.3281 Epoch 52/100 113/113 [==============================] - 0s 4ms/step - loss: 196815.4531 - rmse: 196785.1250 Epoch 53/100 113/113 [==============================] - 0s 4ms/step - loss: 195415.1719 - rmse: 195450.5312 Epoch 54/100 113/113 [==============================] - 0s 4ms/step - loss: 196653.5938 - rmse: 196605.7031 Epoch 55/100 113/113 [==============================] - 1s 5ms/step - loss: 196389.0000 - rmse: 196340.7656 Epoch 56/100 113/113 [==============================] - 0s 4ms/step - loss: 196424.7656 - rmse: 196415.4375 Epoch 57/100 113/113 [==============================] - 1s 5ms/step - loss: 195727.8125 - rmse: 195755.4375 Epoch 58/100 113/113 [==============================] - 1s 5ms/step - loss: 195353.2031 - rmse: 195343.7031 Epoch 59/100 113/113 [==============================] - 0s 4ms/step - loss: 196480.1094 - rmse: 196465.6406 Epoch 60/100 113/113 [==============================] - 0s 4ms/step - loss: 195707.7344 - rmse: 195706.2812 Epoch 61/100 113/113 [==============================] - 1s 4ms/step - loss: 196518.3281 - rmse: 196531.7656 Epoch 62/100 113/113 [==============================] - 0s 4ms/step - loss: 196596.8750 - rmse: 196637.5000 Epoch 63/100 113/113 [==============================] - 1s 5ms/step - loss: 195746.3438 - rmse: 195785.8594 Epoch 64/100 113/113 [==============================] - 1s 5ms/step - loss: 195946.7500 - rmse: 195933.0781 Epoch 65/100 113/113 [==============================] - 0s 4ms/step - loss: 196086.7344 - rmse: 196068.1250 Epoch 66/100 113/113 [==============================] - 1s 5ms/step - loss: 194923.8906 - rmse: 194909.7344 Epoch 67/100 113/113 [==============================] - 1s 4ms/step - loss: 195931.7188 - rmse: 195942.6562 Epoch 68/100 113/113 [==============================] - 0s 4ms/step - loss: 195410.1406 - rmse: 195338.6875 Epoch 69/100 113/113 [==============================] - 1s 5ms/step - loss: 195627.5000 - rmse: 195625.7812 Epoch 70/100 113/113 [==============================] - 0s 4ms/step - loss: 195520.2656 - rmse: 195608.5000 Epoch 71/100 113/113 [==============================] - 1s 5ms/step - loss: 196054.2031 - rmse: 196043.2188 Epoch 72/100 113/113 [==============================] - 0s 4ms/step - loss: 194509.5312 - rmse: 194481.4375 Epoch 73/100 113/113 [==============================] - 0s 4ms/step - loss: 194791.0625 - rmse: 194814.2969 Epoch 74/100 113/113 [==============================] - 1s 5ms/step - loss: 194941.3906 - rmse: 194933.3594 Epoch 75/100 113/113 [==============================] - 1s 4ms/step - loss: 194375.0312 - rmse: 194461.8438 Epoch 76/100 113/113 [==============================] - 1s 5ms/step - loss: 194948.4219 - rmse: 194977.3125 Epoch 77/100 113/113 [==============================] - 0s 4ms/step - loss: 194722.8594 - rmse: 194648.8594 Epoch 78/100 113/113 [==============================] - 0s 4ms/step - loss: 195392.5781 - rmse: 195526.1250 Epoch 79/100 113/113 [==============================] - 0s 4ms/step - loss: 195306.2500 - rmse: 195267.0781 Epoch 80/100 113/113 [==============================] - 0s 4ms/step - loss: 195162.0156 - rmse: 195202.9844 Epoch 81/100 113/113 [==============================] - 1s 5ms/step - loss: 194134.9219 - rmse: 194132.5625 Epoch 82/100 113/113 [==============================] - 1s 5ms/step - loss: 194423.2500 - rmse: 194440.1406 Epoch 83/100 113/113 [==============================] - 0s 4ms/step - loss: 193753.1406 - rmse: 193730.8594 Epoch 84/100 113/113 [==============================] - 0s 4ms/step - loss: 195446.2969 - rmse: 195448.2500 Epoch 85/100 113/113 [==============================] - 0s 4ms/step - loss: 194611.2031 - rmse: 194610.0156 Epoch 86/100 113/113 [==============================] - 0s 4ms/step - loss: 194347.8750 - rmse: 194335.2500 Epoch 87/100 113/113 [==============================] - 0s 4ms/step - loss: 195579.7031 - rmse: 195591.5625 Epoch 88/100 113/113 [==============================] - 1s 5ms/step - loss: 194265.9688 - rmse: 194236.1406 Epoch 89/100 113/113 [==============================] - 1s 5ms/step - loss: 195006.6719 - rmse: 194977.8281 Epoch 90/100 113/113 [==============================] - 1s 4ms/step - loss: 194886.8125 - rmse: 194890.3281 Epoch 91/100 113/113 [==============================] - 0s 4ms/step - loss: 194388.2812 - rmse: 194374.4062 Epoch 92/100 113/113 [==============================] - 1s 4ms/step - loss: 193720.2188 - rmse: 193666.0625 Epoch 93/100 113/113 [==============================] - 0s 4ms/step - loss: 194508.2969 - rmse: 194553.4844 Epoch 94/100 113/113 [==============================] - 1s 5ms/step - loss: 194147.2500 - rmse: 194182.4375 Epoch 95/100 113/113 [==============================] - 1s 5ms/step - loss: 193905.2031 - rmse: 193906.9688 Epoch 96/100 113/113 [==============================] - 0s 4ms/step - loss: 194199.8750 - rmse: 194150.9062 Epoch 97/100 113/113 [==============================] - 0s 4ms/step - loss: 194328.1562 - rmse: 194289.0781 Epoch 98/100 113/113 [==============================] - 0s 4ms/step - loss: 194047.2969 - rmse: 194030.4062 Epoch 99/100 113/113 [==============================] - 0s 4ms/step - loss: 193601.1094 - rmse: 193566.3281 Epoch 100/100 113/113 [==============================] - 0s 4ms/step - loss: 192845.9688 - rmse: 192796.1875
<tensorflow.python.keras.callbacks.History at 0x299dc4610>

  • Predict on the test set:

predictions = model.predict(X_test)
2021-10-04 13:08:42.823668: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.
predictions[:5]
array([[ 500118.97], [ 597861.2 ], [1233606.4 ], [ 277795.9 ], [ 320446.3 ]], dtype=float32)
  • Convert to a 1D array before visualization:

predictions = np.ravel(predictions) predictions[:5]
array([ 500118.97, 597861.2 , 1233606.4 , 277795.9 , 320446.3 ], dtype=float32)
rmse(y_test, predictions).numpy()
191119.78088467862