CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!
Path: blob/main/002_TensorFlow_Regression.ipynb
Views: 47
Dataset import and exploration
Drop columns we won't need
Feature engineering
Houses that weren't renovated have
yr_renovated = 0
Here's how to get the first renovation year
Let's create a couple of features:
House age
Was the house renovated?
Was the renovation recent? (10 years or less)
Was the renovation not that recent (more than 10 years but less than 30)
We'll then drop the original features
A lot of City options
Let's declare a function that will get rid of all city values that don't occur often
The original value will be replaced with 'Rare':
Test:
Target variable visualization
The distribution is highly skewed, so let's calculate Z-scores and remove outliers (assume the distirbution is otherwise normal)
Still a bit of skew present
There seem to be houses selling for $0
Let's remove them:
Data preparation for ML
We'll MinMaxScale the numerical features and one-hot encode the categorical ones
The features
waterfront
,was_renovated
,was_renovated_10_yrs
andwas_renovated_30_yrs
are ignored, since they're already in (0, 1) format
Train/test split - 80:20:
Let's apply the transformations:
Sparse array format:
Convert to array:
Model training
RMSE is the best metric, as the error is displayed in the same units the target variable is in
Really simple model:
Predict on the test set:
Convert to a 1D array before visualization: