The goal is to find a Linear regression to create a linear model that can be used to estimate the natural logarithm of average MPG for a vehicle based on the vehicle's weight based on the following:
• The weight of the vehicle, measured in pounds.
• The average miles per gallon (MPG) for the model.
• The natural logarithm of average MPG for the model.
import pandas as pd import matplotlib.pyplot as plt
The first tasks will be to import and view the data.
df = pd.read_table(filepath_or_buffer='auto_data.txt', sep='\t') weight = list(df.wt) mpg = list(df.mpg) ln_mpg = list(df.ln_mpg)
Each list contains as required, all 398 values
print('Length of list weight : ' + str(len(weight))) print('Length of list mpg : ' + str(len(mpg))) print('Length of list ln_mpg : ' + str(len(ln_mpg)))
First 10 vehicles on the list
print("{:>6}{:>8}{:>10}".format("Weight", "MPG", "LN_MPG")) print("{:-^24}".format("")) for i in range(0,10): print("{:>6}{:>8}{:>10}".format(weight[i], mpg[i], ln_mpg[i]))
We will now create two scatter plots
plt.figure(figsize=[12,4]) plt.subplot(1,2,1) plt.scatter(weight, mpg, c='skyblue', edgecolor='k') plt.xlabel('Weight (in Pounds)') plt.ylabel('MPG') plt.title('Plot of MPG against Weight') plt.subplot(1,2,2) plt.scatter(weight, ln_mpg, c='skyblue', edgecolor='k') plt.xlabel('Weight (in Pounds)') plt.ylabel('Natural Logarithm of MPG') plt.title('Plot of Log-MPG against Weight') plt.show()
Observations
Notice that the relationship between MPG and weight in the first scatter plot seems to have a slight bend or curve, whereas the relationship between log-MPG and weight appears to be mostly linear. Since we will be constructing a linear model, we will use log-MPG as the response variable in our model.
We will now be splitting the data into training and test sets.
x_train = (weight[:300]) x_test = (weight[300:]) y_train = (ln_mpg[:300]) y_test = (ln_mpg[300:]) mpg_train = (mpg[:300]) mpg_test = (mpg[300:]) n_train = (len(x_train)) n_test = (len(x_test)) print('Training Set Size:' + ' ' + str(n_train)) print('Test Set Size:' + ' ' + str(n_test))
Create Scatter Plots
plt.figure(figsize=[12,4]) plt.subplot(1,2,1) plt.scatter(x_train, y_train, c='skyblue', edgecolor='k') plt.xlabel('Weight (in Pounds)') plt.ylabel('Natural Logarithm of MPG') plt.title('Training Set') plt.subplot(1,2,2) plt.scatter(x_test, y_test, c='skyblue', edgecolor='k') plt.xlabel('Weight (in Pounds)') plt.ylabel('Natural Logarithm of MPG') plt.title('Test Set') plt.show()
We will start by calculating the mean of the 𝑋 values (which represent weight), and the mean of the 𝑌 values (which represent log-MPG).
mean_x = sum(x_train)/len(x_train) mean_y = sum(y_train)/len(y_train) print('Mean of X = {:.2f}'.format(mean_x)) print('Mean of Y = {:.4f}'.format(mean_y))
Calculating 𝑆𝑥𝑥 and 𝑆𝑦𝑦.
Sxx = sum([((x - mean_x) ** 2) for x in x_train]) Syy = sum([((x - mean_y) ** 2) for x in y_train]) print('Sxx =' + ' ' + str(round(Sxx,2))) print('Syy =' + ' ' + str(round(Syy,4)))
Calculating the variance of the training values of 𝑋 and 𝑌.
var_x = sum([((x - mean_x) ** 2) for x in x_train])/len(x_train) var_y = sum([((x - mean_y) ** 2) for x in y_train])/len(y_train) print('Variance of X =' + ' ' + str(round(var_x,2))) print('Variance of Y =' + ' ' + str(round(var_y,4)))
We will calculate 𝑆𝑋𝑌, which we will then use to find the coefficients for our linear regression model.
Sxy = sum([((x-mean_x)*(y-mean_y)) for x, y in zip(x_train, y_train)]) print("Sxy =", round(Sxy,2))
We will now be calculating the coeffecients of our model.
beta_1 = Sxy / Sxx beta_0 = mean_y - beta_1 * mean_x print("beta_0 =", round(beta_0, 4)) print("beta_1 =", round(beta_1, 8))
y_vals = [beta_0 + beta_1 * 1500, beta_0 + beta_1 * 5500] plt.figure(figsize=[12,4]) plt.subplot(1,2,1) plt.scatter(x_train, y_train, c='skyblue', edgecolor='k') plt.plot([1500,5500], y_vals, c='crimson', lw=3) plt.xlabel('Weight (in Pounds)') plt.ylabel('Natural Logarithm of MPG') plt.title('Training Set') plt.subplot(1,2,2) plt.scatter(x_test, y_test, c='skyblue', edgecolor='k') plt.plot([1500,5500], y_vals, c='crimson', lw=3) plt.xlabel('Weight (in Pounds)') plt.ylabel('Natural Logarithm of MPG') plt.title('Test Set') plt.show()
We will be calculating the training r-squared score, and that we will start by calculating estimated response values for the training set.
pred_y_train = [beta_0 + beta_1 * x for x in x_train]
We will now calculate the residuals for the training set.
error_y_train = [y_train[i] - pred_y_train[i] for i in range(len(y_train))]
We will be displaying the values mentioned above.
print(f"{'True y':>6} {'Pred y':>10} {'Error':>10}") print("-" * 30) for i in range(10): print(f"{y_train[i]:>.4f} {pred_y_train[i]:>10.4f} {error_y_train[i]:>10.4f}")
We will now calculate the sum of squared errors score for the training set.
sse_train = 0 for i in range(len(y_train)): sse_train += error_y_train[i]**2 sse_train = round(sse_train, 4) print("Training SSE = ", sse_train)
We will now calculate the r-squared score for the training set.
y_train_mean = sum(y_train)/len(y_train) r2 = 1 - (sse_train/sum([(y_train[i] - y_train_mean)**2 for i in range(len(y_train))])) r2 = round(r2, 4) print("Training r-squared = ", r2)
We will be calculating the test r-squared score, and that we will start by calculating estimated response values for the test set.
pred_y_test = [beta_0 + beta_1 * x for x in x_test]
We will now calculate the residuals for the test set.
error_y_test = [y_test[i] - pred_y_test[i] for i in range(len(y_test))]
We will be displaying the values mentioned above.
print(f"{'True y':>6} {'Pred y':>10} {'Error':>10}") print("-" * 30) for i in range(10): print(f"{y_test[i]:>6.4f} {pred_y_test[i]:>10.4f} {error_y_test[i]:>10.4f}")
We will now calculate the sum of squared errors score for the test set.
sse_test = 0 for i in range(len(y_test)): sse_test += error_y_test[i]**2 sse_test = round(sse_test, 4) print("Test SSE =", sse_test)
We will now calculate the value of 𝑆𝑌𝑌 on the test set, and will then use that and the test sum of squared errors to calculate the test r-squared score.
y_test_mean = sum(y_test)/len(y_test) r2 = 1 - (sse_test/sum([(y_test[i] - y_test_mean)**2 for i in range(len(y_test))])) r2 = round(r2, 4) print("Test r-Squared = ", r2)
We will now create a plot to visualize the errors for the observations in the test set.
plt.figure(figsize=[8,4]) plt.scatter(x_test, y_test, c='skyblue', edgecolor='k') plt.plot([1500,5250], [beta_0 + beta_1 * 1500, beta_0 + beta_1 * 5250], c='crimson', lw=3) for i in range(n_test): plt.plot([x_test[i], x_test[i]], [pred_y_test[i], y_test[i]], c='coral', zorder=0) plt.show()
We will be calculating estimates for the average MPG for observations in our test set.
e = 2.718281828 pred_mpg_test = [e ** y for y in pred_y_test] for i in pred_y_test: pred_mpg_test.append(e**i) print(pred_mpg_test)
We will now calculate the error in each estimate for the average MPG.
error_mpg_test = [mpg_test[i] - pred_mpg_test [i] for i in range (len(mpg_test))]
We will now display the true MPG, the estimated MPG, and the estimation error for each of the first 10 observations in the test set.
print ("True MPG Pred MPG Error") print("-" * 29) for i in range (len (mpg_test)): if i == 10: break print ("{:8.1f}{:12.1f}{:9.1f}".format (mpg_test [i], pred_mpg_test[i], error_mpg_test[i]))