Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
| Download
Project: My First Project
Views: 50Image: ubuntu2204
Kernel: Python 3 (system-wide)
In [1]:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
/tmp/ipykernel_1763/2721995799.py in <cell line: 1>()
----> 1 from preamble import *
2 get_ipython().run_line_magic('matplotlib', 'inline')
~/introduction_to_ml_with_python/preamble.py in <module>
3 import numpy as np
4 import matplotlib.pyplot as plt
----> 5 import mglearn
6 from cycler import cycler
7
~/introduction_to_ml_with_python/mglearn/__init__.py in <module>
----> 1 from . import plots
2 from . import tools
3 from .plots import cm3, cm2
4 from .tools import discrete_scatter
5 from .plot_helpers import ReBl
~/introduction_to_ml_with_python/mglearn/plots.py in <module>
3 from .plot_animal_tree import plot_animal_tree
4 from .plot_rbf_svm_parameters import plot_svm
----> 5 from .plot_knn_regression import plot_knn_regression
6 from .plot_knn_classification import plot_knn_classification
7 from .plot_2d_separator import plot_2d_classification, plot_2d_separator
~/introduction_to_ml_with_python/mglearn/plot_knn_regression.py in <module>
5 from sklearn.metrics import euclidean_distances
6
----> 7 from .datasets import make_wave
8 from .plot_helpers import cm3
9
~/introduction_to_ml_with_python/mglearn/datasets.py in <module>
3 import os
4 from scipy import signal
----> 5 from sklearn.datasets import load_boston
6 from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
7 from sklearn.datasets import make_blobs
/usr/local/lib/python3.10/dist-packages/sklearn/datasets/__init__.py in __getattr__(name)
155 <https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
156 """)
--> 157 raise ImportError(msg)
158 try:
159 return globals()[name]
ImportError:
`load_boston` has been removed from scikit-learn since version 1.2.
The Boston housing prices dataset has an ethical problem: as
investigated in [1], the authors of this dataset engineered a
non-invertible variable "B" assuming that racial self-segregation had a
positive impact on house prices [2]. Furthermore the goal of the
research that led to the creation of this dataset was to study the
impact of air quality but it did not give adequate demonstration of the
validity of this assumption.
The scikit-learn maintainers therefore strongly discourage the use of
this dataset unless the purpose of the code is to study and educate
about ethical issues in data science and machine learning.
In this special case, you can fetch the dataset from the original
source::
import pandas as pd
import numpy as np
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
Alternative datasets include the California housing dataset and the
Ames housing dataset. You can load the datasets as follows::
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
for the California housing dataset and::
from sklearn.datasets import fetch_openml
housing = fetch_openml(name="house_prices", as_frame=True)
for the Ames housing dataset.
[1] M Carlisle.
"Racist data destruction?"
<https://medium.com/@docintangible/racist-data-destruction-113e3eff54a8>
[2] Harrison Jr, David, and Daniel L. Rubinfeld.
"Hedonic housing prices and the demand for clean air."
Journal of environmental economics and management 5.1 (1978): 81-102.
<https://www.researchgate.net/publication/4974606_Hedonic_housing_prices_and_the_demand_for_clean_air>
Introduction
Why Machine Learning?
Problems Machine Learning Can Solve
Knowing Your Task and Knowing Your Data
Why Python?
scikit-learn
Installing scikit-learn
Essential Libraries and Tools
Jupyter Notebook
NumPy
In [2]:
x:
[[1 2 3]
[4 5 6]]
SciPy
In [3]:
NumPy array:
[[1. 0. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]]
In [4]:
SciPy sparse CSR matrix:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
In [5]:
COO representation:
(0, 0) 1.0
(1, 1) 1.0
(2, 2) 1.0
(3, 3) 1.0
matplotlib
In [6]:
[<matplotlib.lines.Line2D at 0x1be867b9748>]
Invalid PDF output
pandas
In [7]:
In [8]:
mglearn
Python 2 versus Python 3
Versions Used in this Book
In [9]:
Python version: 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)]
pandas version: 1.0.3
matplotlib version: 3.1.3
NumPy version: 1.18.1
SciPy version: 1.4.1
IPython version: 7.13.0
scikit-learn version: 0.24.dev0
A First Application: Classifying Iris Species
Meet the Data
In [10]:
In [11]:
Keys of iris_dataset:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])
In [12]:
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, pre
...
In [13]:
Target names: ['setosa' 'versicolor' 'virginica']
In [14]:
Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [15]:
Type of data: <class 'numpy.ndarray'>
In [16]:
Shape of data: (150, 4)
In [17]:
First five rows of data:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
In [18]:
Type of target: <class 'numpy.ndarray'>
In [19]:
Shape of target: (150,)
In [20]:
Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
Measuring Success: Training and Testing Data
In [21]:
In [22]:
X_train shape: (112, 4)
y_train shape: (112,)
In [23]:
X_test shape: (38, 4)
y_test shape: (38,)
First Things First: Look at Your Data
In [24]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE868F9C88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE869714C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE869A5D48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE869DFE08>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86A18E48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86A4FEC8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86A88F88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86AC8088>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86ACEC48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86B06D88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86B71188>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86BAA208>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86BE22C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86C1A388>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86C54408>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BE86C8C3C8>]],
dtype=object)
Invalid PDF output
Building Your First Model: k-Nearest Neighbors
In [25]:
In [26]:
KNeighborsClassifier(n_neighbors=1)
Making Predictions
In [27]:
X_new.shape: (1, 4)
In [28]:
Prediction: [0]
Predicted target name: ['setosa']
Evaluating the Model
In [29]:
Test set predictions:
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
2]
In [30]:
Test set score: 0.97
In [31]:
Test set score: 0.97
Summary and Outlook
In [32]:
Test set score: 0.97