Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Math 152: Intro to Mathematical Software
Lecture 16: Pandas (part 2)
Announcements:
Peer reviews due Thursday 8pm. Request from the graders: please place all feedback in a file called "grades.txt", to make it easier to find. (If one of your assigned folders is empty, please still make "grades.txt" and report that you found an empty folder.)
Next Monday is a University holiday, so no lecture, sections, or office hours. Zonglin's office hours are rescheduled to Tuesday 10-12 (and also replace my usual Tuesday office hour).
Today: more pandas...
getting data into SMC
Series
DataFrames
1. Getting data into SMC
WARNING: SMC has absolutely no direct access to the files on your laptop. It is a remote computer account running in the cloud. That said, if you use pandas/python, etc, on your own computer at some point, it will of course have direct access. (Maybe do a live demo of
python get-pip.py
thenpip install --user pandas
on my Linux laptop.)+New, paste the link into the box, click "Download from Internet".
+New, drag and drop, upload a file.
Use
wget
orcurl
from the terminal or in a %sh cell.Use
scp
orrsync
(see project settings... SSH button) from your computer to the project.Note: the last two methods require your project to have an Internet upgrade. For this course, that is provided by UCSD.
The %sh
command simulates a Unix (or Mac) terminal from within a worksheet. As long as a command isn't interactive, it should work for you.
--2017-02-15 22:14:38-- http://spatialkeydocs.s3.amazonaws.com/FL_insurance_sample.csv.zip
Resolving spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)... 54.231.40.83
Connecting to spatialkeydocs.s3.amazonaws.com (spatialkeydocs.s3.amazonaws.com)|54.231.40.83|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 866400 (846K) [application/zip]
Saving to: ‘FL_insurance_sample.csv.zip’
FL_insurance_sample 100%[=====================>] 846.09K 2.07MB/s in 0.4s
2017-02-15 22:14:39 (2.07 MB/s) - ‘FL_insurance_sample.csv.zip’ saved [866400/866400]
2017-02-15-140805 FL_insurance_sample.csv.zip index.html nyt.png 2017-02-15.sagews __MACOSX index.shtml FL_insurance_sample.csv animals.csv nikolai.jpg
Archive: FL_insurance_sample.csv.zip
inflating: FL_insurance_sample.csv
creating: __MACOSX/
inflating: __MACOSX/._FL_insurance_sample.csv
policyID | statecode | county | eq_site_limit | hu_site_limit | fl_site_limit | fr_site_limit | tiv_2011 | tiv_2012 | eq_site_deductible | hu_site_deductible | fl_site_deductible | fr_site_deductible | point_latitude | point_longitude | line | construction | point_granularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 119736 | FL | CLAY COUNTY | 498960.0 | 498960.00 | 498960.0 | 498960.0 | 498960.00 | 792148.90 | 0.0 | 9979.2 | 0.0 | 0 | 30.102261 | -81.711777 | Residential | Masonry | 1 |
1 | 448094 | FL | CLAY COUNTY | 1322376.3 | 1322376.30 | 1322376.3 | 1322376.3 | 1322376.30 | 1438163.57 | 0.0 | 0.0 | 0.0 | 0 | 30.063936 | -81.707664 | Residential | Masonry | 3 |
2 | 206893 | FL | CLAY COUNTY | 190724.4 | 190724.40 | 190724.4 | 190724.4 | 190724.40 | 192476.78 | 0.0 | 0.0 | 0.0 | 0 | 30.089579 | -81.700455 | Residential | Wood | 1 |
3 | 333743 | FL | CLAY COUNTY | 0.0 | 79520.76 | 0.0 | 0.0 | 79520.76 | 86854.48 | 0.0 | 0.0 | 0.0 | 0 | 30.063236 | -81.707703 | Residential | Wood | 3 |
4 | 172534 | FL | CLAY COUNTY | 0.0 | 254281.50 | 0.0 | 254281.5 | 254281.50 | 246144.49 | 0.0 | 0.0 | 0.0 | 0 | 30.060614 | -81.702675 | Residential | Wood | 1 |
5 | 785275 | FL | CLAY COUNTY | 0.0 | 515035.62 | 0.0 | 0.0 | 515035.62 | 884419.17 | 0.0 | 0.0 | 0.0 | 0 | 30.063236 | -81.707703 | Residential | Masonry | 3 |
6 | 995932 | FL | CLAY COUNTY | 0.0 | 19260000.00 | 0.0 | 0.0 | 19260000.00 | 20610000.00 | 0.0 | 0.0 | 0.0 | 0 | 30.102226 | -81.713882 | Commercial | Reinforced Concrete | 1 |
7 | 223488 | FL | CLAY COUNTY | 328500.0 | 328500.00 | 328500.0 | 328500.0 | 328500.00 | 348374.25 | 0.0 | 16425.0 | 0.0 | 0 | 30.102217 | -81.707146 | Residential | Wood | 1 |
8 | 433512 | FL | CLAY COUNTY | 315000.0 | 315000.00 | 315000.0 | 315000.0 | 315000.00 | 265821.57 | 0.0 | 15750.0 | 0.0 | 0 | 30.118774 | -81.704613 | Residential | Wood | 1 |
9 | 142071 | FL | CLAY COUNTY | 705600.0 | 705600.00 | 705600.0 | 705600.0 | 705600.00 | 1010842.56 | 14112.0 | 35280.0 | 0.0 | 0 | 30.100628 | -81.703751 | Residential | Masonry | 1 |
policyID | eq_site_limit | hu_site_limit | fl_site_limit | fr_site_limit | tiv_2011 | tiv_2012 | eq_site_deductible | hu_site_deductible | fl_site_deductible | fr_site_deductible | point_latitude | point_longitude | point_granularity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 36634.000000 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 3.663400e+04 | 36634.000000 | 36634.000000 | 36634.000000 | 36634.000000 | 36634.000000 |
mean | 548661.937189 | 7.314780e+05 | 2.074348e+06 | 6.646009e+05 | 9.911724e+05 | 2.172875e+06 | 2.571004e+06 | 7.787908e+02 | 7.037985e+03 | 192.453210 | 26.483594 | 28.087477 | -81.903565 | 1.640907 |
std | 259912.926494 | 1.752300e+07 | 1.964150e+07 | 1.746114e+07 | 1.767178e+07 | 2.015015e+07 | 2.036777e+07 | 4.787979e+04 | 9.582593e+04 | 8037.713716 | 4702.740932 | 1.647734 | 1.809066 | 1.070561 |
min | 100074.000000 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.000000e+01 | 7.337000e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 24.547514 | -87.447290 | 1.000000 |
25% | 323647.750000 | 0.000000e+00 | 4.248106e+04 | 0.000000e+00 | 0.000000e+00 | 4.562415e+04 | 5.401351e+04 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 26.471996 | -82.439331 | 1.000000 |
50% | 548525.000000 | 0.000000e+00 | 1.926911e+05 | 0.000000e+00 | 0.000000e+00 | 2.021051e+05 | 2.416310e+05 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 28.057077 | -81.585700 | 1.000000 |
75% | 774252.000000 | 0.000000e+00 | 9.163381e+05 | 0.000000e+00 | 5.741010e+04 | 9.465821e+05 | 1.128871e+06 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 0.000000 | 29.623005 | -80.372910 | 3.000000 |
max | 999971.000000 | 2.160000e+09 | 2.160000e+09 | 2.160000e+09 | 2.160000e+09 | 2.160000e+09 | 1.701000e+09 | 6.273765e+06 | 7.380000e+06 | 450000.000000 | 900000.000000 | 30.989820 | -80.033257 | 7.000000 |
pd.Series
http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#series
Datasets are built out of
pd.Series
objects, which are like a List of values, but with a specific type (like floating point number, integer, string, ...) and an index. More precisely, apd.Series
is a "One-dimensional ndarray with axis labels"Larger
pd.DataFrame
objects are built out of several such series. Each of them is a column, must have the same length, and each column has a name.
Examples:
Indexing...
... so a pd.Series is much like a Python dict except you can do arithmetic, there is an order, you can have the same key multiple times, indexing is amazing. And everything is super fast.
Exercise right now!:
Make a Python dict
d
of your choosing, then try doingpd.Series(d)
. It should work.Make up 3 different pd.Series objects with various choices of values and index.
Try adding them together and see what happens.
"50 to 80 percent of time is spent wading through the tedium of the first two steps – acquiring and wrangling data – before even getting to the real work of analysis and insight." -- New York Times
pd.DataFrame
Documentation: http://pandas.pydata.org/pandas-docs/version/0.18.1/dsintro.html#dataframe
DataFrame = "a 2-dimensional labeled data structure with columns of potentially different types."
Think of it as:
a spreadsheet, or
a SQL table, or
a dict of
pd.Series
objects.
Now, we build a pd.DataFrame
:
age | name | |
---|---|---|
0 | 42 | William |
1 | 68 | Marlene |
2 | 19 | Jon |
3 | 22 | Jane |
name | age | location | |
---|---|---|---|
0 | William | 42 | NaN |
1 | Marlene | 68 | NaN |
2 | Jon | 19 | NaN |
3 | Jane | 22 | NaN |
The bold numbers or values on the left are called the "index".
The index is used to uniquely identify a row.
Use people.ix[i] to get the row with given index!
This isn't obvious. You just have to memorize it. Learn it now.
Of course, the rows don't have to be indexed by numbers -- anything hashable is allowed.
col1 | col2 | |
---|---|---|
name | william | jon |
rank | private | general |
ser | NaN | 5077 |
** Exercise right now!** Make a data frame for which there are two rows with the same index i, showing that the index need not be unique! [Hint: we made such a series above]
What happens when you type d.ix[i]
?
col1 | col2 | |
---|---|---|
i | 2 | 3 |
name | william | jon |
rank | private | general |
ser | NaN | 5077 |
NOTE: Besides .ix
, there are also other properties for various purposes. Read more about them here:
a | b | |
---|---|---|
name | john | john |
rank | private | private |
number | 389 | 389 |
name | nathan | nathan |
a | b | |
---|---|---|
name | john | john |
name | nathan | nathan |
col1 | col2 | |
---|---|---|
name | william | jon |
rank | private | general |
ser | NaN | 5077 |
col1 | col2 | |
---|---|---|
name | william | jon |
rank | private | general |
Question: How to get all rows starting from index 1 up to the end?
Answer: slicing.
age | name | |
---|---|---|
0 | 42 | William |
1 | 68 | Marlene |
2 | 19 | Jon |
3 | 22 | Jane |
age | name | |
---|---|---|
0 | 42 | William |
1 | 68 | Marlene |
age | name | |
---|---|---|
1 | 68 | Marlene |
2 | 19 | Jon |
age | name | |
---|---|---|
0 | 42 | William |
2 | 19 | Jon |
CSV
CSV is a very simple column-oriented data format, where a text file contains rows and a comma is a delimiter for the columns.
The first row contains the header, describing the columns.
Pandas then imports this file via pd.read_csv()
(the argument of this function could not only be a filename, but also a URL link 'http://...' pointint to a csv file)
name,type,tail,legs,age,color
joe,elephant,Y,4,16,gray
tim,dog,Y,4,7,golden
susan,cat,Y,4,4,black
frank,fish,N,0,1,green
olivia,spider,N,6,.5,black
minki,cat,Y,4,6,white
name | type | tail | legs | age | color | |
---|---|---|---|---|---|---|
0 | joe | elephant | Y | 4 | 16.0 | gray |
1 | tim | dog | Y | 4 | 7.0 | golden |
2 | susan | cat | Y | 4 | 4.0 | black |
3 | frank | fish | N | 0 | 1.0 | green |
4 | olivia | spider | N | 6 | 0.5 | black |
5 | minki | cat | Y | 4 | 6.0 | white |
The power of Pandas is, that you can start asking questions about this data, filter it, compare it with other datasets, or even picture it in a plot very easily.
legs | age | |
---|---|---|
count | 6.000000 | 6.000000 |
mean | 3.666667 | 5.750000 |
std | 1.966384 | 5.654644 |
min | 0.000000 | 0.500000 |
25% | 4.000000 | 1.750000 |
50% | 4.000000 | 5.000000 |
75% | 4.000000 | 6.750000 |
max | 6.000000 | 16.000000 |
name | type | tail | legs | age | color | |
---|---|---|---|---|---|---|
2 | susan | cat | Y | 4 | 4.0 | black |
3 | frank | fish | N | 0 | 1.0 | green |
4 | olivia | spider | N | 6 | 0.5 | black |
name | type | tail | legs | age | color | |
---|---|---|---|---|---|---|
0 | joe | elephant | Y | 4 | 16.0 | gray |
1 | tim | dog | Y | 4 | 7.0 | golden |
2 | susan | cat | Y | 4 | 4.0 | black |
5 | minki | cat | Y | 4 | 6.0 | white |
Question: which animal(s) has/have the largest number of legs?
Plotting
In a related library "statsmodels", there are also various datasets included. Try plotting one of them (where it makes sense)
violent | murder | hs_grad | poverty | single | white | urban | |
---|---|---|---|---|---|---|---|
state | |||||||
Alabama | 459.9 | 7.1 | 82.1 | 17.5 | 29.0 | 70.0 | 48.65 |
Alaska | 632.6 | 3.2 | 91.4 | 9.0 | 25.5 | 68.3 | 44.46 |
Arizona | 423.2 | 5.5 | 84.2 | 16.5 | 25.7 | 80.0 | 80.07 |
Arkansas | 530.3 | 6.3 | 82.4 | 18.8 | 26.3 | 78.4 | 39.54 |
California | 473.4 | 5.4 | 80.6 | 14.2 | 27.8 | 62.7 | 89.73 |