CoCalc -- Data Wrangling.ipynb

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Views: ²⁷
Image: ubuntu2204

Kernel: Python 3 (system-wide)

Data Wrangling

In [26]:

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

Data Analysis

Loading in our SpaceX dataset

In [27]:

df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv")
df.head(10)

First I identified and calculated the percentage of missing values that are within each attribute

In [28]:

df.isnull().sum()/len(df)*100

FlightNumber       0.000000
Date               0.000000
BoosterVersion     0.000000
PayloadMass        0.000000
Orbit              0.000000
LaunchSite         0.000000
Outcome            0.000000
Flights            0.000000
GridFins           0.000000
Reused             0.000000
Legs               0.000000
LandingPad        28.888889
Block              0.000000
ReusedCount        0.000000
Serial             0.000000
Longitude          0.000000
Latitude           0.000000
dtype: float64

In [29]:

# Figuring which columns are numerical and categorical:

df.dtypes

FlightNumber        int64
Date               object
BoosterVersion     object
PayloadMass       float64
Orbit              object
LaunchSite         object
Outcome            object
Flights             int64
GridFins             bool
Reused               bool
Legs                 bool
LandingPad         object
Block             float64
ReusedCount         int64
Serial             object
Longitude         float64
Latitude          float64
dtype: object

Our dataset have information on different launch facilities, so first I needed to figure out the number of launches from each site.

In [30]:

LaunchSiteCount = df["LaunchSite"].value_counts()
LaunchSiteCount

LaunchSite
CCAFS SLC 40    55
KSC LC 39A      22
VAFB SLC 4E     13
Name: count, dtype: int64

For every launch there is a dedicated orbit, so next I found the number and occurence of each orbit type

In [31]:

OrbitCount=df["Orbit"].value_counts()
OrbitCount

Orbit
GTO      27
ISS      21
VLEO     14
PO        9
LEO       7
SSO       5
MEO       3
ES-L1     1
HEO       1
SO        1
GEO       1
Name: count, dtype: int64

Next, I looked at how many different landing outcomes there were, and how frequently each occured.

In [32]:

landing_outcomes = df["Outcome"].value_counts()
landing_outcomes

Outcome
True ASDS      41
None None      19
True RTLS      14
False ASDS      6
True Ocean      5
False Ocean     2
None ASDS       2
False RTLS      1
Name: count, dtype: int64

Taking a closer look at the landing outcomes, I needed to create a landing outcome label.

The first step was identifying the keys for each respective outcome.

Then identifying all the outcomes where a landing wasnt achieved.

Lastly, I assigned the landing outcome to be represented by the following;

0=Failed Landing

1=Successful Landing

In [33]:

# Finding the landing outcome keys
for i,outcome in enumerate(landing_outcomes.keys()):
    print(i,outcome)

True ASDS
None None
True RTLS
False ASDS
True Ocean
False Ocean
None ASDS
False RTLS

In [34]:

# Identifying the outcomes that resulted in landing failure
bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]])
bad_outcomes

{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}

In [35]:

# Assigning our landing outcome labels
landing_class = df['Outcome'].apply(lambda x:0 if x in bad_outcomes else 1)

# Applying our new label to our dataframe
df['Class']=landing_class

In [36]:

# Checking our updated dataframe 
df.head()

Because the outcomes are 0=failure, and 1=success - I was able to determine the overall success rate

In [37]:

df["Class"].mean()

0.6666666666666666

In [38]:

# Lastly, I downnloaded the csv of the data 

df.to_csv("dataset_part_2.csv", index=False)

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

Data Wrangling

Data Analysis

Product

Resources

Company

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more, all in one place. Commercial Alternative to JupyterHub.

Data Wrangling

Data Analysis

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.