CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

| Download
Views: 27
Image: ubuntu2204
Kernel: Python 3 (system-wide)

Data Wrangling

import pandas as pd import numpy as np pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None)

Data Analysis

Loading in our SpaceX dataset

df=pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DS0321EN-SkillsNetwork/datasets/dataset_part_1.csv") df.head(10)

First I identified and calculated the percentage of missing values that are within each attribute

df.isnull().sum()/len(df)*100
FlightNumber 0.000000 Date 0.000000 BoosterVersion 0.000000 PayloadMass 0.000000 Orbit 0.000000 LaunchSite 0.000000 Outcome 0.000000 Flights 0.000000 GridFins 0.000000 Reused 0.000000 Legs 0.000000 LandingPad 28.888889 Block 0.000000 ReusedCount 0.000000 Serial 0.000000 Longitude 0.000000 Latitude 0.000000 dtype: float64
# Figuring which columns are numerical and categorical: df.dtypes
FlightNumber int64 Date object BoosterVersion object PayloadMass float64 Orbit object LaunchSite object Outcome object Flights int64 GridFins bool Reused bool Legs bool LandingPad object Block float64 ReusedCount int64 Serial object Longitude float64 Latitude float64 dtype: object

Our dataset have information on different launch facilities, so first I needed to figure out the number of launches from each site.

LaunchSiteCount = df["LaunchSite"].value_counts() LaunchSiteCount
LaunchSite CCAFS SLC 40 55 KSC LC 39A 22 VAFB SLC 4E 13 Name: count, dtype: int64

For every launch there is a dedicated orbit, so next I found the number and occurence of each orbit type

OrbitCount=df["Orbit"].value_counts() OrbitCount
Orbit GTO 27 ISS 21 VLEO 14 PO 9 LEO 7 SSO 5 MEO 3 ES-L1 1 HEO 1 SO 1 GEO 1 Name: count, dtype: int64

Next, I looked at how many different landing outcomes there were, and how frequently each occured.

landing_outcomes = df["Outcome"].value_counts() landing_outcomes
Outcome True ASDS 41 None None 19 True RTLS 14 False ASDS 6 True Ocean 5 False Ocean 2 None ASDS 2 False RTLS 1 Name: count, dtype: int64

Taking a closer look at the landing outcomes, I needed to create a landing outcome label.

The first step was identifying the keys for each respective outcome.

Then identifying all the outcomes where a landing wasnt achieved.

Lastly, I assigned the landing outcome to be represented by the following;

0=Failed Landing

1=Successful Landing

# Finding the landing outcome keys for i,outcome in enumerate(landing_outcomes.keys()): print(i,outcome)
0 True ASDS 1 None None 2 True RTLS 3 False ASDS 4 True Ocean 5 False Ocean 6 None ASDS 7 False RTLS
# Identifying the outcomes that resulted in landing failure bad_outcomes=set(landing_outcomes.keys()[[1,3,5,6,7]]) bad_outcomes
{'False ASDS', 'False Ocean', 'False RTLS', 'None ASDS', 'None None'}
# Assigning our landing outcome labels landing_class = df['Outcome'].apply(lambda x:0 if x in bad_outcomes else 1) # Applying our new label to our dataframe df['Class']=landing_class
# Checking our updated dataframe df.head()

Because the outcomes are 0=failure, and 1=success - I was able to determine the overall success rate

df["Class"].mean()
0.6666666666666666
# Lastly, I downnloaded the csv of the data df.to_csv("dataset_part_2.csv", index=False)