CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
better-data-science

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

GitHub Repository: better-data-science/TensorFlow
Path: blob/main/008_CNN_001_Working_With_Image_Data.ipynb
Views: 47
Kernel: Python 3.9.7 (tf_apple)
!ls PetImages/
Cat Dog
!ls PetImages/Cat | wc -l
12501
!ls PetImages/Dog | wc -l
12501
  • Two folders (one per class), the data isn't split

import os import random import pathlib import shutil import matplotlib.pyplot as plt
  • Set these variables however you want

  • We'll later split the dataset into training/testing/validation sets with a 80:10:10 ratio

# Distinct image classes img_classes = ['cat', 'dog'] # Folders for training, testing, and validation subsets dir_data = pathlib.Path.cwd().joinpath('data') dir_train = dir_data.joinpath('train') dir_valid = dir_data.joinpath('validation') dir_test = dir_data.joinpath('test') # Train/Test/Validation split config pct_train = 0.8 pct_valid = 0.1 pct_test = 0.1

Directory structure

  • We want to have a folder that contains dedicated folders for training, testing, and validation images

  • Each of these subfolders will have two folders inside - one for cats, and the other for dogs

  • We'll declare a function which creates the directory structure:

def setup_folder_structure() -> None: # Create base folders if they don't exist if not dir_data.exists(): dir_data.mkdir() if not dir_train.exists(): dir_train.mkdir() if not dir_valid.exists(): dir_valid.mkdir() if not dir_test.exists(): dir_test.mkdir() # Create subfolders for each class for cls in img_classes: if not dir_train.joinpath(cls).exists(): dir_train.joinpath(cls).mkdir() if not dir_valid.joinpath(cls).exists(): dir_valid.joinpath(cls).mkdir() if not dir_test.joinpath(cls).exists(): dir_test.joinpath(cls).mkdir() # Print the directory structure # Credits - https://stackoverflow.com/questions/3455625/linux-command-to-print-directory-structure-in-the-form-of-a-tree dir_str = os.system('''ls -R data | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/ /' -e 's/-/|/' ''') print(dir_str) return
setup_folder_structure()
|-test |---cat |---dog |-train |---cat |---dog |-validation |---cat |---dog 0

Train/Test/Validation split

  • It's recommended to have three subsets when training image models

    • Training set - The largest subset on which the model trains

    • Validation sets - A separate set used for evaluation during training

    • Test set - Used to perform a final test, images were never seen by the model

  • We'll go over every image in cat or dog folder, and then create a random number between 0 and 1

    • If the random number is 0.80 or below, the image will belong to the training set

    • If the random number is between 0.80 and 0.90, the image will belong to the validation set

    • Else, the image will belong to the test set

    • You can use the shutil module to copy the images

def train_test_validation_split(src_folder: pathlib.PosixPath, class_name: str) -> dict: # For tracking n_train, n_valid, n_test = 0, 0, 0 # Random seed for reproducibility random.seed(42) # Iterate over every image for file in src_folder.iterdir(): img_name = str(file).split('/')[-1] # Make sure it's JPG if file.suffix == '.jpg': # Generate a random number x = random.random() # Where should the image go? tgt_dir = '' # .80 or below if x <= pct_train: tgt_dir = 'train' n_train += 1 # Between .80 and .90 elif pct_train < x <= (pct_train + pct_valid): tgt_dir = 'validation' n_valid += 1 # Above .90 else: tgt_dir = 'test' n_test += 1 # Copy the image shutil.copy( src=file, # data/<train|valid|test>/<cat\dog>/<something>.jpg dst=f'{str(dir_data)}/{tgt_dir}/{class_name}/{img_name}' ) return { 'source': str(src_folder), 'target': str(dir_data), 'n_train': n_train, 'n_validaiton': n_valid, 'n_test': n_test }
%%time train_test_validation_split( src_folder=pathlib.Path.cwd().joinpath('PetImages/Cat'), class_name='cat' )
CPU times: user 249 ms, sys: 1.73 s, total: 1.98 s Wall time: 4.1 s
{'source': '/Users/dradecic/Desktop/tf_dataset/PetImages/Cat', 'target': '/Users/dradecic/Desktop/tf_dataset/data', 'n_train': 10016, 'n_validaiton': 1239, 'n_test': 1245}
%%time train_test_validation_split( src_folder=pathlib.Path.cwd().joinpath('PetImages/Dog'), class_name='dog' )
CPU times: user 248 ms, sys: 1.71 s, total: 1.96 s Wall time: 4.15 s
{'source': '/Users/dradecic/Desktop/tf_dataset/PetImages/Dog', 'target': '/Users/dradecic/Desktop/tf_dataset/data', 'n_train': 10016, 'n_validaiton': 1239, 'n_test': 1245}
  • It's not a perfect 80:10:10 split due to randomization, but it will do


Visualizng images

  • Always visualize the dataset when working with images

  • This function plots a random subset of 10 images from a given directory

  • The images are displayed in a grid of 2 rows and 5 columns:

def plot_random_sample(img_dir: pathlib.PosixPath): # How many images we're showing n = 10 # Get absolute paths to these N images imgs = random.sample(list(img_dir.iterdir()), n) # Make sure num_row * num_col = n num_row = 2 num_col = 5 # Create a figure fig, axes = plt.subplots(num_row, num_col, figsize=(3.5 * num_col, 3 * num_row)) # For every image for i in range(num_row * num_col): # Read the image img = plt.imread(str(imgs[i])) # Display the image ax = axes[i // num_col, i % num_col] ax.imshow(img) # Set title as <train|test|validation>/<cat\dog>/<img_name>.jpg ax.set_title('/'.join(str(imgs[i]).split('/')[-3:])) plt.tight_layout() plt.show()
plot_random_sample(img_dir=pathlib.Path().cwd().joinpath('data/train/cat'))
Image in a Jupyter notebook
plot_random_sample(img_dir=pathlib.Path().cwd().joinpath('data/validation/dog'))
Image in a Jupyter notebook

What's next?

  • We'll explore what image data actually is and how to work with it