CNN 1 - Working with Image data

Dataset:
- https://www.kaggle.com/shaunthesheep/microsoft-catsvsdogs-dataset

In [1]:

!ls PetImages/

Cat Dog

In [2]:

!ls PetImages/Cat | wc -l

12501

In [3]:

!ls PetImages/Dog | wc -l

12501

Two folders (one per class), the data isn't split

In [4]:

import os
import random
import pathlib
import shutil
import matplotlib.pyplot as plt

Set these variables however you want
We'll later split the dataset into training/testing/validation sets with a 80:10:10 ratio

In [5]:

# Distinct image classes
img_classes = ['cat', 'dog']

# Folders for training, testing, and validation subsets
dir_data  = pathlib.Path.cwd().joinpath('data')
dir_train = dir_data.joinpath('train')
dir_valid = dir_data.joinpath('validation')
dir_test  = dir_data.joinpath('test')

# Train/Test/Validation split config
pct_train = 0.8
pct_valid = 0.1
pct_test = 0.1

Directory structure

We want to have a folder that contains dedicated folders for training, testing, and validation images
Each of these subfolders will have two folders inside - one for cats, and the other for dogs
We'll declare a function which creates the directory structure:

In [6]:

def setup_folder_structure() -> None:
    # Create base folders if they don't exist
    if not dir_data.exists():  dir_data.mkdir()
    if not dir_train.exists(): dir_train.mkdir()
    if not dir_valid.exists(): dir_valid.mkdir()
    if not dir_test.exists():  dir_test.mkdir()
    
    # Create subfolders for each class
    for cls in img_classes:
        if not dir_train.joinpath(cls).exists(): dir_train.joinpath(cls).mkdir()
        if not dir_valid.joinpath(cls).exists(): dir_valid.joinpath(cls).mkdir()
        if not dir_test.joinpath(cls).exists():  dir_test.joinpath(cls).mkdir()
        
    # Print the directory structure
    # Credits - https://stackoverflow.com/questions/3455625/linux-command-to-print-directory-structure-in-the-form-of-a-tree
    dir_str = os.system('''ls -R data | grep ":$" | sed -e 's/:$//' -e 's/[^-][^\/]*\//--/g' -e 's/^/   /' -e 's/-/|/' ''')
    print(dir_str)
    return

In [7]:

setup_folder_structure()

   |-test
   |---cat
   |---dog
   |-train
   |---cat
   |---dog
   |-validation
   |---cat
   |---dog
0

Train/Test/Validation split

It's recommended to have three subsets when training image models
- Training set - The largest subset on which the model trains
- Validation sets - A separate set used for evaluation during training
- Test set - Used to perform a final test, images were never seen by the model
We'll go over every image in cat or dog folder, and then create a random number between 0 and 1
- If the random number is 0.80 or below, the image will belong to the training set
- If the random number is between 0.80 and 0.90, the image will belong to the validation set
- Else, the image will belong to the test set
- You can use the shutil module to copy the images

In [8]:

def train_test_validation_split(src_folder: pathlib.PosixPath, class_name: str) -> dict:
    # For tracking
    n_train, n_valid, n_test = 0, 0, 0
    
    # Random seed for reproducibility
    random.seed(42)
    
    # Iterate over every image
    for file in src_folder.iterdir():
        img_name = str(file).split('/')[-1]
        
        # Make sure it's JPG
        if file.suffix == '.jpg':
            # Generate a random number
            x = random.random()
            
            # Where should the image go?
            tgt_dir = ''
            
            # .80 or below
            if x <= pct_train:  
                tgt_dir = 'train'
                n_train += 1
                
            # Between .80 and .90
            elif pct_train < x <= (pct_train + pct_valid):  
                tgt_dir = 'validation'
                n_valid += 1
                
            # Above .90
            else:  
                tgt_dir = 'test'
                n_test += 1
                
            # Copy the image
            shutil.copy(
                src=file,
                # data/<train|valid|test>/<cat\dog>/<something>.jpg
                dst=f'{str(dir_data)}/{tgt_dir}/{class_name}/{img_name}'
            )
            
    return {
        'source': str(src_folder),
        'target': str(dir_data),
        'n_train': n_train,
        'n_validaiton': n_valid,
        'n_test': n_test
    }

In [9]:

%%time

train_test_validation_split(
    src_folder=pathlib.Path.cwd().joinpath('PetImages/Cat'),
    class_name='cat'
)

CPU times: user 249 ms, sys: 1.73 s, total: 1.98 s
Wall time: 4.1 s

{'source': '/Users/dradecic/Desktop/tf_dataset/PetImages/Cat',
 'target': '/Users/dradecic/Desktop/tf_dataset/data',
 'n_train': 10016,
 'n_validaiton': 1239,
 'n_test': 1245}

In [10]:

%%time

train_test_validation_split(
    src_folder=pathlib.Path.cwd().joinpath('PetImages/Dog'),
    class_name='dog'
)

CPU times: user 248 ms, sys: 1.71 s, total: 1.96 s
Wall time: 4.15 s

{'source': '/Users/dradecic/Desktop/tf_dataset/PetImages/Dog',
 'target': '/Users/dradecic/Desktop/tf_dataset/data',
 'n_train': 10016,
 'n_validaiton': 1239,
 'n_test': 1245}

It's not a perfect 80:10:10 split due to randomization, but it will do

Visualizng images

Always visualize the dataset when working with images
This function plots a random subset of 10 images from a given directory
The images are displayed in a grid of 2 rows and 5 columns:

In [11]:

def plot_random_sample(img_dir: pathlib.PosixPath):
    # How many images we're showing
    n = 10
    # Get absolute paths to these N images
    imgs = random.sample(list(img_dir.iterdir()), n)
    
    # Make sure num_row * num_col = n
    num_row = 2
    num_col = 5 

    # Create a figure
    fig, axes = plt.subplots(num_row, num_col, figsize=(3.5 * num_col, 3 * num_row))
    # For every image
    for i in range(num_row * num_col):
        # Read the image
        img = plt.imread(str(imgs[i]))
        # Display the image
        ax = axes[i // num_col, i % num_col]
        ax.imshow(img)
        # Set title as <train|test|validation>/<cat\dog>/<img_name>.jpg
        ax.set_title('/'.join(str(imgs[i]).split('/')[-3:]))

    plt.tight_layout()
    plt.show()

In [12]:

plot_random_sample(img_dir=pathlib.Path().cwd().joinpath('data/train/cat'))

In [13]:

plot_random_sample(img_dir=pathlib.Path().cwd().joinpath('data/validation/dog'))

What's next?

We'll explore what image data actually is and how to work with it

CoCalc provides the best real-time collaborative environment for Jupyter Notebooks, LaTeX documents, and SageMath, scalable from individual users to large groups and classes!

CNN 1 - Working with Image data

Directory structure

Train/Test/Validation split

Visualizng images

What's next?

Product

Resources

Company