Intro¶

CNN(Encoder): Convolutional Neural Networks (CNN) are a type of deep learning model used to solve machine learning problems related to images and videos, such as image classification, object detection, and segmentation. The assumption that the patterns learned from images, like edges and contours, are independent of the pixel positions in the image allows for the sharing of weights and parameters.
LSTM(Decoder): Just as CNNs are adapted for images, Long Short-Term Memory (LSTM) networks are effective for solving machine learning problems related to sequential data.
In this project, we will build a hybrid model that connects CNNs and LSTMs to receive images or videos as input and produce text as output. Let's take a detailed look at the architecture of the hybrid model and how to implement it in PyTorch.
Datset: COCO(Common Objects in Context)

Download Data¶

The training dataset and the validation dataset occupy 13GB and 6GB respectively.
Let's download, extract, and process the dataset files.

In [ ]:

              # linux
!apt-get install wget
# mac
!brew install wget
 
# create a data directory
!mkdir data_dir

# train: 
!wget http://images.cocodataset.org/zips/train2014.zip -P ./data_dir/
!unzip ./data_dir/train_2014.zip -d ./data_dir/

# val: 
!wget http://images.cocodataset.org/zips/val2014.zip -P ./data_dir/
!unzip ./data_dir/val2014.zip -d ./data_dir/ 

# annotations:
!sudo wget http://msvocds.blob.core.windows.net/annotations-1-0-3/captions_train-val2014.zip -P ./data_dir/ 
!wget -c https://pjreddie.com/media/files/captions_train-val2014.zip

!unzip ./data_dir/captions_train-val2014.zip -d ./data_dir/
!rm ./data_dir/captions_train-val2014.zip

            

Import Dependencies¶

In [2]:

              import os
import nltk # for NLP
import pickle
import numpy as np
from PIL import Image
from collections import Counter
from pycocotools.coco import COCO # to handle COCO dataset easier
import matplotlib.pyplot as plt
 
import torch
import torch.nn as nn
import torch.utils.data as data
from torchvision import transforms
import torchvision.models as models
import torchvision.transforms as transforms
from torch.nn.utils.rnn import pack_padded_sequence # for padding

            

In [2]:

              nltk.download('punkt') # for tokenizing

            

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/ohchanghyun/nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Out[2]:

True

Build Vocab¶

Define a vocabulary object, Vocab(), to which you'll add tokens. This object will also establish a mapping between text tokens and their corresponding numerical representations.
Additionally, by saving the vocabulary object locally, you can avoid the need to rebuild the vocabulary from scratch every time you want to retrain your model, saving time and computational resources.

In [3]:

              class Vocab(object):
    """Simple vocabulary wrapper."""
    def __init__(self):
        self.w2i = {}   # word to index
        self.i2w = {}   # index to word
        self.index = 0
 
    def __call__(self, token):
        if not token in self.w2i:
            return self.w2i['<unk>']
        return self.w2i[token]
 
    def __len__(self):
        return len(self.w2i)
    def add_token(self, token):
        if not token in self.w2i:
            self.w2i[token] = self.index
            self.i2w[self.index] = token
            self.index += 1

def build_vocabulary(json, threshold):
    """Build a simple vocabulary wrapper."""
    coco = COCO(json)
    counter = Counter()
    ids = coco.anns.keys()
    for i, id in enumerate(ids):
        caption = str(coco.anns[id]['caption'])
        tokens = nltk.tokenize.word_tokenize(caption.lower())
        counter.update(tokens) 
 
        if (i+1) % 1000 == 0:
            print("[{}/{}] Tokenized the captions.".format(i+1, len(ids)))
 
    # If the word frequency is less than 'threshold', then the word is discarded.
    tokens = [token for token, cnt in counter.items() if cnt >= threshold]
 
    # Create a vocab wrapper and add some special tokens.
    vocab = Vocab()
    vocab.add_token('<pad>')
    vocab.add_token('<start>')
    vocab.add_token('<end>')
    vocab.add_token('<unk>')
 
    # Add the words to the vocabulary.
    for i, token in enumerate(tokens):
        vocab.add_token(token)
    return vocab
 
vocab = build_vocabulary(json='data_dir/annotations/captions_val2014.json', threshold=4) # you can change threhold
vocab_path = './data_dir/vocabulary.pkl'
with open(vocab_path, 'wb') as f:
    pickle.dump(vocab, f)
print("Total vocabulary size: {}".format(len(vocab)))
print("Saved the vocabulary wrapper to '{}'".format(vocab_path))

            

loading annotations into memory...
Done (t=0.13s)
creating index...
index created!
[1000/202654] Tokenized the captions.
[2000/202654] Tokenized the captions.
[3000/202654] Tokenized the captions.
[4000/202654] Tokenized the captions.
[5000/202654] Tokenized the captions.
...
[199000/202654] Tokenized the captions.
[200000/202654] Tokenized the captions.
[201000/202654] Tokenized the captions.
[202000/202654] Tokenized the captions.
Total vocabulary size: 7289
Saved the vocabulary wrapper to './data_dir/vocabulary.pkl'

Resize Images¶

Let's preprocess the images. Since the size and shape of images in the dataset can vary, we need to transform all images to a fixed shape so that they can be inputted into the first layer of the CNN model.

In [4]:

              def reshape_image(image, shape):
    """Resize an image to the given shape."""
    return image.resize(shape, Image.LANCZOS)
 
def reshape_images(image_path, output_path, shape):
    """Reshape the images in 'image_path' and save into 'output_path'."""
    if not os.path.exists(output_path):
        os.makedirs(output_path)
 
    images = os.listdir(image_path)
    num_im = len(images)
    for i, im in enumerate(images):
        try:
            with open(os.path.join(image_path, im), 'r+b') as f:
                with Image.open(f) as image:
                    image = reshape_image(image, shape)
                    image.save(os.path.join(output_path, im), image.format)
            if (i+1) % 100 == 0:
                print ("[{}/{}] Resized the images and saved into '{}'."
                    .format(i+1, num_im, output_path))
        except:
            print(f'\n failed to download {output_path} {im} {image.format}')

#origin: image_path = './data_dir/train2014/'
image_path = './data_dir/val2014/'
output_path = './data_dir/resized_images/'
image_shape = [256, 256]
reshape_images(image_path, output_path, image_shape)

            

[100/40504] Resized the images and saved into './data_dir/resized_images/'.
[200/40504] Resized the images and saved into './data_dir/resized_images/'.
[300/40504] Resized the images and saved into './data_dir/resized_images/'.
[400/40504] Resized the images and saved into './data_dir/resized_images/'.
[500/40504] Resized the images and saved into './data_dir/resized_images/'.
[600/40504] Resized the images and saved into './data_dir/resized_images/'.
[700/40504] Resized the images and saved into './data_dir/resized_images/'.
...
[40300/40504] Resized the images and saved into './data_dir/resized_images/'.
[40400/40504] Resized the images and saved into './data_dir/resized_images/'.
[40500/40504] Resized the images and saved into './data_dir/resized_images/'.

Instantiate Data Loader¶

Let's cast the preprocessed data into a PyTorch dataset object. This object can then be used to define a PyTorch data loader, which will be used to fetch batches of data during the training loop.
Typically, there is no need to write a separate collate() function, but since we need to handle sentences of varying lengths, we will write one. By doing so, when a sentence length k is shorter than a fixed length n, the pack_padded_sequence function can be used to add n−k padding tokens.

In [5]:

              class CustomCocoDataset(data.Dataset):
    """COCO Custom Dataset compatible with torch.utils.data.DataLoader."""
    def __init__(self, data_path, coco_json_path, vocabulary, transform=None):
        """Set the path for images, captions and vocabulary wrapper.
        
        Args:
            root: image directory.
            json: coco annotation file path.
            vocab: vocabulary wrapper.
            transform: image transformer.
        """
        self.root = data_path
        self.coco_data = COCO(coco_json_path)
        self.indices = list(self.coco_data.anns.keys())
        self.vocabulary = vocabulary
        self.transform = transform
 
    def __getitem__(self, idx):
        """Returns one data pair (image and caption)."""
        coco_data = self.coco_data
        vocabulary = self.vocabulary
        annotation_id = self.indices[idx]
        caption = coco_data.anns[annotation_id]['caption']
        image_id = coco_data.anns[annotation_id]['image_id']
        image_path = coco_data.loadImgs(image_id)[0]['file_name']
 
        image = Image.open(os.path.join(self.root, image_path)).convert('RGB')
        if self.transform is not None:
            image = self.transform(image)
 
        # Convert caption (string) to word ids.
        word_tokens = nltk.tokenize.word_tokenize(str(caption).lower())
        caption = []
        caption.append(vocabulary('<start>'))
        caption.extend([vocabulary(token) for token in word_tokens])
        caption.append(vocabulary('<end>'))
        ground_truth = torch.Tensor(caption)
        return image, ground_truth
 
    def __len__(self):
        return len(self.indices)
 
 
def collate_function(data_batch):
    """Creates mini-batch tensors from the list of tuples (image, caption).
    
    We should build custom collate_fn rather than using default collate_fn, 
    because merging caption (including padding) is not supported in default.
    Args:
        data: list of tuple (image, caption). 
            - image: torch tensor of shape (3, 256, 256).
            - caption: torch tensor of shape (?); variable length.
    Returns:
        images: torch tensor of shape (batch_size, 3, 256, 256).
        targets: torch tensor of shape (batch_size, padded_length).
        lengths: list; valid length for each padded caption.
    """
    # Sort a data list by caption length (descending order).
    data_batch.sort(key=lambda d: len(d[1]), reverse=True)
    imgs, caps = zip(*data_batch)
 
    # Merge images (from list of 3D tensors to 4D tensor).
    # Originally, imgs is a list of <batch_size> number of RGB images with dimensions (3, 256, 256)
    # This line of code turns it into a single tensor of dimensions (<batch_size>, 3, 256, 256)
    imgs = torch.stack(imgs, 0) # stack(): Concatenates a sequence of tensors along a new dimension (all tensors should be exactly same size)
 
    # Merge captions (from list of 1D tensors to 2D tensor), similar to merging of images done above.
    cap_lens = [len(cap) for cap in caps]
    tgts = torch.zeros(len(caps), max(cap_lens)).long() # long(): Type casting to longTensor
    for i, cap in enumerate(caps):
        end = cap_lens[i]
        tgts[i, :end] = cap[:end] # fill tgts with caps
    return imgs, tgts, cap_lens
 
def get_loader(data_path, coco_json_path, vocabulary, transform, batch_size, shuffle, num_workers):
    """Returns torch.utils.data.DataLoader for custom coco dataset."""
    # COCO caption dataset
    coco_dataser = CustomCocoDataset(data_path=data_path,
                       coco_json_path=coco_json_path,
                       vocabulary=vocabulary,
                       transform=transform)
    
    # Data loader for COCO dataset
    # This will return (images, captions, lengths) for each iteration.
    # images: a tensor of shape (batch_size, 3, 224, 224).
    # captions: a tensor of shape (batch_size, padded_length).
    # lengths: a list indicating valid length for each caption. length is (batch_size).
    custom_data_loader = torch.utils.data.DataLoader(dataset=coco_dataser, 
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              num_workers=num_workers,
                                              collate_fn=collate_function)
    return custom_data_loader

            

Model Definition¶

Let's define the model.

CNN¶

For the CNN model, we'll use the ResNet 152 architecture, which can be obtained from the PyTorch model repository and is pretrained on the ImageNet dataset.

Q❓ Why do we remove the last layer of the pretrained ResNet model and replace it with a fully connected layer followed by a batch normalization layer?
- fc layer: By replacing the last layer, we substitute the final weight matrix (of dimensions Kx1000) with a new one of arbitrary dimensions Kx256. This allows the model to output features that are better suited for our specific task, which may not have 1000 classes like ImageNet.
- batch normalization: Batch normalization normalizes the outputs of the fully connected layer across the entire batch to have a mean of 0 and a standard deviation of 1. This is similar to standard input data normalization using torch.transforms. Performing batch normalization can 'limit the variability of hidden layer outputs'. It often has the effect of accelerating the training process as well.

In [17]:

              class CNNModel(nn.Module):
    def __init__(self, embedding_size):
        """Load the pretrained ResNet-152 and replace top fc layer."""
        super(CNNModel, self).__init__()
        resnet = models.resnet152(weights=models.ResNet152_Weights.DEFAULT) 
        module_list = list(resnet.children())[:-1]      # delete the last fc layer.
        self.resnet_module = nn.Sequential(*module_list)
        self.linear_layer = nn.Linear(resnet.fc.in_features, embedding_size) # resnet.fc.in_features: the number of input channel in resnet
        self.batch_norm = nn.BatchNorm1d(embedding_size, momentum=0.01) # Not to affect LSTM learning
        
    def forward(self, input_images):
        """Extract feature vectors from input images."""
        with torch.no_grad():
            resnet_features = self.resnet_module(input_images)
        resnet_features = resnet_features.reshape(resnet_features.size(0), -1)
        final_features = self.batch_norm(self.linear_layer(resnet_features))
        return final_features

            

LSTM(Long Short-Term Memory)¶

The LSTM layer takes an embedding vector as input and outputs a series of words that perfectly describe the image from which the embedding was generated.
More details about LSTM will be covered on another page.

In [18]:

              class LSTMModel(nn.Module):
    def __init__(self, embedding_size, hidden_layer_size, vocabulary_size, num_layers, max_seq_len=20):
        """Set the hyper-parameters and build the layers."""
        super(LSTMModel, self).__init__()
        self.embedding_layer = nn.Embedding(vocabulary_size, embedding_size)
        self.lstm_layer = nn.LSTM(embedding_size, hidden_layer_size, num_layers, batch_first=True)
        self.linear_layer = nn.Linear(hidden_layer_size, vocabulary_size)
        self.max_seq_len = max_seq_len
        
    def forward(self, input_features, caps, lens):
        """Decode image feature vectors and generates captions."""
        embeddings = self.embedding_layer(caps)
        embeddings = torch.cat((input_features.unsqueeze(1), embeddings), 1) 
        lstm_input = pack_padded_sequence(embeddings, lens, batch_first=True) 
        hidden_variables, _ = self.lstm_layer(lstm_input)
        model_outputs = self.linear_layer(hidden_variables[0])
        return model_outputs
    
    def sample(self, input_features, lstm_states=None):
        """Generate captions for given image features using greedy search."""
        sampled_indices = []
        lstm_inputs = input_features.unsqueeze(1)
        for i in range(self.max_seq_len): 
            hidden_variables, lstm_states = self.lstm_layer(lstm_inputs, lstm_states)          # hiddens: (batch_size, 1, hidden_size)
            model_outputs = self.linear_layer(hidden_variables.squeeze(1))            # outputs:  (batch_size, vocab_size)
            _, predicted_outputs = model_outputs.max(1)                        # predicted: (batch_size)
            sampled_indices.append(predicted_outputs)
            lstm_inputs = self.embedding_layer(predicted_outputs)                       # inputs: (batch_size, embed_size)
            lstm_inputs = lstm_inputs.unsqueeze(1)                         # inputs: (batch_size, 1, embed_size)
        sampled_indices = torch.stack(sampled_indices, 1)                # sampled_ids: (batch_size, max_seq_length)
        return sampled_indices

            

Training Loop¶

Changing the image to a fixed size (256x256) is not enough; the process of normalizing the data remains.
Normalization is important because distributions vary depending on the data dimension, which can distort the entire optimization space and lead to inefficient gradient descent.

In [7]:

              # Device configuration
device = torch.device('mps' if torch.cuda.is_available() else 'cpu')


# Create model directory
if not os.path.exists('models_dir/'):
    os.makedirs('models_dir/')

    
# Image preprocessing, normalization for the pretrained resnet
transform = transforms.Compose([ 
    transforms.RandomCrop(224),
    transforms.RandomHorizontalFlip(), 
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406),     # mean of Imagenet
                         (0.229, 0.224, 0.225))])   # std  of Imagenet


# Load vocabulary wrapper
with open('data_dir/vocabulary.pkl', 'rb') as f:
    vocabulary = pickle.load(f)

            

In [8]:

              device = torch.device('mps' if torch.backends.mps.is_available() else 'cpu')
device

Out[8]:

device(type='mps')

In [9]:

              # Build data loader
custom_data_loader = get_loader('data_dir/resized_images', 'data_dir/annotations/captions_val2014.json', vocabulary, 
                         transform, 128,
                         shuffle=True, num_workers=0) #jake changed: num_workers was 1 


# Build the models
encoder_model = CNNModel(256).to(device)
decoder_model = LSTMModel(256, 512, len(vocabulary), 1).to(device)
 

# Loss and optimizer
loss_criterion = nn.CrossEntropyLoss()
parameters = list(decoder_model.parameters()) + list(encoder_model.linear_layer.parameters()) + list(encoder_model.batch_norm.parameters())
optimizer = torch.optim.Adam(parameters, lr=0.001)

            

loading annotations into memory...
Done (t=0.12s)
creating index...
index created!

In [ ]:

              # Train the models
total_num_steps = len(custom_data_loader)
for epoch in range(5):
    for i, (imgs, caps, lens) in enumerate(custom_data_loader):
        # Set mini-batch dataset
        imgs = imgs.to(device)
        caps = caps.to(device)
        tgts = pack_padded_sequence(caps, lens, batch_first=True)[0]
 

        # Forward, backward and optimize
        feats = encoder_model(imgs)
        outputs = decoder_model(feats, caps, lens)
        loss = loss_criterion(outputs, tgts)
        decoder_model.zero_grad()
        encoder_model.zero_grad()
        loss.backward()
        optimizer.step()
 
        # # Print log info
        if i % 10 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}, Perplexity: {:5.4f}' # perplexity(PPL): Evaluation metric for NLP
                  .format(epoch, 5, i, total_num_steps, loss.item(), np.exp(loss.item()))) 
 
        # Save the model checkpoints
        if (i+1) % 1000 == 0:
            torch.save(decoder_model.state_dict(), os.path.join(
                'models_dir/', 'decoder-{}-{}.ckpt'.format(epoch+1, i+1)))
            torch.save(encoder_model.state_dict(), os.path.join(
                'models_dir/', 'encoder-{}-{}.ckpt'.format(epoch+1, i+1)))

            

Predict Caption¶

The training is complete, and let's try image captioning using the trained model

In [11]:

              image_file_path = 'sample.jpg'
 

# Device configuration
device = torch.device('mps' if torch.cuda.is_available() else 'cpu') # jake changed: gpu -> mps
 

def load_image(image_file_path, transform=None):
    img = Image.open(image_file_path).convert('RGB')
    img = img.resize([224, 224], Image.LANCZOS)
    
    if transform is not None:
        img = transform(img).unsqueeze(0)
    
    return img
 

# Image preprocessing
transform = transforms.Compose([
    transforms.ToTensor(), 
    transforms.Normalize((0.485, 0.456, 0.406), 
                         (0.229, 0.224, 0.225))])


# Load vocabulary wrapper
with open('data_dir/vocabulary.pkl', 'rb') as f:
    vocabulary = pickle.load(f)


# Build models
encoder_model = CNNModel(256).eval()  # eval mode (batchnorm uses moving mean/variance)
decoder_model = LSTMModel(embedding_size= 256, hidden_layer_size= 512, vocabulary_size= len(vocabulary), num_layers= 1)
encoder_model = encoder_model.to(device)
decoder_model = decoder_model.to(device)


# Load the trained model parameters
encoder_model.load_state_dict(torch.load('models_dir/encoder-3-1000.ckpt'))
decoder_model.load_state_dict(torch.load('models_dir/decoder-3-1000.ckpt'))


# Prepare an image
img = load_image(image_file_path, transform)
img_tensor = img.to(device)


# Generate an caption from the image
feat = encoder_model(img_tensor)
# print(f'feat: {feat.shape}')
sampled_indices = decoder_model.sample(feat)
sampled_indices = sampled_indices[0].cpu().numpy()          # (1, max_seq_length) -> (max_seq_length)
# print(f'sampled_indices: {sampled_indices}')



# Convert word_ids to words
predicted_caption = []
for token_index in sampled_indices:
    word = vocabulary.i2w[token_index]
    predicted_caption.append(word)
    if word == '<end>':
        break
predicted_sentence = ' '.join(predicted_caption)


# Print out the image and the generated caption
print(predicted_sentence)
img = Image.open(image_file_path)
plt.imshow(np.asarray(img))

            

<start> a dog is sitting on a bench in front of a house . <end>

Out[11]:

<matplotlib.image.AxesImage at 0xa34ebbbd0>

[PyTorch] Train PyTorch NN (0)	2023.10.29
[PyTorch] Intro (0)	2023.10.29

Jake's blog

[PyTorch] Image Captioning(CNN-LSTM)

Intro¶

Download Data¶

Import Dependencies¶

Build Vocab¶

Resize Images¶

Instantiate Data Loader¶

Model Definition¶

CNN¶

LSTM(Long Short-Term Memory)¶

Training Loop¶

Predict Caption¶

'Computer Science > PyTorch' 카테고리의 다른 글

티스토리툴바

[PyTorch] Image Captioning(CNN-LSTM)

Intro¶

Download Data¶

Import Dependencies¶

Build Vocab¶

Resize Images¶

Instantiate Data Loader¶

Model Definition¶

CNN¶

LSTM(Long Short-Term Memory)¶

Training Loop¶

Predict Caption¶

'Computer Science > PyTorch' 카테고리의 다른 글

'Computer Science/PyTorch' Related Articles

티스토리툴바