Introduction

Recently while running some experiments on Bert and trying to reproduce the results across different experiments with the same hyperparameters, I encountered something strange. In my two experiments, the only thing that was different was that I was evaluating the model performance on the validation set after every epoch. Using the same dataset, without shuffling and making sure to set the random seed, I was getting different training losses for the first epoch but different training losses for the second epoch. My experiments were not reproducible.

Accurate depiction of me as I was trying to reproduce experiments

Reproducibility is actually a larger issue in the machine learning community in general as different weight initializations and inherent randomness in large networks make it hard to reproduce results and obtain the same conclusions. As stated in the ICLR 2019 workshop on reproducibility in machine learning, “A result which is reproducible is more likely to be robust and meaningful and rules out many types of experimenter error (either fraud or accidental).” Thus, making sure our experiments are reproducible is vital in building robust conclusions about the designs of our model.

After spending the last couple of days investigating the reasons for the discrepancy of results that I was getting, I have learned some important things that I want to share which may not be so obvious for those new to or even experienced with PyTorch. Specifically, I will focus on the nuances of reproducibility and randomness with PyTorch. To illustrate this I have included the following toy example based off of one of PyTorch’s official examples for image classification. To interact with the code used in this blog post see the jupyter notebook.

PyTorch Random Number Generators

Before we dive into the example, let us first understand more about PyTorch’s internal random number generators (RNG) for the CPU and CUDA. Anytime we call a PyTorch method, model, function that involves randomness, a random number is consumed and the RNG state changes. PyTorch includes several methods for controlling the RNG such as setting the seed with torch.manual_seed(). Thus, to make sure two experiments are reproducible, PyTorch recommends to set seed the RNG state with torch.manual_seed(0). Nevertheless, as we will see, this is not enough.

Toy Example

Loading the Data

We use the CIFAR dataset as an example. We download our training and validation dataset splits and wrap a DataLoader object around them.

In [1]:
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import DataLoader


transforms = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5), (0.5))])

train_set = torchvision.datasets.CIFAR10('./data', train=True,
                                       download=True,
                                       transform=transforms)
train_loader = DataLoader(train_set, batch_size=4, shuffle=False)

# treat the test dataset as a validation set for this example
validation_set = torchvision.datasets.CIFAR10('./data', train=False,
                                            download=True,
                                            transform=transforms)
validation_loader = DataLoader(validation_set, batch_size=4, shuffle=False)

Defining our PyTorch Model

Our PyTorch model will be a simple convolutional neural network. Note the choice of including the dropout layer after the first fully connected layer. This will be important later.

In [2]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

class Net(nn.Module):
    def __init__(self, dropout=True):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        
        self.dropout1 = nn.Dropout(0.2) if dropout else None
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        if self.dropout1 is not None:
            x = self.dropout1(x)
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Experiment Running

For each experiment we will run on two epochs and observe the training loss. We will generate the same initial seed to hopefully ensure reproducibility. We also allow different validation functions to explore how different changes affect reproducibility.

In [3]:
import torch

# use a validation func to allow us to easily define different ways of validation for illustrative purposes 
def train(net, train_loader, validation_loader, optimizer, criterion, validation_func, num_epochs=2):
    for epoch in range(num_epochs):  # loop over the dataset multiple times
        net.train()
        running_loss = 0.0
        for i, data in enumerate(train_loader, 0):
            # get the inputs; data is a list of [inputs, labels]
            inputs, labels = data

            # zero the parameter gradients
            optimizer.zero_grad()

            # forward + backward + optimize
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()


            # print statistics
            running_loss += loss.item()
            if i % 2000 == 1999:    # print every 2000 mini-batches
                print('[Epoch %d, Iter %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / 2000))
                running_loss = 0.0
            
        if validation_func is not None:
            validation_func(net, validation_loader, epoch)
        print('')
    
    print('Finished Training')



def validation(net, validation_dataloader, epoch):
    net.eval()
    running_loss = 0.0
    with torch.no_grad():
        for i, data in enumerate(validation_dataloader, 0):
            inputs, labels = data
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
    print('Val [Epoch %d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / len(validation_dataloader) ))

SEED = 2147483647
In [4]:
print('======== Training With Validation ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=validation)

print('======== Training Without Validation ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=None)
======== Training With Validation ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532
Val [Epoch 1,  2500] loss: 1.453

[Epoch 2, Iter  2000] loss: 1.465
[Epoch 2, Iter  4000] loss: 1.457
[Epoch 2, Iter  6000] loss: 1.411
[Epoch 2, Iter  8000] loss: 1.365
[Epoch 2, Iter 10000] loss: 1.388
[Epoch 2, Iter 12000] loss: 1.359
Val [Epoch 2,  2500] loss: 1.277

Finished Training
======== Training Without Validation ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532

[Epoch 2, Iter  2000] loss: 1.463
[Epoch 2, Iter  4000] loss: 1.458
[Epoch 2, Iter  6000] loss: 1.410
[Epoch 2, Iter  8000] loss: 1.361
[Epoch 2, Iter 10000] loss: 1.386
[Epoch 2, Iter 12000] loss: 1.359

Finished Training

We can see from the output that the training losses are consistent across the first epoch but differ after. As a result with the same experiment settings and seemingly the same hyperparameters we are getting different results.

Investigation

As discussed before, PyTorch’s RNG is consumed whenever a random function is called. My first instincts were that the model was actually causing the differing results since we are running through the forward pass of the model during validation. Specifically, the dropout layer may be the cause.

In [6]:
print('======== Training With Validation without Dropout ========')
torch.manual_seed(SEED)
net = Net(dropout=False)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=validation)

print('======== Training Without Validation without Dropout ========')
torch.manual_seed(SEED)
net = Net(dropout=False)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=None)
======== Training With Validation without Dropout ========
[Epoch 1, Iter  2000] loss: 2.111
[Epoch 1, Iter  4000] loss: 1.805
[Epoch 1, Iter  6000] loss: 1.639
[Epoch 1, Iter  8000] loss: 1.546
[Epoch 1, Iter 10000] loss: 1.529
[Epoch 1, Iter 12000] loss: 1.478
Val [Epoch 1,  2500] loss: 1.418

[Epoch 2, Iter  2000] loss: 1.413
[Epoch 2, Iter  4000] loss: 1.406
[Epoch 2, Iter  6000] loss: 1.346
[Epoch 2, Iter  8000] loss: 1.312
[Epoch 2, Iter 10000] loss: 1.325
[Epoch 2, Iter 12000] loss: 1.290
Val [Epoch 2,  2500] loss: 1.298

Finished Training
======== Training Without Validation without Dropout ========
[Epoch 1, Iter  2000] loss: 2.111
[Epoch 1, Iter  4000] loss: 1.805
[Epoch 1, Iter  6000] loss: 1.639
[Epoch 1, Iter  8000] loss: 1.546
[Epoch 1, Iter 10000] loss: 1.529
[Epoch 1, Iter 12000] loss: 1.478

[Epoch 2, Iter  2000] loss: 1.413
[Epoch 2, Iter  4000] loss: 1.406
[Epoch 2, Iter  6000] loss: 1.346
[Epoch 2, Iter  8000] loss: 1.312
[Epoch 2, Iter 10000] loss: 1.325
[Epoch 2, Iter 12000] loss: 1.290

Finished Training

Amazingly, the problem appears to be fixed as the results are the same! However, does this mean we can never use randomness in our models?

We of course should be able to use dropout layers in our models even if we want to have reproducible results. The issue is not with the model, rather it is with the PyTorch DataLoader itself. During validation, when we call net.eval(), the dropout layer is disabled, so the forward pass during validation should not be the issue. The same results in this case is due to the model not requiring any randomness at all! To illustrate that the problem is with the DataLoader, let’s remove the forward pass in the validation function altogether.

In [7]:
def validation_without_model(net, validation_dataloader, epoch):
    net.eval()
    running_loss = 0.0
    with torch.no_grad():
        for i, data in enumerate(validation_dataloader, 0):
            break
            inputs, labels = data
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
    print('Val [Epoch %d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / len(validation_dataloader) ))
print('======== Training With Validation Skip Forward Pass ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=validation_without_model)

print('======== Training Without Validation ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=None)
======== Training With Validation Skip Forward Pass ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532
Val [Epoch 1,     1] loss: 0.000

[Epoch 2, Iter  2000] loss: 1.465
[Epoch 2, Iter  4000] loss: 1.457
[Epoch 2, Iter  6000] loss: 1.411
[Epoch 2, Iter  8000] loss: 1.365
[Epoch 2, Iter 10000] loss: 1.388
[Epoch 2, Iter 12000] loss: 1.359
Val [Epoch 2,     1] loss: 0.000

Finished Training
======== Training Without Validation ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532

[Epoch 2, Iter  2000] loss: 1.463
[Epoch 2, Iter  4000] loss: 1.458
[Epoch 2, Iter  6000] loss: 1.410
[Epoch 2, Iter  8000] loss: 1.361
[Epoch 2, Iter 10000] loss: 1.386
[Epoch 2, Iter 12000] loss: 1.359

Finished Training

As we can see, the results still differ after the first epoch. The only difference between the different runs is the for loop over the validation DataLoader object. It is here where the reproducibility issues occur. Let’s do some further investigation into the Dataloader. If we look into the DataLoader’s __iter__() method, called during the initialization of the for loop iterator, we see that it creates a _SingleProcessDataLoaderIter or _MultiProcessingDataLoaderIter object. Both these classes derive from the base class _BaseDataLoaderIter. If we look at the __init__() method of this object, we see the following line of code.

Alas we have found the source of our problem! This call consumes PyTorch’s RNG and results in a different RNG state when we train in the next epoch. Because our model’s forward pass involves dropout and additional generations of random numbers, the different RNG state results in different elements of the input tensor being zeroed by dropout, leading to different output tensors, different losses, and overall, different results!

Fix

With the issue identified, all we have to do to fix the problem is to ensure the RNG state after we finish validation is the same as the RNG state before start the forloop during validation. We can use the torch.get_rng_state() and torch.set_rng_state() functions.

In [8]:
def validation_ensure_rng_state(net, validation_dataloader, epoch):
    net.eval()
    running_loss = 0.0
    state = torch.get_rng_state()
    with torch.no_grad():
        for i, data in enumerate(validation_dataloader, 0):
            inputs, labels = data
            outputs = net(inputs)
            loss = criterion(outputs, labels)
            running_loss += loss.item()
    print('Val [Epoch %d, %5d] loss: %.3f' %
                      (epoch + 1, i + 1, running_loss / len(validation_dataloader) ))
    torch.set_rng_state(state)
    
print('======== Training With Validation Ensure RNG State ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=validation_ensure_rng_state)

print('======== Training Without Validation ========')
torch.manual_seed(SEED)
net = Net(dropout=True)
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)
train(net, train_loader, validation_loader, optimizer, criterion, validation_func=None)
======== Training With Validation Ensure RNG State ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532
Val [Epoch 1,  2500] loss: 1.453

[Epoch 2, Iter  2000] loss: 1.463
[Epoch 2, Iter  4000] loss: 1.458
[Epoch 2, Iter  6000] loss: 1.410
[Epoch 2, Iter  8000] loss: 1.361
[Epoch 2, Iter 10000] loss: 1.386
[Epoch 2, Iter 12000] loss: 1.359
Val [Epoch 2,  2500] loss: 1.273

Finished Training
======== Training Without Validation ========
[Epoch 1, Iter  2000] loss: 2.131
[Epoch 1, Iter  4000] loss: 1.837
[Epoch 1, Iter  6000] loss: 1.676
[Epoch 1, Iter  8000] loss: 1.588
[Epoch 1, Iter 10000] loss: 1.567
[Epoch 1, Iter 12000] loss: 1.532

[Epoch 2, Iter  2000] loss: 1.463
[Epoch 2, Iter  4000] loss: 1.458
[Epoch 2, Iter  6000] loss: 1.410
[Epoch 2, Iter  8000] loss: 1.361
[Epoch 2, Iter 10000] loss: 1.386
[Epoch 2, Iter 12000] loss: 1.359

Finished Training

With our new understanding of PyTorch RNGs and DataLoaders, we can now more confidently run reproducible experiments and derive robust conclusions!

Close Menu