2022-10-12 11:15:57

问题描述

笔者在训练一个深度学习网络时，发现使用不同的PyTorch版本运行同一个训练代码，训练出来的网络结果差异巨大。具体来说，笔者训练得到的结果如下所示：

PyTorch版本	Torchvision版本	测试结果
1.2	0.4.0	82.58017
1.5	0.6.0	83.11847
1.6	0.7.0	74.97795
1.10	0.11.1	68.33818

网络的参数以及训练的设置完全相同，但是却得到了差异巨大的结果。

原因分析

发现在Torchvision>0.6.0 时，模型的训练出现了很大的问题，特别是后面两个版本的训练中，损失函数一直无法降下去，因此断定在预处理部分代码可能存在Bug。

经过检查，发现在Torchvision=0.7.0 版本时出现了一个更新：

[Transforms] Usetorch.rand instead ofrandom.random() for random transforms (#2520)

而我的代码中，预训练部分只设置了random.seed(seed) 和np.random.seed(seed) ，由此导致图像和标签的预训练产生的随机不一致，故导致高版本时训练的损失函数迟迟不能下降。

Torchvision=0.7.0 及以后的版本中torchvision.transforms.RandomHorizontalFlip() 源码为：

classRandomHorizontalFlip(torch.nn.Module):"""Horizontally flip the given image randomly with a given probability.
    If the image is torch Tensor, it is expected
    to have [..., H, W] shape, where ... means an arbitrary number of leading
    dimensions

    Args:
        p (float): probability of the image being flipped. Default value is 0.5
    """def__init__(self, p=0.5):super().__init__()
        self.p= pdefforward(self, img):"""
        Args:
            img (PIL Image or Tensor): Image to be flipped.

        Returns:
            PIL Image or Tensor: Randomly flipped image.
        """if torch.rand(1)< self.p:return F.hflip(img)return imgdef__repr__(self):return self.__class__.__name__+'(p={})'.format(self.p)

Torchvision=0.6.0 及以前的版本中torchvision.transforms.RandomHorizontalFlip() 源码为：

classRandomHorizontalFlip(object):"""Horizontally flip the given PIL Image randomly with a given probability.

    Args:
        p (float): probability of the image being flipped. Default value is 0.5
    """def__init__(self, p=0.5):
        self.p= pdef__call__(self, img):"""
        Args:
            img (PIL Image): Image to be flipped.

        Returns:
            PIL Image: Randomly flipped image.
        """if random.random()< self.p:return F.hflip(img)return imgdef__repr__(self):return self.__class__.__name__+'(p={})'.format(self.p)

显然，它们产生随机数的函数是不同的。

在Torchvision=0.6.0 及以前的版本中使用torchvision.transforms 下的函数时，可以只设置random 函数的随机种子，但是这样不保险，程序在更高版本时则会出现Bug。

解决方案

由于在Torchvision=0.7.0 及以后的版本中预处理的随机数改为torch.rand ，因此需要对 PyTorch 设置随机种子：

torch.manual_seed(seed)# 为CPU设置随机种子
torch.cuda.manual_seed(seed)# 为当前GPU设置随机种子
torch.cuda.manual_seed_all(seed)# 为所有GPU设置随机种子

统一起来可以这样设置：

defset_seed(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)# cpu
    torch.cuda.manual_seed(seed)# gpu
    torch.cuda.manual_seed_all(seed)# all gpus

设置完之后再次用四种版本进行训练，结果如下：

PyTorch版本	Torchvision版本	测试结果
1.2	0.4.0	83.32672
1.5	0.6.0	83.16423
1.6	0.7.0	82.62892
1.10	0.11.1	82.66235

举一反三

写代码时，首先考虑同一环境下同一代码的可重复性，即让Pytorch是Deterministic（确定）的。可以参考如下文章：

把这一个工作做踏实也避免了后续的Bug。