2022-10-01 11:06:44

Pytorch 中多 GPU 训练一二事

背景

在大数据时代，单机单卡的训练模式已经无法适应模型规模和数据量的提升了，因此使用多 GPU 训练模型逐渐成为主流。 Pytorch 在 4.0 版本中开始提供多 GPU 接口，那么本文主要简要介绍 Pytorch 中多 GPU 训练的两种方法。

关于多 GPU

多 GPU，从字面意思理解就是说我们的机器中存在两个以上的 GPU , 在安装了CUDA 的机器上使用命令nvidia-smi 可以查看 GPU 数量以及其他信息，如下图所示：
在这里插入图片描述
这是单机多卡的情况，还有一种情况就是多个 GPU 分布在不同的机器上，Pytorch 针对两种情况提供了不同的训练接口，下面我们逐一介绍。

torch.nn.DataParallel()

关于这种方法的使用非常简单，只需要为你的模型加上一个 wraper 。

import torch.nnas nn
model= nn.DataParallel(model)

pytorch官网的介绍中，该方法在训练时每个 gpu 上都有一个模型副本，input数据会被平分成 n(n是训练过程中使用的 gpu 的数量)等份，而最后的反向传播都是在一个 gpu(默认是 gpu0) 进行的。因此在设置batch-size时应注意，需要考虑乘以 gpu 的数量，此外在某些情况下使用多卡的速度反而会比单卡低，例如数据量比较小。下面给出一个能跑的代码。

import torchimport torch.nnas nnfrom torch.autogradimport Variablefrom torch.utils.dataimport Dataset, DataLoaderimport os

input_size=5
output_size=2
batch_size=30
data_size=30classRandomDataset(Dataset):def__init__(self, size, length):
        self.len= length
        self.data= torch.randn(length, size)def__getitem__(self, index):return self.data[index]def__len__(self):return self.len

rand_loader= DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)classModel(nn.Module):# Our modeldef__init__(self, input_size, output_size):super(Model, self).__init__()
        self.fc= nn.Linear(input_size, output_size)defforward(self,input):
        output= self.fc(input)print("  In Model: input size",input.size(),"output size", output.size())return output
model= Model(input_size, output_size)if torch.cuda.is_available():
    model.cuda()if torch.cuda.device_count()>1:print("Let's use", torch.cuda.device_count(),"GPUs!")# 就这一行
    model= nn.DataParallel(model)for datain rand_loader:if torch.cuda.is_available():
        input_var= Variable(data.cuda())else:
        input_var= Variable(data)
    output= model(input_var)print("Outside: input size", input_var.size(),"output_size", output.size())

torch.nn.parallel.DistributedDataParallel

这是官网建议采用的方法，为分布式训练设计的框架，在单机上也能用，而且其性能是要优于上一个方法的。官网对其优点描述如下：

Each process maintains its own optimizer and performs a complete optimization step with each iteration. While this may appear redundant, since the gradients have already been gathered together and averaged across processes and are thus the same for every process, this means that no parameter broadcast step is needed, reducing time spent transferring tensors between nodes.
Each process contains an independent Python interpreter, eliminating the extra interpreter overhead and “GIL-thrashing” that comes from driving several execution threads, model replicas, or GPUs from a single Python process. This is especially important for models that make heavy use of the Python runtime, including models with recurrent layers or many small components.

这里也给出一个直接可以跑的例子。

import torchimport torch.nnas nnfrom torch.autogradimport Variablefrom torch.utils.dataimport Dataset, DataLoaderimport osfrom torch.utils.data.distributedimport DistributedSampler# 1) 初始化
torch.distributed.init_process_group(backend="nccl")

input_size=5
output_size=2
batch_size=30
data_size=90# 2） 配置每个进程的gpu
local_rank= torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device= torch.device("cuda", local_rank)classRandomDataset(Dataset):def__init__(self, size, length):
        self.len= length
        self.data= torch.randn(length, size).to('cuda')def__getitem__(self, index):return self.data[index]def__len__(self):return self.len

dataset= RandomDataset(input_size, data_size)# 3）使用DistributedSampler
rand_loader= DataLoader(dataset=dataset,
                         batch_size=batch_size,
                         sampler=DistributedSampler(dataset))classModel(nn.Module):def__init__(self, input_size, output_size):super(Model, self).__init__()
        self.fc= nn.Linear(input_size, output_size)defforward(self,input):
        output= self.fc(input)print("  In Model: input size",input.size(),"output size", output.size())return output
    
model= Model(input_size, output_size)# 4) 封装之前要把模型移到对应的gpu
model.to(device)if torch.cuda.device_count()>1:print("Let's use", torch.cuda.device_count(),"GPUs!")# 5) 封装
    model= torch.nn.parallel.DistributedDataParallel(model,
                                                      device_ids=[local_rank],
                                                      output_device=local_rank)for datain rand_loader:if torch.cuda.is_available():
        input_var= Variable(data.cuda())else:
        input_var= Variable(data)
    
    output= model(input_var)print("Outside: input size", input_var.size(),"output_size", output.size())

命令行运行程序

CUDA_VISIBLE_DEVICES=0,1 python-m torch.distributed.launch--nproc_per_node=2 torch_ddp.py

pytorch 多 GPU 训练

Pytorch 中多 GPU 训练一二事

背景

关于多 GPU

torch.nn.DataParallel()

torch.nn.parallel.DistributedDataParallel

参考链接