PyTorch指定单GPU和多GPU训练及保存-加载模型(含CPU)的总结

2022-10-05 11:19:55

| 更新：2020.10.25 | fjy2035@foxmail.com

前言：本博客基本涵盖single-gpu和multi-gpu的使用，及训练模型的保存和加载。更复杂功能，修改后亦可得到。
查看gpu使用情况和哪些用户在使用gpu：（watch -n [time] nvidia-smi）和（gpustat -cpu）
https://github.com/wookayin/gpustat
https://pypi.org/project/gpustat/
关闭服务器 GPU 占用线程：kill -9 PID

注意：Train/Test过程中 inputs 和 labels，以及待训练 model 均加载到GPU中。对小模型来说，多GPU并行运算反而耗时，大模型bath_size远大于GPU数(或加宽加深Hidden-layers)，GPU优势才能体现。增大bath_size，导致预测准确率降低，可增大epoch。

因为pytorch是在第0块gpu上初始化，占用一定空间的显存，所以使用不当会遇到out of memory的问题。以下探讨涵盖single-GPU和Multi-GPU在训练前指定GPU、保存和加载训练模型、GPU和CPU互加载模型三个过程。

1. PyTorch使用指定GPU - 单GPU

直接使用代码 model.cuda(), PyTorch默认从0开始的单GPU:

model= Model()if torch.cuda.is_available():
    model= model.cuda()

有两种方法可直接指定单GPU:

在终端shell：CUDA_VISIBLE_DEVICES=1 python main.py，表示只有第1块gpu可见，其他gpu不可用。第1块gpu编号已变成第0块，如果依然使用cuda:1会报invalid device ordinal；以下同效。
python代码（2选1）：

os.environ["CUDA_VISIBLE_DEVICES"]="1"# 官方推荐使用 "CUDA_VISIBLE_DEVICES"
model= Model()if torch.cuda.is_available():
 	model= model.cuda()#使用第一个GPU
images= images.cuda()
labels= labels.cuda()or# 直接定义设备device，并指定起始位置GPU："cuda:0"。或"cuda:1"作为起始位置，编号为0
device= torch.device("cuda:0"if torch.cuda.is_available()else"cpu")# 单GPU运行，且多GPU时可指定起始位置/编号
net= self.model.to(device)# 等效于self.model.cuda()
images= self.images.to(device)
labels= self.labels.to(device)

Note，“cuda:0"或"cuda"都代表起始device_id为0，系统默认从0开始。可根据需要修改起始位置，如“cuda:1”等效"cuda:0"或"cuda”。

# 任取一个，torch版本不同会有差别
torch.cuda.device(id)# id 是GPU编号or 
torch.cuda.set_device(id)or
torch.device('cuda')

单GPU中保存训练模型（2选1）

state={'model': self.model.state_dict(),'epoch': ite}
torch.save(state, self.model.name())or# 直接保存
torch.save(self.model.state_dict(),'Mymodel.pth')# 当前目录

测试，单GPU/CPU中加载 single-gpu 训练模型（3选1）
详解参考第3部分：[GPU和CPU互加载模型参数] (3. PyTorch使用指定GPU训练 - 其他问题详解（含CPU）)

checkpoint= torch.load(self.model.name())
self.model.load_state_dict(checkpoint['model'])or# 直接加载
self.model.load_state_dict(torch.load('Mymodel.pth'))or# load gpu or cpuif torch.cuda.is_available():# gpu
    self.model.load_state_dict(torch.load('Mymodel.pth'))else:# cpu  官方推荐CPU的加载方式
    checkpoint= torch.load(self.model.name(),map_location=lambda storage, loc: storage)
	self.model.load_state_dict(checkpoint['model'])

2. PyTorch使用指定GPU - 多GPU（DataParallel）

仍有两种方法可直接指定多GPU:

在终端shell：CUDA_VISIBLE_DEVICES=0,1,3 python main.py
python代码：

# gpu_ids = [0, 1, 3]   # 或 os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,3"# os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, [0, 1, 3]))
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,3"# CUDA_VISIBLE_DEVICES 表当前可被python程序检测到的显卡
device= torch.device("cuda:0"if torch.cuda.is_available()else"cpu")# 多GPU时可指定起始位置/编号# 若不加if项，也不报错，但训练可能会变成单GPUif torch.cuda.device_count()>1:# 查看当前电脑可用的gpu数量，或 if len(gpu_ids) > 1:print("Let's use", torch.cuda.device_count(),"GPUs!")# self.model = torch.nn.DataParallel(self.model, device_ids=gpu_ids)
    self.model= torch.nn.DataParallel(self.model)# 声明所有设备
net= self.model.to(device)# 从指定起始位置开始，将模型放到gpu或cpu上
images= self.images.to(device)# 模型和训练数据都放在主设备
labels= self.labels.to(device)

Note：使用多GPU训练，单用 model = torch.nn.DataParallel(model)，默认所有存在的显卡都会被使用。

多GPU中保存训练模型（3选1）

ifisinstance(self.model,torch.nn.DataParallel):# 判断是否并行
    self.model= self.model.module
state={'model': self.model.state_dict(),'epoch': ite}
torch.save(state, self.model.name())# No-moduleorifisinstance(self.model, torch.nn.DataParallel):
    torch.save(self.model.module.stat_dict,'Mymodel')# No-moduleelse:
    torch.save(self.model.stat_dict,'Mymodel')# No-moduleor# 直接保存
torch.save(self.model.state_dict(),'Mymodel.pth')# is-module

测试，单GPU/多GPU/CPU加载 multi-gpu 训练模型：（3选1）
详解参考第3部分：[GPU和CPU互加载模型参数] (3. PyTorch使用指定GPU训练 - 其他问题详解（含CPU）)

# ################## 方法 1: add
net= torch.nn.DataParallel(net)# 加上module
net.load_state_dict(torch.load("model/cnn_train.pth"))# 加上module，再加载model# ################## 方法 2: remove (2选1)
net.load_state_dict({k.replace('module.',''): vfor k, vin torch.load("model/cnn_train.pth").items()})orfrom collectionsimport OrderedDict
state_dict= torch.load("model/cnn_train.pth")# 当前路径 model 文件下
new_state_dict= OrderedDict()# create new OrderedDict that does not contain `module.`for k, vin state_dict.items():# remove `module.`
    name= k[7:]# 或 name = k.replace('module.', '')
    new_state_dict[name]= v
net.load_state_dict(new_state_dict)

3. PyTorch使用指定GPU训练 - 其他问题详解（含CPU）

DataParallel：torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)
（1）DataParallel 实现在module级别上的数据并行使用，返回新模型，即将model在每个GPU分别保存一份。
（2）DataParallel 将输入tensor自动划分并分配到多GPU上的多个模型，即每个GPU计算tensor的一部分，所以输入batch_size应大于设备量GPU。
（3）DataParallel 在每个model完成计算后，收集与合并结果然后可返回到某一个GPU集中处理。
Note：多GPU训练使用DataParallel对网络进行封装，因此在原网络结构中添加了一层module。
module：多GPU并行处理的模型
device_ids：GPU编号（默认全部GPU）
output_device：输出位置（默认device_ids[0]或cuda:0)
dim：tensors被分散的维度，默认0

gpu_ids=[3,4,6,7]# 或os.environ["CUDA_VISIBLE_DEVICES"] = "3,4,6,7"
device= torch.device("cuda:0"if torch.cuda.is_available()else"cpu")# 多GPU时可指定起始位置/编号# 若不加if项，也不报错，但训练可能会变成单GPUif torch.cuda.device_count()>1:# 查看当前电脑可用的gpu数量，或 if len(gpu_ids) > 1:print("Let's use", torch.cuda.device_count(),"GPUs!")
    self.model= torch.nn.DataParallel(self.model, device_ids=gpu_ids)# 声明所有可用设备
    
net= self.model.to(device)# 模型放在主设备
images= self.images.to(device)# 训练数据放在主设备
labels= self.labels.to(device)

训练过程中，若用model的子模块：

model= Net()# 在单GPU中
out= model.fc(input)

model= Net()# 在DataParallel中，调用并行网络中定义的网络层
model= torch.nn.DataParallel(model)
out= model.module.fc(input)

测试过程中，GPU和CPU互加载模型参数：
参考博客 [gpu和cpu互加载模型参数] (https://blog.csdn.net/bc521bc/article/details/85623515)

# 假设只保存了模型的参数(model.state_dict())到文件名为modelparameters.pth, model = Net()# cpu -> cpu or gpu -> gpu:
checkpoint= torch.load('modelparameters.pth')
model.load_state_dict(checkpoint)# cpu -> gpu 1
torch.load('modelparameters.pth', map_location=lambda storage, loc: storage.cuda(1))# gpu 1 -> gpu 0
torch.load('modelparameters.pth', map_location={'cuda:1':'cuda:0'})# gpu -> cpu
torch.load('modelparameters.pth', map_location=lambda storage, loc: storage)# 特殊情况
torch.load(opt.model,map_location='cpu')

4. 完整代码示意

# coding: utf-8# coding: GBKimport torchimport torchvisionimport torchvision.transformsas transformsimport numpyas npimport torch.nnas nnimport torch.nn.functionalas Fimport torch.optimas optimimport matplotlib.pyplotas pltfrom torch.autogradimport Variablefrom torch.backendsimport cudnn# 若使用服务器多卡训练import osfrom collectionsimport OrderedDict# 指定对程序可见的GPU编号# 表示只有第0,1,3块GPU可见，其他GPU不可用，并且第1块GPU默认编号就是第0块
os.environ['CUDA_VISIBLE_DEVICES']='0,1,3'# torch.cuda.current_device()# torch.cuda.initialized = True# 定义数据转换transformer
transform= transforms.Compose([transforms.ToTensor(),# (H,W,C)转换为(C,H,W) 并且值为[0, 1.]# transforms.Resize((32, 32)),
     transforms.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))]# 归一化)# 加载数据
train_set= torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
train_loader= torch.utils.data.DataLoader(train_set, batch_size=10, shuffle=True, num_workers=0)

test_set= torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
test_loader= torch.utils.data.DataLoader(test_set, batch_size=10, shuffle=False, num_workers=0)

classes=['plane','car','bird','cat','deer','dog','frog','horse','ship','truck']# ############################################################ 定义网络 简单的CNNclassCNN(nn.Module):def__init__(self):super(CNN, self).__init__()
        self.conv1= nn.Conv2d(3,6,5)
        self.conv2= nn.Conv2d(6,16,5)
        self.pool= nn.MaxPool2d(2,2)
        self.fc1= nn.Linear(16*5*5,120)
        self.fc2= nn.Linear(120,84)
        self.fc3= nn.Linear(84,10)defforward(self, x):
        h1= self.pool(F.relu(self.conv1(x)))
        h2= self.pool(F.relu(self.conv2(h1)))
        h2= h2.view(-1,16*5*5)
        h3= self.fc1(h2)
        h4= self.fc2(h3)
        h5= self.fc3(h4)return h5# 实例化模型
net= CNN()# 使用（多）GPU训练# 定义device，“cuda:0” 只代表起始的device_id为 0
device= torch.device('cuda:2'if torch.cuda.is_available()else'cpu')print("GPU or CPU is available: ", device)if torch.cuda.device_count()>1:# multi-gpuprint('Lets use', torch.cuda.device_count(),'GPUs!')
    net= nn.DataParallel(net)
net.to(device)# 定义损失函数(loss function)和优化器(optimizer)
criterion= nn.CrossEntropyLoss()# classification criterion and regression criterion
optimizer= optim.SGD(net.parameters(), lr=0.001, momentum=0.9)# ############################################################ 训练
net.train()# 在训练时启用BN层和Dropout层，对模型进行更改for epochinrange(1):# 循环遍历数据集的次数
    running_loss=0.# enumerate 将一个可遍历的数据对象(如列表、元组或字符串)组合为一个索引序列，并列出数据和数据下标，常用于forfor i, datainenumerate(train_loader,0):
        images, labels= data# get the inputs# 需要将输入网络的数据复制到GPU
        images= images.to(device)
        labels= labels.to(device)

        optimizer.zero_grad()# 清空过往梯度缓存区域# 经典四步
        outs= net(images)
        loss= criterion(outs, labels)# forward，前向传播
        loss.backward()# backward，后向传播，计算当前梯度
        optimizer.step()# optimize，根据梯度更新网络参数# 打印loss
        running_loss+= loss.item()if i%2000==1999:# print every 2000 mini-batchesprint('[epoch %d, iter %d] loss : %.3f'%(epoch+1, i+1, running_loss/2000))
            running_loss=0.print('Finish Training!')
torch.save(net.state_dict(),'model/cnn_train.pth')# multi-gpu has module，single-gpu or cpu has No-moduleprint('Finish save the model!')# ############################################################ 测试# os.environ['CUDA_VISIBLE_DEVICES'] = '0'
device= torch.device('cuda:0'if torch.cuda.is_available()else'cpu')# train，device：(cuda：0-n)；test，device: (cuda:0)print('Test is running:', device)# ################## load 'model.pth'# def load_gpu_cpu_pth(self, net, path):# single-gpu, multi-gpu and cpu, auto - load and convert types in 'model.pth'if'gpu'if torch.cuda.is_available()else'cpu'=='gpu':
    state_dict= torch.load("model/cnn_train.pth")else:
    state_dict= torch.load("model/cnn_train.pth", map_location=lambda storage, loc: storage)
new_state_dict= OrderedDict()# create new OrderedDict that does not contain `module.`ifisinstance(net, torch.nn.DataParallel):# 判断模型net是否并行print('\nThe source model is isinstance in test')iflist(state_dict.keys())[0][:6]=='module':print("The loaded model  always contains 'module'")# 直接加载 -- Model is module
        net.load_state_dict(state_dict)else:print("The loaded model is adding 'module'...")# Method 1: add -- Model is No-module
        net= torch.nn.DataParallel(net)# add module
        net.load_state_dict(state_dict)# then load modelprint("Finish loading 'model.pth'\n")else:print('\nThe source model is not isinstance in test')iflist(state_dict.keys())[0][:6]=='module':print("The loaded model is removing 'module'")# Method 2: remove  (2选1)# net.load_state_dict({k.replace('module.', ''): v for k, v in torch.load("model/cnn_train.pth").items()})for k, vin state_dict.items():# remove `module.`
            name= k[7:]# 或 name = k.replace('module.', '')
            new_state_dict[name]= v
        net.load_state_dict(new_state_dict)else:print("The loaded model always contains 'module'")# 直接加载 -- Model is No-module
        net.load_state_dict(state_dict)print("Finish loading 'model.pth'\n")# ################## test
net.to(device)
net.eval()# 在评测时不启用BN层和Dropout层，冻结后这两个操作不会对模型进行更改

correct_test=0
total_test=0for epochinrange(1):# range(start, stop[, step])，默认从0开始，range(0)是空集for datain test_loader:
        images_test, labels_test= data# 需要将测试网络的数据复制到GPU
        images_test= images_test.to(device)
        labels_test= labels_test.to(device)# 评估预测# 虽然使用net.eval()，但在验证阶段有时报错out of memory，可能是梯度不回传，造成梯度累加。故取消验证阶段的loss。with torch.no_grad():
            outs_test= net(images_test)
        _, predict= torch.max(outs_test.data,1)
        total_test+= labels_test.size(0)
        correct_test+=(predict== labels_test).sum().item()print('Accuracy of the network on the 10000 test images: %d %%'%(100* correct_test/ total_test))print('Finish Testing!')

5. 拓展其他博客

[1] CPU加载GPU训练model和GPU加载CPU训练model：
https://www.ptorch.com/news/74.html
[2] 单机多卡并行训练、多机多GPU训练和DistributedDataParallel解决显存使用不平衡：
https://blog.csdn.net/weixin_47196664/a