DDP 数据shuffle 的设置
使用DDP要给dataloader传入sampler参数(torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0, drop_last=False)) 。 默认shuffle=True,但按照pytorch DistributedSampler的实现:
def __iter__(self) -> Iterator[T_co]:
if self.shuffle:
# deterministically shuffle based on epoch and seed
g = torch.Generator()
g.manual_seed(self.seed + self.epoch)
indices = torch.randperm(len(self.dataset), generator=g).tolist() # type: ignore
else:
indices = list(range(len(self.dataset))) # type: ignore
产生随机indix的种子是和当前的epoch有关,所以需要在训练的时候手动set epoch的值来实现真正的shuffle:
for epochinrange(start_epoch, n_epochs):if is_distributed:
sampler.set_epoch(epoch)
train(loader)
DDP 增大batchsize 效果变差的问题
large batchsize:
理论上的优点:
- 数据中的噪声影响可能会变小,可能容易接近最优点;
缺点和问题:
- 降低了梯度的variance;(理论上,对于凸优化问题,低的梯度variance可以得到更好的优化效果; 但是实际上Keskar et al验证了增大batchsize会导致差的泛化能力);
- 对于非凸优化问题,损失函数包含多个局部最优点,小的batchsize有噪声的干扰可能容易跳出局部最优点,而大的batchsize有可能停在局部最优点跳不出来。
解决方法:
- 增大learning_rate,但是可能出现问题,在训练开始就用很大的learning_rate 可能导致模型不收敛 (https://arxiv.org/abs/1609.04836)
- 使用warming up (https://arxiv.org/abs/1706.02677)
warmup
在训练初期就用很大的learning_rate可能会导致训练不收敛的问题,warmup的思想是在训练初期用小的学习率,随着训练慢慢变大学习率,直到base learning_rate,再使用其他decay(CosineAnnealingLR)的方式训练.
# copy from https://github.com/ildoonet/pytorch-gradual-warmup-lr/blob/master/warmup_scheduler/scheduler.pyfrom torch.optim.lr_schedulerimport _LRSchedulerfrom torch.optim.lr_schedulerimport ReduceLROnPlateauclassGradualWarmupScheduler(_LRScheduler):""" Gradually warm-up(increasing) learning rate in optimizer.
Proposed in 'Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour'.
Args:
optimizer (Optimizer): Wrapped optimizer.
multiplier: target learning rate = base lr * multiplier if multiplier > 1.0. if multiplier = 1.0, lr starts from 0 and ends up with the base_lr.
total_epoch: target learning rate is reached at total_epoch, gradually
after_scheduler: after target_epoch, use this scheduler(eg. ReduceLROnPlateau)
"""def__init__(self, optimizer, multiplier, total_epoch, after_scheduler=None):
self.multiplier= multiplierif self.multiplier<1.:raise ValueError('multiplier should be greater thant or equal to 1.')
self.total_epoch= total_epoch
self.after_scheduler= after_scheduler
self.finished=Falsesuper(GradualWarmupScheduler, self).__init__(optimizer)defget_lr(self):if self.last_epoch> self.total_epoch:if self.after_scheduler:ifnot self.finished:
self.after_scheduler.base_lrs=[base_lr* self.multiplierfor base_lrin self.base_lrs]
self.finished=Truereturn self.after_scheduler.get_last_lr()return[base_lr* self.multiplierfor base_lrin self.base_lrs]if self.multiplier==1.0:return[base_lr*(float(self.last_epoch)/ self.total_epoch)for base_lrin self.base_lrs]else:return[base_lr*((self.multiplier-1.)* self.last_epoch/ self.total_epoch+1.)for base_lrin self.base_lrs]defstep_ReduceLROnPlateau(self, metrics, epoch=None):if epochisNone:
epoch= self.last_epoch+1
self.last_epoch= epochif epoch!=0else1# ReduceLROnPlateau is called at the end of epoch, whereas others are called at beginningif self.last_epoch<= self.total_epoch:
warmup_lr=[base_lr*((self.multiplier-1.)* self.last_epoch/ self.total_epoch+1.)for base_lrin self.base_lrs]for param_group, lrinzip(self.optimizer.param_groups, warmup_lr):
param_group['lr']= lrelse:if epochisNone:
self.after_scheduler.step(metrics,None)else:
self.after_scheduler.step(metrics, epoch- self.total_epoch)defstep(self, epoch=None, metrics=None):iftype(self.after_scheduler)!= ReduceLROnPlateau:if self.finishedand self.after_scheduler:if epochisNone:
self.after_scheduler.step(None)else:
self.after_scheduler.step(epoch- self.total_epoch)
self._last_lr= self.after_scheduler.get_last_lr()else:returnsuper(GradualWarmupScheduler, self).step(epoch)else:
self.step_ReduceLROnPlateau(metrics, epoch)
参考:
https://github.com/ildoonet/pytorch-gradual-warmup-lr
https://aws.amazon.com/blogs/machine-learning/the-importance-of-hyperparameter-tuning-for-scaling-deep-learning-training-to-multiple-gpus/