2022-10-23 09:26:14

前言
PyTorch提供了两个主要特性：
（1）一个n维的Tensor，与numpy相似但是支持GPU运算。
（2）搭建和训练神经网络的自动微分功能。
我们将会使用一个全连接的ReLU网络作为实例。该网络有一个隐含层，使用梯度下降来训练，目标是最小化网络输出和真实输出之间的欧氏距离。

Tensors（张量）

Warm-up：numpy

在介绍PyTorch之前，我们先使用numpy来实现一个网络。
Numpy提供了一个n维数组对象，以及操作这些数组的函数。Numpy是一个通用的科学计算框架。它不是专门为计算图、深度学习或者梯度计算而生，但是我们能用它来把一个两层的网络拟合到随机数据上，只要我们手动把numpy运算在网络上前向和反向执行即可。

# -*- coding: utf-8 -*-
import numpy as np# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out =64,1000,100,10# Create random input and output datax = np.random.randn(N, D_in)y = np.random.randn(N, D_out)# Randomly initialize weights
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate =1e-6
for tin range(500):# Forward pass: compute predicted y
    h =x.dot(w1)
    h_relu = np.maximum(h,0)
    y_pred = h_relu.dot(w2)# Compute and print loss
    loss = np.square(y_pred -y).sum()
    print(t, loss)# Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred =2.0 * (y_pred -y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h <0] =0
    grad_w1 =x.T.dot(grad_h)# Update weights
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

PyTorch：Tensors

Numpy是一个了不起的框架，但是它很遗憾地不能支持GPU运算，无法对数值计算进行GPU加速。对于现在的深度神经网络，GPU一般能提供50倍以上的加速，所以numpy由于对GPU缺少支持，不能满足深度神经网络的计算需求。
这里介绍一下最基本的PyTorch概念：Tensor。一个PyTorch Tensor在概念上等价于numpy array：Tensor是一个n维的array，PyTorch提供了很多函数来在Tensors上进行运算。像numpy arrays一样，PyTorch Tensors也不是为深度学习、计算图、梯度而生；他们是一个科学计算的通用工具。
PyTorch Tensors可以利用GPU来加速数值计算。为了能在GPU上跑Tensor，我们只需要将它转到新的数据类型。
我们使用PyTorch Tensors来拟合2层的网络。与上面的numpy例子一样，我们需要手动执行网络上的前向和反向过程。

# -*- coding: utf-8 -*-

import torch


dtype = torch.FloatTensor# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out =64,1000,100,10# Create random input and output datax = torch.randn(N, D_in).type(dtype)y = torch.randn(N, D_out).type(dtype)# Randomly initialize weights
w1 = torch.randn(D_in, H).type(dtype)
w2 = torch.randn(H, D_out).type(dtype)

learning_rate =1e-6
for tin range(500):# Forward pass: compute predicted y
    h =x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)# Compute and print loss
    loss = (y_pred -y).pow(2).sum()
    print(t, loss)# Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred =2.0 * (y_pred -y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h <0] =0
    grad_w1 =x.t().mm(grad_h)# Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

Autograd（自动梯度）

PyTorch：Variables and autograd （变量和自动梯度）

在上面的例子中，我们必须手动执行网络的前向和反向通道。对于一个两层的小网络来说，手动反向执行不是什么大事，但是对于大型网络来说，就非常费劲了。
幸运的是，我们可以使用自动微分来自动计算神经网络的反向通道。PyTorch的autograd 包就提供了此项功能。当使用autograd的时候，你的网络的前向通道定义一个计算图（computational graph），图中的节点（node）是Tensors，边（edge）将会是根据输入Tensor来产生输出Tensor的函数。这个图的反向传播将会允许你很轻松地去计算梯度。
这个听起来复杂，但是实际操作非常简单。我们把PyTorch Tensors打包到Variable 对象中，一个Variable代表一个计算图中的节点。如果x是一个Variable，那么x. data 就是一个Tensor 。并且x.grad是另一个Variable，该Variable保持了x相对于某个标量值得梯度。
PyTorch的Variable具有与PyTorch Tensors相同的API。差不多所有适用于Tensor的运算都能适用于Variables。区别在于，使用Variables定义一个计算图，令我们可以自动计算梯度。
下面我们使用PyTorch 的Variables和自动梯度来执行我们的两层的神经网络。我们不再需要手动执行网络的反向通道了。

# -*- coding: utf-8 -*-
import torchfrom torch.autograd import Variable

dtype = torch.FloatTensor
# dtype = torch.cuda.FloatTensor # Uncomment thisto runon GPU

# Nis batch size; D_inis input dimension;
# His hidden dimension; D_outis output dimension.
N, D_in, H, D_out =64,1000,100,10

#Create random Tensorsto hold inputand outputs,and wrap themin Variables.
# Setting requires_grad=False indicates that wedonot needto compute gradients
#with respectto these Variables during the backward pass.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)

#Create random Tensorsfor weights,and wrap themin Variables.
# Setting requires_grad=True indicates that we wantto compute gradientswith
# respectto these Variables during the backward pass.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate =1e-6for tin range(500):
    #Forward pass: compute predicted yusing operationson Variables; these
    # are exactly the same operations we usedto compute theforward passusing
    # Tensors, but wedonot needto keep referencesto intermediate values since
    # we arenot implementing the backward passby hand.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)

    # Computeand print lossusing operationson Variables.
    # Now lossis a Variableof shape (1,)and loss.datais a Tensorof shape
    # (1,); loss.data[0]is a scalar value holding the loss.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])

    # Use autogradto compute the backward pass. This call will compute the
    # gradientof losswith respectto all Variableswith requires_grad=True.
    # After this call w1.gradand w2.grad will be Variables holding the gradient
    #of the losswith respectto w1and w2 respectively.
    loss.backward()

    # Update weightsusing gradient descent; w1.dataand w2.data are Tensors,
    # w1.gradand w2.grad are Variablesand w1.grad.dataand w2.grad.data are
    # Tensors.
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data

    # Manually zero the gradients after updating weights
    w1.grad.data.zero_()
    w2.grad.data.zero_()

PyTorch : Defining new autograd functions（定义新的自动梯度函数）

在底层，每一个原始的自动梯度运算符实际上是两个在Tensor上运行的函数。其中，forward函数计算从输入Tensors获得的输出Tensors。而backward函数接收输出Tensors相对于某个标量值的梯度，并且计算输入Tensors相对于该相同标量值的梯度。
在PyTorch中，我们可以很容易地定义自己的自动梯度运算符。具体来讲，就是先定义torch.autograd.Function的子类，然后实现forward和backward函数。之后我们就可以使用这个新的自动梯度运算符了。使用该运算符的方式是创建一个实例，并且像一个函数一样去调用它，传递包含输入数据的Variables。
在这个例子中，我们定义自己的定制自动梯度函数来执行ReLU非线性，然后使用它执行我们的两层网络。

# -*- coding: utf-8 -*-import torchfrom torch.autogradimport VariableclassMyReLU(torch.autograd.Function):"""
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """defforward(self, input):"""
        In the forward pass we receive a Tensor containing the input and return a
        Tensor containing the output. You can cache arbitrary Tensors for use in the
        backward pass using the save_for_backward method.
        """
        self.save_for_backward(input)return input.clamp(min=0)defbackward(self, grad_output):"""
        In the backward pass we receive a Tensor containing the gradient of the loss
        with respect to the output, and we need to compute the gradient of the loss
        with respect to the input.
        """
        input, = self.saved_tensors
        grad_input = grad_output.clone()
        grad_input[input <0] =0return grad_input


dtype = torch.FloatTensor# dtype = torch.cuda.FloatTensor # Uncomment this to run on GPU# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out =64,1000,100,10# Create random Tensors to hold input and outputs, and wrap them in Variables.
x = Variable(torch.randn(N, D_in).type(dtype), requires_grad=False)
y = Variable(torch.randn(N, D_out).type(dtype), requires_grad=False)# Create random Tensors for weights, and wrap them in Variables.
w1 = Variable(torch.randn(D_in, H).type(dtype), requires_grad=True)
w2 = Variable(torch.randn(H, D_out).type(dtype), requires_grad=True)

learning_rate =1e-6for tin range(500):# Construct an instance of our MyReLU class to use in our network
    relu = MyReLU()# Forward pass: compute predicted y using operations on Variables; we compute# ReLU using our custom autograd operation.
    y_pred = relu(x.mm(w1)).mm(w2)# Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.data[0])# Use autograd to compute the backward pass.
    loss.backward()# Update weights using gradient descent
    w1.data -= learning_rate * w1.grad.data
    w2.data -= learning_rate * w2.grad.data# Manually zero the gradients after updating weights
    w1.grad.data.zero_()
    w2.grad.data.zero_()

TensorFlow: Static Graphs（静态图）

PyTorch自动梯度看起来非常像TensorFlow：在两个框架中，我们都定义计算图，使用自动微分来计算梯度。两者最大的不同就是TensorFlow的计算图是静态的，而PyTorch使用动态的计算图。
在TensorFlow中，我们定义计算图一次，然后重复执行这个相同的图，可能会提供不同的输入数据。而在PyTorch中，每一个前向通道定义一个新的计算图。
静态图的好处在于你可以预先对图进行优化。例如，一个框架可能要融合一些图运算来提升效率，或者产生一个策略来将图分布到多个GPU或机器上。如果你重复使用相同的图，前期优化的消耗就会被分摊开，因为相同的图在多次重复运行。
静态图和动态图的一个不同之处是控制流。对于一些模型，我们希望对每个数据点执行不同的计算。例如，一个递归神经网络可能对于每个数据点执行不同的时间步数，这个展开（unrolling）可以作为一个循环来实现。对于一个静态图，循环结构要作为图的一部分。因此，TensorFlow提供了运算符（例如tf .scan）来把循环嵌入到图当中。对于动态图来说，情况更加简单：既然我们为每个例子即时创建图，我们可以使用正常的解释流控制来为每个输入执行不同的计算。
为了与上面的PyTorch自动梯度实例做对比，我们使用TensorFlow来拟合一个简单的2层网络。

# -*- coding: utf-8 -*-
import tensorflowas tf
import numpyas np

# First weset up the computational graph:

# Nis batch size; D_inis input dimension;
# His hidden dimension; D_outis output dimension.
N, D_in, H, D_out =64,1000,100,10

#Create placeholdersfor the inputand target data; these will be filled
#with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

#Create Variablesfor the weightsand initialize themwith random data.
# A TensorFlow Variable persists its value across executionsof the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

#Forward pass: Compute the predicted yusing operationson TensorFlow Tensors.
# Note that this code doesnot actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute lossusing operationson TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) **2.0)

# Compute gradientof the losswith respectto w1and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weightsusing gradient descent.To actually update the weights
# we needto evaluate new_w1and new_w2 when executing the graph. Note that
#in TensorFlow the the actof updating the valueof the weightsis partof
# the computational graph;in PyTorch this happens outside the computational
# graph.
learning_rate =1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow sessionto
# actually execute the graph.with tf.Session()as sess:
    # Run the graph onceto initialize the Variables w1and w2.
    sess.run(tf.global_variables_initializer())

    #Create numpy arrays holding the actual datafor the inputs xand targets
    # y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)for _in range(500):
        # Execute the graph many times.Each time it executes we wantto bind
        # x_valueto xand y_valueto y, specifiedwith the feed_dict argument.
        #Each time we execute the graph we wantto compute the valuesfor loss,
        # new_w1,and new_w2; the valuesof these Tensors are returnedas numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

nn module

PyTorch: nn

计算图和自动梯度是非常强大的范式，可用于定义复杂的运算符和自动求导数。然而，对于一个大型的网络来说，原始的自动梯度有点太低级别了。
在建立神经网络的时候，我们经常把计算安排在层（layers）中。某些层有可学习的参数，将会在学习中进行优化。
在TensorFlow中，Keras，TensorFlow-Slim和TFLearn这些包提供了原始计算图之上的高级抽象，这对于构建神经网络大有裨益。
在PyTorch中， nn包服务于相同的目的。nn包定义了一系列Modules，大体上相当于神经网络的层。一个Module接收输入Variables，计算输出Variables，但是也可以保持一个内部状态，例如包含了可学习参数的Variables。nn 包还定义了一系列在训练神经网络时常用的损失函数。
在下面例子中，我们使用nn包来实现我们的两层神经网络。

# -*- coding: utf-8 -*-
import torchfrom torch.autograd import Variable

# Nis batch size; D_inis input dimension;
# His hidden dimension; D_outis output dimension.
N, D_in, H, D_out =64,1000,100,10

#Create random Tensorsto hold inputsand outputs,and wrap themin Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Use the nn packageto define our modelas asequenceof layers. nn.Sequential
#is aModule which contains other Modules,and applies theminsequenceto
# produce its output.Each LinearModule computes outputfrom inputusing a
# linearfunction,andholdsinternalVariablesforitsweightandbias.model =torch.nn.Sequential(
    torch.nn.Linear(D_in, H),torch.nn.ReLU(),torch.nn.Linear(H, D_out),
)

#Thennpackagealsocontainsdefinitionsofpopularlossfunctions;in this
#case we will use Mean Squared Error (MSE)as our lossfunction.loss_fn =torch.nn.MSELoss(size_average=False)learning_rate = 1e-4fortinrange(500):
    #Forward pass: compute predicted yby passing xto the model.Module objects
    #override the __call__operator so you can call them like functions. When
    # doing so you pass a Variableof input datato theModuleand it produces
    # a Variableof output data.
    y_pred = model(x)

    # Computeand print loss. We pass Variables containing the predictedandtrue
    # valuesof y,and the lossfunctionreturnsaVariablecontainingthe
    #loss.loss =loss_fn(y_pred, y)print(t, loss.data[0])

    #Zerothegradientsbeforerunningthebackwardpass.model.zero_grad()

    #Backwardpass: compute gradientof the losswith respectto all the learnable
    # parametersof the model. Internally, the parametersofeachModule are stored
    #in Variableswith requires_grad=True, so this call will compute gradientsfor
    # all learnable parametersin the model.
    loss.backward()

    # Update the weightsusing gradient descent.Each parameteris a Variable, so
    # we can access its dataand gradients like we did before.for paramin model.parameters():
        param.data -= learning_rate * param.grad.data

PyTorch: optim

目前，我们已经通过手动改变持有可学习参数的Variables的 .data成员来更新模型的权重。对于简单的优化算法（例如随机梯度下降）来说这不是一个大的负担，但是实际上我们经常使用更加复杂的优化器来训练神经网络，例如AdaGrad, RMSProp, Adam等。
PyTorch的optim包将优化算法进行抽象，并提供了常用的优化算法的实现。
下面这个例子，我们将会使用 nn包来定义模型，使用optim包提供的Adam算法来优化这个模型。

# -*- coding: utf-8 -*-
import torchfrom torch.autograd import Variable

# Nis batch size; D_inis input dimension;
# His hidden dimension; D_outis output dimension.
N, D_in, H, D_out =64,1000,100,10

#Create random Tensorsto hold inputsand outputs,and wrap themin Variables.
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)

# Use the nn packageto define our modeland lossfunction.model =torch.nn.Sequential(
    torch.nn.Linear(D_in, H),torch.nn.ReLU(),torch.nn.Linear(H, D_out),
)loss_fn =torch.nn.MSELoss(size_average=False)

#UsetheoptimpackagetodefineanOptimizerthatwillupdatetheweightsof
#themodelforus.HerewewilluseAdam; the optim package contains many other
# optimization algoriths. The first argumentto the Adamconstructortellsthe
#optimizerwhichVariablesitshouldupdate.learning_rate = 1e-4optimizer =torch.optim.Adam(model.parameters(),lr=learning_rate)fortinrange(500):
    #Forward pass: compute predicted yby passing xto the model.
    y_pred = model(x)

    # Computeand print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.data[0])

    # Before the backward pass, use the optimizer objectto zero allof the
    # gradientsfor the variables it will update (which are the learnable weights
    #of the model)
    optimizer.zero_grad()

    # Backward pass: compute gradientof the losswith respectto model
    # parameters
    loss.backward()

    # Calling thestepfunctiononanOptimizermakesanupdatetoits
    #parametersoptimizer.step()

PyTorch: Custom nn Modules （定制nn模块）

有时候，需要设定比现有模块序列更加复杂的模型。这时，你可以通过生成一个nn.Module的子类来定义一个forward。该forward可以使用其他的modules或者其他的自动梯度运算来接收输入Variables，产生输出Variables。
在这个例子中，我们实现两层神经网络作为一个定制的Module子类。

# -*- coding: utf-8 -*-import torchfrom torch.autogradimport VariableclassTwoLayerNet(torch.nn.Module):def__init__(self, D_in, H, D_out):"""
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)defforward(self, x):"""
        In the forward function we accept a Variable of input data and we must return
        a Variable of output data. We can use Modules defined in the constructor as
        well as arbitrary operators on Variables.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)return y_pred# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out =64,1000,100,10# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)# Construct our model by instantiating the class defined above
model = TwoLayerNet(D_in, H, D_out)# Construct our loss function and an Optimizer. The call to model.parameters()# in the SGD constructor will contain the learnable parameters of the two# nn.Linear modules which are members of the model.
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)for tin range(500):# Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)# Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.data[0])# Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

我们实现一个非常奇怪的模型来作为动态图和权重分享的例子。这个模型是一个全连接的ReLU网络。每一个前向通道选择一个1至4之间的随机数，在很多隐含层中使用。多次使用相同的权重来计算最内层的隐含层。
这个模型我们使用正常的Python流控制来实现循环。在定义前向通道时，通过多次重复使用相同的Module来实现权重分享。
我们实现这个模型作为一个Module的子类。

# -*- coding: utf-8 -*-import randomimport torchfrom torch.autogradimport VariableclassDynamicNet(torch.nn.Module):def__init__(self, D_in, H, D_out):"""
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)defforward(self, x):"""
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)for _in range(random.randint(0,3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)return y_pred# N is batch size; D_in is input dimension;# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out =64,1000,100,10# Create random Tensors to hold inputs and outputs, and wrap them in Variables
x = Variable(torch.randn(N, D_in))
y = Variable(torch.randn(N, D_out), requires_grad=False)# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)# Construct our loss function and an Optimizer. Training this strange model with# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(size_average=False)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)for tin range(500):# Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)# Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.data[0])# Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

总结
本文介绍了PyTorch中的重点模块和使用，对于开展之后的实战练习非常重要。所以，我们需要认真练习一下本文的所有模块。最好手敲代码走一遍。