本文大部分内容翻译自Illustrated Self-Attention, Step-by-step guide to self-attention with illustrations and code,仅用于学习,如有翻译不当之处,敬请谅解!






  1. 准备输入;
  2. 初始化权重;
  3. 获取keyqueryvalue
  4. 为第1个输入计算注意力分数;
  5. 计算softmax;
  6. 将分数乘以values;
  7. 对权重化后的values求和,得到输出1;
  8. 对其余的输入,重复第4-7步。




Input 1: [1, 0, 1, 0]
Input 2: [0, 2, 0, 2]
Input 3: [1, 1, 1, 1]






[[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]]


[[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]]


[[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]]

注意: 在神经网络设置中,这些权重通常都是一些小的数字,利用随机分布,比如Gaussian, Xavier and Kaiming分布,随机初始化。在训练开始前已经完成初始化。




               [0, 0, 1]
[1, 0, 1, 0] x [1, 1, 0] = [0, 1, 1]
[0, 1, 0]
[1, 1, 0]


               [0, 0, 1]
[0, 2, 0, 2] x [1, 1, 0] = [4, 4, 0]
[0, 1, 0]
[1, 1, 0]


               [0, 0, 1]
[1, 1, 1, 1] x [1, 1, 0] = [2, 3, 1]
[0, 1, 0]
[1, 1, 0]


               [0, 0, 1]
[1, 0, 1, 0] [1, 1, 0] [0, 1, 1]
[0, 2, 0, 2] x [0, 1, 0] = [4, 4, 0]
[1, 1, 1, 1] [1, 1, 0] [2, 3, 1]



               [1, 0, 1]
[1, 0, 1, 0] [1, 0, 0] [1, 0, 2]
[0, 2, 0, 2] x [0, 0, 1] = [2, 2, 2]
[1, 1, 1, 1] [0, 1, 1] [2, 1, 3]




            [0, 4, 2]
[1, 0, 2] x [1, 4, 3] = [2, 4, 4]
[1, 0, 1]




softmax([2, 4, 4]) = [0.0, 0.5, 0.5]

第六步: 将分数乘以values


1: 0.0 * [1, 2, 3] = [0.0, 0.0, 0.0]
2: 0.5 * [2, 8, 0] = [1.0, 4.0, 0.0]
3: 0.5 * [2, 6, 3] = [1.0, 3.0, 1.5]



  [0.0, 0.0, 0.0]
+ [1.0, 4.0, 0.0]
+ [1.0, 3.0, 1.5]
= [2.0, 7.0, 1.5]

  产生的向量[2.0, 7.0, 1.5](暗绿色)就是输出1,这是基于输入1的query表示与其它的keys,包括它自身的key互相作用的结果。




  这里有PyTorch的实现代码,PyTorch是一个主流的Python深度学习框架。为了能够很好地使用代码片段中的@运算符, .T and None操作,请确保Python≥3.6,PyTorch ≥1.3.1。

1. 准备输入

import torch

x = [
[1, 0, 1, 0], # Input 1
[0, 2, 0, 2], # Input 2
[1, 1, 1, 1] # Input 3
x = torch.tensor(x, dtype=torch.float32)

2. 初始化权重

w_key = [
[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
w_query = [
[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
w_value = [
[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
w_key = torch.tensor(w_key, dtype=torch.float32)
w_query = torch.tensor(w_query, dtype=torch.float32)
w_value = torch.tensor(w_value, dtype=torch.float32)

3. 获取keyqueryvalue

keys = x @ w_key
querys = x @ w_query
values = x @ w_value print(keys)
# tensor([[0., 1., 1.],
# [4., 4., 0.],
# [2., 3., 1.]]) print(querys)
# tensor([[1., 0., 2.],
# [2., 2., 2.],
# [2., 1., 3.]]) print(values)
# tensor([[1., 2., 3.],
# [2., 8., 0.],
# [2., 6., 3.]])

4. 为第1个输入计算注意力分数

attn_scores = querys @ keys.T

# tensor([[ 2.,  4.,  4.],  # attention scores from Query 1
# [ 4., 16., 12.], # attention scores from Query 2
# [ 4., 12., 10.]]) # attention scores from Query 3

5. 计算softmax

from torch.nn.functional import softmax

attn_scores_softmax = softmax(attn_scores, dim=-1)
# tensor([[6.3379e-02, 4.6831e-01, 4.6831e-01],
# [6.0337e-06, 9.8201e-01, 1.7986e-02],
# [2.9539e-04, 8.8054e-01, 1.1917e-01]]) # For readability, approximate the above as follows
attn_scores_softmax = [
[0.0, 0.5, 0.5],
[0.0, 1.0, 0.0],
[0.0, 0.9, 0.1]
attn_scores_softmax = torch.tensor(attn_scores_softmax)

6. 将分数乘以values

weighted_values = values[:,None] * attn_scores_softmax.T[:,:,None]

# tensor([[[0.0000, 0.0000, 0.0000],
# [0.0000, 0.0000, 0.0000],
# [0.0000, 0.0000, 0.0000]],
# [[1.0000, 4.0000, 0.0000],
# [2.0000, 8.0000, 0.0000],
# [1.8000, 7.2000, 0.0000]],
# [[1.0000, 3.0000, 1.5000],
# [0.0000, 0.0000, 0.0000],
# [0.2000, 0.6000, 0.3000]]])

7. 对权重化后的values求和,得到输出

outputs = weighted_values.sum(dim=0)

# tensor([[2.0000, 7.0000, 1.5000],  # Output 1
# [2.0000, 8.0000, 0.0000], # Output 2
# [2.0000, 7.8000, 0.3000]]) # Output 3






# -*- coding: utf-8 -*-

from typing import List
import math
from pprint import pprint x = [[1, 0, 1, 0], # Input 1
[0, 2, 0, 2], # Input 2
[1, 1, 1, 1] # Input 3
] w_key = [[0, 0, 1],
[1, 1, 0],
[0, 1, 0],
[1, 1, 0]
] w_query = [[1, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 1]
] w_value = [[0, 2, 0],
[0, 3, 0],
[1, 0, 3],
[1, 1, 0]
] # vector dot of two vectors
def vector_dot(list1: List[float or int], list2: List[float or int]) -> float or int:
dot_sum = 0
for element_i, element_j in zip(list1, list2):
dot_sum += element_i * element_j return dot_sum # get weights matrix by x, using matrix multiplication
def get_weights_matrix_by_x(x, weight_matrix):
x_matrix = []
for i in range(len(x)):
x_row = []
for j in range(len(weight_matrix[0])):
x_row.append(vector_dot(x[i], [_[j] for _ in weight_matrix])) x_matrix.append(x_row) return x_matrix # softmax function
def softmax(x: List[float or int]) -> List[float or int]:
x_sum = sum([math.exp(_) for _ in x])
return [math.exp(_)/x_sum for _ in x] x_key = get_weights_matrix_by_x(x, w_key)
x_value = get_weights_matrix_by_x(x, w_value)
x_query = get_weights_matrix_by_x(x, w_query)
# print(x_key)
# print(x_value)
# print(x_query) outputs = []
for query in x_query:
score_list = [vector_dot(query, key) for key in x_key]
softmax_score_list = softmax(score_list) weights_list = []
for i in range(len(softmax_score_list)):
weights = [softmax_score_list[i] * _ for _ in x_value[i]]
weights_list.append(weights) output = []
for j in range(len(weights_list[0])):
output.append(sum([_[j] for _ in weights_list])) outputs.append(output) pprint(outputs)


[[1.9366210616669624, 6.683105308334811, 1.5950684074995565],
[1.9999939663351456, 7.9639915951322156, 0.0539764053125496],
[1.9997046127769653, 7.759892254657784, 0.3583892946751152]]





