tokenizer = tiktoken.get_encoding("gpt2") 通常tokenizer 拥有encode 和 decode 方法 encode : 将文本转换为向量 decode : 将向量转换为文本
text = "Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace."
integers = tokenizer.encode(text, allowed_special={"<|endoftext|>"})
strings = tokenizer.decode(integers)
Hello, do you like tea? <|endoftext|> In the sunlit terraces of someunknownPlace.
创建一个词汇表大小为6, 维度为3的张量,词汇表决定了大模型的上限 vocab_size = 50257 output_dim = 3
创建词元嵌入 torch.manual_seed(123) pos_embedding_layer = torch.nn.Embedding(vocab_size, output_dim) tensor([[[ 0.3793, 1.0554, -0.4246], [ 1.4180, 0.1776, -0.2737], [ 0.6189, -3.0485, -1.0450], [-1.1296, -0.5921, -0.0588], [ 1.6772, -0.8353, 0.7531], [-0.1515, 0.2832, 0.1554], [-0.7367, 2.1855, 0.2716], [ 0.0744, -0.8683, -0.5622], [ 0.7998, 1.8777, 1.0335], [-0.4080, -0.0293, 0.2531], [-2.1542, 1.3953, 1.1845], [ 0.5945, -0.4951, -0.5756], [-1.4126, 0.5412, -1.2169], [-0.0322, -0.4761, -0.8343], [ 0.9031, -0.7218, -0.5951]]], grad_fn=<EmbeddingBackward0>)
- 计算注意力分数, 点积
query = inputs[1]
attn_scores_2 = torch.empty(inputs.shape[0])
for i, x_i in enumerate(inputs): s = 0 for m, n in enumerate(query): s = s + query[m] * inputs[i][m] attn_scores_2[i] = s
print(attn_scores_2)
attn_weights_2_tmp = attn_scores_2 / attn_scores_2.sum() print(attn_scores_2) print(attn_scores_2.sum())
print("Attention weights:", attn_weights_2_tmp)
print("Sum:", attn_weights_2_tmp.sum())
|