Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

项目中代码很多很全，值得细读。一个月前，Meta 发布了开源大模型 llama3 系列，在多个关键基准测试中优于业界 SOTA 模型，并在代码生成任务上全面领先。此后，开发者们便开始了本地部署和完成，比如 llama3 的中文完成、llama3 的纯 NumPy 完成等。十几个小时前，有位名为「Nishant Aklecha」的开发者发布了一个从零开始完成 llama3 的存储库，包括跨多个头的注意力矩阵乘法、位置编码和每一个层在内都有非常详细的解释。该项目得到了大神 Karpathy 的称赞，他表示项目看起来不错，

项目中代码很多很全，值得细读。

一个月前，Meta 发布了开源大模型 llama3 系列，在多个关键基准测试中优于业界 SOTA 模型，并在代码生成任务上全面领先。

此后，开发者们便开始了本地部署和完成，比如 llama3 的中文完成、llama3 的纯 NumPy 完成等。

十几个小时前，有位名为「Nishant Aklecha」的开发者发布了一个从零开始完成 llama3 的存储库，包括跨多个头的注意力矩阵乘法、位置编码和每一个层在内都有非常详细的解释。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

该项目得到了大神 Karpathy 的称赞，他表示项目看起来不错，完全展开后，通过模块嵌套和相互调用，可以更容易看到实际的情况。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

上传半天的时间，该项目已在 GitHub 上收获了 1.5k 的 star，足可见其含金量。

从零开始完成 llama3

接下来项目作家手把手教你如何从头开始完成 llama3。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

项目地址：https://github.com/naklecha/llama3-from-scratch

首先从 Meta 提供的 llama3 模型文件中加载张量。

下载地址：https://llama.meta.com/llama-downloads/

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

接着是分词器（tokenizer），作家表示没打算自己完成分词器，因而借用了 Andrej Karpathy 的完成方式：

分词器的完成链接：https://github.com/karpathy/minbpe

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe
import torch
import json
import matplotlib.pyplot as plt
tokenizer_path = "Meta-Llama-3-8B/tokenizer.model"
special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|reserved_special_token_0|>",
            "<|reserved_special_token_1|>",
            "<|reserved_special_token_2|>",
            "<|reserved_special_token_3|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|reserved_special_token_4|>",
            "<|eot_id|>",  # end of turn
        ] + [f"<|reserved_special_token_{i}|>" for i in range (5, 256 - 5)] mergeable_ranks = load_tiktoken_bpe (tokenizer_path) tokenizer = tiktoken.Encoding (
    name=Path (tokenizer_path).name,
    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p {L}\p {N}]?\p {L}+|\p {N}{1,3}| ?[^\s\p {L}\p {N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
    mergeable_ranks=mergeable_ranks,
    special_tokens={token: len (mergeable_ranks) + i for i, token in enumerate (special_tokens)},
)
tokenizer.decode (tokenizer.encode ("hello world!"))

'hello world!'

上述步骤完成后，就是读取模型文件了。由于该研究是从头开始完成 llama3，因此代码一次只读取一个张量文件。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

model = torch.load ("Meta-Llama-3-8B/consolidated.00.pth")
print (json.dumps (list (model.keys ())[:20], indent=4))

[
    "tok_embeddings.weight",
    "layers.0.attention.wq.weight",
    "layers.0.attention.wk.weight",
    "layers.0.attention.wv.weight",
    "layers.0.attention.wo.weight",
    "layers.0.feed_forward.w1.weight",
    "layers.0.feed_forward.w3.weight",
    "layers.0.feed_forward.w2.weight",
    "layers.0.attention_norm.weight",
    "layers.0.ffn_norm.weight",
    "layers.1.attention.wq.weight",
    "layers.1.attention.wk.weight",
    "layers.1.attention.wv.weight",
    "layers.1.attention.wo.weight",
    "layers.1.feed_forward.w1.weight",
    "layers.1.feed_forward.w3.weight",
    "layers.1.feed_forward.w2.weight",
    "layers.1.attention_norm.weight",
    "layers.1.ffn_norm.weight",
    "layers.2.attention.wq.weight"
]

with open ("Meta-Llama-3-8B/params.json", "r") as f:
    config = json.load (f)
config

{'dim': 4096,
 'n_layers': 32,
 'n_heads': 32,
 'n_kv_heads': 8,
 'vocab_size': 128256,
 'multiple_of': 1024,
 'ffn_dim_multiplier': 1.3,
 'norm_eps': 1e-05,
 'rope_theta': 500000.0}

项目作家应用以下配置来推断模型细节：

模型有 32 个 transformer 层；

每一个多头注意力块有 32 个头。

dim = config ["dim"]
n_layers = config ["n_layers"]
n_heads = config ["n_heads"]
n_kv_heads = config ["n_kv_heads"]
vocab_size = config ["vocab_size"]
multiple_of = config ["multiple_of"]
ffn_dim_multiplier = config ["ffn_dim_multiplier"]
norm_eps = config ["norm_eps"]
rope_theta = torch.tensor (config ["rope_theta"])

接下来的操作是将文本装换为 token，这里作家应用的是 tiktoken 库（一个用于 OpenAI 模型的 BPE tokeniser）。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

prompt = "the answer to the ultimate question of life, the universe, and everything is"
tokens = [128000] + tokenizer.encode (prompt)
print (tokens)
tokens = torch.tensor (tokens)
prompt_split_as_tokens = [tokenizer.decode ([token.item ()]) for token in tokens]
print (prompt_split_as_tokens)

[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']

然后将 token 转换为嵌入。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

embedding_layer = torch.nn.Embedding (vocab_size, dim)
embedding_layer.weight.data.copy_(model ["tok_embeddings.weight"])
token_embeddings_unnormalized = embedding_layer (tokens).to (torch.bfloat16)
token_embeddings_unnormalized.shape

torch.Size ([17, 4096])

将嵌入进行归一化。该研究应用均方根 RMS 算法进行归一化。不过，在这一步之后，张量形态不会改变，只是值进行了归一化。 Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

# def rms_norm (tensor, norm_weights):
#     rms = (tensor.pow (2).mean (-1, keepdim=True) + norm_eps)**0.5
#     return tensor * (norm_weights /rms)
def rms_norm (tensor, norm_weights):
    return (tensor * torch.rsqrt (tensor.pow (2).mean (-1, keepdim=True) + norm_eps)) * norm_weights

构建 transformer 第一层。完成上述准备后，接着是构建 transformer 第一层：从模型文件中访问 layer.0（即第一层），归一化后嵌入维度仍然是 [17×4096] 。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

token_embeddings = rms_norm (token_embeddings_unnormalized, model ["layers.0.attention_norm.weight"])
token_embeddings.shape

torch.Size ([17, 4096])

从头开始完成注意力。加载第一层 transformer 的注意力头：

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

print (
    model ["layers.0.attention.wq.weight"].shape,
    model ["layers.0.attention.wk.weight"].shape,
    model ["layers.0.attention.wv.weight"].shape,
    model ["layers.0.attention.wo.weight"].shape
)
torch.Size ([4096, 4096]) torch.Size ([1024, 4096]) torch.Size ([1024, 4096]) torch.Size ([4096, 4096])

展开查问。展开来自多个注意力头的查问，得到的形态是 [32x128x4096]，这里，32 是 llama3 中注意力头的数量，128 是查问向量的巨细，4096 是 token 嵌入的巨细。

q_layer0 = model ["layers.0.attention.wq.weight"]
head_dim = q_layer0.shape [0] //n_heads
q_layer0 = q_layer0.view (n_heads, head_dim, dim)
q_layer0.shape

torch.Size ([32, 128, 4096])

从头完成第一层的第一个头。访问第一层的查问权重矩阵，巨细是 [128×4096]。

q_layer0_head0 = q_layer0 [0]
q_layer0_head0.shape

torch.Size ([128, 4096])

将查问权重与 token 嵌入相乘，从而得到 token 的查问，在这里你可以看到结果巨细是 [17×128]。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

q_per_token = torch.matmul (token_embeddings, q_layer0_head0.T)
q_per_token.shape

torch.Size ([17, 128])

定位编码。当初处于这样一个阶段，即对提示符中的每一个 token 都有一个查问向量，但是考虑单个查问向量，我们不知道其提示符中的位置。作家应用了 RoPE（转动位置嵌入）来解决。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

q_per_token_split_into_pairs = q_per_token.float ().view (q_per_token.shape [0], -1, 2)
q_per_token_split_into_pairs.shape

torch.Size ([17, 64, 2])

在上面的步骤中，该研究将查问向量分成对，并对每对应用转动角度移位。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

应用复数点积来转动向量。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

zero_to_one_split_into_64_parts = torch.tensor (range (64))/64
zero_to_one_split_into_64_parts

tensor ([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250,
        0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656,
        0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062,
        0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469,
        0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875,
        0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281,
        0.8438, 0.8594, 0.8750, 0.8906, 0.9062, 0.9219, 0.9375, 0.9531, 0.9688,
        0.9844])

freqs = 1.0 / (rope_theta ** zero_to_one_split_into_64_parts)
freqs

tensor ([1.0000e+00, 8.1462e-01, 6.6360e-01, 5.4058e-01, 4.4037e-01, 3.5873e-01,
        2.9223e-01, 2.3805e-01, 1.9392e-01, 1.5797e-01, 1.2869e-01, 1.0483e-01,
        8.5397e-02, 6.9566e-02, 5.6670e-02, 4.6164e-02, 3.7606e-02, 3.0635e-02,
        2.4955e-02, 2.0329e-02, 1.6560e-02, 1.3490e-02, 1.0990e-02, 8.9523e-03,
        7.2927e-03, 5.9407e-03, 4.8394e-03, 3.9423e-03, 3.2114e-03, 2.6161e-03,
        2.1311e-03, 1.7360e-03, 1.4142e-03, 1.1520e-03, 9.3847e-04, 7.6450e-04,
        6.2277e-04, 5.0732e-04, 4.1327e-04, 3.3666e-04, 2.7425e-04, 2.2341e-04,
        1.8199e-04, 1.4825e-04, 1.2077e-04, 9.8381e-05, 8.0143e-05, 6.5286e-05,
        5.3183e-05, 4.3324e-05, 3.5292e-05, 2.8750e-05, 2.3420e-05, 1.9078e-05,
        1.5542e-05, 1.2660e-05, 1.0313e-05, 8.4015e-06, 6.8440e-06, 5.5752e-06,
        4.5417e-06, 3.6997e-06, 3.0139e-06, 2.4551e-06])

freqs_for_each_token = torch.outer (torch.arange (17), freqs)
freqs_cis = torch.polar (torch.ones_like (freqs_for_each_token), freqs_for_each_token)
freqs_cis.shape
# viewing tjhe third row of freqs_cis
value = freqs_cis [3]
plt.figure ()
for i, element in enumerate (value [:17]):
    plt.plot ([0, element.real], [0, element.imag], color='blue', linewidth=1, label=f"Index: {i}")
    plt.annotate (f"{i}", xy=(element.real, element.imag), color='red')
    plt.xlabel ('Real')
    plt.ylabel ('Imaginary')
    plt.title ('Plot of one row of freqs_cis')
    plt.show ()

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

当初每一个 token 查问都有了复数。

q_per_token_as_complex_numbers = torch.view_as_complex (q_per_token_split_into_pairs)
q_per_token_as_complex_numbers.shape

torch.Size ([17, 64])

q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis
q_per_token_as_complex_numbers_rotated.shape

torch.Size ([17, 64])

转动后的向量。

q_per_token_split_into_pairs_rotated = torch.view_as_real (q_per_token_as_complex_numbers_rotated)
q_per_token_split_into_pairs_rotated.shape

torch.Size ([17, 64, 2])

当初有了一个新的查问向量 (转动查问向量)，形态为 [17×128]，其中 17 是 token 数量，128 是查问向量的维度。

q_per_token_rotated = q_per_token_split_into_pairs_rotated.view (q_per_token.shape)
q_per_token_rotated.shape

torch.Size ([17, 128])

键（几乎和查问一样），键也生成维度为 128 的键向量。键的权重只有查问的 1/4，这是因为键的权重在 4 个头之间共享，以减少所需的计算量，键也会被转动以添加位置信息，就像查问一样。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

k_layer0 = model ["layers.0.attention.wk.weight"]
k_layer0 = k_layer0.view (n_kv_heads, k_layer0.shape [0] //n_kv_heads, dim)
k_layer0.shape

torch.Size ([8, 128, 4096])

k_layer0_head0 = k_layer0 [0]
k_layer0_head0.shape

torch.Size ([128, 4096])

k_per_token = torch.matmul (token_embeddings, k_layer0_head0.T)
k_per_token.shape

torch.Size ([17, 128])

k_per_token_split_into_pairs = k_per_token.float ().view (k_per_token.shape [0], -1, 2)
k_per_token_split_into_pairs.shape

torch.Size ([17, 64, 2])

k_per_token_as_complex_numbers = torch.view_as_complex (k_per_token_split_into_pairs)
k_per_token_as_complex_numbers.shape

torch.Size ([17, 64])

k_per_token_split_into_pairs_rotated = torch.view_as_real (k_per_token_as_complex_numbers * freqs_cis)
k_per_token_split_into_pairs_rotated.shape

torch.Size ([17, 64, 2])

k_per_token_rotated = k_per_token_split_into_pairs_rotated.view (k_per_token.shape)
k_per_token_rotated.shape

torch.Size ([17, 128])

每一个 token 查问和键的转动值以下，每一个查问和键当初的形态都是 [17×128]。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

接下来一步是将查问和键矩阵相乘。注意力得分矩阵 (qk_per_token) 的形态为 [17×17]，其中 17 是提示中 token 的数量。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

qk_per_token = torch.matmul (q_per_token_rotated, k_per_token_rotated.T)/(head_dim)**0.5
qk_per_token.shape

torch.Size ([17, 17])

当初必须掩蔽查问键分数。

在 llama3 的训练过程中，未来 token 的 qk 分数被掩蔽。这是因为在训练期间，只学习应用过去的 token 来预测未来的 token。因此在推理过程中，将未来的 token 标记为零。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

def display_qk_heatmap (qk_per_token):
    _, ax = plt.subplots ()
    im = ax.imshow (qk_per_token.to (float).detach (), cmap='viridis')
    ax.set_xticks (range (len (prompt_split_as_tokens)))
    ax.set_yticks (range (len (prompt_split_as_tokens)))
    ax.set_xticklabels (prompt_split_as_tokens)
    ax.set_yticklabels (prompt_split_as_tokens)
    ax.figure.colorbar (im, ax=ax)    

display_qk_heatmap (qk_per_token)

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

mask = torch.full ((len (tokens), len (tokens)), float ("-inf"), device=tokens.device) mask = torch.triu (mask, diagonal=1) mask

tensor ([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf], 
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf], 
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

qk_per_token_after_masking = qk_per_token + mask
display_qk_heatmap (qk_per_token_after_masking)

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax (qk_per_token_after_masking, dim=1).to (torch.bfloat16) display_qk_heatmap (qk_per_token_after_masking_after_softmax)

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

值（几乎在注意力结束时）

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

这些分数 (0-1) 被用于确定每一个 token 应用了多少值矩阵。

就像键一样，值权重也在 4 个注意力头之间共享（以节省计算量）

结果，下面的值权重矩阵形态为 [8x128x4096]

v_layer0 = model ["layers.0.attention.wv.weight"] v_layer0 = v_layer0.view (n_kv_heads, v_layer0.shape [0] //n_kv_heads, dim) v_layer0.shape

torch.Size ([8, 128, 4096])

第一层和第一个头的值权重矩阵以下所示。

v_layer0_head0 = v_layer0 [0] v_layer0_head0.shape

torch.Size ([128, 4096])

值向量以下图所示。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

当初应用值权重来获取每一个 token 的注意力值，其巨细为 [17×128]，其中 17 为提示中的 token 数，128 为每一个 token 的值向量维数。

v_per_token = torch.matmul (token_embeddings, v_layer0_head0.T)v_per_token.shape

torch.Size ([17, 128])

注意力以下图所示。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

与每一个 token 的值相乘后得到的注意力向量的形态为 [17*128]。

qkv_attention = torch.matmul (qk_per_token_after_masking_after_softmax, v_per_token) qkv_attention.shape

torch.Size ([17, 128])

多头注意力与单头注意力以下图所示。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

当初有了第一层和第一个头的注意力值。

接下来运行一个循环并执行与上面单元完全相同的数学运算，不过第一层中的每一个头除外。

qkv_attention_store = []
for head in range (n_heads):
    q_layer0_head = q_layer0 [head]
    k_layer0_head = k_layer0 [head//4] # key weights are shared across 4 heads
v_layer0_head = v_layer0 [head//4] # value weights are shared across 4 heads
q_per_token = torch.matmul (token_embeddings, q_layer0_head.T)
    k_per_token = torch.matmul (token_embeddings, k_layer0_head.T)
    v_per_token = torch.matmul (token_embeddings, v_layer0_head.T)

    q_per_token_split_into_pairs = q_per_token.float ().view (q_per_token.shape [0], -1, 2)
    q_per_token_as_complex_numbers = torch.view_as_complex (q_per_token_split_into_pairs)
    q_per_token_split_into_pairs_rotated = torch.view_as_real (q_per_token_as_complex_numbers * freqs_cis [:len (tokens)])
    q_per_token_rotated = q_per_token_split_into_pairs_rotated.view (q_per_token.shape)

    k_per_token_split_into_pairs = k_per_token.float ().view (k_per_token.shape [0], -1, 2)
    k_per_token_as_complex_numbers = torch.view_as_complex (k_per_token_split_into_pairs)
    k_per_token_split_into_pairs_rotated = torch.view_as_real (k_per_token_as_complex_numbers * freqs_cis [:len (tokens)])
    k_per_token_rotated = k_per_token_split_into_pairs_rotated.view (k_per_token.shape)

    qk_per_token = torch.matmul (q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5
mask = torch.full ((len (tokens), len (tokens)), float ("-inf"), device=tokens.device)
    mask = torch.triu (mask, diagonal=1)
    qk_per_token_after_masking = qk_per_token + mask
qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax (qk_per_token_after_masking, dim=1).to (torch.bfloat16)
    qkv_attention = torch.matmul (qk_per_token_after_masking_after_softmax, v_per_token)
    qkv_attention = torch.matmul (qk_per_token_after_masking_after_softmax, v_per_token)
    qkv_attention_store.append (qkv_attention)
len (qkv_attention_store)

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

当初第一层上的所有 32 个头都有了 qkv_attention 矩阵，并在快结束的时候将所有注意力分数合并为一个巨细为 [17×4096] 的大矩阵。

stacked_qkv_attention = torch.cat (qkv_attention_store, dim=-1) stacked_qkv_attention.shape

torch.Size ([17, 4096])

权重矩阵是最后的步骤之一。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

第 0 层注意力要做的最后一件事是，对以下的权重矩阵进行乘法操作。

w_layer0 = model ["layers.0.attention.wo.weight"] w_layer0.shape

torch.Size ([4096, 4096])

这是一个简单的线性层，所以只做矩阵乘法（matmul）。

embedding_delta = torch.matmul (stacked_qkv_attention, w_layer0.T) embedding_delta.shape

torch.Size ([17, 4096])

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

当初，注意力之后的嵌入值有了变化，并应该被添加到原始 token 嵌入中。

embedding_after_edit = token_embeddings_unnormalized + embedding_delta
embedding_after_edit.shape

torch.Size ([17, 4096])

归一化并在嵌入 delta 过程中运行一个前馈神经网络。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

embedding_after_edit_normalized = rms_norm (embedding_after_edit, model ["layers.0.ffn_norm.weight"]) embedding_after_edit_normalized.shape

torch.Size ([17, 4096])

加载 ff 权重，并完成前馈网络。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

llama3 应用 SwiGLU 前馈网络，该网络架构非常擅长在模型需要时添加非线性。当前，在 LLMs 中应用这一前馈网络是非常标准的做法。

w1 = model ["layers.0.feed_forward.w1.weight"] w2 = model ["layers.0.feed_forward.w2.weight"] w3 = model ["layers.0.feed_forward.w3.weight"] output_after_feedforward = torch.matmul (torch.functional.F.silu (torch.matmul (embedding_after_edit_normalized, w1.T)) * torch.matmul (embedding_after_edit_normalized, w3.T), w2.T) output_after_feedforward.shape

torch.Size ([17, 4096])

当初终于在第一层之后为每一个 token 提供了新的编辑后的嵌入，并且在完成之前只剩下 31 层需要处理（one for loop away）。

你可以想象这个编辑后的嵌入拥有在第一层上所有查问的信息。当初每一层将在所问问题上编码越来越复杂的查问，直到得到的嵌入了解所需的下一个 token 的一切。

layer_0_embedding = embedding_after_edit+output_after_feedforwardlayer_0_embedding.shape

torch.Size ([17, 4096])

之前为每一层做的所有事情，都可以一次性完成。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

final_embedding = token_embeddings_unnormalized
for layer in range (n_layers):
    qkv_attention_store = []
    layer_embedding_norm = rms_norm (final_embedding, model [f"layers.{layer}.attention_norm.weight"])
    q_layer = model [f"layers.{layer}.attention.wq.weight"]
    q_layer = q_layer.view (n_heads, q_layer.shape [0] //n_heads, dim)
    k_layer = model [f"layers.{layer}.attention.wk.weight"]
    k_layer = k_layer.view (n_kv_heads, k_layer.shape [0] //n_kv_heads, dim)
    v_layer = model [f"layers.{layer}.attention.wv.weight"]
    v_layer = v_layer.view (n_kv_heads, v_layer.shape [0] //n_kv_heads, dim)
    w_layer = model [f"layers.{layer}.attention.wo.weight"]
    for head in range (n_heads):
        q_layer_head = q_layer [head]
        k_layer_head = k_layer [head//4]
        v_layer_head = v_layer [head//4]
        q_per_token = torch.matmul (layer_embedding_norm, q_layer_head.T)
        k_per_token = torch.matmul (layer_embedding_norm, k_layer_head.T)
        v_per_token = torch.matmul (layer_embedding_norm, v_layer_head.T)
        q_per_token_split_into_pairs = q_per_token.float ().view (q_per_token.shape [0], -1, 2) 
        q_per_token_as_complex_numbers = torch.view_as_complex (q_per_token_split_into_pairs)
        q_per_token_split_into_pairs_rotated = torch.view_as_real (q_per_token_as_complex_numbers * freqs_cis)
        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view (q_per_token.shape)
        k_per_token_split_into_pairs = k_per_token.float ().view (k_per_token.shape [0], -1, 2)
        k_per_token_as_complex_numbers = torch.view_as_complex (k_per_token_split_into_pairs)
        k_per_token_split_into_pairs_rotated = torch.view_as_real (k_per_token_as_complex_numbers * freqs_cis)
        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view (k_per_token.shape)
        qk_per_token = torch.matmul (q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5
        mask = torch.full ((len (token_embeddings_unnormalized), len (token_embeddings_unnormalized)), float ("-inf"))
        mask = torch.triu (mask, diagonal=1)
        qk_per_token_after_masking = qk_per_token + mask
        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax (qk_per_token_after_masking, dim=1).to (torch.bfloat16)
        qkv_attention = torch.matmul (qk_per_token_after_masking_after_softmax, v_per_token)
        qkv_attention_store.append (qkv_attention)

    stacked_qkv_attention = torch.cat (qkv_attention_store, dim=-1)
    w_layer = model [f"layers.{layer}.attention.wo.weight"]
    embedding_delta = torch.matmul (stacked_qkv_attention, w_layer.T)
    embedding_after_edit = final_embedding + embedding_delta
    embedding_after_edit_normalized = rms_norm (embedding_after_edit, model [f"layers.{layer}.ffn_norm.weight"])
    w1 = model [f"layers.{layer}.feed_forward.w1.weight"]
    w2 = model [f"layers.{layer}.feed_forward.w2.weight"]
    w3 = model [f"layers.{layer}.feed_forward.w3.weight"]
    output_after_feedforward = torch.matmul (torch.functional.F.silu (torch.matmul (embedding_after_edit_normalized, w1.T)) * torch.matmul (embedding_after_edit_normalized, w3.T), w2.T)
    final_embedding = embedding_after_edit+output_after_feedforward

当初有了最终的嵌入，即该模型对下一个 token 的最佳猜测。该嵌入的形态与常见的 token 嵌入 [17×4096] 相同，其中 17 为 token 数，4096 为嵌入维数。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

final_embedding = rms_norm (final_embedding, model ["norm.weight"]) final_embedding.shape

torch.Size ([17, 4096])

将该嵌入解码为 token 值。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

应用该输入解码器将最终的嵌入转换为一个 token。

model ["output.weight"].shape

torch.Size ([128256, 4096])

应用最后 token 的嵌入来预测下一个值。在示例中，42 是「生命、宇宙和万物终极问题的答案是什么」的答案，根据《银河系漫游指南》一书，大多数现代 LLMs 都会回答 42，应该验证了整个代码。

logits = torch.matmul (final_embedding [-1], model ["output.weight"].T) logits.shape

torch.Size ([128256])

模型预测 token 数 2983 为下一个 token，这是 42 的 token 数吗？以下是最后的代码单元。

next_token = torch.argmax (logits, dim=-1) next_token

tensor (2983)

最后，启动。

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

tokenizer.decode ([next_token.item ()])

'42'

完结撒花

{{userData.name}}已认证

Karpathy称赞，从零完成LLaMa3项目爆火，半天1.5k star

消息称苹果首席运营官威廉姆斯访问台积电，探讨 AI 芯片开发

高德舆图：拟与浙江衢州共同打造首个时空智能乡村，将逐步开放 AI 能力

AI 架构 Transformer 再进化：谷歌新方法突破长文本处理，注意力模块内存需求可降至 1/47

秒变Midjourney高手！精选 52 条高级感的 sref 风格代码

中国电信自研 AI 节能系统：年均节电 8 亿度，节约电费 5.2 亿元

微软开源 bitnet.cpp 1-bit LLM 推理框架：不靠 GPU 可本地运行千亿参数 AI 模型，能耗最多降低 82.2%

Meta 用 AI 生成北极光图片，遭网友怒喷

英伟达 CEO 黄仁勋展望公司未来：坐拥 5 万名员工、部署 1 亿个 AI 助手

化学诺奖为何颁给「AI+生物」，凭什么Baker独占一半？

字节跳动清华AIR成立联合研究中心推动大模型产学研合作

{{userData.name}}已认证

消息称苹果首席运营官威廉姆斯访问台积电，探讨 AI 芯片开发

高德舆图：拟与浙江衢州共同打造首个时空智能乡村，将逐步开放 AI 能力

AI 架构 Transformer 再进化：谷歌新方法突破长文本处理，注意力模块内存需求可降至 1/47

秒变Midjourney高手！精选 52 条高级感的 sref 风格代码

中国电信自研 AI 节能系统：年均节电 8 亿度，节约电费 5.2 亿元

微软开源 bitnet.cpp 1-bit LLM 推理框架：不靠 GPU 可本地运行千亿参数 AI 模型，能耗最多降低 82.2%

Meta 用 AI 生成北极光图片，遭网友怒喷

英伟达 CEO 黄仁勋展望公司未来：坐拥 5 万名员工、部署 1 亿个 AI 助手

化学诺奖为何颁给「AI+生物」，凭什么Baker独占一半？

字节跳动清华AIR成立联合研究中心 推动大模型产学研合作

字节跳动清华AIR成立联合研究中心推动大模型产学研合作