TensorFlow版BERT源码详解之self-attention

TensorFlow版BERT源码详解之self-attention

self-attetion是BERT中的最为核心的内容之一,虽然TensorFlow版的BERT中的self-attention的原理和论文中是一致的,但是实现代码却有所出入。为了帮助新手快速理解这部分内容,所以通过该篇博客逐行解释具体代码。

文章目录

1. 函数参数

def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, size_per_head=512, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, do_return_2d_tensor=False, batch_size=None, from_seq_length=None, to_seq_length=None) 

参数from_tensor指的是输入张量,如果不考虑对TPU的优化的话,它的输入维度为[batch_size, seq_length, hidden_size]。

参数from_tensor指的是输出张量,如果是self-attetion,则它的维度和输入是一致的,即[batch_size, seq_length, hidden_size]。当然,谷歌设计了一个更为通用的attention layer,所以也可以设置成和输入维度不一致。

参数attention_mask指的是部分位置的元素进行mask,具体来说该张量中每个元素的取值为1或者0。其中当取值为0时,则进行mask。它的维度为[batch_size, seq_length, seq_length]。那它是什么时候进行使用呢?比如为了防止信息泄露,将后面位置的元素进行mask。

参数num_attention_heads,在Transformer中包括的是多头的注意力,该参数就是头的个数。

参数size_per_head,指的是每个注意力头对应的维度。

参数query_act、key_act和value_act指的是Q、K、V对应的激活函数。一般是不使用的。

参数initializer_range指的是权重初始化的范围,在BERT中用的是0.02。

参数do_return_2d_tensor的类型为布尔型。当为True时,输出维度为[batch_size * from_seq_length, num_attention_heads * size_per_head]。当为False时,输出维度为[batch_size, from_seq_length, num_attention_heads* size_per_head]。

参数batch_size、from_seq_length和to_seq_length是当input为矩阵(维度为2)时,则需要进行对应输入。

"""Performs multi-headed attention from `from_tensor` to `to_tensor`. This is an implementation of multi-headed attention based on "Attention is all you Need". If `from_tensor` and `to_tensor` are the same, then this is self-attention. Each timestep in `from_tensor` attends to the corresponding sequence in `to_tensor`, and returns a fixed-with vector. This function first projects `from_tensor` into a "query" tensor and `to_tensor` into "key" and "value" tensors. These are (effectively) a list of tensors of length `num_attention_heads`, where each tensor is of shape [batch_size, seq_length, size_per_head]. Then, the query and key tensors are dot-producted and scaled. These are softmaxed to obtain attention probabilities. The value tensors are then interpolated by these probabilities, then concatenated back to a single tensor and returned. In practice, the multi-headed attention are done with transposes and reshapes rather than actual separate tensors. Args: from_tensor: float Tensor of shape [batch_size, from_seq_length, from_width]. to_tensor: float Tensor of shape [batch_size, to_seq_length, to_width]. attention_mask: (optional) int32 Tensor of shape [batch_size, from_seq_length, to_seq_length]. The values should be 1 or 0. The attention scores will effectively be set to -infinity for any positions in the mask that are 0, and will be unchanged for positions that are 1. num_attention_heads: int. Number of attention heads. size_per_head: int. Size of each attention head. query_act: (optional) Activation function for the query transform. key_act: (optional) Activation function for the key transform. value_act: (optional) Activation function for the value transform. attention_probs_dropout_prob: (optional) float. Dropout probability of the attention probabilities. initializer_range: float. Range of the weight initializer. do_return_2d_tensor: bool. If True, the output will be of shape [batch_size * from_seq_length, num_attention_heads * size_per_head]. If False, the output will be of shape [batch_size, from_seq_length, num_attention_heads * size_per_head]. batch_size: (Optional) int. If the input is 2D, this might be the batch size of the 3D version of the `from_tensor` and `to_tensor`. from_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `from_tensor`. to_seq_length: (Optional) If the input is 2D, this might be the seq length of the 3D version of the `to_tensor`. Returns: float Tensor of shape [batch_size, from_seq_length, num_attention_heads * size_per_head]. (If `do_return_2d_tensor` is true, this will be of shape [batch_size * from_seq_length, num_attention_heads * size_per_head]). Raises: ValueError: Any of the arguments or tensor shapes are invalid. """ 

2. 维度变换过程

在深度学习中,把握每个张量或者矩阵在变换前后的维度是至关重要的。但是不少初学者却忽略了这一点,所以特此加入该部分内容。

在开始之前,我们先把self-attention的公式列一下:
 A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q, K, V)=softmax(\frac{QK^T}{\sqrt{d_k}}) V Attention(Q,K,V)=softmax(dk​   ​QKT​)V

2.1 单个注意力头

为了简单起见,先假设注意力头是单个,然后搞清楚后再进行扩展就更容易理解多个注意力头的维度变换过程。令B=batch_size,S=seq_length,H=hidden_size。

首先,Q、K、V的维度均为[B, S, H]。那么 Q K T QK^T QKT的维度为[B, S,S],则 s o f t m a x ( Q K T d k ) V softmax(\frac{QK^T}{\sqrt{d_k}}) V softmax(dk​   ​QKT​)V的维度为[B, S, H]。

2.2 多个注意力头

多个注意力头相比于单个注意力头而言,Q、K、V的维度均多了一维,即[B, S, N, H]。在矩阵相乘之前,会将维度转换为[B, N, S, H]。K转置后维度为[B, N, H, S]。所以矩阵相乘后的维度为[B, N, S, S]。则 s o f t m a x ( Q K T d k ) V softmax(\frac{QK^T}{\sqrt{d_k}}) V softmax(dk​   ​QKT​)V的维度为[B, N, S, H]。最后再还原成[B, S, N, H]。

3. 代码解析

3.1 transpose_for_scores函数

def transpose_for_scores(input_tensor, batch_size, num_attention_heads, seq_length, width): output_tensor = tf.reshape( input_tensor, [batch_size, seq_length, num_attention_heads, width]) output_tensor = tf.transpose(output_tensor, [0, 2, 1, 3]) return output_tensor 

transpose_for_scores函数的作用就是把维度从[B, S, N, H]维度转换为[B, N, S, H]。

3.2 reshape_to_matrix函数

def reshape_to_matrix(input_tensor): """Reshapes a >= rank 2 tensor to a rank 2 tensor (i.e., a matrix).""" ndims = input_tensor.shape.ndims if ndims < 2: raise ValueError("Input tensor must have at least rank 2. Shape = %s" % (input_tensor.shape)) if ndims == 2: return input_tensor width = input_tensor.shape[-1] output_tensor = tf.reshape(input_tensor, [-1, width]) return output_tensor 

reshape_to_matrix,就是把三维张量转换为二维矩阵,变换规则为前两维压缩成一维,最后一维不变。之所以要进行上述操作,主要是在TPU上进行矩阵和张量的来回切换,会很影响整个运算效率。

举个例子来说,假设input_tensor的维度是[8, 128, 768], output_tensor是[8*128, 768]=[1024, 768] 。

3.3 attention_layer函数

def attention_layer(from_tensor, to_tensor, attention_mask=None, num_attention_heads=1, size_per_head=512, query_act=None, key_act=None, value_act=None, attention_probs_dropout_prob=0.0, initializer_range=0.02, do_return_2d_tensor=False, batch_size=None, from_seq_length=None, to_seq_length=None): from_shape = get_shape_list(from_tensor, expected_rank=[2, 3]) to_shape = get_shape_list(to_tensor, expected_rank=[2, 3]) if len(from_shape) != len(to_shape): raise ValueError( "The rank of `from_tensor` must match the rank of `to_tensor`.") if len(from_shape) == 3: batch_size = from_shape[0] from_seq_length = from_shape[1] to_seq_length = to_shape[1] elif len(from_shape) == 2: if (batch_size is None or from_seq_length is None or to_seq_length is None): raise ValueError( "When passing in rank 2 tensors to attention_layer, the values " "for `batch_size`, `from_seq_length`, and `to_seq_length` " "must all be specified.") # Scalar dimensions referenced here: # B = batch size (number of sequences) # F = `from_tensor` sequence length # T = `to_tensor` sequence length # N = `num_attention_heads` # H = `size_per_head` from_tensor_2d = reshape_to_matrix(from_tensor) to_tensor_2d = reshape_to_matrix(to_tensor) # `query_layer` = [B*F, N*H] query_layer = tf.layers.dense( from_tensor_2d, num_attention_heads * size_per_head, activation=query_act, name="query", kernel_initializer=create_initializer(initializer_range)) # `key_layer` = [B*T, N*H] key_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=key_act, name="key", kernel_initializer=create_initializer(initializer_range)) # `value_layer` = [B*T, N*H] value_layer = tf.layers.dense( to_tensor_2d, num_attention_heads * size_per_head, activation=value_act, name="value", kernel_initializer=create_initializer(initializer_range)) # `query_layer` = [B, N, F, H] query_layer = transpose_for_scores(query_layer, batch_size, num_attention_heads, from_seq_length, size_per_head) # `key_layer` = [B, N, T, H] key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, to_seq_length, size_per_head) # Take the dot product between "query" and "key" to get the raw # attention scores. # `attention_scores` = [B, N, F, T] attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) attention_scores = tf.multiply(attention_scores, 1.0 / math.sqrt(float(size_per_head))) if attention_mask is not None: # `attention_mask` = [B, 1, F, T] attention_mask = tf.expand_dims(attention_mask, axis=[1]) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. attention_scores += adder # Normalize the attention scores to probabilities. # `attention_probs` = [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. attention_probs = dropout(attention_probs, attention_probs_dropout_prob) # `value_layer` = [B, T, N, H] value_layer = tf.reshape( value_layer, [batch_size, to_seq_length, num_attention_heads, size_per_head]) # `value_layer` = [B, N, T, H] value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) # `context_layer` = [B, N, F, H] context_layer = tf.matmul(attention_probs, value_layer) # `context_layer` = [B, F, N, H] context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) if do_return_2d_tensor: # `context_layer` = [B*F, N*H] context_layer = tf.reshape( context_layer, [batch_size * from_seq_length, num_attention_heads * size_per_head]) else: # `context_layer` = [B, F, N*H] context_layer = tf.reshape( context_layer, [batch_size, from_seq_length, num_attention_heads * size_per_head]) return context_layer 

从最开始的from_shape = get_shape_list(from_tensor, expected_rank=[2, 3])到raise ValueError是进行维度检查(值得学习和借鉴)。

下面两行代码是将from_tensor和to_tensor的维度均转换为[B*S, H],具体可参考3.2的内容。

from_tensor_2d = reshape_to_matrix(from_tensor) to_tensor_2d = reshape_to_matrix(to_tensor) 

query_layer是将维度从[BS, H]转换成了[BS, N*H]。key_layer和value_layer与之相似。

query_layer = tf.layers.dense( from_tensor_2d, num_attention_heads * size_per_head, activation=query_act, name="query", kernel_initializer=create_initializer(initializer_range)) 

下列代码即为Attention(Q, K, V)中的 Q K T d k \frac{QK^T}{\sqrt{d_k}} dk​   ​QKT​的对应代码,这部分内容可参考上文中的维度转换过程:

query_layer = transpose_for_scores(query_layer, batch_size,num_attention_heads, from_seq_length, size_per_head) # `key_layer` = [B, N, T, H] key_layer = transpose_for_scores(key_layer, batch_size, num_attention_heads, to_seq_length, size_per_head) # Take the dot product between "query" and "key" to get the raw # attention scores. # `attention_scores` = [B, N, F, T] attention_scores = tf.matmul(query_layer, key_layer, transpose_b=True) attention_scores = tf.multiply(attention_scores, 1.0 / math.sqrt(float(size_per_head))) 

需要注意的是,在部分场景中需要对某些位置的元素进行mask。即为下列代码:

if attention_mask is not None: # `attention_mask` = [B, 1, F, T] attention_mask = tf.expand_dims(attention_mask, axis=[1]) # Since attention_mask is 1.0 for positions we want to attend and 0.0 for # masked positions, this operation will create a tensor which is 0.0 for # positions we want to attend and -10000.0 for masked positions. adder = (1.0 - tf.cast(attention_mask, tf.float32)) * -10000.0 # Since we are adding it to the raw scores before the softmax, this is # effectively the same as removing these entirely. attention_scores += adder 

dropout函数可见3.4。其他操作依然可以参照上文中的维度变换。返回结果有两种,一种是返回矩阵,一种是返回张量。

# Normalize the attention scores to probabilities. # `attention_probs` = [B, N, F, T] attention_probs = tf.nn.softmax(attention_scores) # This is actually dropping out entire tokens to attend to, which might # seem a bit unusual, but is taken from the original Transformer paper. attention_probs = dropout(attention_probs, attention_probs_dropout_prob) # `value_layer` = [B, T, N, H] value_layer = tf.reshape( value_layer, [batch_size, to_seq_length, num_attention_heads, size_per_head]) # `value_layer` = [B, N, T, H] value_layer = tf.transpose(value_layer, [0, 2, 1, 3]) # `context_layer` = [B, N, F, H] context_layer = tf.matmul(attention_probs, value_layer) # `context_layer` = [B, F, N, H] context_layer = tf.transpose(context_layer, [0, 2, 1, 3]) if do_return_2d_tensor: # `context_layer` = [B*F, N*H] context_layer = tf.reshape( context_layer, [batch_size * from_seq_length, num_attention_heads * size_per_head]) else: # `context_layer` = [B, F, N*H] context_layer = tf.reshape( context_layer, [batch_size, from_seq_length, num_attention_heads * size_per_head]) return context_layer 

3.4 dropout函数

def dropout(input_tensor, dropout_prob): """Perform dropout. Args: input_tensor: float Tensor. dropout_prob: Python float. The probability of dropping out a value (NOT of *keeping* a dimension as in `tf.nn.dropout`). Returns: A version of `input_tensor` with dropout applied. """ if dropout_prob is None or dropout_prob == 0.0: return input_tensor output = tf.nn.dropout(input_tensor, 1.0 - dropout_prob) return output 

4. BERT家族参数量

模型参数量
bert-base110M(1.1亿)
bert-large340M
DistilBERT66M
MobileBERT25M

Read more

前端防范 XSS(跨站脚本攻击)

目录 一、防范措施 1.layui util  核心转义的特殊字符 示例 2.js-xss.js库 安装 1. Node.js 环境(npm/yarn) 2. 浏览器环境 核心 API 基础使用 1. 基础过滤(默认规则) 2. 自定义过滤规则 (1)允许特定标签 (2)允许特定属性 (3)自定义标签处理 (4)自定义属性处理 (5)转义特定字符 常见场景示例 1. 过滤用户输入的评论内容 2. 允许特定富文本标签(如富文本编辑器内容) 注意事项 更多配置 XSS(跨站脚本攻击)是一种常见的网络攻击手段,它允许攻击者将恶意脚本注入到其他用户的浏览器中。

详细教程:如何从前端查看调用接口、传参及返回结果(附带图片案例)

详细教程:如何从前端查看调用接口、传参及返回结果(附带图片案例)

目录 1. 打开浏览器开发者工具 2. 使用 Network 面板 3. 查看具体的API请求 a. Headers b. Payload c. Response d. Preview e. Timing 4. 实际操作步骤 5. 常见问题及解决方法 a. 无法看到API请求 b. 请求失败 c. 跨域问题(CORS) 作为一名后端工程师,理解前端如何调用接口、传递参数以及接收返回值是非常重要的。下面将详细介绍如何通过浏览器开发者工具(F12)查看和分析这些信息,并附带图片案例帮助你更好地理解。 1. 打开浏览器开发者工具 按下 F12 或右键点击页面选择“检查”可以打开浏览器的开发者工具。常用的浏览器如Chrome、Firefox等都内置了开发者工具。下面是我选择我的一篇文章,打开开发者工具进行演示。 2. 使用

Cursor+Codex隐藏技巧:用截图秒修前端Bug的保姆级教程(React/Chakra UI案例)

Cursor+Codex隐藏技巧:用截图秒修前端Bug的保姆级教程(React/Chakra UI案例) 前端开发中最令人头疼的莫过于那些难以定位的UI问题——元素错位、样式冲突、响应式失效...传统调试方式往往需要反复修改代码、刷新页面、检查元素。现在,通过Cursor编辑器集成的Codex功能,你可以直接用截图交互快速定位和修复这些问题。本文将带你从零开始,掌握这套革命性的调试工作流。 1. 环境准备与基础配置 在开始之前,确保你已经具备以下环境: * Cursor编辑器最新版(v2.5+) * Node.js 18.x及以上版本 * React 18项目(本文以Chakra UI 2.x为例) 首先在Cursor中安装Codex插件: 1. 点击左侧扩展图标 2. 搜索"Codex"并安装 3. 登录你的OpenAI账户(需要ChatGPT Plus订阅) 关键配置项: // 在项目根目录创建.