如何使用预训练的BERT模型进行下一个句子标记?

时间:2019-01-18 22:37:01

标签: tensorflow artificial-intelligence natural-language-processing

我是AI和NLP的新手。 我想检查一下bert的工作方式。 我使用BERT预训练模型: https://github.com/google-research/bert

我运行了extract_features.py示例,在readme.md的extract features段落中进行了描述。 我有矢量,作为输出。

伙计们,如何转换结果,我进入extract_features.py,以获得下一个/不是下一个标签?

我想运行bert检查两个句子是否相关,然后查看结果。

谢谢!

2 个答案:

答案 0 :(得分:1)

答案是使用权重,所使用的内容或下一句话的训练,然后从那里进行登录。因此,要将Bert用于nextSentence输入用于训练的格式的两个句子:

template = templateEnv.get_template(template_path)
rawoutput = template.render(items=items)

output_filename = template.getVariable('filename')

with open(output_filename, "w") as stream:
    stream.write(rawoutput)

然后用下一个代码扩展Bert模型

def convert_single_example(ex_index, example, label_list, max_seq_length,
                       tokenizer):
"""Converts a single `InputExample` into a single `InputFeatures`."""
label_map = {}
for (i, label) in enumerate(label_list):
    label_map[label] = i

tokens_a = tokenizer.tokenize(example.text_a)
tokens_b = None
if example.text_b:
    tokens_b = tokenizer.tokenize(example.text_b)

if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3"
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
        tokens_a = tokens_a[0:(max_seq_length - 2)]

# The convention in BERT is:
# (a) For sequence pairs:
#  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
#  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
# (b) For single sequences:
#  tokens:   [CLS] the dog is hairy . [SEP]
#  type_ids: 0     0   0   0  0     0 0
#
# Where "type_ids" are used to indicate whether this is the first
# sequence or the second sequence. The embedding vectors for `type=0` and
# `type=1` were learned during pre-training and are added to the wordpiece
# embedding vector (and position vector). This is not *strictly* necessary
# since the [SEP] token unambiguously separates the sequences, but it makes
# it easier for the model to learn the concept of sequences.
#
# For classification tasks, the first vector (corresponding to [CLS]) is
# used as as the "sentence vector". Note that this only makes sense because
# the entire model is fine-tuned.
tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)

if tokens_b:
    for token in tokens_b:
        tokens.append(token)
        segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

input_ids = tokenizer.convert_tokens_to_ids(tokens)

# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)

# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length

label_id = label_map[example.label]
if ex_index < 5:
    tf.logging.info("*** Example ***")
    tf.logging.info("guid: %s" % (example.guid))
    tf.logging.info("tokens: %s" % " ".join(
        [tokenization.printable_text(x) for x in tokens]))
    tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
    tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
    tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
    tf.logging.info("label: %s (id = %d)" % (example.label, label_id))

feature = InputFeatures(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids,
    label_id=label_id)
return feature

概率-您需要的是它的nextSentence前提

答案 1 :(得分:0)

我不确定如何在tensorflow中做到这一点。但是在通过拥抱面孔https://github.com/huggingface/pytorch-pretrained-BERT/blob/master/pytorch_pretrained_bert/modeling.py#L854的pythorch实现中,有一个模型BertForNextSentencePrediction。