使用BERT

时间:2020-07-27 08:48:01

标签: tensorflow nlp information-retrieval transformer question-answering

我目前正在使用BERT来开发问答系统(印度尼西亚语)。 数据集和给出的问题均使用印尼语。

问题是,我仍然不清楚如何在BERT中开发问题解答系统的分步过程。

从我阅读了许多研究期刊和论文后得出的结论来看,过程可能是这样的:

  1. 准备主数据集
  2. 加载训练前数据
  3. 使用预训练数据训练主数据集(以便产生“微调”模型)
  4. 集群微调模型
  5. 测试(向系统提出问题)
  6. 评估

我想问的是:

  • 这些步骤正确吗?还是可能缺少任何步骤?
  • 此外,如果BERT提供的默认预训练数据是英语,而我的主要数据集是印度尼西亚语,那么我如何创建自己的印尼预训练数据?
  • 真的需要在BERT中执行数据/模型集群吗?

我感谢您提供的任何有帮助的答案。 预先非常感谢。

1 个答案:

答案 0 :(得分:0)

我会看看 Huggingface 的问答示例。这至少是一个很好的起点。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
? Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")