Question

我目前正在使用BERT来开发问答系统（印度尼西亚语）。数据集和给出的问题均使用印尼语。

问题是，我仍然不清楚如何在BERT中开发问题解答系统的分步过程。

从我阅读了许多研究期刊和论文后得出的结论来看，过程可能是这样的：

准备主数据集
加载训练前数据
使用预训练数据训练主数据集（以便产生“微调”模型）
集群微调模型
测试（向系统提出问题）
评估

我想问的是：

这些步骤正确吗？还是可能缺少任何步骤？
此外，如果BERT提供的默认预训练数据是英语，而我的主要数据集是印度尼西亚语，那么我如何创建自己的印尼预训练数据？
真的需要在BERT中执行数据/模型集群吗？

我感谢您提供的任何有帮助的答案。预先非常感谢。

Answer 1

我会看看 Huggingface 的问答示例。这至少是一个很好的起点。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
? Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

使用BERT

1 个答案: