Question

当我尝试使用代码标记一个数据样本时，我正在使用Bert进行文本分类任务：

encoded_sent = tokenizer.encode(
                        sentences[7],                       
                        add_special_tokens = True)

一切顺利，但是每当我尝试使用代码标记整个数据时：

# For every sentence...
for sent in sentences:
    
    encoded_sent = tokenizer.encode(
                        sent,                       
                        add_special_tokens = True)

它给了我错误：

"ValueError: Input nan is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers."

我尝试使用成功被某人令牌化的英语数据，但遇到相同的错误。这是我加载数据的方式：

import pandas as pd

df=pd.read_csv("/content/DATA.csv",header=0,dtype=str)
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'
df.columns = [DATA_COLUMN, LABEL_COLUMN]

df["sentence"].head

这就是我加载令牌生成器的方式：

# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = AutoTokenizer.from_pretrained('aubmindlab/bert-base-arabert')

我的数据示例：

原文：مساعدنائبرئيسالمنزل：لمنرحتىرسالةكوميحتىغردهاجيسونتشافيتز

已标记：['مساعد'，'نائب'，'رئيس'，'ال'，'##منزل'，'：'，'لم'，'نر'，'حتى'，'رسال'，'＃＃ة'，'كومي'，'حتى'，'غرد'，'##ها'，'جيسون'，'تشافي'，'##ت'，'##ز']

有什么建议吗？！

Answer 1

您的数据似乎包含 NAN 值，要解决此问题，您必须消除 NAN 值或将所有数据转换为字符串（本地解决方案）。

尝试使用：

encoded_sent = tokenizer.encode(
        str(sent),                       
        add_special_tokens = True)

如果您确定数据集不计算 NAN 值，您可以使用该解决方案，或者检测您的数据集是否包含您可能使用的 NAN 值：

for sent in sentences: 
    print(sent) 
    encoded_sent = tokenizer.encode( sent, add_special_tokens = True)

Bert令牌化错误ValueError：输入nan无效。应为字符串，字符串列表/元组或整数列表/元组

1 个答案: