def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.1, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list()
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list()
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.1, random_state=100)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
当我尝试使用BERT标记生成器从数据帧拆分时,我们遇到了这样的错误。
答案 0 :(得分:1)
我有同样的错误。问题是我的清单中没有任何内容,例如:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-german-cased')
# create test dataframe
texts = ['Vero Moda Damen Übergangsmantel Kurzmantel Chic Business Coatigan SALE',
'Neu Herren Damen Sportschuhe Sneaker Turnschuhe Freizeit 1975 Schuhe Gr. 36-46',
'KOMBI-ANGEBOT Zuckerpaste STRONG / SOFT / ZUBEHÖR -Sugaring Wachs Haarentfernung',
None]
labels = [1, 2, 3, 1]
d = {'texts': texts, 'labels': labels}
test_df = pd.DataFrame(d)
因此,在将“数据框”列转换为列表之前,我删除了所有“无”行。
test_df = test_df.dropna()
texts = test_df["texts"].tolist()
texts_encodings = tokenizer(texts, truncation=True, padding=True)
这对我有用。
答案 1 :(得分:0)
def split_data(path):
df = pd.read_csv(path)
return train_test_split(df , test_size=0.2, random_state=100)
train, test = split_data(DATA_DIR)
train_texts, train_labels = train['text'].to_list(), train['sentiment'].to_list()
test_texts, test_labels = test['text'].to_list(), test['sentiment'].to_list()
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2, random_state=100)
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
valid_encodings = tokenizer(valid_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)
尝试更改拆分的大小。会的。这意味着分割数据不足以使分词器进行分词
答案 2 :(得分:0)
就我而言,我必须设置 is_split_into_words=True
https://huggingface.co/transformers/main_classes/tokenizer.html
<块引用>要编码的序列或一批序列。每个序列可以是一个字符串或一个字符串列表(预标记字符串)。如果序列以字符串列表(预标记化)的形式提供,则必须设置 is_split_into_words=True(以消除一批序列的歧义)。