Question

我在excel中有一个干净的句子列，我只是想将特定的列放入数据框中，然后将其放在BERT标记器之间。

import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = "[CLS] " + str(text) + " [SEP]"
marked_text[:10211]

每个句子后面我都没有输出CLS和SEP。输出是

'[CLS] [\'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits\', \'Caught up on Dynasties and now need a large gin and some ther...

根本没有发现SEP。只是为了提醒上面输出中的第一句话，第一行是第二行，依此类推。

Answer 1

[SEP] 是在 stringized 列表的末尾。您可以使用以下命令打印字符串的最后10个字符进行检查：

print(marked_text[-10:])

也就是说，我想您的预期结果是

[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP]
[CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]
...

要这样做，请将字符串连接应用于文本条目的每个：

import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = []
for e in text:
    marked_text.append("[CLS] " + str(e) + " [SEP]")
print(*marked_text)

输出：

[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP] [CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]...

BERT令牌生成器

1 个答案: