我在excel中有一个干净的句子列,我只是想将特定的列放入数据框中,然后将其放在BERT标记器之间。
import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = "[CLS] " + str(text) + " [SEP]"
marked_text[:10211]
每个句子后面我都没有输出CLS和SEP。 输出是
'[CLS] [\'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits\', \'Caught up on Dynasties and now need a large gin and some ther...
根本没有发现SEP。 只是为了提醒上面输出中的第一句话,第一行是第二行,依此类推。
答案 0 :(得分:0)
[SEP]
是在 stringized 列表的末尾。您可以使用以下命令打印字符串的最后10个字符进行检查:
print(marked_text[-10:])
也就是说,我想您的预期结果是
[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP]
[CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]
...
要这样做,请将字符串连接应用于文本条目的每个 :
import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = []
for e in text:
marked_text.append("[CLS] " + str(e) + " [SEP]")
print(*marked_text)
输出:
[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP] [CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]...