如何Sent_Tokenize列

时间:2019-10-16 23:28:42

标签: python nltk tokenize

我像这样enter image description here进行了数据处理。以下是我用于数据处理的代码。

cnn2['text'] = cnn2['text'].str.lower()
cnn2.to_csv('2013_10557_cnn_cleaned.csv')

puncts = '!”#$%&’()*+,-/:;<=>?@[]^_`{|}~'
def remove_punctuation(txt):
    txt_nopunct =''.join([c for c in txt if c not in puncts])
    return txt_nopunct
cnn2['text'] = cnn2['text'].str.replace('"', '')
cnn2['text'] = cnn2['text'].str.replace("'", '')
cnn2['text'] = cnn2['text'].apply(lambda x: remove_punctuation(x))
cnn2.to_csv('2013_10557_cnn_cleaned.csv')

cnn2['text'] = cnn2['text'].str.replace('cnn', '')
cnn2.to_csv('2013_10557_cnn_cleaned.csv')
cnn2['text'] = cnn2['text'].str.replace('washington', '')
cnn2['text'] = cnn2['text'].str.replace('new york', '')
cnn2['text'] = cnn2['text'].str.replace('seoul south korea', '')
cnn2['text'] = cnn2['text'].str.replace('pyongyang north korea', '')
cnn2.to_csv('cnn_cleaned.csv')

现在,我需要对cnn2['text'](名为“ text”的列)进行标记化处理,然后在该列上覆盖。但是我不知道该怎么做。 以下是我尝试过的代码,但是没有用,我从TypeError: expected string or bytes-like object获得了sent = sent_tokenize(cnn2['text'][i])。我该怎么办?

import nltk

from nltk.tokenize import sent_tokenize, word_tokenize 
nltk.download('punkt')

for i in range(518):
    sent = sent_tokenize(cnn2['text'][i])
    cnn2['text'][i] = sent

0 个答案:

没有答案