Question

我像这样enter image description here进行了数据处理。以下是我用于数据处理的代码。

cnn2['text'] = cnn2['text'].str.lower()
cnn2.to_csv('2013_10557_cnn_cleaned.csv')

puncts = '!”#$%&’()*+,-/:;<=>?@[]^_`{|}~'
def remove_punctuation(txt):
    txt_nopunct =''.join([c for c in txt if c not in puncts])
    return txt_nopunct
cnn2['text'] = cnn2['text'].str.replace('"', '')
cnn2['text'] = cnn2['text'].str.replace("'", '')
cnn2['text'] = cnn2['text'].apply(lambda x: remove_punctuation(x))
cnn2.to_csv('2013_10557_cnn_cleaned.csv')

cnn2['text'] = cnn2['text'].str.replace('cnn', '')
cnn2.to_csv('2013_10557_cnn_cleaned.csv')
cnn2['text'] = cnn2['text'].str.replace('washington', '')
cnn2['text'] = cnn2['text'].str.replace('new york', '')
cnn2['text'] = cnn2['text'].str.replace('seoul south korea', '')
cnn2['text'] = cnn2['text'].str.replace('pyongyang north korea', '')
cnn2.to_csv('cnn_cleaned.csv')

现在，我需要对cnn2['text']（名为“ text”的列）进行标记化处理，然后在该列上覆盖。但是我不知道该怎么做。以下是我尝试过的代码，但是没有用，我从TypeError: expected string or bytes-like object获得了sent = sent_tokenize(cnn2['text'][i])。我该怎么办？

import nltk

from nltk.tokenize import sent_tokenize, word_tokenize 
nltk.download('punkt')

for i in range(518):
    sent = sent_tokenize(cnn2['text'][i])
    cnn2['text'][i] = sent

如何Sent_Tokenize列

0 个答案: