我像这样enter image description here进行了数据处理。以下是我用于数据处理的代码。
cnn2['text'] = cnn2['text'].str.lower()
cnn2.to_csv('2013_10557_cnn_cleaned.csv')
puncts = '!”#$%&’()*+,-/:;<=>?@[]^_`{|}~'
def remove_punctuation(txt):
txt_nopunct =''.join([c for c in txt if c not in puncts])
return txt_nopunct
cnn2['text'] = cnn2['text'].str.replace('"', '')
cnn2['text'] = cnn2['text'].str.replace("'", '')
cnn2['text'] = cnn2['text'].apply(lambda x: remove_punctuation(x))
cnn2.to_csv('2013_10557_cnn_cleaned.csv')
cnn2['text'] = cnn2['text'].str.replace('cnn', '')
cnn2.to_csv('2013_10557_cnn_cleaned.csv')
cnn2['text'] = cnn2['text'].str.replace('washington', '')
cnn2['text'] = cnn2['text'].str.replace('new york', '')
cnn2['text'] = cnn2['text'].str.replace('seoul south korea', '')
cnn2['text'] = cnn2['text'].str.replace('pyongyang north korea', '')
cnn2.to_csv('cnn_cleaned.csv')
现在,我需要对cnn2['text']
(名为“ text”的列)进行标记化处理,然后在该列上覆盖。但是我不知道该怎么做。
以下是我尝试过的代码,但是没有用,我从TypeError: expected string or bytes-like object
获得了sent = sent_tokenize(cnn2['text'][i])
。我该怎么办?
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
for i in range(518):
sent = sent_tokenize(cnn2['text'][i])
cnn2['text'][i] = sent