我在熊猫数据框中有一封电子邮件。在应用 sent_tokenize 之前,我可以像这样删除标点符号。
def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return punctuationRemoved
应用send_tokenize之后,数据帧如下所示。如何在将句子标记为列表的同时删除标点符号?
sent_tokenize
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized
标记为句子后的数据帧示例
[Nah I don't think he goes to usf, he lives around here though]
[Even my brother is not like to speak with me., They treat me like aids patent.]
[I HAVE A DATE ON SUNDAY WITH WILL!, !]
[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]
[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]
答案 0 :(得分:1)
您可以尝试使用以下功能,其中可以使用apply
遍历句子和字符中的每个单词,并检查字符后是否有.join
的标点符号。另外,您可能需要map
,因为要将函数应用于每个句子:
def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
f = lambda sent: ''.join(ch for w in sent for ch in w
if ch not in string.punctuation)
sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))
return sent_tokenized
注意,import string
需要string.punctuation
。