从熊猫数据框中的句子列表中删除标点符号

时间:2018-08-04 16:37:03

标签: python pandas nlp

我在熊猫数据框中有一封电子邮件。在应用 sent_tokenize 之前,我可以像这样删除标点符号

def removePunctuation(fullCorpus):
punctuationRemoved = fullCorpus['text'].str.replace(r'[^\w\s]+', '')
return  punctuationRemoved

应用send_tokenize之后,数据帧如下所示。如何在将句子标记为列表的同时删除标点符号?

  

sent_tokenize

def tokenizeSentences(fullCorpus):
sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
return sent_tokenized
  

标记为句子后的数据帧示例

[Nah I don't think he goes to usf, he lives around here though]                                                                                                                                                                                                                          

[Even my brother is not like to speak with me., They treat me like aids patent.]                                                                                                                                                                                                         

[I HAVE A DATE ON SUNDAY WITH WILL!, !]                                                                                                                                                                                                                                                  

[As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers., Press *9 to copy your friends Callertune]                                                                                                                      

[WINNER!!, As a valued network customer you have been selected to receivea £900 prize reward!, To claim call 09061701461., Claim code KL341., Valid 12 hours only.]

1 个答案:

答案 0 :(得分:1)

您可以尝试使用以下功能,其中可以使用apply遍历句子和字符中的每个单词,并检查字符后是否有.join的标点符号。另外,您可能需要map,因为要将函数应用于每个句子:

def tokenizeSentences(fullCorpus):
    sent_tokenized = fullCorpus['body_text'].apply(sent_tokenize)
    f = lambda sent: ''.join(ch for w in sent for ch in w 
                                                  if ch not in string.punctuation) 

    sent_tokenized = sent_tokenized.apply(lambda row: list(map(f, row)))    
    return sent_tokenized

注意import string需要string.punctuation