如何优雅地从字符串中删除n长度的省略号(带有空格的NLP)?

时间:2019-07-08 09:20:31

标签: python nlp spacy

我目前正在对此spam text message dataset进行数据清理。这些短信中有很多省略号,例如:

mystr = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'

如您所见,椭圆具有2个周期(..)或3个周期(...

我最初的解决方案是编写一个函数spacy_tokenizer,该函数可对我的字符串进行标记,删除停用词以及标点符号:

import spacy
nlp = spacy.load('en_core_web_sm')
from nltk.corpus import stopwords

stopWords = set(stopwords.words('english'))
print(stopWords)

import string
punctuations = string.punctuation
def spacy_tokenizer(sentence):
    # Create token object
    mytokens = nlp(sentence)
    # Case normalization and Lemmatization
    mytokens = [ word.lemma_.lower() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    # Remove stop words and punctuations
    mytokens = [ word.strip(".") for word in mytokens if word not in stopWords and word not in punctuations ]
    # return preprocessed list of tokens
    return mytokens

但是,此功能不能消除椭圆形

IN: print(spacy_tokenizer(mystr))
OUT: ['go', 'jurong', 'point', 'crazy', '', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '', 'cine', 'get', 'amore', 'wat', '']

如您所见,有len(token) = 0的令牌显示为''

我的解决方法是向spacy_tokenizer添加另一个列表理解,如下所示:[ word for word in mytokens if len(word) > 0]

def spacy_tokenizer(sentence):
    # Create token object
    mytokens = nlp(sentence)
    # Case normalization and Lemmatization
    mytokens = [ word.lemma_.lower() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    # Remove stop words and punctuations
    mytokens = [ word.strip(".") for word in mytokens if word not in stopWords and word not in punctuations ]
    # remove empty strings
    mytokens = [ word for word in mytokens if len(word) > 0]
    return mytokens

IN: print(spacy_tokenizer(mystr))
OUT: ['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'get', 'amore', 'wat']


因此,新功能提供了预期的结果,但它并不是我认为的最优雅的解决方案。有人有其他想法吗?

2 个答案:

答案 0 :(得分:0)

这将删除2或3个周期的椭圆:

import re

regex = r"[.]{2,3}"
test_str = "Go until jurong point, crazy.. Available only. in bugis n great world la e buffet... Cine there got amore wat..."
subst = ""

result = re.sub(regex, subst, test_str)

if result:
    print (result)

如果需要,您也可以使用它here

答案 1 :(得分:-1)

如果您根本不关心标点符号(这样看来,因为您还删除了例句中的逗号),则应该考虑删除所有标点符号。

import re

sent = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
multipunc = re.compile("[\.,]+")
sent = multipunc.sub(" ", sent).lower().split()

该功能目前不考虑.,以外的标点符号。如果要删除字母数字字符以外的任何内容,可以考虑使用\w字符类。