我目前正在对此spam text message dataset进行数据清理。这些短信中有很多省略号,例如:
mystr = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
如您所见,椭圆具有2个周期(..
)或3个周期(...
)
我最初的解决方案是编写一个函数spacy_tokenizer
,该函数可对我的字符串进行标记,删除停用词以及标点符号:
import spacy
nlp = spacy.load('en_core_web_sm')
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
print(stopWords)
import string
punctuations = string.punctuation
def spacy_tokenizer(sentence):
# Create token object
mytokens = nlp(sentence)
# Case normalization and Lemmatization
mytokens = [ word.lemma_.lower() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
# Remove stop words and punctuations
mytokens = [ word.strip(".") for word in mytokens if word not in stopWords and word not in punctuations ]
# return preprocessed list of tokens
return mytokens
但是,此功能不能消除椭圆形
IN: print(spacy_tokenizer(mystr))
OUT: ['go', 'jurong', 'point', 'crazy', '', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', '', 'cine', 'get', 'amore', 'wat', '']
如您所见,有len(token) = 0
的令牌显示为''
我的解决方法是向spacy_tokenizer
添加另一个列表理解,如下所示:[ word for word in mytokens if len(word) > 0]
def spacy_tokenizer(sentence):
# Create token object
mytokens = nlp(sentence)
# Case normalization and Lemmatization
mytokens = [ word.lemma_.lower() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
# Remove stop words and punctuations
mytokens = [ word.strip(".") for word in mytokens if word not in stopWords and word not in punctuations ]
# remove empty strings
mytokens = [ word for word in mytokens if len(word) > 0]
return mytokens
IN: print(spacy_tokenizer(mystr))
OUT: ['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'get', 'amore', 'wat']
因此,新功能提供了预期的结果,但它并不是我认为的最优雅的解决方案。有人有其他想法吗?
答案 0 :(得分:0)
这将删除2或3个周期的椭圆:
import re
regex = r"[.]{2,3}"
test_str = "Go until jurong point, crazy.. Available only. in bugis n great world la e buffet... Cine there got amore wat..."
subst = ""
result = re.sub(regex, subst, test_str)
if result:
print (result)
如果需要,您也可以使用它here。
答案 1 :(得分:-1)
如果您根本不关心标点符号(这样看来,因为您还删除了例句中的逗号),则应该考虑删除所有标点符号。
import re
sent = 'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...'
multipunc = re.compile("[\.,]+")
sent = multipunc.sub(" ", sent).lower().split()
该功能目前不考虑.
和,
以外的标点符号。如果要删除字母数字字符以外的任何内容,可以考虑使用\w
字符类。