我编写了以下代码来处理十进制数并删除标点符号。但它不会返回我期望的输出。你能否告诉我正则表达式中的错误。
def preprocess(sentence):
sentence = sentence.lower()
pattern=pattern = r'''(?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations, e.g. U.S.A.
| \$?\d+(,|.\d+)*
| \$?\d+%?
| \w+([-'/]\w+)* # words w/ optional internal hyphens/apostrophe
|/\m+([-'/]\w+)*
'''
tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(sentence)
stopWords = set(stopwords.words('english'))
filtered_words = [w for w in tokens if not w in stopWords]
print filtered_words
sentence = "At eight o'clock@yahoo.com /m/098yh6 baby??? sun'problems 67% on Thu_rsday 76,564 morning: Arthur; didn't 34.56 feel very good. French-Fries"
print preprocess(sentence)
输出:
['8',“o'clock”,'yahoo','com','/ m / 098yh6','baby',“sun'problems”, '67', 'thu_rsday', '76,564','morning','arthur',“没有”, '34 .56','感觉','好','法语-fries']
我的问题是它从67中删除%。当我改变这两行的顺序时
| \$?\d+%?
| \$?\d+(,|.\d+)*
输出更改为以下行:
['八',“点”,'雅虎','com','/ m / 098yh6','宝贝','太阳'问题“,'67%' ,'thu_rsday', '76','564','早上','亚瑟',“没有”, '34','56','感觉','好','法式炸薯条']
我该如何解决这个问题?