Question

我编写了以下代码来处理十进制数并删除标点符号。但它不会返回我期望的输出。你能否告诉我正则表达式中的错误。

def preprocess(sentence):
sentence = sentence.lower()
pattern=pattern = r'''(?x)               # set flag to allow verbose regexps
          ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
          | \$?\d+(,|.\d+)*
          | \$?\d+%?
          | \w+([-'/]\w+)*    # words w/ optional internal hyphens/apostrophe
          |/\m+([-'/]\w+)*
        '''

tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(sentence)
stopWords = set(stopwords.words('english'))
filtered_words = [w for w in tokens if not w in stopWords]
print filtered_words


sentence = "At eight o'clock@yahoo.com /m/098yh6 baby??? sun'problems 67% on Thu_rsday 76,564 morning: Arthur; didn't 34.56 feel very good. French-Fries"
print preprocess(sentence)

输出：

['8'，“o'clock”，'yahoo'，'com'，'/ m / 098yh6'，'baby'，“sun'problems”， '67'， 'thu_rsday'， '76,564'，'morning'，'arthur'，“没有”， '34 .56'，'感觉'，'好'，'法语-fries']

我的问题是它从67中删除％。当我改变这两行的顺序时

| \$?\d+%?                
| \$?\d+(,|.\d+)*

输出更改为以下行：

['八'，“点”，'雅虎'，'com'，'/ m / 098yh6'，'宝贝'，'太阳'问题“，'67％' ，'thu_rsday'， '76'，'564'，'早上'，'亚瑟'，“没有”， '34'，'56'，'感觉'，'好'，'法式炸薯条']

我该如何解决这个问题？

如何在NLTK regexp tokenizer中指定一个模式来处理数字？

0 个答案: