如何在NLTK regexp tokenizer中指定一个模式来处理数字?

时间:2016-12-02 18:04:43

标签: python nltk

我编写了以下代码来处理十进制数并删除标点符号。但它不会返回我期望的输出。你能否告诉我正则表达式中的错误。

def preprocess(sentence):
sentence = sentence.lower()
pattern=pattern = r'''(?x)               # set flag to allow verbose regexps
          ([A-Z]\.)+         # abbreviations, e.g. U.S.A.
          | \$?\d+(,|.\d+)*
          | \$?\d+%?
          | \w+([-'/]\w+)*    # words w/ optional internal hyphens/apostrophe
          |/\m+([-'/]\w+)*
        '''

tokenizer = RegexpTokenizer(pattern)
tokens = tokenizer.tokenize(sentence)
stopWords = set(stopwords.words('english'))
filtered_words = [w for w in tokens if not w in stopWords]
print filtered_words


sentence = "At eight o'clock@yahoo.com /m/098yh6 baby??? sun'problems 67% on Thu_rsday 76,564 morning: Arthur; didn't 34.56 feel very good. French-Fries"
print preprocess(sentence)

输出:

  

['8',“o'clock”,'yahoo','com','/ m / 098yh6','baby',“sun'problems”, '67', 'thu_rsday', '76,564','morning','arthur',“没有”, '34 .56','感觉','好','法语-fries']

我的问题是它从67中删除%。当我改变这两行的顺序时

| \$?\d+%?                
| \$?\d+(,|.\d+)*

输出更改为以下行:

  

['八',“点”,'雅虎','com','/ m / 098yh6','宝贝','太阳'问题“,'67%' ,'thu_rsday', '76','564','早上','亚瑟',“没有”, '34','56','感觉','好','法式炸薯条']

我该如何解决这个问题?

0 个答案:

没有答案