从字符串中删除常用单词?

时间:2014-01-20 06:36:22

标签: python string python-2.7

我正在尝试过滤常用字词以结束城市名称。

这就是我所拥有的:

import re
ask = "What's the weather like in Lexington, SC?"
REMOVE_LIST = ["like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"]
remove = '|'.join(REMOVE_LIST)
regex = re.compile(r'\b('+remove+r')\b', flags=re.IGNORECASE)
out = regex.sub("", ask)

并输出:

nothing to repeat

3 个答案:

答案 0 :(得分:1)

[x for x in ask.split() if x.lower() not in REMOVE_LIST]

答案 1 :(得分:1)

你应该逐字地转义字符串,因为某些字符在正则表达式中具有特殊含义(例如?中的REMOVE_LIST):

使用re.escape来转义这些字符:

>>> import re
>>> re.escape('?')
'\\?'

>>> re.search('?', 'Lexington?')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\re.py", line 142, in search
    return _compile(pattern, flags).search(string)
  File "C:\Python27\lib\re.py", line 242, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat
>>> re.search(r'\?', 'Lexington?')
<_sre.SRE_Match object at 0x0000000002C68100>
>>>

>>> import re
>>> ask = "What's the weather like in Lexington, SC?"
>>> REMOVE_LIST = ["like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"]
>>> remove = '|'.join(map(re.escape, REMOVE_LIST))
>>> regex = re.compile(r'\b(' + remove + r')\b', flags=re.IGNORECASE)
>>> out = regex.sub("", ask)
>>> print out
     Lexington, SC?

答案 2 :(得分:0)

使用正则表达式查找单词:

import re

sentence = "What's the weather like in Lexington, SC?"
words = re.findall(r"[\w']+", sentence.lower())
remove = {"like", "in", "how's", "hows", "weather", "the", "whats", "what's", "?"}

print set(words) - remove

集合是无序的,因此如果顺序很重要,您可以使用列表解析来过滤单词列表:

[word for word in words if word not in remove]