Python正则表达式字符串到单词列表(包括带连字符的单词)

时间:2010-08-04 14:56:11

标签: python regex

我想解析一个字符串以获取包含所有单词的列表(也用连字符)。目前的代码是:

s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)

返回:

['', 'this', 'is', 'A', 'sentence', 'one', 'word']

我希望它返回:

['', 'this', 'is', 'A', 'sentence', 'one-word']

5 个答案:

答案 0 :(得分:4)

如果您不需要前导空字符串,则可以使用模式\w(?:[-\w]*\w)?进行匹配

>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']

请注意,它不会与包含won't等撇号的单词匹配。

答案 1 :(得分:2)

这里我的传统“为什么在使用Python时可以使用正则表达式语言”替代方案:

import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
                 for word in s.replace(';','; ').split()
                 ])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""

答案 2 :(得分:1)

您可以改用"[^\w-]+"

答案 3 :(得分:1)

s = "-this is. A - sentence;one-word what's"
re.findall("\w+-\w+|[\w']+",s)

结果: ['this','is','A','sentence','one-word',“what's”]

请确保您注意到正确的排序是首先寻找低调的词!

答案 4 :(得分:0)

Yo可以尝试使用NLTK库:

>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']

我在这里找到了它:http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html希望它有所帮助