我想解析一个字符串以获取包含所有单词的列表(也用连字符)。目前的代码是:
s = '-this is. A - sentence;one-word'
re.compile("\W+",re.UNICODE).split(s)
返回:
['', 'this', 'is', 'A', 'sentence', 'one', 'word']
我希望它返回:
['', 'this', 'is', 'A', 'sentence', 'one-word']
答案 0 :(得分:4)
如果您不需要前导空字符串,则可以使用模式\w(?:[-\w]*\w)?
进行匹配:
>>> import re
>>> s = '-this is. A - sentence;one-word'
>>> rx = re.compile(r'\w(?:[-\w]*\w)?')
>>> rx.findall(s)
['this', 'is', 'A', 'sentence', 'one-word']
请注意,它不会与包含won't
等撇号的单词匹配。
答案 1 :(得分:2)
这里我的传统“为什么在使用Python时可以使用正则表达式语言”替代方案:
import string
s = "-this is. A - sentence;one-word what's"
s = filter(None,[word.strip(string.punctuation)
for word in s.replace(';','; ').split()
])
print s
""" Output:
['this', 'is', 'A', 'sentence', 'one-word', "what's"]
"""
答案 2 :(得分:1)
您可以改用"[^\w-]+"
。
答案 3 :(得分:1)
s = "-this is. A - sentence;one-word what's"
re.findall("\w+-\w+|[\w']+",s)
结果: ['this','is','A','sentence','one-word',“what's”]
请确保您注意到正确的排序是首先寻找低调的词!
答案 4 :(得分:0)
Yo可以尝试使用NLTK库:
>>> import nltk
>>> s = '-this is a - sentence;one-word'
>>> hyphen = r'(\w+\-\s?\w+)'
>>> wordr = r'(\w+)'
>>> r = "|".join([ hyphen, wordr])
>>> tokens = nltk.tokenize.regexp_tokenize(s,r)
>>> print tokens
['this', 'is', 'a', 'sentence', 'one-word']
我在这里找到了它:http://www.cs.oberlin.edu/~jdonalds/333/lecture03.html希望它有所帮助