Question

我正在尝试将名称为：value元素的文本文件解析为带有“name：value”的列表...这是一个扭曲：值有时会是多个单词甚至是多行，而分隔符不是一组固定的单词。这是我正在尝试使用的一个例子......

listing="price:44.55 name:John Doe title:Super Widget description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!

我想要归来的是......

["price:44.55", "name:John Doe", "title:Super Widget", "description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

这是我到目前为止所尝试的......

details = re.findall(r'[\w]+:.*', post, re.DOTALL)
["price:", "44.55 name:John Doe title:Super Widget description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

不是我想要的。还是......

details = re.findall(r'[\w]+:.*?', post, re.DOTALL)
["price:", "name:", "title:", "description:"]

不是我想要的。还是......

details = re.split(r'([\w]+:)', post)
["", "price:", "44.55", "name:", "John Doe", "title:", "Super Widget", "description:", "This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!"]

更近，但仍然没有骰子。此外，我可以处理一个空列表项。所以，基本上，我的问题是如何将分隔符与re.split（）上的值保持一致，或者如何使re.findall（）保持过于贪婪或过于吝啬？

提前感谢您的阅读！

Answer 1

使用预见断言：

>>> re.split(r'\s(?=\w+:)', post)
['price:44.55',
 'name:John Doe',
 'title:Super Widget',
 'description:This widget slices, dices, and drives your kids to soccer practice\r\nIt even comes with Super Widget Mini!']

当然，如果你的价值观中有一些冒号后面的某些词语，它仍然会失败。

Answer 2

@ Pavel的回答更好，但您也可以将上一次尝试的结果合并在一起：

# kill the first empty bit
if not details[0]:
    details.pop(0)

return [a + b for a, b in zip(details[::2], details[1::2])]

python正则表达式拆分字符串，同时保持分隔符的值

2 个答案: