Question

我有一个文本文件，包含以下格式的行：

c="etc etc etc" 124:1 124:1||r="TrNAP etc"||c="etc etc" 124:10 124:10

引号中的文字会逐行变化，数字也会变化。否则格式是不变的。这些数字表示某些其他文档中引号中文本的行号和单词编号(line#:word#)。

有人可以提供一些示例正则表达式代码来提取line#:word#数字吗？谢谢！

Answer 1

>>> import re
>>> c = '"etc etc etc" 124:1 124:1||r="TrNAP etc"||c="etc etc" 124:10 124:10'
>>> print re.findall(r"(\d+):(\d+)", c)
[('124', '1'), ('124', '1'), ('124', '10'), ('124', '10')]

Answer 2

您可以使用以下内容：

(\d+:\d+)

请参阅DEMO

Answer 3

对于包含所有变量的完整行，请使用：

c="([^"]+)" (\d+):(\d+) (\d+):(\d+)\|\|r="([^"]+)"\|\|c="([^"]+)" (\d+):(\d+) (\d+):(\d+)

https://regex101.com/r/qY9kG2/1

Python正则表达式从字符串中提取数据

3 个答案: