使用Regex从文本中提取元素并附加到字典

时间:2017-04-13 03:11:21

标签: python regex

我正在尝试通过使用某种类型的循环和正则表达式从网站检索的本文创建字典。我希望字典看起来像这样:

{36:30281, 36 2/3:30282, 37:30283, 37 1/3: 30283, 38:30284 etc..}

以下是我从网站上检索的文字:

[option value="-1">Choose size</option>, option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]

我对正则表达式不太满意。任何人都可以给我一个解决方案,帮助我做到这一点吗?

谢谢

2 个答案:

答案 0 :(得分:0)

这是一个有效的正则表达式:

re.findall('\\t(\d{2}\s+\d\/\d)\\r\\n', [option value="-1">Choose size</option>, 'option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]')

输出:

['36 2/3', '37 1/3', '38 2/3', '39 1/3', '40 2/3', '41 1/3']

它的工作方式主要基于你的,但变化如下。除了正则表达式的第一部分,我删除了所有内容。我将13更改为'\d',这意味着任何数字而不仅仅是一个和三个。然后我将'\\r\\n'添加到最后,这不是必要的,所以如果你愿意,可以把它关掉,但我想这对你来说只是额外的安全性。

答案 1 :(得分:0)

可以使用(demo):

value=\"(\d+)\"\D*(\d+(?:\ [\d/]+)?)

<小时/> 在Python中,这将是(使用词典理解):

import re 

junk_string = """
[option value="-1">Choose size</option>, option value="30281">\r\n\t\t\t\t\t\t\t\t\t36\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option>, option value="30282">\r\n\t\t\t\t\t\t\t\t\t36 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30283"\r\n\t\t\t\t\t\t\t\t\t37 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t/option, option value="30284">\r\n\t\t\t\t\t\t\t\t\t38\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30285">\r\n\t\t\t\t\t\t\t\t\t38 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30286">\r\n\t\t\t\t\t\t\t\t\t39 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30287">\r\n\t\t\t\t\t\t\t\t\t40\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30288">\r\n\t\t\t\t\t\t\t\t\t40 2/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>, option value="30289">\r\n\t\t\t\t\t\t\t\t\t41 1/3\r\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t</option>]
"""

rx = re.compile(r'value=\"(\d+)\"\D*(\d+(?:\ [\d/]+)?)')
result = {m.group(2): m.group(1) 
            for m in rx.finditer(junk_string)}

print(result)
# {'36': '30281', '36 2/3': '30282', '37 1/3': '30283', '38': '30284', '38 2/3': '30285', '39 1/3': '30286', '40': '30287', '40 2/3': '30288', '41 1/3': '30289'}

但正如评论中已经说过的那样,这实际上不是文本而是DOM的一部分,所以至少考虑使用解析器。