Question

我目前正试图从网页上抓取一些数据。我需要的数据在html源代码的<meta>标记内。刮刮数据并将其保存到带有BeautifulSoup的String是没有问题的。

字符串包含我要提取的2个数字。应将这些数字中的每一个（来自1-100的评论分数）分配给不同的变量以进行进一步处理。

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

第一个值为79/100，第二个值为86/100，但我只需要79和86。到目前为止，我创建了一个正则表达式搜索来查找这些值，然后.replace("/100")来清理它们。

但是使用我的代码，我只获得第一个正则表达式搜索匹配的值，即79。我尝试使用m.group(1)获取第二个值，但它不起作用。

我错过了什么？

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

m = re.search("../100", test_str)
if m:
    found = m.group(0).replace("/100","")
    print found

    # output -> 79

感谢您的帮助。

祝你好运！

Answer 1

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"    
m =  re.findall('(\d+(?=\/100))', test_str)
# m = ['79', '86']

我使用..更改了/d+，因此您可以搜索1位数或2位

我也使用积极的预测(?=...)，因此.replace变得不必要

Regex101

的示例

Answer 2

我不知道为什么大多数人不建议对命名组进行回引用。

您可以执行以下操作，语法可能并不完美。

test_str = "<meta content=\"Overall Rating: 79/100 ... Some Info ... Score: 86/100 \"/>"

pattern = "^<meta content=\"Overall Rating: (?P<rating>.*?) ... Some Info ... (?P<score>.*?)$"

match = re.match(pattern, test_str)

match.group('rating')
match.group('score')

如何在Python中使用Regex从同一个String中提取多个值？

2 个答案: