Question

我有一个页面的完整html，因此我需要找到它的GA（Google Analytics（分析））ID。例如：

<script>ga('create', 'UA-4444444444-1', 'auto');</script>

从上面的字符串中，我需要获取UA-4444444444-1，它从“ UA-”开始，以“ -1”结尾。我已经尝试过了：

re.findall(r"\"trackingId\"\s?:\s?\"(UA-\d+-\d+)\"", raw_html)

但是没有成功。请让我知道我在犯什么错误。

谢谢

Answer 1

似乎您想得太多了，您可以直接寻求UA令牌：

re.findall(r"UA-\d+-\d+")

Answer 2

在通过html进行解析时，请勿使用正则表达式。从标签中提取文本时应找到BeautifulSoup。在这里，我们从html提取脚本标签，然后将正则表达式应用于位于脚本标签中的文本。

import re
from bs4 import BeautifulSoup as bs4


html = "<script>ga('create', 'UA-4444444444-1', 'auto');</script>"

soup = bs4(html, 'lxml')

pattern = re.compile("UA-[0-9]+-[0-9]+")
ids = []
for i in soup.findAll("script"):
    ids.append(pattern.findall(i.text)[0])
print(ids)

在Python中以不同字符串开头和结尾的字符串中查找字符串

2 个答案: