我试图从具有如下嵌套结构的网站获取网址:
<script>
var.model = {
data: {
"alist": [{
'url': 'http://www.here.org'
}]
}
}
</script>
问题是,我似乎找不到通过这棵树的方法。我有使用BeautifulSoup4但没有使用slimit的经验,我现在正在努力学习。
def grabLinks(base, limit):
base_url = requests.get(base)
soup = BeautifulSoup(base_url.content, "html.parser")
for script in soup.find_all("script", {'src': False}):
if isinstance(script, NavigableString): continue
tree = Parser().parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.Assign) and getattr(node.left, 'value', '') == "data":
print(getattr(node.right, 'properties'))
我可以到达名称&#34; alist&#34; (有很多getattrs),但我无法访问其中的值字典。如果您需要更多信息,请告诉我。
编辑:已修复!更新的代码:
def grabLinks(base, limit):
base_url = requests.get(base)
soup = BeautifulSoup(base_url.content, "html.parser")
for script in soup.find_all("script", {'src': False}):
if isinstance(script, NavigableString): continue
tree = Parser().parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, a.Assign) and getattr(node.left, 'value', '') == "data":
sibnode = getattr(node.right, 'properties')[0]
print(sibnode.right.items[0].properties[0].right.value)