Question

我得到一些我不太明白的奇怪行为。我希望有人可以解释发生了什么。

考虑这个元数据：

<meta property="og:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">

此行成功找到所有＆＃34; og＆＃34;属性并返回一个列表。

opengraphs = doc.html.head.findAll(property=re.compile(r'^og'))

然而，这行不能为twitter卡做同样的事情。

twitterCards = doc.html.head.findAll(name=re.compile(r'^twitter'))

为什么第一行成功找到所有＆＃34; og＆＃34; （opengraph卡），但未找到推特卡？

Answer 1

这是因为name是标记名称参数的名称，这基本上意味着在这种情况下BeautifulSoup将查找标记名称以{{开头的元素1}}。

为了指定您实际上是指属性，请使用：

twitter

或者，通过CSS selector：

doc.html.head.find_all(attrs={'name': re.compile(r'^twitter')})

其中doc.html.head.select("[name^=twitter]")表示＆＃34;以＆＃34;。

开头

Answer 2

问题是name=具有特殊意义。它用于查找标记名称 - 在您的代码中为meta

您必须添加"meta"并将词典与"name"

一起使用

不同项目的示例。

from bs4 import BeautifulSoup
import re

data='''
<meta property="og:title" content="This is the Tesla Semi truck">
<meta property="twitter:title" content="This is the Tesla Semi truck">
<meta name="twitter:title" content="This is the Tesla Semi truck">
'''

head = BeautifulSoup(data)

print(head.findAll(property=re.compile(r'^og'))) # OK
print(head.findAll(property=re.compile(r'^tw'))) # OK

print(head.findAll(name=re.compile(r'^meta'))) # OK
print(head.findAll(name=re.compile(r'^tw')))   # empty

print(head.findAll('meta', {'name': re.compile(r'^tw')})) # OK

Python美丽的汤提取HTML元数据

2 个答案: