Question

好的，这就是我想要做的。我在Python上还很陌生，只是刚接触它。无论如何，我正在尝试使用这个小工具从页面中提取数据。在这种情况下，我希望用户输入一个URL并使其返回

<meta content=" % Likes, % Comments - @% on Instagram: “post description []”" name="description" />

不过，将%替换为帖子所拥有的喜欢/评论等数量。

这是我的完整代码：

from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
import re

url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('content', attrs={'content': 'description'})
print (result)

但是每当我运行它时，都会得到[]。我在做什么错了？

Answer 1

匹配这些标签的正确方法是：

result = soup2.findAll('meta', content=True, attrs={"name": "description"})

但是，html.parser无法正确解析<meta>标签。它没有意识到它们是自动关闭的，因此在结果中包括了其余的<head>。我更改为

soup2 = BeautifulSoup(page2.content, 'html5lib')

，然后上述搜索的结果是：

[<meta content="46.3m Likes, 2.6m Comments - EGG GANG  (@world_record_egg) on Instagram: “Let’s set a world record together and get the most liked post on Instagram. Beating the current…”" name="description"/>]

Answer 2

这似乎可行：

for tag in soup2.findAll("meta"):
    if tag.get("property", None) == "og:description":
        print(tag.get("content", None))

基本上，您要遍历页面中的所有标签，并在其中查找属性为“ og：description”的标签，这似乎是您想要的Open Graph属性。

有帮助吗？

完整版本：

from bs4 import BeautifulSoup
import requests

url = "https://www.instagram.com/p/BsOGulcndj-/"
page2 = requests.get(url)
soup2 = BeautifulSoup(page2.content, 'html.parser')
result = soup2.findAll('meta', attrs={'content': 'description'})

for tag in soup2.findAll("meta"):
    if tag.get("property", None) == "og:description":
        print(tag.get("content", None))

更新：关于漂亮打印的问题，有几种方法可以完成。这些方式之一涉及正则表达式和字符串插值。例如：

likes = re.search('(.*)Likes', string).group(1)
comments = re.search(',(.*)Comments', string).group(1)
description = re.search('-(.*)', string).group(1)

print(f"{likes} Likes | {comments} Comments | {description}")

但是，如果您对此还有其他疑问，则可能应该在新帖子中提出。

尝试使用BeautifulSoup获取元数据时出现意外结果

2 个答案: