Question

我想从具有唯一类的页面中打印出所有ID。

我要用“美丽的汤”刮擦的页面是这样的：

<div itemscope itemprop="item" itemtype="http://schema.org/Product" id="12345" class="realestate"> 
<div class="contentArea"> 
<meta itemprop="name" content="Name - 12345 " /> 
<meta itemprop="url" content="https://url12345.hu" />   
<meta itemprop="category" content="category1" />   
</div>
</div>
<div itemscope itemprop="item" itemtype="http://schema.org/Product" id="12346" class="realestate"> 
<div class="contentArea"> 
<meta itemprop="name" content="Name - 12346 " /> 
<meta itemprop="url" content="https://url12346.hu" />   
<meta itemprop="category" content="category1" />   
</div>
</div>

“ ID”是Itemscope DIV中的唯一标识符，因此，我想以某种方式提取这些唯一ID并将其全部打印出来（方法是将所有其他广告信息附加到该ID（例如名称，URL，等））

我尝试使用此python代码，但无法正常工作。

import requests
from bs4 import BeautifulSoup

page = requests.get('searchResultPage.url')
soup = BeautifulSoup(page.text, 'html.parser')
id = soup.find_all('id')
print(id)

它返回一个空列表。

我所期望的，以及我想要的是通过以下方式从div中获取ID为ID的列表： 12345 12346

谢谢您的帮助！

Answer 1

BeautifulSoup的find_all（）函数查找某种特定的所有HTML标签。 id不是标签，而是标签的属性。您必须搜索包含所需ID的标签，在本例中为div标签。

div_tags = soup.find_all('div')
ids = []
for div in div_tags:
     ID = div.get('id')
     if ID is not None:
         ids.append(ID)

BeautifulSoup还提供了查找具有特定属性的标签的功能。

Answer 2

标记和属性之间有区别，在您的情况下，div是标记，id是标记的属性。因此，必须使用find_all(name='tag')查找所有标记，然后才能使用get('attribute')获取属性。如果要剪贴长页面，可以使用理解列表来优化代码：

soup = BeautifulSoup(markup=page, 'html.parser')
test = [r['id'] for r in soup.find_all(name="div", attrs={"id":"12346"}) if r.get('id') is not None]

输出：

['12345', '12346']

此外，您可以使用find_all()获取所有具有id属性的标签（感谢Jon Clements），例如：

test = [r['id'] for r in soup.find_all(name="div", attrs={"id":True})]

Answer 3

HS-nebula是对的，find_all查找某种类型的标签是正确的，因为汤中的id是属性而不是标签类型。要获取汤中所有ID的列表，您可以使用以下一个衬垫

ids = [tag['id'] for tag in soup.select('div[id]')]

这使用CSS选择器而不是bs4的find_all，因为我发现缺少bs4的内置文档。

所以soup.select所做的是返回所有具有名为“ id”的属性的div元素的列表，然后我们遍历该div标签列表，并将“ id”属性的值添加到id列表。

Answer 4

如果您想查看整个网址中的所有 ID，这会起作用，但它还会包含许多外部和内部 HTML 标记和代码。

id = soup.find_all(id=True)
print(id)

如果您想在每行一个 ID 的列表/数组中查看没有所有 HTML 的实际 ID，这里是一个选项：

for ID in soup.find_all('div', id=True):  
    print(ID.get('id'))

在上面的 For 循环中，您在引号中指定标签，即“div”，然后要求它列出您想要的属性，即“id=True”

Answer 5

以下是一些解决方案：如果只考虑带有 id 的标签：

tags = page_soup.find_all(id=True)
for tag in tags:
    print(tag.name,tag['id'],sep='->')

如果需要循环所有标签：

 tags = page_soup.find_all()
    for tag in tags:
        if 'id' in tag.attrs:
            print(tag.name,tag['id'],sep='->')

仅获取所有 ID：

ids =[tag['id'] for tag in page_soup.find_all(id=True)]

如何使用python中的漂亮汤从div查找find_all（id）

5 个答案: