Question

我有以下html部分，它与其他href链接重复几次：

<div class="product-list-item  margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">

现在我想获取本文档中所有直接在div标签之后的“product-list-item”类的href链接。对beautifulsoup来说很新，而且我提出的任何工作都没有。

感谢您的想法。

编辑：真的不一定是美丽的;当它可以使用正则表达式和python html解析器完成时，这也没关系。

EDIT2：我尝试了什么（我对python很新，所以我从高级观点来看可能完全是愚蠢的）：

soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
    if x[i].get("class") and "product-list-item" in x[i].get("class"):
        print(x[i].get("class"))

这将为我提供所有“product-list-item”的列表，但后来尝试了类似

的内容

print(x[i].get("class").next_element)

因为我认为next_element或next_sibling应该给我下一个标签，但它只是导致AttributeError：'list'对象没有属性'next_element'。所以我只尝试了第一个列表元素：

print(x[i][0].get("class").next_element)

导致此错误：return self.attrs [key] KeyError：0。还尝试使用.find_all（“href”）和.get（“href”），但这都会导致相同的错误。

EDIT3：好吧，好像我发现了如何解决它，现在我做了：

x = soup.find_all("div")

for i in range(len(x)):    
    if x[i].get("class") and "product-list-item" in x[i].get("class"):
        print(x[i].next_element.next_element.get("href"))

这也可以通过使用find_all函数的另一个属性来缩短：

x = soup.find_all("div", "product-list-item")
for i in x:
    print(i.next_element.next_element.get("href"))

问候

Answer 1

我希望获得本文档中所有直接位于div标签之后的“product-list-item”
的href链接

要查找<a href>中的第一个<div>元素：

links = []
for div in soup.find_all('div', 'product-list-item'): 
    a = div.find('a', href=True) # find <a> anywhere in <div>
    if a is not None:
       links.append(a['href'])

它假定链接在<div>内。 <{1}}中第<div>之前的任何元素都会被忽略。

如果你愿意;你可以对此更加严格，例如，只有当它是<a href>中的第一个孩子时才接受链接：

<div>

或a = div.contents[0] # take the very first child even if it is not a Tag if a.name == 'a' and a.has_attr('href'): links.append(a['href'])不在<a>内：

<div>

There are many ways to search and navigate in BeautifulSoup

如果使用lxml.html进行搜索，如果您熟悉它们，也可以使用xpath和css表达式。

Python 3，美丽的汤，得到下一个标签

1 个答案: