Question

我一直在研究购物网站，我想从其html代码中提取品牌名称和产品名称，如下所示：

<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span></h1>

我试过了：results = soup.findAll("h1", {"class" : "product-name elim-suites"})[0].text

并得到：u'ChantecailleLimited Edition Protect the Lion Eye Palette'

正如你所看到的，Chantecaille是品牌名称，其余的是产品名称，但它们现在互相贴合，有什么建议吗？谢谢！

Answer 1

您可以使用previous_sibling，它获取具有相同父级的前一个节点（解析树中的级别相同）。

此外，当您搜索单个元素时，请使用findAll而不是find。

item_span = soup.find("h1", {"class" : "product-name elim-suites"}).find("span")

product_name = item_span.previous_sibling
brand_name = item_span.text

print product_name
print brand_name

输出：

Chantecaille
Limited Edition Protect the Lion Eye Palette

Answer 2

您可以使用 get_text 并传递一个字符来分隔文字或使用. h1.find(text=True, recursive=False)上的h1提取文字，然后从 span <中提取文字/ em>直接：

In [1]: h ="""<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette ...: </span></h1>""" In [2]: from bs4 import BeautifulSoup In [3]: soup = BeautifulSoup(h, "html.parser") In [4]: h1 = soup.select_one("h1.product-name.elim-suites") In [5]: print(h1.get_text("\n")) Chantecaille Limited Edition Protect the Lion Eye Palette In [6]: prod, desc = h1.find(text=True, recursive=False), h1.span.text In [7]: print(prod, desc) (u'Chantecaille', u'Limited Edition Protect the Lion Eye Palette\n')

或者，如果 span 之后出现文本，也可以使用 find_all ：

In [8]: h ="""<h1 class="product-name elim-suites">Chantecaille <span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span>other text</h1>""" In [9]: from bs4 import BeautifulSoup In [10]: soup = BeautifulSoup(h, "html.parser") In [11]: h1 = soup.select_one("h1.product-name.elim-suites") In [12]: print(h1.get_text("\n")) Chantecaille Limited Edition Protect the Lion Eye Palette other text In [13]: prod, desc = " ".join(h1.find_all(text=True, recursive=False)), h1.span.text In [14]: In [14]: print(prod, desc) (u'Chantecaille other text', u'Limited Edition Protect the Lion Eye Palette')

如何使用BeautifulSoup从长标签中获取部分文本

2 个答案: