我一直在研究购物网站,我想从其html代码中提取品牌名称和产品名称,如下所示:
<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span></h1>
我试过了:results = soup.findAll("h1", {"class" : "product-name elim-suites"})[0].text
并得到:u'ChantecailleLimited Edition Protect the Lion Eye Palette'
正如你所看到的,Chantecaille是品牌名称,其余的是产品名称,但它们现在互相贴合,有什么建议吗?谢谢!
答案 0 :(得分:0)
您可以使用previous_sibling
,它获取具有相同父级的前一个节点(解析树中的级别相同)。
此外,当您搜索单个元素时,请使用findAll
而不是find
。
item_span = soup.find("h1", {"class" : "product-name elim-suites"}).find("span")
product_name = item_span.previous_sibling
brand_name = item_span.text
print product_name
print brand_name
输出:
Chantecaille
Limited Edition Protect the Lion Eye Palette
答案 1 :(得分:0)
您可以使用 get_text 并传递一个字符来分隔文字或使用. h1.find(text=True, recursive=False)
上的h1
提取文字,然后从 span <中提取文字/ em>直接:
In [1]: h ="""<h1 class="product-name elim-suites">Chantecaille<span itemprop="name" >Limited Edition Protect the Lion Eye Palette
...: </span></h1>"""
In [2]: from bs4 import BeautifulSoup
In [3]: soup = BeautifulSoup(h, "html.parser")
In [4]: h1 = soup.select_one("h1.product-name.elim-suites")
In [5]: print(h1.get_text("\n"))
Chantecaille
Limited Edition Protect the Lion Eye Palette
In [6]: prod, desc = h1.find(text=True, recursive=False), h1.span.text
In [7]: print(prod, desc)
(u'Chantecaille', u'Limited Edition Protect the Lion Eye Palette\n')
或者,如果 span 之后出现文本,也可以使用 find_all :
In [8]: h ="""<h1 class="product-name elim-suites">Chantecaille
<span itemprop="name" >Limited Edition Protect the Lion Eye Palette</span>other text</h1>"""
In [9]: from bs4 import BeautifulSoup
In [10]: soup = BeautifulSoup(h, "html.parser")
In [11]: h1 = soup.select_one("h1.product-name.elim-suites")
In [12]: print(h1.get_text("\n"))
Chantecaille
Limited Edition Protect the Lion Eye Palette
other text
In [13]: prod, desc = " ".join(h1.find_all(text=True, recursive=False)), h1.span.text
In [14]:
In [14]: print(prod, desc)
(u'Chantecaille other text', u'Limited Edition Protect the Lion Eye Palette')