我目前正在尝试废弃特定电子商务网站的信息,我只想获得产品信息,例如产品名称,价格,颜色和尺寸,只有价格已被削减的产品。
我目前正在使用xpath
这是我的python抓码
来自lxml import html 导入请求
class CategoryCrawler(object):
def __init__(self, starting_url):
self.starting_url = starting_url
self.items = set()
def __str__(self):
return('All Items:', self.items)
def crawl(self):
self.get_item_from_link(self.starting_url)
return
def get_item_from_link(self, link):
start_page = requests.get(link)
tree = html.fromstring(start_page.text)
names = tree.xpath('//span[@class="name"][@dir="ltr"]/text()')
print(names)
crawler = CategoryCrawler('https://www.myfavoriteecommercesite.com/')
crawler.crawl()
div class =“products-info”>
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="sale-flag-percent">-22%</span>
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
<span class="price -old ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="500000">500,000</span>
</span>
</span>
</div>
DIV
div class =“products-info”&gt;
<h2 class="title"><span class="brand ">Apple </span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>
<div class="price-container clearfix">
<span class="price-box ri">
<span class="price ">
<span data-currency-iso="NGN">₦</span>
<span dir="ltr" data-price="388990">388,990</span>
</span>
</span>
</div>
DIV
我想知道如何只选择父div,即
div class =“price-container clearfix”&gt;它还包含任何这些子类跨越类
span class =“price -old”&gt;或
span class =“sale-flag-percent”&gt;
谢谢大家
答案 0 :(得分:0)
一个解决方案是获取所有<div class="price-container clearfix">
并迭代,检查关键字存在的整个元素的字符串。
但更好的解决方案是使用带有xpath的条件:
from lxml import html
htmlst = 'your html'
tree=html.fromstring(htmlst)
divs = tree.xpath('//div[@class="price-container clearfix" and .//span[@class = "price -old " or @class = "sale-flag-percent"] ]')
print(divs)
这将获取所有div class="price-container clearfix"
,然后检查是否包含搜索到的类的跨度。