如何只选择具有特定子项的div与xpath python

时间:2018-02-16 14:14:20

标签: python-3.x xpath web-scraping lxml

我目前正在尝试废弃特定电子商务网站的信息,我只想获得产品信息,例如产品名称,价格,颜色和尺寸,只有价格已被削减的产品。

我目前正在使用xpath

这是我的python抓码

来自lxml import html 导入请求

class CategoryCrawler(object):

def __init__(self, starting_url):
    self.starting_url = starting_url
    self.items = set()


def __str__(self):
    return('All Items:', self.items)


def crawl(self):
    self.get_item_from_link(self.starting_url)
    return


def get_item_from_link(self, link):

        start_page = requests.get(link)

        tree = html.fromstring(start_page.text)


        names = tree.xpath('//span[@class="name"][@dir="ltr"]/text()')


        print(names)

请注意,这不是原始网址

crawler = CategoryCrawler('https://www.myfavoriteecommercesite.com/')

crawler.crawl()

程序运行时......这些是从电子商务网站获取的HTML内容

具有价格下降的产品部分

div class =“products-info”>

<h2 class="title"><span class="brand ">Apple&nbsp;</span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>

 <div class="price-container clearfix">

    <span class="sale-flag-percent">-22%</span> 

        <span class="price-box ri">

                 <span class="price ">

                        <span data-currency-iso="NGN">₦</span> 

                        <span dir="ltr" data-price="388990">388,990</span>  

                  </span>  

                  <span class="price -old ">

                        <span data-currency-iso="NGN">₦</span> 

                        <span dir="ltr" data-price="500000">500,000</span>  

                  </span> 

        </span>

  </div>

DIV

没有价格下跌的产品部分

div class =“products-info”&gt;

<h2 class="title"><span class="brand ">Apple&nbsp;</span> <span class="name" dir="ltr">IPhone X 5.8-Inch HD (3GB,64GB ROM) IOS 11, 12MP + 7MP 4G Smartphone - Silver</span></h2>

 <div class="price-container clearfix">


        <span class="price-box ri">

                 <span class="price ">

                        <span data-currency-iso="NGN">₦</span> 

                        <span dir="ltr" data-price="388990">388,990</span>  

                  </span>  

        </span>

  </div>

DIV

现在这是我的确切问题

我想知道如何只选择父div,即

div class =“price-container clearfix”&gt;它还包含任何这些子类跨越类

span class =“price -old”&gt;或

span class =“sale-flag-percent”&gt;

谢谢大家

1 个答案:

答案 0 :(得分:0)

一个解决方案是获取所有<div class="price-container clearfix">并迭代,检查关键字存在的整个元素的字符串。

但更好的解决方案是使用带有xpath的条件:

from lxml import html 
htmlst = 'your html'
tree=html.fromstring(htmlst)
divs = tree.xpath('//div[@class="price-container clearfix" and .//span[@class = "price -old " or @class = "sale-flag-percent"] ]')
print(divs)

这将获取所有div class="price-container clearfix",然后检查是否包含搜索到的类的跨度。