Scrapy条件爬行

时间:2017-01-17 11:51:01

标签: python beautifulsoup scrapy

我的HTML代码包含大量类似结构的div ...下面是包含2个这样的div的代码摘录

<!-- 1st Div start --> 

<div class="outer-container">
<div class="inner-container">
<a href="www.xxxxxx.com"></a>
<div class="abc xyz" title="verified"></div>
<div class="mody">
        <div class="row">
            <div class="col-md-5 col-xs-12">
                <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
                <div class="mvsdfm casmhrn" itemprop="address">
                    <span itemprop="Address">1223 Industrial Blvd</span><br>
                    <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
                </div>
                <div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
                    (800) 845-0000
                </div>
            </div>
        </div>
    </div>
</div>
</div>

<!-- 2nd Div start -->

<div class="outer-container">
<div class="inner-container">
<a href="www.yyyyyy.com"></a>
<div class="mody">
        <div class="row">
            <div class="col-md-5 col-xs-12">
                <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
                <div class="mvsdfm casmhrn" itemprop="address">
                    <span itemprop="Address">7890 Business St</span><br>
                    <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
                </div>
                <div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
                    (800) 845-0000
                </div>
            </div>
        </div>
    </div>
</div>
</div>

所以这就是我想要Scrapy做的事情 ......

如果 class =&#34; div,外容器&#34; 包含另一个div title =&#34;已验证&#34; 就像在上面的第一个div中一样,它应该转到它上面的URL(即www.xxxxxx.com)并在该页面上获取其他一些字段。

如果没有div包含 title =&#34;已验证&#34; ,就像上面的第二个div一样,它应该获取下的所有数据div class =&#34; mody&#34; 。即公司名称(Fat Dude,LLC),地址,城市,州等......并且不遵循网址(即www.yyyyy.com)

那么如何在Scrapy搜寻器中应用此条件/逻辑。我在考虑使用BeautifulSoup,但不确定......

到目前为止我尝试了什么......

class MySpider(CrawlSpider):
    name = 'dknfetch'
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
    allowed_domains = ['www.xxxxx.com']
    def parse(self, response):
            hxs = Selector(response)
            soup = BeautifulSoup(response.body, 'lxml')
            nf = NewsFields()
            cName = soup.find_all("a", class_="mheading primary h4")
            addrs = soup.find_all("span", itemprop_="Address")
            loclity = soup.find_all("span", itemprop_="Locality")
            region = soup.find_all("span", itemprop_="Region")
            post = soup.find_all("span", itemprop_="postalCode")

            nf['companyName'] = cName[0]['content']
            nf['address'] = addrs[0]['content']
            nf['locality'] = loclity[0]['content']
            nf['state'] = region[0]['content']
            nf['zipcode'] = post[0]['content']
             yield nf
            for url in hxs.xpath('//div[@class="inner-container"]/a/@href').extract():
             yield Request(url, callback=self.parse)

当然,上面的代码会返回并抓取 div class =&#34; inner-container&#34; 下的所有网址,因为此代码中没有指定条件抓取,因为我不知道在哪里/如何设置它。

如果有人之前做过类似的事情,请分享。谢谢

1 个答案:

答案 0 :(得分:1)

不需要使用BeautifulSoup,Scrapy自带选择器功能(也单独发布为parsel)。让我们使用您的HTML作一个例子:

html = u"""
<!-- 1st Div start --> 
<div class="outer-container">
<div class="inner-container">
<a href="www.xxxxxx.com"></a>
<div class="abc xyz" title="verified"></div>
<div class="mody">
        <div class="row">
            <div class="col-md-5 col-xs-12">
                <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
                <div class="mvsdfm casmhrn" itemprop="address">
                    <span itemprop="Address">1223 Industrial Blvd</span><br>
                    <span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
                </div>
                <div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
                    (800) 845-0000
                </div>
            </div>
        </div>
    </div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<a href="www.yyyyyy.com"></a>
<div class="mody">
        <div class="row">
            <div class="col-md-5 col-xs-12">
                <h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
                <div class="mvsdfm casmhrn" itemprop="address">
                    <span itemprop="Address">7890 Business St</span><br>
                    <span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
                </div>
                <div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
                    (800) 845-0000
                </div>
            </div>
        </div>
    </div>
</div>
</div>
"""

from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
    if div.css('div[title="verified"]'):
        url = div.css('a::attr(href)').extract_first()
        print 'verified, follow this URL:', url
    else:
        nf = {}
        nf['companyName'] = div.xpath('string(.//h2)').extract_first()
        nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
        nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
        nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
        nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
        print 'not verified, extracted item is:', nf

上一个代码段的结果是:

verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}

但是在Scrapy中你甚至不需要实例化Selector类,传递给回调的response对象中有一个快捷方式。此外,您不应该是CrawlSpider的子类,只需常规的Spider类即可。把它们放在一起:

from scrapy import Spider, Request
from myproject.items import NewsFields

class MySpider(Spider):
    name = 'dknfetch'
    start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
    allowed_domains = ['www.xxxxx.com']

    def parse(self, response):
        for div in response.selector.css('.outer-container'):
            if div.css('div[title="verified"]'):
                url = div.css('a::attr(href)').extract_first()
                yield Request(url)
            else:
                nf = NewsFields()
                nf['companyName'] = div.xpath('string(.//h2)').extract_first()
                nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
                nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
                nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
                nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
                yield nf

我建议你熟悉Parsel的API:https://parsel.readthedocs.io/en/latest/usage.html

快乐刮!