scrapy spider没有关注链接到另一个页面

时间:2018-03-08 15:49:10

标签: python scrapy

我正在引用此tutorial,它可用于在第一页上获取数据,以及之后的链接。

但是,在我的示例中,我在点击列表链接之前尝试检查列表是否包含3个内容:

  1. 项目必须包含商家名称
  2. 项目必须有电话号码
  3. 项目必须有网站
  4. 如果是这样,我希望scrapy点击转到我能够检索电子邮件的商家资料的商家链接。

    之后,我希望scrapy返回主页,并在该页面上重复其余19个列表的过程。

    然而,它会输出一个重复的列表,如下所示:

    enter image description here

    service_name = input("Input Industry: ")
    city = input("Input The City: ")
    
    
    class Item(scrapy.Item):    
        business_name = scrapy.Field()
        phonenumber = scrapy.Field()
        email = scrapy.Field()
        website = scrapy.Field()
    
    class Bbb_spider(scrapy.Spider):
        name = "bbb"
    
        start_urls = [
            "http://www.yellowbook.com/s/"+ service_name + "/" + city
        ]
    
        def __init__(self):
            self.seen_business_names = []
            self.seen_websites = []
            self.seen_emails = []
    
    
        def parse(self, response):
            for business in response.css('div.listing-info'):
                item = Item()
                item['business_name'] = business.css('div.info.l h2 a::text').extract()
                item['website'] = business.css('a.s_website::attr(href)').extract()
                for x in item['business_name'] and item['website']:
                    if x not in self.seen_business_names and item['website']:
                        if item['business_name']:
                            if item['website']:
                                item['phonenumber'] = business.css('div.phone-number::text').extract_first()
                                for href in response.css('div.info.l h2 a::attr(href)'):
                                    yield response.follow(href, self.businessprofile)
    
                for href in response.css('ul.page-nav.r li a::attr(href)'):
                    yield response.follow(href, self.parse)
    
        def businessprofile(self, response):
            for profile in response.css('div.profile-info.l'):
                item = Item()
                item['email'] = profile.css('a.email::text').extract()
                for x in item['email']:
                    if x not in self.seen_emails:
                        self.seen_business_names.append(x)
                        yield item
    python scrapy
    

    有关如何改进代码的任何建议?

1 个答案:

答案 0 :(得分:1)

在做蜘蛛之前阅读指南。要填充项目,您应使用Item Loaderpost and pre-processors可用于您的目的。 对于重复,you can use a custom pipeline