Question

我正在引用此tutorial，它可用于在第一页上获取数据，以及之后的链接。

但是，在我的示例中，我在点击列表链接之前尝试检查列表是否包含3个内容：

项目必须包含商家名称
项目必须有电话号码
项目必须有网站

如果是这样，我希望scrapy点击转到我能够检索电子邮件的商家资料的商家链接。

之后，我希望scrapy返回主页，并在该页面上重复其余19个列表的过程。

然而，它会输出一个重复的列表，如下所示：

service_name = input("Input Industry: ")
city = input("Input The City: ")


class Item(scrapy.Item):    
    business_name = scrapy.Field()
    phonenumber = scrapy.Field()
    email = scrapy.Field()
    website = scrapy.Field()

class Bbb_spider(scrapy.Spider):
    name = "bbb"

    start_urls = [
        "http://www.yellowbook.com/s/"+ service_name + "/" + city
    ]

    def __init__(self):
        self.seen_business_names = []
        self.seen_websites = []
        self.seen_emails = []


    def parse(self, response):
        for business in response.css('div.listing-info'):
            item = Item()
            item['business_name'] = business.css('div.info.l h2 a::text').extract()
            item['website'] = business.css('a.s_website::attr(href)').extract()
            for x in item['business_name'] and item['website']:
                if x not in self.seen_business_names and item['website']:
                    if item['business_name']:
                        if item['website']:
                            item['phonenumber'] = business.css('div.phone-number::text').extract_first()
                            for href in response.css('div.info.l h2 a::attr(href)'):
                                yield response.follow(href, self.businessprofile)

            for href in response.css('ul.page-nav.r li a::attr(href)'):
                yield response.follow(href, self.parse)

    def businessprofile(self, response):
        for profile in response.css('div.profile-info.l'):
            item = Item()
            item['email'] = profile.css('a.email::text').extract()
            for x in item['email']:
                if x not in self.seen_emails:
                    self.seen_business_names.append(x)
                    yield item
python scrapy

有关如何改进代码的任何建议？

Answer 1

在做蜘蛛之前阅读指南。要填充项目，您应使用Item Loader，post and pre-processors可用于您的目的。对于重复，you can use a custom pipeline。

scrapy spider没有关注链接到另一个页面

1 个答案: