我正在引用此tutorial,它可用于在第一页上获取数据,以及之后的链接。
但是,在我的示例中,我在点击列表链接之前尝试检查列表是否包含3
个内容:
如果是这样,我希望scrapy
点击转到我能够检索电子邮件的商家资料的商家链接。
之后,我希望scrapy
返回主页,并在该页面上重复其余19个列表的过程。
然而,它会输出一个重复的列表,如下所示:
service_name = input("Input Industry: ")
city = input("Input The City: ")
class Item(scrapy.Item):
business_name = scrapy.Field()
phonenumber = scrapy.Field()
email = scrapy.Field()
website = scrapy.Field()
class Bbb_spider(scrapy.Spider):
name = "bbb"
start_urls = [
"http://www.yellowbook.com/s/"+ service_name + "/" + city
]
def __init__(self):
self.seen_business_names = []
self.seen_websites = []
self.seen_emails = []
def parse(self, response):
for business in response.css('div.listing-info'):
item = Item()
item['business_name'] = business.css('div.info.l h2 a::text').extract()
item['website'] = business.css('a.s_website::attr(href)').extract()
for x in item['business_name'] and item['website']:
if x not in self.seen_business_names and item['website']:
if item['business_name']:
if item['website']:
item['phonenumber'] = business.css('div.phone-number::text').extract_first()
for href in response.css('div.info.l h2 a::attr(href)'):
yield response.follow(href, self.businessprofile)
for href in response.css('ul.page-nav.r li a::attr(href)'):
yield response.follow(href, self.parse)
def businessprofile(self, response):
for profile in response.css('div.profile-info.l'):
item = Item()
item['email'] = profile.css('a.email::text').extract()
for x in item['email']:
if x not in self.seen_emails:
self.seen_business_names.append(x)
yield item
python scrapy
有关如何改进代码的任何建议?
答案 0 :(得分:1)
在做蜘蛛之前阅读指南。要填充项目,您应使用Item Loader,post and pre-processors可用于您的目的。 对于重复,you can use a custom pipeline。