我希望我的蜘蛛能够抓住每个人的“追随者”和“关注”信息。此时它仅提供了数千个中的6个结果。我怎样才能得到完整的结果?
“items.py”包括:
import scrapy
class HouzzItem(scrapy.Item):
Following = scrapy.Field()
Follower= scrapy.Field()
名为“houzzsp.py”的蜘蛛包括:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class HouzzspSpider(CrawlSpider):
name = "houzzsp"
allowed_domains = ['www.houzz.com']
start_urls = ['http://www.houzz.com/professionals']
rules = [
Rule(LinkExtractor(restrict_xpaths='//li[@class="sidebar-item"]')),
Rule(LinkExtractor(restrict_xpaths='//a[@class="navigation-button next"]')),
Rule(LinkExtractor(restrict_xpaths='//div[@class="name-info"]'),
callback='parse_items')
]
def parse_items(self, response):
page = response.xpath('//div[@class="follow-section profile-l-sidebar "]')
for titles in page:
Score = titles.xpath('.//a[@class="following follow-box"]/span[@class="follow-count"]/text()').extract()
Score1 = titles.xpath('.//a[@class="followers follow-box"]/span[@class="follow-count"]/text()').extract()
yield {'Following':Score,'Follower':Score1}
编辑:已对规则进行了更改,并且按预期工作。
答案 0 :(得分:1)
将scrapy的LinkExtractor
与restrict_xpaths
参数一起使用时,您无需为要遵循的网址指定确切的xpath。来自scrapy's documentation:
restrict_xpaths(str或list) - 是一个XPath(或XPath列表) 定义响应中应提取链接的区域 从
因此,我们的想法是指定部分,因此LinkExtractor
只会深入查看这些标记,以便找到要关注的链接。
总结一下,不要在a
内添加restrict_xpaths
标记(@href
会更糟糕),因为LinkExtractor
会在a
内找到import numpy as np
L1 = np.array([3, 1, 4, 2, 3, 1])
L2 = np.array([4, 8, 9, 5, 6, 7])
个标记你指定的xpath。