我有一个Spider可以抓取几个start_urls,但问题是我只收到有限数量的输出。但是,当我抓取一个start_url时,它会在无限滚动页面之前返回所有结果。这是我的蜘蛛代码:
from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem
class PinterestSpider(Spider):
name = "pinterest"
allowed_domains = ["pinterest.com"]
start_urls = [
"http://www.pinterest.com/jetsetterphoto/pins/",
"http://www.pinterest.com/llbean/pins/",
"http://www.pinterest.com/nordstrom/pins/"
]
def parse(self, response):
hxs = Selector(response)
pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
comment_counts = hxs.xpath("//em[@class='socialMetaCount commentCountSmall']/text()").extract()
board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()
pin_descriptions = hxs.xpath("//p[@class='pinDescription']/text()").extract()
items = []
for pin_link, repin_count, like_count, comment_count, board_name, pin_description in zip(pin_links, repin_counts, like_counts, comment_counts, board_names, pin_descriptions):
item = PinterestItem()
item["pin_link"] = pin_link.strip()
item["repin_count"] = repin_count.strip()
item["like_count"] = like_count.strip()
item["comment_count"] = comment_count.strip()
item["board_name"] = board_name.strip()
item["pin_description"] = pin_description.strip()
items.append(item)
return items
我可以看到抓取工具抓取start_urls,但只返回JSON文件中的16行输出。当我使用一个start_url时,它会提供更多的输出行(所有这些行,直到无限滚动页面)。我可以在设置中执行的请求数量是否设置了限制?我试着寻找类似的问题,但找不到像我这样的问题。任何想法?
编辑:它可以与每个DOmain的COncurrent Requests设置有关吗??? http://doc.scrapy.org/en/latest/topics/settings.html他们在这里指定max为8,每个域只有8行输出。默认情况下,并发请求总数为16,这样就可以解释为什么我只得到2个start_urls的结果。如果我改变默认值,我会测试这是否有效(我不知道这对其他人是否有意义。)
编辑:我想将它添加到我的蜘蛛中以提取基本信息:
for BasicInfo in selector.css('div.userProfilePage'):
item["company_pins"] = get(pin.css('div.PinCount::text'))
item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))
然后代码将是这样的:
def parse(self, response):
selector = Selector(response)
items = []
for pin in selector.css('div.pinWrapper'):
item = PinterestItem()
item["pin_link"] = get(pin.css('div.pinHolder a::attr(href)'))
item["repin_count"] = get(pin.css('em.repinCountSmall::text'))
item["like_count"] = get(pin.css('em.likeCountSmall::text'))
item["comment_count"] = get(pin.css('em.commentCountSmall::text'))
item["board_name"] = get(pin.css('div.creditTitle::text'))
item["pin_description"] = get(pin.css('p.pinDescription::text'))
items.append(item)
self.log("extracted %d item(s) from %s" % (len(items), response.url))
return items
def parse(self, response):
selector = Selector(response)
items = []
for BasicInfo in selector.css('div.userProfilePage'):
item["company_pins"] = get(pin.css('div.PinCount::text'))
item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))
items.append(item)
self.log("extracted %d item(s) from %s" % (len(items), response.url))
return items
我知道这是错的,但我不知道在哪里或如何把它。我应该使用请求包含回调吗?
答案 0 :(得分:2)
我重写了你的蜘蛛:
<div class="pinWrapper">
元素,pin
使用CSS选择器(当然可以使用相对XPath表达式,即以“.//”开头而不是“//”://div[@class='creditTitle']/text()
应该是{{1 }})请注意,.//div[@class='creditTitle']/text()
和::text
是Scrapy添加到CSS选择器语法的扩展
以下代码每页抓取25个项目:
::attr(attribute_name)