Scrapy仅返回有限的输出量

时间:2014-05-06 10:19:30

标签: json output scrapy

我有一个Spider可以抓取几个start_urls,但问题是我只收到有限数量的输出。但是,当我抓取一个start_url时,它会在无限滚动页面之前返回所有结果。这是我的蜘蛛代码:

from scrapy.spider import Spider
from scrapy.selector import Selector
from Pinterest.items import PinterestItem

class PinterestSpider(Spider):
    name = "pinterest"
    allowed_domains = ["pinterest.com"]
    start_urls = [
        "http://www.pinterest.com/jetsetterphoto/pins/",
        "http://www.pinterest.com/llbean/pins/",        
        "http://www.pinterest.com/nordstrom/pins/"
    ]

    def parse(self, response):
        hxs = Selector(response)
        pin_links = hxs.xpath("//div[@class='pinHolder']/a/@href").extract()
        repin_counts = hxs.xpath("//em[@class='socialMetaCount repinCountSmall']/text()").extract()
        like_counts = hxs.xpath("//em[@class='socialMetaCount likeCountSmall']/text()").extract()
        comment_counts = hxs.xpath("//em[@class='socialMetaCount commentCountSmall']/text()").extract()
        board_names = hxs.xpath("//div[@class='creditTitle']/text()").extract()
        pin_descriptions = hxs.xpath("//p[@class='pinDescription']/text()").extract()

        items = []
        for pin_link, repin_count, like_count, comment_count, board_name, pin_description in zip(pin_links, repin_counts, like_counts, comment_counts, board_names, pin_descriptions):
            item = PinterestItem()
            item["pin_link"] = pin_link.strip()
            item["repin_count"] = repin_count.strip()
            item["like_count"] = like_count.strip()
            item["comment_count"] = comment_count.strip()
            item["board_name"] = board_name.strip()
            item["pin_description"] = pin_description.strip()
            items.append(item)
        return items

我可以看到抓取工具抓取start_urls,但只返回JSON文件中的16行输出。当我使用一个start_url时,它会提供更多的输出行(所有这些行,直到无限滚动页面)。我可以在设置中执行的请求数量是否设置了限制?我试着寻找类似的问题,但找不到像我这样的问题。任何想法?

编辑:它可以与每个DOmain的COncurrent Requests设置有关吗??? http://doc.scrapy.org/en/latest/topics/settings.html他们在这里指定max为8,每个域只有8行输出。默认情况下,并发请求总数为16,这样就可以解释为什么我只得到2个start_urls的结果。如果我改变默认值,我会测试这是否有效(我不知道这对其他人是否有意义。)

编辑:我想将它添加到我的蜘蛛中以提取基本信息:

for BasicInfo in selector.css('div.userProfilePage'):
    item["company_pins"] = get(pin.css('div.PinCount::text'))
    item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
    item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
    item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))

然后代码将是这样的:

def parse(self, response):
    selector = Selector(response)
    items = []
    for pin in selector.css('div.pinWrapper'):
        item = PinterestItem()            
        item["pin_link"] = get(pin.css('div.pinHolder a::attr(href)'))
        item["repin_count"] = get(pin.css('em.repinCountSmall::text'))
        item["like_count"] = get(pin.css('em.likeCountSmall::text'))
        item["comment_count"] = get(pin.css('em.commentCountSmall::text'))
        item["board_name"] = get(pin.css('div.creditTitle::text'))
        item["pin_description"] = get(pin.css('p.pinDescription::text'))

        items.append(item)
    self.log("extracted %d item(s) from %s" % (len(items), response.url))
    return items

def parse(self, response):
    selector = Selector(response)
    items = []
    for BasicInfo in selector.css('div.userProfilePage'):
        item["company_pins"] = get(pin.css('div.PinCount::text'))
        item["company_likes"] = get(pin.css('ul.userStats li~ li+ li a::text'))
        item["company_name"] = get(pin.css('h1.userProfileHeaderName::text'))
        item["company_followers"] = get(pin.css('a.FollowerCount .buttonText::text'))

items.append(item)
    self.log("extracted %d item(s) from %s" % (len(items), response.url))
    return items

我知道这是错的,但我不知道在哪里或如何把它。我应该使用请求包含回调吗?

1 个答案:

答案 0 :(得分:2)

我重写了你的蜘蛛:

  • 循环<div class="pinWrapper">元素,
  • 并且相对于每个pin使用CSS选择器(当然可以使用相对XPath表达式,即以“.//”开头而不是“//”://div[@class='creditTitle']/text()应该是{{1 }})

请注意,.//div[@class='creditTitle']/text()::text是Scrapy添加到CSS选择器语法的扩展

以下代码每页抓取25个项目:

::attr(attribute_name)