Question

我是一个蟒蛇以及scrapy noob并面临一个我似乎无法找到出路的问题。我试图用我的scrapy蜘蛛代码从www.tarladalal.com上剔除食谱的成分

import scrapy
import os
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector


from tarla.items import

 from tarla.items import TarlaDalal

class TarlaDalalSpider(CrawlSpider):
     name = "tarla"
     start_url = ["http://www.tarladalal.com/recipes-for-breakfast-151",
             "http://www.tarladalal.com/recipes-for-accompaniments-244",
              ]

     rules = (
       # Follow pagination
    Rule(LinkExtractor(allow=(r'\?pageindex=\d+',)), follow=True),

    # Extract recipes
    Rule(LinkExtractor(allow=(r'[0-9]r',),deny = r'/display-comments*'), callback ='parse_recipe', follow=True)
)

    def parse_recipe(self, response):
        hxs = HtmlXPathSelector(response)
        recipe = TarlaDalal()
        ing = []
        # name
        try:
            recipe['name'] = hxs.select("//h1[@itemprop = 'name']/text()")[0].extract().strip()
        except:
            pass

        # ingredients
        ingredient_nodes = hxs.select("//div[@id = 'rcpinglist']/span[@itemprop = 'ingredient']")
        for ingredient_node in ingredient_nodes:
            try:
                quantity = ingredient_node.select("span[@itemprop = 'amount']/text()").extract()[0]
                name = ingredient_node.select("div[@itemprop = 'name']']/text()").extract()[0]
            except:
                continue

            ingred= Ingredient()
            ingred['name'] = name
            ingred['quantity'] = quantity
            ing.append(ingred)

            recipe['ingredients'] = ing
        try:
            recipe['prep_time']= hxs.select("//time[@itemprop = 'prepTime']/text()").extract()[0]
        except:
            pass
        try:
            recipe['cook_time'] = hxs.select("//time[@itemprop = 'cookTime']/text()").extract()[0]
        except:
            pass
        return recipe

鉴于此脚本，Scrapy而不是抓取在启动telnet控制台后立即关闭蜘蛛，甚至没有获取启动URL的第一个请求

如果我使用特定蜘蛛的命令行解析调试scrapy

    spider parse --spider=tarla "http://www.tarladalal.com/recipes-for-breakfast-151"

它基本上可以从网址页面中删除所需的所有可用链接

我正在使用scrapy v 1.0.3和python 2.7

任何关于为什么会出现此问题的指示都会非常有用。也是第一次在这里发布海报，我对任何无意中缺乏信息表示歉意

Scrapy爬行问题

0 个答案: