我是一个蟒蛇以及scrapy noob并面临一个我似乎无法找到出路的问题。 我试图用我的scrapy蜘蛛代码从www.tarladalal.com上剔除食谱的成分
import scrapy
import os
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from tarla.items import
from tarla.items import TarlaDalal
class TarlaDalalSpider(CrawlSpider):
name = "tarla"
start_url = ["http://www.tarladalal.com/recipes-for-breakfast-151",
"http://www.tarladalal.com/recipes-for-accompaniments-244",
]
rules = (
# Follow pagination
Rule(LinkExtractor(allow=(r'\?pageindex=\d+',)), follow=True),
# Extract recipes
Rule(LinkExtractor(allow=(r'[0-9]r',),deny = r'/display-comments*'), callback ='parse_recipe', follow=True)
)
def parse_recipe(self, response):
hxs = HtmlXPathSelector(response)
recipe = TarlaDalal()
ing = []
# name
try:
recipe['name'] = hxs.select("//h1[@itemprop = 'name']/text()")[0].extract().strip()
except:
pass
# ingredients
ingredient_nodes = hxs.select("//div[@id = 'rcpinglist']/span[@itemprop = 'ingredient']")
for ingredient_node in ingredient_nodes:
try:
quantity = ingredient_node.select("span[@itemprop = 'amount']/text()").extract()[0]
name = ingredient_node.select("div[@itemprop = 'name']']/text()").extract()[0]
except:
continue
ingred= Ingredient()
ingred['name'] = name
ingred['quantity'] = quantity
ing.append(ingred)
recipe['ingredients'] = ing
try:
recipe['prep_time']= hxs.select("//time[@itemprop = 'prepTime']/text()").extract()[0]
except:
pass
try:
recipe['cook_time'] = hxs.select("//time[@itemprop = 'cookTime']/text()").extract()[0]
except:
pass
return recipe
鉴于此脚本,Scrapy而不是抓取在启动telnet控制台后立即关闭蜘蛛,甚至没有获取启动URL的第一个请求
如果我使用特定蜘蛛的命令行解析调试scrapy
spider parse --spider=tarla "http://www.tarladalal.com/recipes-for-breakfast-151"
它基本上可以从网址页面中删除所需的所有可用链接
我正在使用scrapy v 1.0.3和python 2.7
任何关于为什么会出现此问题的指示都会非常有用。也是第一次在这里发布海报,我对任何无意中缺乏信息表示歉意