使用python scrapy从网页中提取链接

时间:2015-03-18 09:36:36

标签: python scrapy

我是python的初学者,并使用scrapy从以下网页中提取链接 http://www.basketball-reference.com/leagues/NBA_2015_games.html

我写的代码是

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from basketball.items import BasketballItem

class BasketballSpider(CrawlSpider):

   name = 'basketball'
   allowed_domains = ['basketball-reference.com/']
   start_urls = ['http://www.basketball-reference.com/leagues/NBA_2015_games.html']
   rules = [Rule(LinkExtractor(allow=['http://www.basketball-reference.com/boxscores/^\w+$']), 'parse_item')]

   def parse_item(self, response):
       item = BasketballItem()
       item['url'] = response.url
       return item

我通过命令提示符运行此代码,但创建的文件没有任何链接。有人可以帮忙吗?

2 个答案:

答案 0 :(得分:1)

找不到链接,修复规则中的正则表达式:

rules = [
    Rule(LinkExtractor(allow='boxscores/\w+'))
]

此外,您在调用callback时不必设置parse_item - 这是默认设置。

allow也可以设置为字符串。

答案 1 :(得分:0)

rules = [
         Rule(LinkExtractor(allow='boxscores/\w+'), callback='parse_item')
]