递归webscraper不使用Scrapy从页面打印文本到屏幕

时间:2014-07-25 18:40:01

标签: python web-scraping scrapy scrapy-spider

我在Windows Vista 64位上使用Python.org版本2.7,64位。我正在构建一个递归的webscraper,它似乎只在从单个页面中提取文本时起作用,但在刮取多个页面时似乎不起作用。代码如下:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]

    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

我从中获得的输出示例如下:

2014-07-25 19:31:32+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Players/133260/Show/Michael-Ngoo> (referer: http://www.whoscored.com/Players/14170/Show/Ishmael-Miller)
2014-07-25 19:31:33+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/160/Show/England-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour)
2014-07-25 19:31:33+0100 [goal3] DEBUG: Filtered offsite request to 'www.cafc.co.uk': <GET http://www.cafc.co.uk/page/Home>
2014-07-25 19:31:34+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Matches/721465/Live/England-Championship-2013-2014-Nottingham-Forest-Charlton> (referer: http://www.whoscored.com/Players/10794/Show/Rafik-Djebbour)
2014-07-25 19:31:36+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/126/News> (referer: http://www.whoscored.com/Teams/1426/News)
2014-07-25 19:31:36+0100 [goal3] DEBUG: Filtered offsite request to 'www.fcsochaux.fr': <GET http://www.fcsochaux.fr/fr/index.php?lng=fr>
2014-07-25 19:31:37+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/976/News> (referer: http://www.whoscored.com/Teams/1426/News)
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.grenoblefoot38.fr': <GET http://www.grenoblefoot38.fr/>
2014-07-25 19:31:37+0100 [goal3] DEBUG: Filtered offsite request to 'www.as.com': <GET http://www.as.com/futbol/articulo/leones-ponen-manos-obra-grenoble/20120713dasdaiftb_52/Tes>
2014-07-25 19:31:38+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/56/News> (referer: http://www.whoscored.com/Teams/53/News)
2014-07-25 19:31:38+0100 [goal3] DEBUG: Filtered offsite request to 'www.realracingclub.es': <GET http://www.realracingclub.es/default.aspx>
2014-07-25 19:31:39+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/125/News> (referer: http://www.whoscored.com/Teams/146/News)
2014-07-25 19:31:39+0100 [goal3] DEBUG: Filtered offsite request to 'www.asnl.net': <GET http://www.asnl.net/pages/club/entraineurs.html>
2014-07-25 19:31:40+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/425/News> (referer: http://www.whoscored.com/Teams/24/News)
2014-07-25 19:31:40+0100 [goal3] DEBUG: Filtered offsite request to 'www.dbu.dk': <GET http://www.dbu.dk/>
2014-07-25 19:31:42+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/282/News> (referer: http://www.whoscored.com/Teams/50/News)
2014-07-25 19:31:42+0100 [goal3] DEBUG: Filtered offsite request to 'www.fc-koeln.de': <GET http://www.fc-koeln.de/index.php?id=10>
2014-07-25 19:31:43+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/58/News> (referer: http://www.whoscored.com/Teams/131/News)
2014-07-25 19:31:43+0100 [goal3] DEBUG: Filtered offsite request to 'www.realvalladolid.es': <GET http://www.realvalladolid.es/>
2014-07-25 19:31:44+0100 [goal3] DEBUG: Crawled (200) <GET http://www.whoscored.com/Teams/973/News> (referer: http://www.whoscored.com/Teams/145/News)
2014-07-25 19:31:44+0100 [goal3] DEBUG: Filtered offsite request to 'www.fifci.org': <GET http://www.fifci.org/>

我可以理解被删除的外部链接超出了爬虫的范围,但是我无法理解的是为什么返回的结果是'DEBUG:'消息和页面的链接,特别是在那里是为所有这些结果打印的成功的HTTP返回代码200。

有谁可以看到这里的问题是什么?

由于

1 个答案:

答案 0 :(得分:1)

您只需要使用follow=True的单一规则:

rules = [Rule(SgmlLinkExtractor(), follow=True, callback='parse_item')]