Scrapy不使用我当前的语法返回网页的文本正文

时间:2014-07-26 00:20:34

标签: python web-scraping scrapy scrapy-spider

我在Windows Vista 64位上使用Python.org版本2.7 64位。我成功地使用了一个用Scrapy构建的递归webscraper来解析维基百科文章中的所有文本。但是,我试图将相同的代码应用于代码中引用的网站,但它没有返回任何文本正文:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]
    #rules = [Rule(SgmlLinkExtractor(allow=()), 
                  #follow=True),
             #Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    #]
    #rules = [
        #Rule(
            #SgmlLinkExtractor(allow=('Regions/252/Tournaments/2',)), 
            #callback='parse_item',
            #follow=True,
        #)
    #]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)
        scripts = response.selector.xpath("normalize-space(//title)")
        for scripts in scripts:
            body = response.xpath('//p').extract()
            body2 = "".join(body)
            print remove_tags(body2).encode('utf-8')  


execute(['scrapy','crawl','goal3'])

我可能希望查看的示例页面如下:

http://www.whoscored.com/Articles/pn4gahfw90kjwje-yx7ztq/Show/Player-Focus-Potential-Change-in-System-may-Convince-Vidal-to-Leave-Juventus 据我了解,上面的代码应该提取页面上找到的任何文本字符串并将它们连接在一起。上面的示例页面的HTML标记用<p>标签封装文本,所以我不确定为什么这不起作用。任何人都可以看到一个明显的原因,为什么我要返回的是使用此代码的页脚?

1 个答案:

答案 0 :(得分:2)

parse_item()内部有点混乱。这是从所有段落(p标记)获取文本并加入它的固定版本:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.markup import remove_tags


class ExampleSpider(CrawlSpider):
    name = "goal3"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=()), follow=True, callback='parse_item')]

    def parse_item(self,response):
        paragraphs = response.selector.xpath("//p").extract()
        text = "".join(remove_tags(paragraph).encode('utf-8') for paragraph in paragraphs)
        print text

对于this page,它会打印:

"There is no budget, there is money. We are in a very strong financial position. We can make big signings." Music to the ears of Manchester United fans as vice-chairman Ed Woodward confirmed the club can make big-money acquisitions in this very transfer window. In a bid to return to the summit of England’s top tier, Woodward has effectively given the green light to a spending spree that has supporters rubbing their hands with glee. Ander Herrara and Luke Shaw have arrived for a combined £59m already this summer and the carousel through the Old Trafford entrance door shows no sign of slowing down. Ángel Di María, Mats Hummels and Daley Blind, amongst others, have all been linked with a move to United, while reports suggesting midfield pitbull Arturo Vidal is set to join Louis van Gaal’s side refuse to die down.  "I’m still on holiday at the moment. Can I say I’m staying at Juve? I don’t know. On Monday I’ll talk to (Juventus manager, Massimili
...
 Contact Us | About Us | Glossary | Privacy Policy | WhoScored Ratings
            Copyright © 2014 WhoScored.com