Question

我正在处理 scrapy shell。我尝试抓取的网址是：http://allegro.pl/sportowe-uzywane-251188?a_enum[127779][15]=15&a_text_i[1][0]=2004&a_text_i[1][1]=2009&a_text_i[5][0]=950&id=251188&offerTypeBuyNow=1&order=p&string=gsxr&bmatch=base-relevance-aut-1-5-0913

但是，当我这样做时，他们会看到（回应）＆＃34;我得到了空白页面页面看起来没有加载

>>> response.css("title")
[]

现在有趣的部分有时会使用相同的命令集正确加载

Answer 1

这对我有用，我建议你从非常基础的教程开始：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'blogspider'
    start_urls = ['http://allegro.pl/sportowe-uzywane-251188?a_enum%5B127779%5D%5B15%5D=15&a_text_i%5B1%5D%5B0%5D=2004&a_text_i%5B1%5D%5B1%5D=2009&a_text_i%5B5%5D%5B0%5D=950&id=251188&offerTypeBuyNow=1&order=p&string=gsxr&bmatch=base-relevance-aut-1-1-0913']

    def parse(self, response):
        print "----------------------------------------------------------------"
        print response.body
        print "----------------------------------------------------------------"

我能够看到页面的正文。 view(response)是错误的，未定义的函数。

将此代码保存为myspider.py并使用scrapy runspider myspider.py运行。您将看到一个大字符串打印到您的终端，即------------- s。

之间的正文

对于Scrapy Shell：

以shell模式启动：scrapy shell

跑步：

>>> fetch("http://allegro.pl/sportowe-uzywane-251188?a_enum%5B127779%5D%5B15%5D=15&a_text_i%5B1%5D%5B0%5D=2004&a_text_i%5B1%5D%5B1%5D=2009&a_text_i%5B5%5D%5B0%5D=950&id=251188&offerTypeBuyNow=1&order=p&string=gsxr&bmatch=base-relevance-aut-1-1-0913")
>>> view(response)

它将在您的默认浏览器中打开已删除的页面。你的网址对我有用。

标题标签显示：

>>> response.css("title")
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Gsxr w Sportowe U\u017cywane - Motocyk'>]

已抓取/已抓取的网页将保存在/tmp目录下/tmp/tmpn8wziQ.html

Answer 2

非常感谢mertyildiran的帮助。

scrapy shell对我不起作用。有时它会抓取网络，但大部分时间都没有。我不知道为什么。

无论如何，我最终得到的代码每次都很有用。

导入scrapy

class QuotesSpider（scrapy.Spider）： name =“allegro” start_urls = ['http://allegro.pl/sportowe-uzywane-251188?a_enum%5B127779%5D%5B15%5D=15&a_text_i%5B1%5D%5B0%5D=2004&a_text_i%5B1%5D%5B1%5D=2009&a_text_i%5B5%5D%5B0%5D=950&id=251188&offerTypeBuyNow=1&order=p&string=gsxr&bmatch=base-relevance-aut-1-1-0913']

def parse(self, response):
    for lista in response.css("article.offer"):
        yield {
        'link': lista.css('a.offer-title::attr(href)').extract(),            
        }

scrapy shell没有打开长链接

2 个答案: