Question

我在Windows Vista上使用Python.org 2.7 64位shell。我安装了Scrapy，它似乎稳定且有效。但是，我复制了以下简单的代码：

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]

        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            titles = hxs.select("//p")
            for titles in titles:
                title = titles.select("a/text()").xpath()
                link = titles.select("a/@href").xpath()
                print title, link

包含在此Youtube视频中：

http://www.youtube.com/watch?v=1EFnX1UkXVU 当我运行此代码时，我收到警告：

    hxs = HtmlXPathSelector(response)
C:\Python27\mrscrap\mrscrap\spiders\test.py:11: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  titles = hxs.select("//p")
c:\Python27\lib\site-packages\scrapy\selector\unified.py:106: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, ins
.Selector instead.
  for x in result]
C:\Python27\mrscrap\mrscrap\spiders\test.py:13: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  title = titles.select("a/text()").extract()
C:\Python27\mrscrap\mrscrap\spiders\test.py:14: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
  link = titles.select("a/@href").extract()

Scrapy的一些语法最近是否已更改，以便.extract（）不再有效？我尝试使用.xpath（）代替，但这会引发一个错误，说.xpath（）需要两个参数，但我不确定在那里使用什么。

有什么想法吗？

由于

Answer 1

参考另一个答案，它应该是

title = titles.xpath("a/text()").extract()

Answer 2

不是extract错误（extract仍然有效），它是select。选择器API最近更改为评论中提到的 1478963 （因为时间运行如此之快，最近已经有一年左右的时间了......）

我们不再使用HtmlXPathSelector，而是使用包含Selector和xpath()方法的一般css()。使用Selector，您可以在两者之间进行选择，特别是通过调用一种方法或另一种方法来混合它们。

新代码中的示例应如下所示（未经测试）：

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class MySpider(BaseSpider):
    name = "craig"
    allowed_domains = ["craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]

    def parse(self, response):
        titles = response.selector.xpath("//p")
        for titles in titles:
            title = titles.xpath("a/text()").extract()
            link = titles.xpath("a/@href").extract()
            print title, link

Answer 3

代码应如下所示（经过测试）。 Aufziehvogel的代码让我95％。

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    from craigslist_sample.items import CraigslistSampleItem

    class MySpider(BaseSpider):
        name = "craig"
        allowed_domains = ["craigslist.org"]
        start_urls = ["http://sfbay.craigslist.org/search/npo"]

        def parse(self, response):
            titles = response.selector.xpath("//p")
            items = []
            for titles in titles:
                item = CraigslistSampleItem()
                item["title"] = titles.xpath("a/text()").extract()
                item["link"] = titles.xpath("a/@href").extract()
                items.append(item)
            return items

Scrapy Spider没有正确刮擦

3 个答案: