我在Windows Vista上使用Python.org 2.7 64位shell。我安装了Scrapy,它似乎稳定且有效。但是,我复制了以下简单的代码:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
for titles in titles:
title = titles.select("a/text()").xpath()
link = titles.select("a/@href").xpath()
print title, link
包含在此Youtube视频中:
http://www.youtube.com/watch?v=1EFnX1UkXVU 当我运行此代码时,我收到警告:
hxs = HtmlXPathSelector(response)
C:\Python27\mrscrap\mrscrap\spiders\test.py:11: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
titles = hxs.select("//p")
c:\Python27\lib\site-packages\scrapy\selector\unified.py:106: ScrapyDeprecationWarning: scrapy.selector.HtmlXPathSelector is deprecated, ins
.Selector instead.
for x in result]
C:\Python27\mrscrap\mrscrap\spiders\test.py:13: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
title = titles.select("a/text()").extract()
C:\Python27\mrscrap\mrscrap\spiders\test.py:14: ScrapyDeprecationWarning: Call to deprecated function select. Use .xpath() instead.
link = titles.select("a/@href").extract()
Scrapy的一些语法最近是否已更改,以便.extract()不再有效?我尝试使用.xpath()代替,但这会引发一个错误,说.xpath()需要两个参数,但我不确定在那里使用什么。
有什么想法吗?
由于
答案 0 :(得分:2)
参考另一个答案,它应该是
title = titles.xpath("a/text()").extract()
答案 1 :(得分:1)
不是extract
错误(extract
仍然有效),它是select
。选择器API最近更改为评论中提到的 1478963 (因为时间运行如此之快,最近已经有一年左右的时间了......)
我们不再使用HtmlXPathSelector
,而是使用包含Selector
和xpath()
方法的一般css()
。使用Selector,您可以在两者之间进行选择,特别是通过调用一种方法或另一种方法来混合它们。
新代码中的示例应如下所示(未经测试):
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/sfc/npo/"]
def parse(self, response):
titles = response.selector.xpath("//p")
for titles in titles:
title = titles.xpath("a/text()").extract()
link = titles.xpath("a/@href").extract()
print title, link
答案 2 :(得分:0)
代码应如下所示(经过测试)。 Aufziehvogel的代码让我95%。
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistSampleItem
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://sfbay.craigslist.org/search/npo"]
def parse(self, response):
titles = response.selector.xpath("//p")
items = []
for titles in titles:
item = CraigslistSampleItem()
item["title"] = titles.xpath("a/text()").extract()
item["link"] = titles.xpath("a/@href").extract()
items.append(item)
return items