我刚刚开始学习网络浏览,我遇到了从网站上获取正确标题的问题。我目前正在观看此指南:https://www.youtube.com/watch?v=1EFnX1UkXVU它会告诉您创建此类:
class MySpider(BaseSpider):
name = "craig"
allowed_domains = ["craigslist.org"]
start_urls = ["http://ksu.craigslist.org/search/foa"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
for titles in titles:
title = titles.xpath("a/text()").extract()
link= titles.xpath("a/@href").extract()
print title, link
问题是该链接只打印了一堆/for/###numbers###.html,标题根本不打印任何内容。我不确定为什么会这样。我已经阅读了之前的主题并改变了一些内容,但我仍然遇到同样的问题。
答案 0 :(得分:3)
Scrapy XPath选择器将根据您提供的XPath提取该页面的HTML中的内容,仅此而已。
让我们使用scrapy shell
查看您的示例网址及其在scrapy中的响应,并测试您的XPath(请注意我使用scrapy 1.0,并删除了一些结果行):< / p>
(scrapy10)paul@paul$ scrapy shell -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.125 Safari/537.36" http://ksu.craigslist.org/search/foa
2015-06-22 19:23:26 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
...
2015-06-22 19:23:27 [scrapy] INFO: Spider opened
2015-06-22 19:23:27 [scrapy] DEBUG: Crawled (200) <GET http://ksu.craigslist.org/search/foa> (referer: None)
...
>>> for paragraph in response.xpath('//p'):
... print "----------"
... print paragraph.xpath('a').extract()
... print paragraph.xpath('a/text()').extract()
...
----------
[u'<a href="/for/5083735342.html" class="i" data-ids="0:00G0G_gb8cQBnOWca"><span class="price">$50</span></a>']
[]
----------
[u'<a href="/for/5042795578.html" class="i" data-ids="0:00t0t_g8XHo4mq1Wb,0:01111_7GErRQg9kc9"><span class="price">$10</span></a>']
[]
----------
[u'<a href="/for/5042796585.html" class="i" data-ids="0:00W0W_kAKZ780MVd4"><span class="price">$10</span></a>']
[]
----------
[u'<a href="/for/5070157083.html" class="i" data-ids="0:00H0H_l93i9PS7WEC,0:00V0V_4HHMk6zAcvp,0:01010_586rQh4KX7Y,0:00l0l_5t6DbernooP"><span class="price">$1100</span></a>']
[]
----------
[u'<a href="/for/5083629657.html" class="i" data-ids="0:01111_7ccUivz24cL"><span class="price">$2</span></a>']
[]
----------
[u'<a href="/for/5083317838.html" class="i"><span class="price">$275</span></a>']
[]
----------
[u'<a href="/for/5056913265.html" class="i" data-ids="0:00J0J_jAZGd05f59U"><span class="price">$25</span></a>']
[]
----------
[u'<a href="/for/5083138728.html" class="i" data-ids="0:00q0q_80N4SDtfsmz"><span class="price">$40</span></a>']
[]
----------
[]
[]
选择a
个p
个段落的.extract()
个元素(与您一样),并在每个段落上调用a/text()
时,您可以看到每个链接的HTML。
您注意到这些标签没有(直接)子文本元素(这是您使用span
选择的内容)
&#34; text&#34; (我认为)之后的部分位于a//text()
子元素中。
您有不同的选择:
a
选择所有string(a)
元素的后代文本元素,而不仅仅是子文本a
告诉XPath引擎为您提供第一个>>> for paragraph in response.xpath('//p'):
... print "----------"
... for a in paragraph.xpath('a'):
... print(a.xpath('@href').extract_first(), a.xpath('string(.)').extract_first())
...
...
----------
(u'/for/5042796585.html', u'$10')
----------
(u'/for/5070157083.html', u'$1100')
----------
(u'/for/5083629657.html', u'$2')
----------
(u'/for/5083317838.html', u'$275')
----------
(u'/for/5056913265.html', u'$25')
----------
(u'/for/5083138728.html', u'$40')
----------
>>>
元素的文本表示对于第二个选项,您将获得此选项(某些线被剥离):
Response
请注意,我在这里使用Scrapy 1.0的方便.extract_first()
method作为选择器。
如果您需要绝对网址,可以在>>> for paragraph in response.xpath('//p'):
... print "----------"
... for a in paragraph.xpath('a'):
... print(response.urljoin(a.xpath('@href').extract_first()), a.xpath('string(.)').extract_first())
...
----------
(u'http://ksu.craigslist.org/for/5070157083.html', u'$1100')
----------
(u'http://ksu.craigslist.org/for/5083629657.html', u'$2')
----------
(u'http://ksu.craigslist.org/for/5083317838.html', u'$275')
----------
(u'http://ksu.craigslist.org/for/5056913265.html', u'$25')
----------
(u'http://ksu.craigslist.org/for/5083138728.html', u'$40')
----------
>>>
对象上使用Scrapy 1.0's .urljoin()
method:
FOR /F "tokens=*" %%G IN ('DIR c:\users /B') DO (
if exist "C:\Users\%%G\Desktop\My Link With Spaces.url" (
rename "C:\Users\%%G\Desktop\My Link With Spaces.url" "My NEW Link With Spaces.url"
)
)
pause