Scrapy没有从网站获得标题

时间:2015-06-22 17:11:56

标签: python web-crawler scrapy

我刚刚开始学习网络浏览,我遇到了从网站上获取正确标题的问题。我目前正在观看此指南:https://www.youtube.com/watch?v=1EFnX1UkXVU它会告诉您创建此类:

class MySpider(BaseSpider):
   name = "craig"
   allowed_domains = ["craigslist.org"]
   start_urls = ["http://ksu.craigslist.org/search/foa"]

   def parse(self, response):
       hxs = HtmlXPathSelector(response)
       titles = hxs.select("//p")
       for titles in titles:
          title = titles.xpath("a/text()").extract()
          link= titles.xpath("a/@href").extract()
          print title, link

问题是该链接只打印了一堆/for/###numbers###.html,标题根本不打印任何内容。我不确定为什么会这样。我已经阅读了之前的主题并改变了一些内容,但我仍然遇到同样的问题。

1 个答案:

答案 0 :(得分:3)

Scrapy XPath选择器将根据您提供的XPath提取该页面的HTML中的内容,仅此而已。

让我们使用scrapy shell查看您的示例网址及其在scrapy中的响应,并测试您的XPath(请注意我使用scrapy 1.0,并删除了一些结果行):< / p>

(scrapy10)paul@paul$ scrapy shell -s USER_AGENT="Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.125 Safari/537.36" http://ksu.craigslist.org/search/foa
2015-06-22 19:23:26 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
...
2015-06-22 19:23:27 [scrapy] INFO: Spider opened
2015-06-22 19:23:27 [scrapy] DEBUG: Crawled (200) <GET http://ksu.craigslist.org/search/foa> (referer: None)
...
>>> for paragraph in response.xpath('//p'):
...     print "----------"
...     print paragraph.xpath('a').extract()
...     print paragraph.xpath('a/text()').extract()
... 
----------
[u'<a href="/for/5083735342.html" class="i" data-ids="0:00G0G_gb8cQBnOWca"><span class="price">$50</span></a>']
[]
----------
[u'<a href="/for/5042795578.html" class="i" data-ids="0:00t0t_g8XHo4mq1Wb,0:01111_7GErRQg9kc9"><span class="price">$10</span></a>']
[]
----------
[u'<a href="/for/5042796585.html" class="i" data-ids="0:00W0W_kAKZ780MVd4"><span class="price">$10</span></a>']
[]
----------
[u'<a href="/for/5070157083.html" class="i" data-ids="0:00H0H_l93i9PS7WEC,0:00V0V_4HHMk6zAcvp,0:01010_586rQh4KX7Y,0:00l0l_5t6DbernooP"><span class="price">$1100</span></a>']
[]
----------
[u'<a href="/for/5083629657.html" class="i" data-ids="0:01111_7ccUivz24cL"><span class="price">$2</span></a>']
[]
----------
[u'<a href="/for/5083317838.html" class="i"><span class="price">$275</span></a>']
[]
----------
[u'<a href="/for/5056913265.html" class="i" data-ids="0:00J0J_jAZGd05f59U"><span class="price">$25</span></a>']
[]
----------
[u'<a href="/for/5083138728.html" class="i" data-ids="0:00q0q_80N4SDtfsmz"><span class="price">$40</span></a>']
[]
----------
[]
[]

选择ap个段落的.extract()个元素(与您一样),并在每个段落上调用a/text()时,您可以看到每个链接的HTML。 您注意到这些标签没有(直接)子文本元素(这是您使用span选择的内容)

&#34; text&#34; (我认为)之后的部分位于a//text()子元素中。

您有不同的选择:

  • 使用a选择所有string(a)元素的后代文本元素,而不仅仅是子文本
  • 使用a告诉XPath引擎为您提供第一个>>> for paragraph in response.xpath('//p'): ... print "----------" ... for a in paragraph.xpath('a'): ... print(a.xpath('@href').extract_first(), a.xpath('string(.)').extract_first()) ... ... ---------- (u'/for/5042796585.html', u'$10') ---------- (u'/for/5070157083.html', u'$1100') ---------- (u'/for/5083629657.html', u'$2') ---------- (u'/for/5083317838.html', u'$275') ---------- (u'/for/5056913265.html', u'$25') ---------- (u'/for/5083138728.html', u'$40') ---------- >>> 元素的文本表示

对于第二个选项,您将获得此选项(某些线被剥离):

Response

请注意,我在这里使用Scrapy 1.0的方便.extract_first() method作为选择器。

如果您需要绝对网址,可以在>>> for paragraph in response.xpath('//p'): ... print "----------" ... for a in paragraph.xpath('a'): ... print(response.urljoin(a.xpath('@href').extract_first()), a.xpath('string(.)').extract_first()) ... ---------- (u'http://ksu.craigslist.org/for/5070157083.html', u'$1100') ---------- (u'http://ksu.craigslist.org/for/5083629657.html', u'$2') ---------- (u'http://ksu.craigslist.org/for/5083317838.html', u'$275') ---------- (u'http://ksu.craigslist.org/for/5056913265.html', u'$25') ---------- (u'http://ksu.craigslist.org/for/5083138728.html', u'$40') ---------- >>> 对象上使用Scrapy 1.0's .urljoin() method

FOR /F "tokens=*" %%G IN ('DIR c:\users /B') DO (

    if exist "C:\Users\%%G\Desktop\My Link With Spaces.url" (

        rename "C:\Users\%%G\Desktop\My Link With Spaces.url" "My NEW Link With Spaces.url"

    )
)
pause