如何使用Scrapy从Google新闻网页上获取标题标题?

时间:2019-04-18 09:19:53

标签: scrapy google-news

我保存了https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen的脱机文件

在确定如何获取所列文章的标题时遇到麻烦。

import scrapy

class newsSpider(scrapy.Spider):
    name = "news"
    start_urls = ['file:///127.0.0.1/home/toni/Desktop/crawldeez/googlenewsoffline.html/'
                  ]

    def parse(self, response):
        for xrnccd in response.css('a.MQsxIb.xTewfe.R7GTQ.keNKEd.j7vNaf.Cc0Z5d.EjqUne'):
            yield {
                'ipQwMb.ekueJc.RD0gLb': xrnccd.css('h3.ipQwMb.ekueJc.RD0gLb::ipQwMb.ekueJc.RD0gLb').get(),
            }

1 个答案:

答案 0 :(得分:1)

问题似乎在于以下事实:页面内容是使用JavaScript动态呈现的,因此无法使用cssxpath方法从HTML中提取页面内容。但是,它存在于响应主体中,因此您可以使用正则表达式提取它。这是Scrapy shell会话,展示如何:

$ scrapy shell "https://news.google.com/search?q=amazon&hl=en-US&gl=US&ceid=US%3Aen"
...
>>> import re
>>> from pprint import pprint
>>>
>>> titles = re.findall(r'<h3 class="[^"]+?"><a[^>]+?>(.+?)</a>', response.text)
>>> pprint(titles)
['Amazon will no longer sell Chinese goods in China',
 'YouTube is finally coming back to Amazon’s Fire TV devices',
 'Amazon Plans to Use Digital Media to Expand Its Advertising Business',
 'Amazon flooded with fake reviews; Learn how to spot them',
 'How To Win in Today&#39;s Amazon World',
 'Amazon Day: How to schedule Amazon deliveries',
 'Bezos Disputes Amazon’s Market Power. But His Merchants Feel the Pinch',
 '20 Best Action Movies to Stream on Amazon Prime',
 ...]