Question

我是scrapy的新手，我正试图在reddit上搜集帖子。为了帮助，我已经访问了scrapy shell并试图挖掘帖子。我使用的页面是https://www.reddit.com/r/news/comments/6a4ie8/philippines_senator_tells_un_reports_of_drug_war/

我查看过该来源，并找到了我想要访问的以下数据：

“class =”usertext-body may-blank-within md-container“＆gt;＆lt; div class =”md“＆gt;＆lt; p＆gt;在我看来，参议员正在使用”替代事实“一词“康威使用它们的方式相反。他用它们来诋毁等等”

为什么当我输入response.xpath（'// div [@ class =“md”]）。extract（）我得到一个空数组。此外，当我尝试通过shell访问此页面上的大量数据时，我得到空数组。

非常感谢提前

Answer 1

如果要访问每个帖子的文本，可以使用此xpath：

response.xpath('//form[contains(@id, "form-t1")]//div//div//p/text()').extract()。

您可以在此处了解有关xpath的更多信息：Scrapy Selectors

最后，如果你想测试xpaths，这是一个非常有用的工具：Videlibri。在左侧textarea中粘贴要解析的HTML，在右侧粘贴xpath。您现在可以轻松地测试您的代码。

希望这有帮助。

Answer 2

使用response.css和response.xpath同时尝试此操作，避免使用form ID，因为它似乎有所改变：

>>> response.css('div.entry form div.usertext-body div.md p ::text').extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'
>>> 
>>> response.xpath("//div[contains(@class, 'entry')]/form/div/div/p[1]/text()").extract_first()
'It seems to me that the senator was using the term "alternative facts" the opposite way Conway used them. He used them to discredit the interpretation of said "facts" as lies, insisting that many of the homicides being counted as extra-judicial killings were just regular homicides.'

无法通过scrapy shell访问某些reddit数据

2 个答案: