Python Scrapy获取文章正文,extract_first()获取无

时间:2018-10-26 14:30:29

标签: python scrapy scrapy-spider

我试图用Scrapy从新闻站点获取文章正文。

import scrapy
import sys 
import json

class ReutersPage(scrapy.Spider):
    name = "reutersPage"
    start_urls = [
        'https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C'
    ]


    def parse(self, response):
        articleBody = response.css('div.StandardArticleBody_body::text').extract_first()
        print('######## Article body ##########')
        print(articleBody)
        yield {
            'body': articleBody
        }  

我尝试在div StandardArticleBody_body中获取文本,但总是获得值。

输出为

2018-10-26 14:23:44 [scrapy.core.engine] INFO: Spider opened
2018-10-26 14:23:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-26 14:23:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/robots.txt> (referer: None)
2018-10-26 14:23:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C> (referer: None)
######## Parse article ##########
######## Article body ##########
None
2018-10-26 14:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.reuters.com/article/chile-sqm-stocks/lithium-miner-sqm-shares-up-2-7-pct-chile-court-clears-way-for-tianqi-stake-purchase-idUSC0N1OX01C>
{'body': None}
2018-10-26 14:23:45 [scrapy.core.engine] INFO: Closing spider (finished)

2 个答案:

答案 0 :(得分:0)

没有任何文字直接属于您选择的div,而是其后代。选择器路径和::之间的空格将获得所有后代的text,而不仅仅是您选择的节点的文本。

尝试一下

articleBody = response.css('div.StandardArticleBody_body ::text').extract_first()

这样您就可以获得div的后代的所有文本。

答案 1 :(得分:0)

<script>
   var timeXAxis = ["Jan 2015", "Feb 2015", "Mar 2015"]
     var timeSeries = [
     {
       name: "Team 1",
       data: [
         ["Jan 2015", 100],
         ["Feb 2015", 150],
         ["Mar 2015", 200],
       ]
     },
     {
       name: "Team 2",
       data: [
         ["Jan 2015", 120],
         ["Feb 2015", 200],
         ["Mar 2015", 300],
       ]
     },
     {
       name: "Team 3",
       data: [
         ["Jan 2015", 111],
         ["Feb 2015", 222],
         ["Mar 2015", 333],
       ]
     }
   ];
</script>