我正在使用scrapy框架抓取一些新闻网站,似乎只存储了被抓取并在循环中重复的最后一个项目
我想存储标题,日期和链接,我从第一页刮下来 并存储整个新闻文章。所以我想将存储在列表中的文章合并为一个字符串。
商品代码
import scrapy
class ScrapedItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
source = scrapy.Field()
date = scrapy.Field()
paragraph = scrapy.Field()
蜘蛛代码
import scrapy
from ..items import ScrapedItem
class CBNCSpider(scrapy.Spider):
name = 'kontan'
start_urls = [
'https://investasi.kontan.co.id/rubrik/28/Emiten'
]
def parse(self, response):
box_text = response.xpath("//ul/li/div[@class='ket']")
items = ScrapedItem()
for crawl in box_text:
title = crawl.css("h1 a::text").extract()
source ="https://investasi.kontan.co.id"+(crawl.css("h1 a::attr(href)").extract()[0])
date = crawl.css("span.font-gray::text").extract()[0].replace("|","")
items['title'] = title
items['source'] =source
items['date'] = date
yield scrapy.Request(url = source,
callback=self.parseparagraph,
meta={'item':items})
def parseparagraph(self, response):
items_old = response.meta['item'] #only last item stored
paragraph = response.xpath("//p/text()").extract()
items_old['paragraph'] = paragraph #merge into single string
yield items_old
我希望可以通过循环更新Date,Title和Source的输出。 并且可以将文章合并为单个字符串以存储在mysql中
答案 0 :(得分:0)
我定义了一个空字典并将那些变量放入其中。而且,我对您的xpath和css选择器进行了一些小的更改,以使它们不易出错。该脚本现在可以按需运行:
var previousLocation = document.referrer;
alert(previousLocation);