Scrapy Splash无法获取React站点的数据

时间:2019-04-29 03:54:00

标签: python reactjs scrapy scrapy-splash

我需要废品this网站。 做出反应,看起来如此。然后,我尝试使用scrapy-splash提取数据。例如,我需要类 Widget _buildListItem(BuildContext context, Record record) { return Card( key: ValueKey(record.activityName), elevation: 8.0, margin: new EdgeInsets.symmetric(horizontal: 10.0, vertical: 6.0), child: Container( decoration: BoxDecoration(color: Color.fromRGBO(64, 75, 96, .9)), child: ListTile( contentPadding: EdgeInsets.symmetric(horizontal: 20.0, vertical: 10.0), title: Text( record.activityName, style: TextStyle(color: Colors.white, fontWeight: FontWeight.bold, fontSize: 23), ), subtitle: Row( children: <Widget>[ new Flexible( child: new Column( crossAxisAlignment: CrossAxisAlignment.start, children: <Widget>[ RichText( text: TextSpan( text: "Activations: "+record.activations+ "\n"+record.dateCompleted, style: TextStyle(color: Colors.white), ), maxLines: 2, softWrap: true, ) ], ) ) ], ), trailing: Container( child: Hero( tag: "avatar_" + record.activityName, child: CircleAvatar( radius: 32, backgroundImage: NetworkImage(record.icon), backgroundColor: Colors.white, ) ) ), onTap: () { Navigator.push( context, MaterialPageRoute(builder: (context) => new DetailPage(record: record))); } ), ) ); } 的“ a”元素。但是响应是一个空数组。我在大约5秒钟内使用了shelf-product-name参数。 但是我只能得到一个空数组。

wait

1 个答案:

答案 0 :(得分:1)

实际上,无需使用Scrapy Splash,因为所有必需的数据都以json格式存储在原始html响应的<script>标签内:

import scrapy
from scrapy.crawler import CrawlerProcess
import json

class JumboCLSpider(scrapy.Spider):
    name = "JumboCl"
    start_urls = ["https://www.jumbo.cl/lacteos-y-bebidas-vegetales/leches-blancas?page=6"]

    def parse(self,response):
        script = [script for script in response.css("script::text") if "window.__renderData" in script.extract()]
        if script:
            script = script[0]
        data = script.extract().split("window.__renderData = ")[-1]
        json_data = json.loads(data[:-1])
        for plp in json_data["plp"]["plp_products"]:
            for product in plp["data"]:
                #yield {"productName":product["productName"]} # data from css:  a.shelf-product-name
                yield product

if __name__ == "__main__":
    c = CrawlerProcess({'USER_AGENT':'Mozilla/5.0'})
    c.crawl(JumboCLSpider)
    c.start()