无法使用Xpath从脚本标签检索数据

时间:2018-10-02 12:12:17

标签: python xpath scrapy

我正在尝试从下面将附加的脚本标签中检索数据。从该脚本标签中,我需要以下数据:digitalData.product.pvi_type_namedigitalData.product.pvi_subtype_namedigitalData.product.model_namedigitalData.product.displayName。 我已经用Python编写了自己的程序以进行检索,但是暂时无法使用...

脚本标签结构:

<script>
var COUNTRY_SHOP_STATUS = "buy";
var COUNTRY_SHOP_URL = "./buy";
var COUNTRY_WHERE_URL = "";
try {digitalData.page.pathIndicator.depth_2 = "mobile";} catch(e) {}
try {digitalData.page.pathIndicator.depth_3 = "mobile";} catch(e) {}
try {digitalData.page.pathIndicator.depth_4 = "smartphones";} catch(e) {}
try {digitalData.page.pathIndicator.depth_5 = "galaxy-note9";} catch(e) {}
try {digitalData.product.pvi_type_name      = "Mobile";} catch(e) {}
try {digitalData.product.pvi_subtype_name   = "Smartphone";} catch(e) {}
try {digitalData.product.model_name         = "SM-N960";} catch(e) {}
try {digitalData.product.displayName        = "galaxy note9";} catch(e) {}
try {digitalData.product.category           = digitalData.page.pathIndicator.depth_3;} catch(e) {}
</script>

Python脚本:

import scrapy
import csv
import re

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
    with open('input.csv','r') as csvf:
        urlreader = csv.reader(csvf, delimiter=',',quotechar='"')
        for url in urlreader:
            if url[0]=="y":
                yield scrapy.Request(url[1])

def parse(self, response):
    def get_values(parameter, script):
        return re.findall('%s = "(.*)"' % parameter, script)[0]

    source_arr = response.xpath("//script[contains(., 'COUNTRY_SHOP_STATUS')]/text()").extract()
    if source_arr:
          source = source_arr[0]
          with open('output.csv', 'a',newline='') as csvfile:
              fieldnames = ['Category', 'Type', 'Model', 'SK']
              writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
              writer.writerow({'Category': get_values("pvi_type_name", source), 'Type': get_values("pvi_subtype_name", source), 'Model': get_values("pathIndicator.depth_5", source), 'SK': get_values("model_name", source)})

1 个答案:

答案 0 :(得分:1)

如果您获得了script的内容,请尝试以下操作以获取所需的值:

import re

result = re.findall('product.*"(.*)"', source_arr[0])
print(result)
# ['Mobile', 'Smartphone', 'SM-N960', 'galaxy note9']