使用Scrapy从HTML中的<script>标记获取数据

时间:2015-11-03 16:01:21

标签: javascript python python-2.7 web-scraping scrapy

我一直在尝试使用Scrapy(xpath)从Kbb的HTML中提取脚本标记中的数据。但我的主要问题是识别正确的div和脚本标签。我是使用xpath的新手,非常感谢任何帮助!

&#xA;&#xA;

HTML( http://www.kbb.com/nissan/的Altima / 2014/25 S-轿车-4D / vehicleid = 392396&安培;意图=购买使用的&安培;里程= 10000&安培;条件=公平&安培; pricetype =零售):

&#XA;&# xA;
 &lt; script type =“text / javascript”src =“http://s1.kbb.com/combine/IncentivesPilotJs/949332058”&gt;&lt; / script&gt;&#xA; &lt; input type =“hidden”id =“ResaleValueUrl”value =“/ ymmt / resalevalue /?vehicleid = 392396”/&gt;&#xA; &lt; input type =“hidden”id =“Intent”value =“buy-used”/&gt;&#xA; &lt;! -  [if lt IE 9]&gt;&#xA; &LT;脚本&GT;&#XA; window.FlashCanvasOptions = {&#xA; swfPath:“/ js / canvas / FlashCanvas / UCMarketMeter /”&#xA; };&#XA; &LT; /脚本&GT;&#XA; &lt; script type =“text / javascript”src =“http://s1.kbb.com/combine/YmmtMarketMeterFlashCanvasJs/795892638”&gt;&lt; / script&gt;&#xA;百分比抑制率ENDIF]  -  GT;!&#XA; &lt; script type =“text / javascript”src =“http://s1.kbb.com/combine/YMMTOverview/1527402533”&gt;&lt; / script&gt;&#xA; &lt; script type =“text / javascript”src =“http://s1.kbb.com/combine/YmmtPricingOverviewBuyUsedJs/-1416499456”&gt;&lt; / script&gt;&#xA;&#xA; &lt; script language =“javascript”type =“text / javascript”&gt;&#xA; $(document).ready(function(){&#xA; KBB.Vehicle.Pages.PricingOverview.Buyers.setup({&#xA; //解决方法直到我们为Flash和#xA提供跨域工作; imageDir:window。 FlashCanvasOptions?“/ Content / images”:“http://file.kelleybluebookimages.com/kbb/images/marketmeter",
 vehicleId:”392396“,&#xA; zipCode:”78701“,&#xA ;里程:“10000”,&#xA;意图:“买二手”,&#xA;价格类型:“零售”,&#xA;条件:“好”,&#xA;期权:“392396 | 53635 | 78701 | 100 | 10 |“,&#xA;价格:”17074“,&#xA;制造商:”日产“,&#xA;型号:”Altima“,&#xA;年份:”2014“,&# xA;风格:“2.5 S Sedan 4D”,&#xA;类别:“”,&#xA; hasCpo:true,&#xA; meetsCpoReq:true,&#xA; showOthersPaid:false,&#xA;数据:{&#xA; “价值观”:{&#xA; “cpo”:{&#xA; “priceMin”:17335.0,&#xA; “价格”:18275.0,&#xA; “priceMax”:19214.0&#xA; },&#XA; “fpp”:{&#xA; “priceMin”:15286.0,&#xA; “价格”:17074.0,&#xA; “priceMax”:18861.0&#xA; },&#XA; “privatepartyexcellent”:{&#xA; “priceMin”:0.0,&#xA; “价格”:16064.0,&#xA; “priceMax”:0.0&#xA; },&#XA; “privatepartyfair”:{&#xA; “priceMin”:0.0,&#xA; “价格”:14081.0,&#xA; “priceMax”:0.0&#xA; },&#XA; “privatepartygood”:{&#xA; “priceMin”:0.0,&#xA; “价格”:15454.0,&#xA; “priceMax”:0.0&#xA; },&#XA; “privatepartyverygood”:{&#xA; “priceMin”:0.0,&#xA; “价格”:15715.0,&#xA; “priceMax”:0.0&#xA; },&#XA; “零售”:{&#xA; “priceMin”:0.0,&#xA; “价格”:17875.0,&#xA; “priceMax”:0.0&#xA; }&#XA; },&#XA; “timAmount”:0.0,&#xA; “monthlyPayments”:{&#xA; “cpo”:{&#xA; “vehiclePrice”:18275.0,&#xA; “率”:2.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:348.0&#xA; },&#XA; “fpp”:{&#xA; “vehiclePrice”:17074.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:342.0&#xA; },&#XA; “privatepartyexcellent”:{&#xA; “vehiclePrice”:16064.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:322.0&#xA; },&#XA; “privatepartyfair”:{&#xA; “vehiclePrice”:14081.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:282.0&#xA; },&#XA; “privatepartygood”:{&#xA; “vehiclePrice”:15454.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:309.0&#xA; },&#XA; “privatepartyverygood”:{&#xA; “vehiclePrice”:15715.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:315.0&#xA; },&#XA; “零售”:{&#xA; “vehiclePrice”:17875.0,&#xA; “率”:4.9,&#xA; “条款”:60.0,&#xA; “taxAndTitle”:6.5,&#xA; “downPay”:0.0,&#xA; “金额”:358.0&#xA; }&#XA; },&#XA; “规模”:{&#xA; “scaleLow”:14081.0,&#xA; “scaleHigh”:19214.0&#xA; },&#XA; “交易”:{&#xA; “下面”:7,&#xA; “介于”之间:17,&#xA; “上方”:3&#xA; }&#XA;}&#XA; adPriceRanges:{“AdPriceRange”:[{“PriceMin”:0,“PriceMax”:8499,“AdPRValue”:1},{“PriceMin”:8500,“PriceMax”:18499,“AdPRValue”:2},{“ PriceMin “:18500”,PriceMax “:23499”,AdPRValue “:3},{” PriceMin “:23500”,PriceMax “:28499”,AdPRValue “:4},{” PriceMin “:28500”,PriceMax“:33499 “AdPRValue”:5},{ “PriceMin”:33500 “PriceMax”:38499 “AdPRValue”:6},{ “PriceMin”:38500 “PriceMax”:43499 “AdPRValue”:7},{” PriceMin “:43500”,PriceMax “:48499”,AdPRValue “:8},{” PriceMin “:48500”,PriceMax “:53499”,AdPRValue “:9},{” PriceMin “:53500”,PriceMax“:63499 “AdPRValue”:10},{ “PriceMin”:63500 “PriceMax”:73499 “AdPRValue”:11},{ “PriceMin”:73500 “PriceMax”:1000000, “AdPRValue”:12}]}} );&#XA; });&#XA; $( '脚音符')隐藏();&#XA; $(window).on('popstate',function(){&#xA; KBB.Vehicle.Pages.PricingOverview.Buyers.stateChangeHandler();&#xA;});&#xA; &lt; / script&gt;&#xA;&#xA;&#xA; Scrapy代码:&#xA;&#xA;来自scrapy.spider导入BaseSpider&#xA;来自scrapy.selector import Selector&#xA; import scrapy&#xA ;&#xA;来自kbb.items import kbbItem&#xA;&#xA; class kbbSpider(scrapy.Spider):&#xA; name =“kbb”&#xA; allowed_domains = [“kbb.com”]&# xA; start_urls = [&#xA; “http://www.kbb.com/nissan/altima/2014/25-s-sedan-4d/?vehicleid=392396&intent=buy-used&10000&good&pricetype=retail"
]& #xA;&#xA; def parse(self,response):&#xA; SEL =选择(响应)&#XA; #位点= sel.xpath( '// DIV')&#XA;项= []&#XA; #for网站中的网站:&#xA;项= kbbItem&#XA; 。项[ 'priceMin'] = site.xpath( '// DIV /脚本')提取[35] [915:922]&#XA;返回项目&#xA;  
&#xA;&#xA;

我最后要填充 priceMin price 来自 fpp 的priceMax 和来自 retail 字段的价格到我的商品中。目前我正在使用索引来获取这些值但是想知道是否有更简单的方法。

&#XA;

1 个答案:

答案 0 :(得分:7)

问题是所需数据在Javascript代码中。而且,您依赖线索引的当前方法非常脆弱且不可靠。

想法是找到包含所需数据的script标记,使用regular expressions来获取包含价格的对象/字典,在{{3}的帮助下将对象加载到python字典中并获得所需的信息。

来自json module

的演示
In [1]: import re
In [2]: import json

In [3]: pattern = re.compile(r"KBB\.Vehicle\.Pages\.PricingOverview\.Buyers\.setup\(.*?data: ({.*?}),\W+adPriceRanges", re.MULTILINE | re.DOTALL)
In [4]: data = response.xpath("//script[contains(., 'KBB.Vehicle.Pages.PricingOverview.Buyers.setup')]/text()").re(pattern)[0]

In [5]: data = data.replace("//Workaround until we get cross domain working for Flash", "")

In [6]: data_obj = json.loads(data)

In [7]: data_obj['values']['fpp']
Out[7]: {u'price': 15569.0, u'priceMax': 17356.0, u'priceMin': 13781.0}

In [8]: data_obj['values']['retail']
Out[8]: {u'price': 16370.0, u'priceMax': 0.0, u'priceMin': 0.0}