scrapy没有JSON对象可以被解码

时间:2017-10-11 09:40:06

标签: javascript python json scrapy

我正在尝试使用scrapy从网页中提取数据...并且所有数据都在javascript中

<script type="text/javascript">
// Globals
var ANUNTURI = [ { "ID": "2750801",   "Data": "Azi 11:16",   "Zile_piata": "146",   "Zona": "Andronache",   "Nr_Camere": "2",   "suprafu": "65",   "Pret": "62.000 EUR",   
    "Citit": "0",   "Tip_teren": "-",   "Etaj": "3 / 3",   "supraft": "-",
       "frontStradal": "-",   "Etichete": "",   "ArePoze": "7",   "Tip_spatiu": "-" },           
and so on... ]

;\r\n    var ID_CAUTARE = 0;\r\n    var CATEG = 3;\r\n    
var TRANZ = 2;\r\n    
var SORTARE = "";\r\n    
var ID_AGENT = "3012";\r\n    
var ID_LOCALITATE = \'13822\';\r\n    
var ID_JUDET = \'10\';\r\n    
var CRITERIU_FILTRU = \'\';\r\n        // judet_schimbat = "";\r\n\r\n    $(\'form[name="anunturi"] input[name="sort"]\').val(SORTARE);\r\n\r\n', u"\r\n\r\n    $(function(){\r\n\r\n        
var setTagValue = ' 0 ';\r\n        
var comboTitle = [];\r\n\r\n        $('#combo_etichete').mpCombo({\r\n            cls: 'mpCombo etichete',\r\n            header_default_text: 'Indiferent',\r\n            interval_from_text: ' Peste ', \r\n            
interval_to_text: ' Pana la ', \r\n            interval_between_text: ' si ', \r\n            combo_width: '162px', \r\n            menu_width: '160px',\r\n            onSelect: function() { // trigger click daca e inchisa cautarea avansata\r\n                if( $('#cautare_avansata').is(':hidden') ) {\r\n                    $('a#filtreaza').trigger('click');\r\n                }\r\n            }\r\n\r\n        });\r\n        \r\n        $('#combo_etichete').mpCombo({'setval': setTagValue});\r\n        comboTitle.push( $('#combo_etichete').mpCombo('gettitle') ); \r\n\r\n        if (comboTitle.length > 0) {\r\n            $('#combo_etichete dt a').text( comboTitle.join(', ') );      \r\n        }\r\n\r\n    });\r\n\r\n\r\n", u'\r\nvar gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");\r\ndocument.write(unescape("%3Cscript src=\'" + gaJsHost + "google-analytics.com/ga.js\' type=\'text/javascript\'%3E%3C/script%3E"));\r\n']
</script>

当我使用

json.loads(response.xpath("//script[2]/text").extract())

它给了我那个错误

  

无法解码Json对象

我只需要获取第一个 var ANUNTURI 及其中的所有内容并将它们放入mysql中。

更新

我也试过这个:

var = re.compile(r"var ANUNTURI= ({.*?});", re.MULTILINE | re.DOTALL)
json.loads(response.xpath("//script[2][contains(., 'var ANUNTURI')]/text()").re(var))

我得到的错误是:

  

TypeError:期望的字符串或缓冲区

然后我尝试了这个:

json.loads("".join(response.xpath("//script[2][contains(., 'var ANUNTURI')]/text()").re(var)))

我得到了:

  

无法解码JSON对象

1 个答案:

答案 0 :(得分:1)

这是一种提取数据的可能方法,但是使用当前显示的代码,很难判断变量是否嵌入了JSON或Javacript。微妙的Javascript方式可能是JSON对象的超集。

data = """/ Globals
var ANUNTURI = [ { "ID": "2750801",   "Data": "Azi 11:16",   "Zile_piata": "146",   "Zona": "Andronache",   "Nr_Camere": "2",   "suprafu": "65",   "Pret": "62.000 EUR",   
    "Citit": "0",   "Tip_teren": "-",   "Etaj": "3 / 3",   "supraft": "-",
       "frontStradal": "-",   "Etichete": "",   "ArePoze": "7",   "Tip_spatiu": "-" },]

;\r\n    var ID_CAUTARE = 0;\r\n    var CATEG = 3;\r\n    
var TRANZ = 2;\r\n    
var SORTARE = "";\r\n    
var ID_AGENT = "3012";\r\n    
var ID_LOCALITATE = \'13822\';\r\n    
var ID_JUDET = \'10\';\r\n    
var CRITERIU_FILTRU = \'\';\r\n        // judet_schimbat = "";\r\n\r\n    $(\'form[name="anunturi"] input[name="sort"]\').val(SORTARE);\r\n\r\n', u"\r\n\r\n    $(function(){\r\n\r\n        
var setTagValue = ' 0 ';\r\n        
var comboTitle = [];\r\n\r\n        $('#combo_etichete').mpCombo({\r\n            cls: 'mpCombo etichete',\r\n            header_default_text: 'Indiferent',\r\n            interval_from_text: ' Peste ', \r\n            
interval_to_text: ' Pana la ', \r\n            interval_between_text: ' si ', \r\n            combo_width: '162px', \r\n            menu_width: '160px',\r\n            onSelect: function() { // trigger click daca e inchisa cautarea avansata\r\n                if( $('#cautare_avansata').is(':hidden') ) {\r\n                    $('a#filtreaza').trigger('click');\r\n                }\r\n            }\r\n\r\n        });\r\n        \r\n        $('#combo_etichete').mpCombo({'setval': setTagValue});\r\n        comboTitle.push( $('#combo_etichete').mpCombo('gettitle') ); \r\n\r\n        if (comboTitle.length > 0) {\r\n            $('#combo_etichete dt a').text( comboTitle.join(', ') );      \r\n        }\r\n\r\n    });\r\n\r\n\r\n", u'\r\nvar gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");\r\ndocument.write(unescape("%3Cscript src=\'" + gaJsHost + "google-analytics.com/ga.js\' type=\'text/javascript\'%3E%3C/script%3E"));\r\n'
"""
from json import loads
from pprint import PrettyPrinter
lines = data.split("\r\n")
anunturi_json = lines[0].split("=")[1]
print anunturi_json
val = loads(anunturi_json)
pp = PrettyPrinter(indent=4)
pp.pprint(val)