在特定的<script> </script>标记之间进行提取

时间:2015-05-07 16:23:02

标签: python lxml

我可以使用以下代码提取所有标签。但是,我不知道如何查看<script></script>标记之间的内容。特别是,我只想说这部分(中间有更多内容,但我对此并不感兴趣):

<script>
            var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]
</script>

不确定我需要添加哪些代码?我需要将[{}]之间的逗号分隔的东西放到python字典中。

编辑接受答案中的建议:

# -*- coding: utf-8 -*-
"""
Created on Thu May  7 10:31:02 2015

@author: idf
"""

import re
import json
import urllib2
from lxml import etree 


url='http://data.cnbc.com/quotes/CLCV1'

def wgetUrl(target):
    try:
        req = urllib2.Request(target)
        req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
        response = urllib2.urlopen(req)
        outtxt = response.read()
        response.close()
    except:
        return ''

    return outtxt

def extract_text(elem):        
    if elem is None:
        print None
    else:
        return ''.join(i for i in elem.itertext())

content = wgetUrl(url)
node = etree.HTML(content)
parser = etree.HTMLParser()


nodes = node.findall(r'.//script')
for x in nodes:
    matches = re.findall(r'quoteDataObj\s\=\s(\[.+\])', x)
    if len(matches) > 0:
        python_dict = json.loads(matches[0])

2 个答案:

答案 0 :(得分:1)

您可以在脚本上使用正则表达式来查找quoteDataObj变量并使用JSON加载其内容。例如:

import re
import json

#...your code...

content = wgetUrl(url)
matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', content)
if len(matches) > 0:
    python_dict = json.loads(matches[0])

输出:

{u'altSymbol': u'CL/M5',
 u'assetType': u'DERIVATIVE',
 u'change': u'-1.39',
 u'code': 0,
 u'curmktstatus': u'REG_MKT',
 u'currencyCode': u'USD',
 u'encodedSymbol': u'CLCV1',
 u'exchange': u'New York Mercantile Exchange',
 u'high': u'61.31',
 u'last': u'59.54',
 u'low': u'59.14',
 u'name': u"WTI Crude Oil (Jun'15)",
 u'noStreaming': u'false',
 u'open': u'60.69',
 u'provider': u'CNBC Quote Cache',
 u'realTime': u'false',
 u'shortName': u'OIL',
 u'source': u'',
 u'symbol': u'CLCV1',
 u'symbolType': u'symbol',
 u'timeZone': u'EDT',
 u'volume': u'189607'}

使用LXML解析

OP表示有兴趣了解如何通过LXML解析来解决问题。这是:

import re
import json

#...your code...

for x in nodes:
    matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', str(x.text))
    if len(matches) > 0:
        python_dict = json.loads(matches[0])

答案 1 :(得分:0)

让JSON做重举

我会假设你想要的内容格式就像你的例子:

<script>
            var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]
</script>

在这种情况下,我们实际上可以将quoteDataObj值视为json。

因此,解决这个问题的最简单方法就是这样:

>>> text
'            var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]'
>>> data = text[ text.index('[')+1 : text.rindex(']') ]
>>> data
'{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}'
>>> import json
>>> values = json.loads(data)
>>> values
{u'code': 0, u'realTime': u'false', u'symbolType': u'symbol', u'high': u'61.31', u'open': u'60.69', u'assetType': u'DERIVATIVE', u'currencyCode': u'USD', u'source': u'', u'low': u'59.14', u'provider': u'CNBC Quote Cache', u'exchange': u'New York Mercantile Exchange', u'symbol': u'CLCV1', u'volume': u'189607', u'curmktstatus': u'REG_MKT', u'encodedSymbol': u'CLCV1', u'shortName': u'OIL', u'change': u'-1.39', u'altSymbol': u'CL/M5', u'last': u'59.54', u'name': u"WTI Crude Oil (Jun'15)", u'noStreaming': u'false', u'timeZone': u'EDT'}