我可以使用以下代码提取所有标签。但是,我不知道如何查看<script>
和</script>
标记之间的内容。特别是,我只想说这部分(中间有更多内容,但我对此并不感兴趣):
<script>
var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]
</script>
不确定我需要添加哪些代码?我需要将[{
和}]
之间的逗号分隔的东西放到python字典中。
编辑接受答案中的建议:
# -*- coding: utf-8 -*-
"""
Created on Thu May 7 10:31:02 2015
@author: idf
"""
import re
import json
import urllib2
from lxml import etree
url='http://data.cnbc.com/quotes/CLCV1'
def wgetUrl(target):
try:
req = urllib2.Request(target)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.0.3 Gecko/2008092417 Firefox/3.0.3')
response = urllib2.urlopen(req)
outtxt = response.read()
response.close()
except:
return ''
return outtxt
def extract_text(elem):
if elem is None:
print None
else:
return ''.join(i for i in elem.itertext())
content = wgetUrl(url)
node = etree.HTML(content)
parser = etree.HTMLParser()
nodes = node.findall(r'.//script')
for x in nodes:
matches = re.findall(r'quoteDataObj\s\=\s(\[.+\])', x)
if len(matches) > 0:
python_dict = json.loads(matches[0])
答案 0 :(得分:1)
您可以在脚本上使用正则表达式来查找quoteDataObj
变量并使用JSON加载其内容。例如:
import re
import json
#...your code...
content = wgetUrl(url)
matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', content)
if len(matches) > 0:
python_dict = json.loads(matches[0])
输出:
{u'altSymbol': u'CL/M5',
u'assetType': u'DERIVATIVE',
u'change': u'-1.39',
u'code': 0,
u'curmktstatus': u'REG_MKT',
u'currencyCode': u'USD',
u'encodedSymbol': u'CLCV1',
u'exchange': u'New York Mercantile Exchange',
u'high': u'61.31',
u'last': u'59.54',
u'low': u'59.14',
u'name': u"WTI Crude Oil (Jun'15)",
u'noStreaming': u'false',
u'open': u'60.69',
u'provider': u'CNBC Quote Cache',
u'realTime': u'false',
u'shortName': u'OIL',
u'source': u'',
u'symbol': u'CLCV1',
u'symbolType': u'symbol',
u'timeZone': u'EDT',
u'volume': u'189607'}
OP表示有兴趣了解如何通过LXML解析来解决问题。这是:
import re
import json
#...your code...
for x in nodes:
matches = re.findall(r'quoteDataObj\s\=\s\[(\{.+\})\]', str(x.text))
if len(matches) > 0:
python_dict = json.loads(matches[0])
答案 1 :(得分:0)
我会假设你想要的内容格式就像你的例子:
<script>
var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]
</script>
在这种情况下,我们实际上可以将quoteDataObj
值视为json。
因此,解决这个问题的最简单方法就是这样:
>>> text
' var quoteDataObj = [{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}]'
>>> data = text[ text.index('[')+1 : text.rindex(']') ]
>>> data
'{"symbol":"CLCV1","symbolType":"symbol","code":0,"name":"WTI Crude Oil (Jun\\u002715)","shortName":"OIL","last":"59.54","exchange":"New York Mercantile Exchange","source":"","open":"60.69","high":"61.31","low":"59.14","change":"-1.39","currencyCode":"USD","timeZone":"EDT","volume":"189607","provider":"CNBC Quote Cache","altSymbol":"CL/M5","curmktstatus":"REG_MKT","realTime":"false","assetType":"DERIVATIVE","noStreaming":"false","encodedSymbol":"CLCV1"}'
>>> import json
>>> values = json.loads(data)
>>> values
{u'code': 0, u'realTime': u'false', u'symbolType': u'symbol', u'high': u'61.31', u'open': u'60.69', u'assetType': u'DERIVATIVE', u'currencyCode': u'USD', u'source': u'', u'low': u'59.14', u'provider': u'CNBC Quote Cache', u'exchange': u'New York Mercantile Exchange', u'symbol': u'CLCV1', u'volume': u'189607', u'curmktstatus': u'REG_MKT', u'encodedSymbol': u'CLCV1', u'shortName': u'OIL', u'change': u'-1.39', u'altSymbol': u'CL/M5', u'last': u'59.54', u'name': u"WTI Crude Oil (Jun'15)", u'noStreaming': u'false', u'timeZone': u'EDT'}