使用BeautifulSoup在CData中刮取变量

时间:2017-10-13 03:13:47

标签: python beautifulsoup cdata

我有一个网页,其中包含以下数据,我想在该网页的CData部分进行搜索。

<script type="text/javascript">//<![CDATA[ 

car.app =


{"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}] 

... 
... 
//]]></script>

我想在CData中获取car.app变量,但我不确定如何在python中解析它。

import bs4 as bs

import urllib.request

class AppURLopener(urllib.request.FancyURLopener):
    version = "Mozilla/5.0"

opener = AppURLopener()
response = opener.open(url)

c = response.read()
soup = bs.BeautifulSoup(c, "html.parser")
print(soup)

1 个答案:

答案 0 :(得分:0)

我认为解决问题的唯一方法是使用BeautifulSoup解析该特定标记,然后进行一些字符串操作以实现目标。

代码:

import bs4 as bs
import urllib.request

c = '''
<script type="text/javascript">//<![CDATA[ 

car.app =


{"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}] 

... 
... 
//]]></script>
'''
soup = bs.BeautifulSoup(c, "html.parser")
script = soup.find('script')
print(str(script.text).split('car.app =')[1].split('...')[0].replace('\n', ''))

输出:

{"lat":26.175625,"lon":-80.13808,"zoom":"13","yellow":"\/img\/icons\/yellow.png","cars":[{"CAR_ID":"715383","ID":"538070521","UID":"0","CARNAME":"MAZDA","TYPE_COLOR":"0","LAT":"26.13437","LON":"-80.11906","COURSE":"100","SPEED":"0","LENGTH":"12","STATE":"OH"}]