我有一个带有CDATA标记的HTML源代码,其中包含一些我想获取的信息。
当我尝试以下操作时:
switch_url = switch_soup.find_all(text=re.compile(('Switches')))
我得到以下输出:
['//<![CDATA[\n "url":"https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list","name":"Switches","admin_only":false},{"is_current":false,"url":"https://nxx.meraki.com/xxxxx/n/xxxxx/manage/configure/switchports","name":"Switch ports","admin_only":false},{"is_current":false,"url":"https://xxxx.meraki.com/Dormitory/n/xxxxxxx/manage/configure/dhcp_servers"//]]>\n ']
如何从CDATA输出中获取“ Switches” URL,即“ https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list”?
谢谢!
答案 0 :(得分:0)
您需要的是这个
from BeautifulSoup import BeautifulSoup
import re
// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))
或者您可以尝试
for script in soup(['script', 'style']):
script.decompose()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)