Question

我有一个带有CDATA标记的HTML源代码，其中包含一些我想获取的信息。

当我尝试以下操作时：

switch_url = switch_soup.find_all(text=re.compile(('Switches')))

我得到以下输出：

['//<![CDATA[\n    "url":"https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list","name":"Switches","admin_only":false},{"is_current":false,"url":"https://nxx.meraki.com/xxxxx/n/xxxxx/manage/configure/switchports","name":"Switch ports","admin_only":false},{"is_current":false,"url":"https://xxxx.meraki.com/Dormitory/n/xxxxxxx/manage/configure/dhcp_servers"//]]>\n  ']

如何从CDATA输出中获取“ Switches” URL，即“ https://xxxx.meraki.com/xxxxxxx/n/xxxxx/manage/nodes/list”？

谢谢！

Answer 1

您需要的是这个

from BeautifulSoup import BeautifulSoup
import re

// source.html contains your html above
f = open('source.html')
soup = BeautifulSoup(''.join(f.readlines()))
cdata = soup.find(text=re.compile("CDATA"))

或者您可以尝试

for script in soup(['script', 'style']):
        script.decompose()

    text = soup.get_text()
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)

从BeautifulSoup Python获取CDATA

1 个答案: