我正在尝试使用beautifulsoup来解析html,但每当我点击带有内联脚本标记的页面时,beautifulsoup会对内容进行编码,但最终不会对其进行解码。
这是我使用的代码:
from bs4 import BeautifulSoup
if __name__ == '__main__':
htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>'
soup = BeautifulSoup(htmlData)
#... using BeautifulSoup ...
print(soup.prettify() )
我想要这个输出:
<html>
<head>
<script type="text/javascript">
console.log("< < not able to write these & also these >> ");
</script>
</head>
<body>
<div>
start of div
</div>
</body>
</html>
但我得到了这个输出:
<html>
<head>
<script type="text/javascript">
console.log("< < not able to write these & also these >> ");
</script>
</head>
<body>
<div>
start of div
</div>
</body>
</html>
答案 0 :(得分:1)
您可以尝试lxml:
import lxml.html as LH
if __name__ == '__main__':
htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>'
doc = LH.fromstring(htmlData)
print(LH.tostring(doc, pretty_print = True))
产量
<html>
<head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head>
<body> <div> start of div </div> </body>
</html>
答案 1 :(得分:-1)
你可以这样做:
htmlCodes = (
('&', '&'),
('<', '<'),
('>', '>'),
('"', '"'),
("'", '''),
)
for i in htmlCodes:
soup.prettify().replace(i[1], i[0])