我正在尝试解析HTML文件(demo.html以使所有相对链接绝对。以下是我尝试在Python脚本中执行此操作的方法 -
from bs4 import BeautifulSoup
f = open('demo.html', 'r')
html_text = f.read()
f.close()
soup = BeautifulSoup(html_text)
for a in soup.findAll('a'):
for x in a.attrs:
if x == 'href':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('link'):
for x in a.attrs:
if x == 'href':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('script'):
for x in a.attrs:
if x == 'src':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
f = open("demo_result.html", "w")
f.write(soup.prettify().encode("utf-8"))
但是,输出文件demo_result.html包含许多意外更改。例如,
<script type="text/javascript" src="/scripts/ddtabmenu.js" />
/***********************************************
* DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com)
* + Drop Down/ Overlapping Content-
* This notice MUST stay intact for legal use
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
***********************************************/
</script>
更改为
<script src="http://www.esplanade.com.sg/scripts/ddtabmenu.js" type="text/javascript">
</script>
</head>
<body>
<p>
/***********************************************
* DD Tab Menu script- (c) Dynamic Drive DHTML code library (www.dynamicdrive.com)
* + Drop Down/ Overlapping Content-
* This notice MUST stay intact for legal use
* Visit Dynamic Drive at http://www.dynamicdrive.com/ for full source code
***********************************************/
有人可以告诉我哪里出错了吗?
谢谢和最诚挚的问候。
答案 0 :(得分:1)
import BeautifulSoup #This is version 3 not version 4
f = open('demo.html', 'r')
html_text = f.read()
f.close()
soup = BeautifulSoup.BeautifulSoup(html_text)
print soup.contents
for a in soup.findAll('a'):
for x in a.attrs:
if x == 'href':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('link'):
for x in a.attrs:
if x == 'href':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
for a in soup.findAll('script'):
for x in a.attrs:
if x == 'src':
temp = a[x]
a[x] = "http://www.esplanade.com.sg" + temp
f = open("demo_result.html", "w")
f.write(soup.prettify().encode("utf-8"))
答案 1 :(得分:0)
您的HTML代码有点乱。您已关闭script
代码,而您又将其关闭
<script type="text/javascript" src="/scripts/ddtabmenu.js" /></script>
它打破了DOM。只需从/
<script type="text/javascript" src="/scripts/ddtabmenu.js" />
即可
答案 2 :(得分:0)
如前所述,回归到BeautifulSoup 3可以解决问题。此外,使用html锚点和javascript引用添加这样的url会有问题,所以我更改了代码:
import re
import BeautifulSoup
with open("demo.html", "r") as file_h:
soup = BeautifulSoup.BeautifulSoup(file_h.read())
url = "http://www.esplanade.com.sg/"
health_check = lambda x: bool(re.search("^(?!javascript:|http://)[/\w]", x))
replacer = lambda x: re.sub("^(%s)?/?" % url, url, x)
for soup_tag in soup.findAll(lambda x: x.name in ["a", "img", "link", "script"]):
if(soup_tag.has_key("href") and health_check(soup_tag["href"])):
soup_tag["href"] = replacer(soup_tag["href"])
if(soup_tag.has_key("src") and health_check(soup_tag["src"])):
soup_tag["src"] = replacer(soup_tag["src"])
with open("demo_result.html", "w") as file_h:
file_h.write(soup.prettify().encode("utf-8"))