Question

我正在研究一个抓取网站，对网站正文进行一些处理并将其输出到新的html文件中的网络抓取工具。功能之一是获取html文件中的任何超链接，然后运行脚本，其中链接将作为脚本的输入。

我要从这里开始。

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>

<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
 mercury poisoning
</a>

</body>

</html>

对此。。。

<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>

<body>
<a onclick ='pythonScript(/wiki/Mercury_poisoning)' href="#" title="Mercury poisoning">
 mercury poisoning
</a>

</body>

</html>

我做了很多谷歌搜索，并且阅读了有关jQuery和Ajax的文章，但不了解这些工具，因此更喜欢在python中进行。可以使用python中的File IO来做到这一点吗？

Answer 1

您可以使用BeautifulSoup做这样的事情：

PS：您需要安装Beautifulsoup：pip install bs4

from bs4 import BeautifulSoup as bs


html = '''<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Scraper</title>
</head>

<body>
<a href="/wiki/Mercury_poisoning" title="Mercury poisoning">
 mercury poisoning
</a>

</body>

</html>
'''

soup = bs(html, 'html.parser')
links = soup.find_all('a')
for link in links:
    actual_link = link['href']
    link['href'] = '#'
    link['onclick'] = 'pythonScript({})'.format(actual_link)
print(soup)

输出：

<html>
<head>
<meta charset="utf-8"/>
<title>Scraper</title>
</head>
<body>
<a href="#" onclick="pythonScript(/wiki/Mercury_poisoning)" title="Mercury poisoning">
 mercury poisoning
</a>
</body>
</html>

奖金：

您还可以像这样创建一个新的HTML文件：

with open('new_html_file.html', 'w') as out:
    out.write(str(soup))

用“ onclick =（PythonScript（link））”切换所有“ href =（link）”

1 个答案: