帮助!我是编程新手,想在课堂上学习。我试图做一个基本的webcrawler,但由于某种原因,Beautiful Soup没有识别.htm文件的链接。你能帮忙吗?
以下是代码:
import re, urllib.request
from bs4 import BeautifulSoup
print("Enter the URL you wish to crawl (include the 'http://'):")
myurl = input("@> ")
root = re.sub("/\w+\.htm", "", myurl)
html = urllib.request.urlopen(myurl)
html = html.read()
html = str(html)
links = re.findall('href=\.+.htm">', html)
已修改为添加评论代码
link = str(link)
link = re.sub(".htm>'", ".htm$", link)
link = re.sub("'$", "", link)
if not re.match("^(http|www)", link):
link = root + "/" + str(link)
current_html = urllib.request.urlopen(link)
current_html = current_html.read()
current_soup = BeautifulSoup(current_html, "html.parser")
current_clean_text = current_soup.getText()