Question

帮助！我是编程新手，想在课堂上学习。我试图做一个基本的webcrawler，但由于某种原因，Beautiful Soup没有识别.htm文件的链接。你能帮忙吗？

以下是代码：

import re, urllib.request
from bs4 import BeautifulSoup

print("Enter the URL you wish to crawl (include the 'http://'):")
myurl = input("@> ")


root = re.sub("/\w+\.htm", "", myurl)


html = urllib.request.urlopen(myurl)    
html = html.read()  
html = str(html)
links = re.findall('href=\.+.htm">', html)

已修改为添加评论代码

link = str(link) 
link = re.sub(".htm>'", ".htm$", link) 
link = re.sub("'$", "", link) 
if not re.match("^(http|www)", link): 
    link = root + "/" + str(link) 
    current_html = urllib.request.urlopen(link) 
    current_html = current_html.read() 
    current_soup = BeautifulSoup(current_html, "html.parser") 
    current_clean_text = current_soup.getText()

获取美丽的汤来读取.htm文件

0 个答案: