获取美丽的汤来读取.htm文件

时间:2015-12-11 21:47:01

标签: python html beautifulsoup html-parsing

帮助!我是编程新手,想在课堂上学习。我试图做一个基本的webcrawler,但由于某种原因,Beautiful Soup没有识别.htm文件的链接。你能帮忙吗?

以下是代码:

import re, urllib.request
from bs4 import BeautifulSoup

print("Enter the URL you wish to crawl (include the 'http://'):")
myurl = input("@> ")


root = re.sub("/\w+\.htm", "", myurl)


html = urllib.request.urlopen(myurl)    
html = html.read()  
html = str(html)
links = re.findall('href=\.+.htm">', html) 

已修改为添加评论代码

link = str(link) 
link = re.sub(".htm>'", ".htm$", link) 
link = re.sub("'$", "", link) 
if not re.match("^(http|www)", link): 
    link = root + "/" + str(link) 
    current_html = urllib.request.urlopen(link) 
    current_html = current_html.read() 
    current_soup = BeautifulSoup(current_html, "html.parser") 
    current_clean_text = current_soup.getText()

0 个答案:

没有答案