我尝试了一切......我在代码中隐藏了打印件,调试了十次,三次检查了内置方法,而且.crawl()
方法dosnt从final_list
中删除了任何对象。
我的任务的目标是构建两个类:
Web_page
:保存网页的数据。(页面以html文件的形式保存在桌面上的文件夹中。Crawler
:比较页面并保存uniqe页面的列表 - - > final_list
import re
import os
def remove_html_tags(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote:
tag = True
elif c == '>' and not quote:
tag = False
elif (c == '"' or c == "'") and tag:
quote = not quote
elif not tag:
out = out + c
return out
def lev(s1, s2):
return lev_iter(s1, s2, dict())
def lev_iter(s1, s2, mem):
(i,j) = (len(s1), len(s2))
if (i,j) in mem:
return mem[(i,j)]
s1_low = s1.lower()
s2_low = s2.lower()
if len(s1_low) == 0 or len(s2_low) == 0:
return max(len(s1_low), len(s2_low))
d1 = lev_iter(s1_low[:-1], s2_low, mem) + 1
d2 = lev_iter(s1_low, s2_low[:-1], mem) + 1
last = 0 if s1_low[-1] == s2_low[-1] else 1
d3 = lev_iter(s1_low[:-1], s2_low[:-1], mem) + last
result = min(d1, d2, d3)
mem[(i,j)] = result
return result
def merge_spaces(content):
return re.sub('\s+', ' ', content).strip()
""" A Class that holds data on a Web page """
class WebPage:
def __init__(self, filename):
self.filename = filename
def process(self):
f = open(self.filename,'r')
LINE_lst = f.readlines()
self.info = {}
for i in range(len(LINE_lst)):
LINE_lst[i] = LINE_lst[i].strip(' \n\t')
LINE_lst[i] = remove_html_tags(LINE_lst[i])
lines = LINE_lst[:]
for line in lines:
if len(line) == 0:
LINE_lst.remove(line)
self.body = ' '.join(LINE_lst[1:])
self.title = LINE_lst[0]
f.close()
def __str__(self):
return self.title + '\n' + self.body
def __repr__(self):
return self.title
def __eq__(self,other):
n = lev(self.body,other.body)
k = len(self.body)
m = len(other.body)
return float(n)/max(k,m) <= 0.15
def __lt__(self,other):
return self.title < other.title
""" A Class that crawls the web """
class Crawler:
def __init__(self, directory):
self.folder = directory
def crawl(self):
pages = [f for f in os.listdir(self.folder) if f.endswith('.html')]
final_list = []
for i in range(len(pages)):
pages[i] = WebPage(self.folder + '\\' + pages[i])
pages[i].process()
for k in range(len(final_list)+1):
if k == len(final_list):
final_list.append(pages[i])
elif pages[i] == final_list[k]:
if pages[i] < final_list[k]:
final_list.append(pages[i])
final_list.remove(final_list[k])
break
print final_list
self.pages = final_list
除了这条怪异的线final_list.remove(final_list[k])
之外,一切正常。请帮忙?什么错了?
答案 0 :(得分:0)
我不确定为什么您的代码无法运行,因此我很难对其进行测试,因为我不知道哪种输入应该最终调用remove()
我建议按照以下步骤操作:
remove()
。remove()
依靠您的__eq__()
方法查找要删除的项目,因此请确保__eq__()
不是罪魁祸首。作为旁注,您可能想要替换它:
self.folder + '\\' + pages[i]
使用:
import os.path
# ...
os.path.join(self.folder, page[i])
这个简单的更改应该使您的脚本适用于所有操作系统,而不是仅适用于Windows。 (GNU / Linux,Mac OS和其他类Unix操作系统使用“/”作为路径分隔符。)
请考虑更换此表单的循环:
for i in range(len(sequence)):
# Do something with sequence[i]
使用:
for item in sequence:
# Do something with item
如果您需要项目索引,请使用enumerate()
:
for i, item in enumerate(sequence):