您在此处看到的是一个简单的站点地图生成器。问题是,只要链接重复,而不是没有做任何事情,它似乎将副本与下一个网址组合在一起。例如,它会说http://www.apple.comhttp://www.apple.com/sitemap 任何提示,参考或类似的将不胜感激。
from time import sleep
from urllib.request import urlopen
allurl=["http://www.apple.com"]
url="http://www.apple.com"
toturl=[]
prinst=True
print("Urllib loaded")
for df in allurl:
toturl=[]
try:
r = str(urlopen(url).read())
except:
pass
for zr in range(0,len(r)-1):
if r[zr]=="h" and r[zr+1]=="r" and r[zr+2]=="e" and r[zr+3]=="f"and r[zr+4]=="=" and r[zr+6]=="h":
for y in range(6,100000):
if r[zr+y]=='"':
break
else:
toturl.append(r[zr+y])
if "".join(toturl) not in allurl: #Conditional Being Ignored, so to speak.
print("".join(toturl))
allurl.append("".join(toturl))
toturl=[]
url=df
print("\n")
答案 0 :(得分:0)
如果有重复,请勿重置toturl
。永远的初始变量,在他们第一次使用时,而不是在你完成它们的时候。
from time import sleep
from urllib.request import urlopen
allurl = ["http://www.apple.com"]
url = "http://www.apple.com"
prinst = True
print("Urllib loaded")
for df in allurl:
try:
r = str(urlopen(url).read())
except:
print("some exception occured!")
else:
for zr in range(0,len(r)-1):
if r[zr:zr+5]=="href=" and r[zr+6]=="h":
toturl = []
for y in range(6,100000):
if r[zr+y] == '"':
break
else:
toturl.append(r[zr+y])
toturl = "".join(toturl)
if toturl not in allurl:
print(toturl)
allurl.append(toturl)
url = df
print("\n")