我有一个包含以下内容的日志文件:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/
http://www.downloadray.com/windows/Photos_and_Images/Graphic_Capture/
http://www.downloadray.com/windows/Photos_and_Images/Digital_Photo_Tools/
我有这段代码:
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
i = 1
while 1:
url = line+"?page=%d" % i
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
has_more = 1
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
此代码不会出错但不起作用。 我尝试修改它但无法解决它。 但是当我尝试这段代码时,它运作良好:
from bs4 import BeautifulSoup
import urllib
import urlparse
g = open("downloadray3.txt", "w")
url = "http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/"
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
i = 1
while 1:
url1 = url+"?page=%d" % i
pageHtml = urllib.urlopen(url1)
soup = BeautifulSoup(pageHtml)
has_more = 2
for a in soup.select("div.n_head2 a[href]"):
try:
print (a["href"])
g.write(a["href"]+"\n")
except:
print "no link"
if has_more:
i += 1
else:
break
那么如何让它可以从日志文本文件中读取。很难一点一点地读取链接。
答案 0 :(得分:1)
你是否从行尾剥去了换行符?
for line in f.readlines():
line = line.strip()
readlines()
会生成一个从文件中获取的行列表,包括换行符\n
字符。
证明打印url
变量的证据(在url = line+"?page=%d" % i
行之后):
您的原始代码:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/ ?page=1 http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/ ?page=2 http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/ ?page=3
根据我的建议修正:
http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=1 http://www.downloadray.com/TIFF-to-JPG_download/ http://www.downloadray.com/Moo0-Image-Thumbnailer_download/ http://www.downloadray.com/Moo0-Image-Sizer_download/ http://www.downloadray.com/Advanced-Image-Viewer-and-Converter_download/ http://www.downloadray.com/GandMIC_download/ http://www.downloadray.com/SendTo-Convert_download/ http://www.downloadray.com/PNG-To-JPG-Converter-Software_download/ http://www.downloadray.com/Graphics-Converter-Pro_download/ http://www.downloadray.com/PICtoC_download/ http://www.downloadray.com/Free-Images-Converter_download/ http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=2 http://www.downloadray.com/VarieDrop_download/ http://www.downloadray.com/Tinuous_download/ http://www.downloadray.com/Acme-CAD-Converter_download/ http://www.downloadray.com/AAOImageConverterandFTP_download/ http://www.downloadray.com/ImageCool-Converter_download/ http://www.downloadray.com/GeoJpeg_download/ http://www.downloadray.com/Android-Resizer-Tool_download/ http://www.downloadray.com/Scarab-Darkroom_download/ http://www.downloadray.com/Jpeg-Resizer_download/ http://www.downloadray.com/TIFF2PDF_download/ http://www.downloadray.com/windows/Photos_and_Images/Image_Convertors/?page=3 http://www.downloadray.com/JGraphite_download/ http://www.downloadray.com/Easy-PNG-to-Icon-Converter_download/ http://www.downloadray.com/JBatch-It!_download/ http://www.downloadray.com/Batch-It!-Pro_download/ http://www.downloadray.com/Batch-It!-Ultra_download/ http://www.downloadray.com/Image-to-Ico-Converter_download/ http://www.downloadray.com/PSD-To-PNG-Converter-Software_download/ http://www.downloadray.com/VectorNow_download/ http://www.downloadray.com/KeitiklImages_download/ http://www.downloadray.com/STOIK-Smart-Resizer_download/
<强>更新强>:
然后,此代码将无法按预期运行,因为while
循环将永远不会继续,因为has_more
变量永远不会更改。
list_of_links = soup.select("div.n_head2 a[href]") if len(list_of_links)==0: break else: for a in soup.select("div.n_head2 a[href]"): print (a["href"]) g.write(a["href"]+"\n") i += 1德尔>
显然,如果查询超出最大页面,页面仍会显示最新页面。因此,如果可用的最大页码数为82,并且您查询第83页,则会显示第82页。要检测此情况,您可以保存以前页面网址的列表,并将其与当前网址列表进行比较。
以下是完整代码(已测试):
from bs4 import BeautifulSoup
import urllib
import urlparse
f = open("downloadray2.txt")
g = open("downloadray3.txt", "w")
for line in f.readlines():
line = line.strip()
i = 1
prev_urls = []
while 1:
url = line+"?page=%d" % i
print 'Examining %s' % url
pageHtml = urllib.urlopen(url)
soup = BeautifulSoup(pageHtml)
list_of_urls = soup.select("div.n_head2 a[href]")
if set(prev_urls)==set(list_of_urls):
break
else:
for a in soup.select("div.n_head2 a[href]"):
print (a["href"])
g.write(a["href"]+"\n")
i += 1
prev_urls = list_of_urls