当我可以从URL列表中提取内容然后将其存储在文本文件中时,问题是我的python代码从文本文件中仅读取了最后一个URL链接,并且仅存储了这些内容。 在这里,我正在使用鹅提取工具从URL中提取一些文本
可以帮我解决这个问题(这里是否存在for循环问题?)
class FetchUrl(Thread):
def __init__(self, url, name):
Thread.__init__(self)
self.name = name
self.url = url
def run(self):
config = Configuration()
config.browser_user_agent = 'Mozilla 5.0'
config.http_timeout = 20
g = Goose(config)
fname = os.path.basename(self.name)
with open(fname +".txt","w+") as f_handler:
for tmp in url:
article = g.extract(url=tmp)
contents = article.cleaned_text
f_handler.write(contents)
msg = "%s was finished downloaded with this link %s!" % (self.name,
self.url)
print(msg)
def main(url):
for item , url in enumerate(url):
name = "Thread %s" % (item+1)
fetch = FetchUrl(url, name)
fetch.start()
if __name__ == "__main__":
u_path = 'url_list/url.txt'
url = []
for line in open(u_path):
line = line.strip()
url.append(line)
print(line)
main(url)
答案 0 :(得分:0)
您的变量contents
被覆盖,这样,当它存在for tmp in url:
循环时,只有最后一个URL的内容在contents
变量中。
尝试类似的东西,
# open file in write mode
# loop over urls
# extract url contents
# clean it
# write to file