使用多个网址提取文本

时间:2019-08-03 08:58:49

标签: python url text

当我可以从URL列表中提取内容然后将其存储在文本文件中时,问题是我的python代码从文本文件中仅读取了最后一个URL链接,并且仅存储了这些内容。 在这里,我正在使用鹅提取工具从URL中提取一些文本

可以帮我解决这个问题(这里是否存在for循环问题?)

class FetchUrl(Thread):
    def __init__(self, url, name):
      Thread.__init__(self)
      self.name = name
      self.url = url

    def run(self):
      config = Configuration()
      config.browser_user_agent = 'Mozilla 5.0'
      config.http_timeout = 20 
      g = Goose(config)
      fname = os.path.basename(self.name)
      with open(fname +".txt","w+") as f_handler:
           for tmp in url:
              article = g.extract(url=tmp)
              contents = article.cleaned_text
              f_handler.write(contents)
       msg = "%s was finished downloaded with this link %s!" % (self.name, 
          self.url)
       print(msg)


def main(url):
   for item , url in enumerate(url):
     name = "Thread %s" % (item+1)
     fetch = FetchUrl(url, name)
     fetch.start()

if __name__ == "__main__":
   u_path = 'url_list/url.txt'
   url = []
   for line in open(u_path):
        line = line.strip()
        url.append(line)
        print(line)
main(url)      

1 个答案:

答案 0 :(得分:0)

您的变量contents被覆盖,这样,当它存在for tmp in url:循环时,只有最后一个URL的内容在contents变量中。 尝试类似的东西,

# open file in write mode
    # loop over urls
        # extract url contents
        # clean it
        # write to file