多线程Scrape Html并安全地保存到一个文件

时间:2017-07-19 13:56:34

标签: python multithreading web-scraping

我想在多个线程中从给定的URL中删除标题(在5个线程中的示例) 并将它们保存到一个文本文件中。怎么做以及如何确保我安全地将输出保存到一个文件?

这是我的代码:

import csv
import requests
requests.packages.urllib3.disable_warnings()

urls = []

with open('Input.csv') as csvDataFile:
    csvReader = csv.reader(csvDataFile)
    for row in csvReader:
        urls.append(row[1])

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end]
    except ValueError:
        return ""

def get_title( url ):
    try:
        r = requests.get(url)
        html_content = r.text.encode('UTF-8')
        title = find_between(html_content , "<title>", "</title>")
        return title
    except:
        return ""

for url in urls:
    f = open('myfile.txt', 'a')
    f.write(get_title(url) + '\n')
    f.close()

1 个答案:

答案 0 :(得分:1)

尝试使用期货
1.创建池 2. sumbit函数和参数
3.从功能中获得结果

import csv
from concurrent import futures
pool = futures.ThreadPoolExecutor(5)
workers = [pool.sumbit(get_title,url) for url in urls]
while not all(worker.done() for worker in workers):
   pass
with open(file) as f:
   w = csv.writer(f)
   w.writerows([[worker.result()] for worker in workers])