无法在python中使用多线程读取/写入文件

时间:2019-03-01 10:18:26

标签: python python-3.x multithreading python-multiprocessing python-multithreading

我有一个输入文件,其中包含一长串URL。让我们在mylines.txt中进行假设:

https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com

我需要做的是:

1)从输入文件mylines.txt

中读取一行

2)执行myFun函数。它将执行一些任务。并产生一个由一行组成的输出。在我的实际代码中,它更加复杂。但是概念上是这样的。

3)将输出写入results.txt文件

由于我的投入很大。我需要利用python多线程。我看着这个很好的post here。但是不幸的是,它假定输入在一个简单的列表中,并且不假定我要将函数的输出写入文件中。

我需要确保每个输入的输出都写在单行中(即,如果多线程正在写同一行,则我会得到不正确的数据,这是危险的)。

我想念周围。但是没有成功。我以前没有使用过python的多线程,但是现在是学习的时候了,因为这对我来说是不可避免的。我的清单很长,没有多线程就无法在合理的时间内完成。我的函数不会执行此简单任务,而是执行此概念不需要的更多操作。

这是我的尝试。请纠正我(在代码本身中):

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import Queue

def myFunc(url):
        response = requests.get(url, verify=False ,timeout=(2, 5))
        results = open("myresults","a") # "a" to append results
        results.write("url is:",url, ", response is:", response.url)
        results.close()

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = Queue.Queue()

for url in worker_data:
    q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

# close the pool and wait for the work to finish
pool.close()
pool.join()

问:如何修复上面的代码(请简明扼要,并在代码本身中为我提供帮助)以从输入文件中读取一行,执行该函数,并使用python多线程将与输入关联的结果写入一行同时执行requests,以便我可以在合理的时间内完成列表。

更新:

根据答案,代码变为:

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue

def myFunc(url):
    response = requests.get(url, verify=False ,timeout=(2, 5))
    return "url is:" + url + ", response is:" + response.url

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
    for url in f:
        q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

with open("myresults","w") as f:
    for line in results:
        f.write(line + '\n')

mylines.txt包含:

https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com

请注意,我首先使用的是

import Queue

并且: q = Queue.Queue(4)

但出现错误:

Traceback (most recent call last):
  File "test3.py", line 4, in <module>
    import Queue
ModuleNotFoundError: No module named 'Queue'

根据一些搜索,我更改为:

import queue

有关行: q = queue.Queue(4)

我还添加了:

from multiprocessing import Queue

但是没有任何效果。python多线程技术专家可以帮忙吗?

2 个答案:

答案 0 :(得分:2)

您应该更改函数以返回字符串:

def myFunc(url):
    response = requests.get(url, verify=False ,timeout=(2, 5))
    return "url is:" + url + ", response is:" + response.url

并将这些字符串稍后写入文件:

results = pool.map(myFunc, q)

with open("myresults","w") as f:
    for line in results:
        f.write(line + '\n')

这可以使多线程处理requests.get正常工作,但是会将结果序列化到输出文件中。

更新

您还应该使用with来读取输入文件:

#load up a queue with your data, this will handle locking
q = Queue.Queue()

with open("mylines.txt","r") as f: # open my input file.
    for url in f:
        q.put(url)

答案 1 :(得分:1)

不是让工作池线程打印出结果(不能保证正确缓冲输出),而是创建另一个线程,该线程从第二个Queue中读取结果并打印出来。

我已经修改了您的解决方案,因此它可以构建自己的工作线程池。给队列无限的长度没有什么意义,因为当队列达到最大大小时主线程将阻塞:您只需要足够长的时间以确保始终有工作线程可以处理工作线程-主线程将阻塞并随着队列大小的增加和减少而取消阻止。

它还标识了负责输出队列中每个项目的线程,这应该使您确信多线程方法正在工作,并从服务器打印响应代码。我发现必须从URL中删除换行符。

由于现在只有一个线程正在写入文件,所以写入始终完全同步,并且没有机会相互干扰。

import threading
import requests
import queue
POOL_SIZE = 4

def myFunc(inq, outq):  # worker thread deals only with queues
    while True:
        url = inq.get()  # Blocks until something available
        if url is None:
            break
        response = requests.get(url.strip(), timeout=(2, 5))
        outq.put((url, response, threading.currentThread().name))


class Writer(threading.Thread):
    def __init__(self, q):
        super().__init__()
        self.results = open("myresults","a") # "a" to append results
        self.queue = q
    def run(self):
        while True:
            url, response, threadname = self.queue.get()
            if response is None:
                self.results.close()
                break
            print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)

#load up a queue with your data, this will handle locking
inq = queue.Queue()  # could usefully limit queue size here
outq = queue.Queue()

# start the Writer
writer = Writer(outq)
writer.start()

# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
    thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
    thread.start()
    threads.append(thread)

# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
    for url in worker_data:
        inq.put(url.strip())
for thread in threads:
    inq.put(None)

# close the pool and wait for the workers to finish
for thread in threads:
    thread.join()

# Terminate the writer
outq.put((None, None, None))
writer.join()

使用mylines.txt中给出的数据,我看到以下输出:

****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3