Question

我有一个输入文件，其中包含一长串URL。让我们在mylines.txt中进行假设：

https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com

我需要做的是：

1）从输入文件mylines.txt

中读取一行

2）执行myFun函数。它将执行一些任务。并产生一个由一行组成的输出。在我的实际代码中，它更加复杂。但是概念上是这样的。

3）将输出写入results.txt文件

由于我的投入很大。我需要利用python多线程。我看着这个很好的post here。但是不幸的是，它假定输入在一个简单的列表中，并且不假定我要将函数的输出写入文件中。

我需要确保每个输入的输出都写在单行中（即，如果多线程正在写同一行，则我会得到不正确的数据，这是危险的）。

我想念周围。但是没有成功。我以前没有使用过python的多线程，但是现在是学习的时候了，因为这对我来说是不可避免的。我的清单很长，没有多线程就无法在合理的时间内完成。我的函数不会执行此简单任务，而是执行此概念不需要的更多操作。

这是我的尝试。请纠正我（在代码本身中）：

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import Queue

def myFunc(url):
        response = requests.get(url, verify=False ,timeout=(2, 5))
        results = open("myresults","a") # "a" to append results
        results.write("url is:",url, ", response is:", response.url)
        results.close()

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = Queue.Queue()

for url in worker_data:
    q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

# close the pool and wait for the work to finish
pool.close()
pool.join()

问：如何修复上面的代码（请简明扼要，并在代码本身中为我提供帮助）以从输入文件中读取一行，执行该函数，并使用python多线程将与输入关联的结果写入一行同时执行requests，以便我可以在合理的时间内完成列表。

更新：

根据答案，代码变为：

import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue

def myFunc(url):
    response = requests.get(url, verify=False ,timeout=(2, 5))
    return "url is:" + url + ", response is:" + response.url

worker_data = open("mylines.txt","r") # open my input file.

#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
    for url in f:
        q.put(url)

# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)

with open("myresults","w") as f:
    for line in results:
        f.write(line + '\n')

mylines.txt包含：

https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com

请注意，我首先使用的是

import Queue

并且： q = Queue.Queue（4）

但出现错误：

Traceback (most recent call last):
  File "test3.py", line 4, in <module>
    import Queue
ModuleNotFoundError: No module named 'Queue'

根据一些搜索，我更改为：

import queue

有关行： q = queue.Queue（4）

我还添加了：

from multiprocessing import Queue

但是没有任何效果。python多线程技术专家可以帮忙吗？

Answer 1

您应该更改函数以返回字符串：

def myFunc(url):
    response = requests.get(url, verify=False ,timeout=(2, 5))
    return "url is:" + url + ", response is:" + response.url

并将这些字符串稍后写入文件：

results = pool.map(myFunc, q)

with open("myresults","w") as f:
    for line in results:
        f.write(line + '\n')

这可以使多线程处理requests.get正常工作，但是会将结果序列化到输出文件中。

更新：

您还应该使用with来读取输入文件：

#load up a queue with your data, this will handle locking
q = Queue.Queue()

with open("mylines.txt","r") as f: # open my input file.
    for url in f:
        q.put(url)

Answer 2

不是让工作池线程打印出结果（不能保证正确缓冲输出），而是创建另一个线程，该线程从第二个Queue中读取结果并打印出来。

我已经修改了您的解决方案，因此它可以构建自己的工作线程池。给队列无限的长度没有什么意义，因为当队列达到最大大小时主线程将阻塞：您只需要足够长的时间以确保始终有工作线程可以处理工作线程-主线程将阻塞并随着队列大小的增加和减少而取消阻止。

它还标识了负责输出队列中每个项目的线程，这应该使您确信多线程方法正在工作，并从服务器打印响应代码。我发现必须从URL中删除换行符。

由于现在只有一个线程正在写入文件，所以写入始终完全同步，并且没有机会相互干扰。

import threading
import requests
import queue
POOL_SIZE = 4

def myFunc(inq, outq):  # worker thread deals only with queues
    while True:
        url = inq.get()  # Blocks until something available
        if url is None:
            break
        response = requests.get(url.strip(), timeout=(2, 5))
        outq.put((url, response, threading.currentThread().name))


class Writer(threading.Thread):
    def __init__(self, q):
        super().__init__()
        self.results = open("myresults","a") # "a" to append results
        self.queue = q
    def run(self):
        while True:
            url, response, threadname = self.queue.get()
            if response is None:
                self.results.close()
                break
            print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)

#load up a queue with your data, this will handle locking
inq = queue.Queue()  # could usefully limit queue size here
outq = queue.Queue()

# start the Writer
writer = Writer(outq)
writer.start()

# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
    thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
    thread.start()
    threads.append(thread)

# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
    for url in worker_data:
        inq.put(url.strip())
for thread in threads:
    inq.put(None)

# close the pool and wait for the workers to finish
for thread in threads:
    thread.join()

# Terminate the writer
outq.put((None, None, None))
writer.join()

使用mylines.txt中给出的数据，我看到以下输出：

****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3

无法在python中使用多线程读取/写入文件

2 个答案: