我有一个输入文件,其中包含一长串URL。让我们在mylines.txt
中进行假设:
https://yahoo.com
https://google.com
https://facebook.com
https://twitter.com
我需要做的是:
1)从输入文件mylines.txt
2)执行myFun
函数。它将执行一些任务。并产生一个由一行组成的输出。在我的实际代码中,它更加复杂。但是概念上是这样的。
3)将输出写入results.txt
文件
由于我的投入很大。我需要利用python多线程。我看着这个很好的post here。但是不幸的是,它假定输入在一个简单的列表中,并且不假定我要将函数的输出写入文件中。
我需要确保每个输入的输出都写在单行中(即,如果多线程正在写同一行,则我会得到不正确的数据,这是危险的)。
我想念周围。但是没有成功。我以前没有使用过python的多线程,但是现在是学习的时候了,因为这对我来说是不可避免的。我的清单很长,没有多线程就无法在合理的时间内完成。我的函数不会执行此简单任务,而是执行此概念不需要的更多操作。
这是我的尝试。请纠正我(在代码本身中):
import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import Queue
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
results = open("myresults","a") # "a" to append results
results.write("url is:",url, ", response is:", response.url)
results.close()
worker_data = open("mylines.txt","r") # open my input file.
#load up a queue with your data, this will handle locking
q = Queue.Queue()
for url in worker_data:
q.put(url)
# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)
# close the pool and wait for the work to finish
pool.close()
pool.join()
问:如何修复上面的代码(请简明扼要,并在代码本身中为我提供帮助)以从输入文件中读取一行,执行该函数,并使用python多线程将与输入关联的结果写入一行同时执行requests
,以便我可以在合理的时间内完成列表。
更新:
根据答案,代码变为:
import threading
import requests
from multiprocessing.dummy import Pool as ThreadPool
import queue
from multiprocessing import Queue
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
return "url is:" + url + ", response is:" + response.url
worker_data = open("mylines.txt","r") # open my input file.
#load up a queue with your data, this will handle locking
q = queue.Queue(4)
with open("mylines.txt","r") as f: # open my input file.
for url in f:
q.put(url)
# make the Pool of workers
pool = ThreadPool(4)
results = pool.map(myFunc, q)
with open("myresults","w") as f:
for line in results:
f.write(line + '\n')
mylines.txt包含:
https://yahoo.com
https://www.google.com
https://facebook.com
https://twitter.com
请注意,我首先使用的是
import Queue
并且: q = Queue.Queue(4)
但出现错误:
Traceback (most recent call last):
File "test3.py", line 4, in <module>
import Queue
ModuleNotFoundError: No module named 'Queue'
根据一些搜索,我更改为:
import queue
有关行: q = queue.Queue(4)
我还添加了:
from multiprocessing import Queue
但是没有任何效果。python多线程技术专家可以帮忙吗?
答案 0 :(得分:2)
您应该更改函数以返回字符串:
def myFunc(url):
response = requests.get(url, verify=False ,timeout=(2, 5))
return "url is:" + url + ", response is:" + response.url
并将这些字符串稍后写入文件:
results = pool.map(myFunc, q)
with open("myresults","w") as f:
for line in results:
f.write(line + '\n')
这可以使多线程处理requests.get
正常工作,但是会将结果序列化到输出文件中。
更新:
您还应该使用with
来读取输入文件:
#load up a queue with your data, this will handle locking
q = Queue.Queue()
with open("mylines.txt","r") as f: # open my input file.
for url in f:
q.put(url)
答案 1 :(得分:1)
不是让工作池线程打印出结果(不能保证正确缓冲输出),而是创建另一个线程,该线程从第二个Queue
中读取结果并打印出来。
我已经修改了您的解决方案,因此它可以构建自己的工作线程池。给队列无限的长度没有什么意义,因为当队列达到最大大小时主线程将阻塞:您只需要足够长的时间以确保始终有工作线程可以处理工作线程-主线程将阻塞并随着队列大小的增加和减少而取消阻止。
它还标识了负责输出队列中每个项目的线程,这应该使您确信多线程方法正在工作,并从服务器打印响应代码。我发现必须从URL中删除换行符。
由于现在只有一个线程正在写入文件,所以写入始终完全同步,并且没有机会相互干扰。
import threading
import requests
import queue
POOL_SIZE = 4
def myFunc(inq, outq): # worker thread deals only with queues
while True:
url = inq.get() # Blocks until something available
if url is None:
break
response = requests.get(url.strip(), timeout=(2, 5))
outq.put((url, response, threading.currentThread().name))
class Writer(threading.Thread):
def __init__(self, q):
super().__init__()
self.results = open("myresults","a") # "a" to append results
self.queue = q
def run(self):
while True:
url, response, threadname = self.queue.get()
if response is None:
self.results.close()
break
print("****url is:",url, ", response is:", response.status_code, response.url, "thread", threadname, file=self.results)
#load up a queue with your data, this will handle locking
inq = queue.Queue() # could usefully limit queue size here
outq = queue.Queue()
# start the Writer
writer = Writer(outq)
writer.start()
# make the Pool of workers
threads = []
for i in range(POOL_SIZE):
thread = threading.Thread(target=myFunc, name=f"worker{i}", args=(inq, outq))
thread.start()
threads.append(thread)
# push the work onto the queues
with open("mylines.txt","r") as worker_data: # open my input file.
for url in worker_data:
inq.put(url.strip())
for thread in threads:
inq.put(None)
# close the pool and wait for the workers to finish
for thread in threads:
thread.join()
# Terminate the writer
outq.put((None, None, None))
writer.join()
使用mylines.txt
中给出的数据,我看到以下输出:
****url is: https://www.google.com , response is: 200 https://www.google.com/ thread worker1
****url is: https://twitter.com , response is: 200 https://twitter.com/ thread worker2
****url is: https://facebook.com , response is: 200 https://www.facebook.com/ thread worker0
****url is: https://www.censys.io , response is: 200 https://censys.io/ thread worker1
****url is: https://yahoo.com , response is: 200 https://uk.yahoo.com/?p=us thread worker3