我在并行化以下Python代码时遇到了一个泡菜(字面意思),可能真的需要一些帮助。
首先输入的是一个CSV文件,其中包含我需要使用函数scrape_function()
进行搜索的网站链接列表。原始代码如下并完美运行
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
m=[]
for i in inputlist:
m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError
print(m)
然后我尝试使用joblib
并行化此代码,如下所示:
from joblib import Parallel, delayed
import multiprocessing
with open('C:\\links.csv','r') as source:
reader=csv.reader(source)
inputlist=list(reader)
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)
然而,这会导致一个奇怪的错误:
File "C:\Users\...\joblib\pool.py", line 371, in send
CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'
知道我在这里做错了吗?如果我尝试将append放在一个单独的函数中,那么错误就会消失,但执行会冻结并无限期挂起:
def process(k):
a=[]
a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
return a
cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)
输入列表有10000页,因此并行处理将是一个巨大的好处。
答案 0 :(得分:0)
如果你真的需要它在不同的过程中,最简单的方法是创建一个process pool并让它处理分发你的函数的链接,例如:
#include "comm.h"
HANDLE handler;
uint8_t connected;
COMSTAT status;
DWORD errors;
int connectSerial(char* portName) {
connected = 0;
handler = CreateFileA(portName,
GENERIC_READ | GENERIC_WRITE,
0,
0,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
0);
if (handler == INVALID_HANDLE_VALUE) {
if (GetLastError() == ERROR_FILE_NOT_FOUND) {
printf("ERROR: Handle was not attached. Reason: %s not available\n", portName);
} else {
printf("ERROR!!!");
}
} else {
DCB dcbSerialParameters = { 0 };
if (!GetCommState(handler, &dcbSerialParameters)) {
printf("failed to get current serial parameters");
} else {
connected = 1;
PurgeComm(handler, PURGE_RXCLEAR | PURGE_TXCLEAR);
// Sleep(2000);
}
}
}
void disconnectSerial() {
if (connected) {
connected = 0;
CloseHandle(handler);
}
}
uint8_t readSerialPort(char* buffer, unsigned int buf_size) {
DWORD bytesRead;
unsigned int toRead;
ClearCommError(handler, &errors, &status);
if (status.cbInQue > 0) {
if (status.cbInQue > buf_size)
toRead = buf_size;
else
toRead = status.cbInQue;
}
if (ReadFile(handler, buffer, toRead, &bytesRead, NULL))
return bytesRead;
return 0;
}
uint8_t writeSerialPort(char* buffer, unsigned int buf_size) {
DWORD bytesSend;
if (!WriteFile(handler, (void*)buffer, buf_size, &bytesSend, 0)) {
ClearCommError(handler, &errors, &status);
return 0;
}
else return 1;
}
注意:我假设您的import csv
from multiprocessing import Pool
if __name__ == "__main__": # multiprocessing guard
with open("c:\\links.csv", "r", newline="") as f: # open the CSV
reader = csv.reader(f) # create a reader
links = [r[0] for r in reader] # collect only the first column
with Pool() as pool: # create a pool, it will make a pool with all your CPU cores...
results = pool.map(scrape_code, links) # distribute your links to scrape_code
print(results)
实际上在第一列中保留了链接,具体取决于您是如何预处理代码中的链接的。
但是,正如我在评论中所述,这并不一定比普通线程更快,所以我首先尝试使用线程。幸运的是,multiprocessing
模块包含threading interfrace dummy,因此您只需将links.csv
替换为from multiprocessing import Pool
,并查看您的代码在哪种情况下更快地运行。