需要帮助并行化此代码

时间:2018-05-21 14:02:49

标签: python python-3.x parallel-processing joblib

我在并行化以下Python代码时遇到了一个泡菜(字面意思),可能真的需要一些帮助。

首先输入的是一个CSV文件,其中包含我需要使用函数scrape_function()进行搜索的网站链接列表。原始代码如下并完美运行

with open('C:\\links.csv','r') as source:
    reader=csv.reader(source)
    inputlist=list(reader)

m=[]

for i in inputlist:
    m.append(scrape_code(re.sub("\'|\[|\]",'',str(i)))) #remove the quotes around the link strings otherwise it results in URLError

print(m)

然后我尝试使用joblib并行化此代码,如下所示:

from joblib import Parallel, delayed
import multiprocessing

with open('C:\\links.csv','r') as source:
        reader=csv.reader(source)
        inputlist=list(reader)

cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(m.append(scrape_code(re.sub("\'|\[|\]",'',str(i))))) for i in inputlist)

然而,这会导致一个奇怪的错误:

  File "C:\Users\...\joblib\pool.py", line 371, in send
    CustomizablePickler(buffer, self._reducers).dump(obj)
AttributeError: Can't pickle local object 'delayed.<locals>.delayed_function'

知道我在这里做错了吗?如果我尝试将append放在一个单独的函数中,那么错误就会消失,但执行会冻结并无限期挂起:

def process(k):
    a=[]
    a.append(scrape_code(re.sub("\'|\[|\]",'',str(k))))
    return a

cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=cores)(delayed(process)(i) for i in inputlist)

输入列表有10000页,因此并行处理将是一个巨大的好处。

1 个答案:

答案 0 :(得分:0)

如果你真的需要它在不同的过程中,最简单的方法是创建一个process pool并让它处理分发你的函数的链接,例如:

#include "comm.h"

HANDLE handler;
uint8_t connected;

COMSTAT status;
DWORD errors;

int connectSerial(char* portName) {
    connected = 0;
    handler = CreateFileA(portName,
                                GENERIC_READ | GENERIC_WRITE,
                                0,
                                0,
                                OPEN_EXISTING,
                                FILE_ATTRIBUTE_NORMAL,
                                0);
    if (handler == INVALID_HANDLE_VALUE) {
        if (GetLastError() == ERROR_FILE_NOT_FOUND) {
            printf("ERROR: Handle was not attached. Reason: %s not available\n", portName);
        } else {
            printf("ERROR!!!");
        }
    } else {
        DCB dcbSerialParameters = { 0 };

        if (!GetCommState(handler, &dcbSerialParameters)) {
            printf("failed to get current serial parameters");
        } else {
            connected = 1;
            PurgeComm(handler, PURGE_RXCLEAR | PURGE_TXCLEAR);
            //  Sleep(2000);
        }
    }
}

void disconnectSerial() {
    if (connected) {
        connected = 0;
        CloseHandle(handler);
    }
}

uint8_t readSerialPort(char* buffer, unsigned int buf_size) {
    DWORD bytesRead;
    unsigned int toRead;

    ClearCommError(handler, &errors, &status);

    if (status.cbInQue > 0) {
        if (status.cbInQue > buf_size) 
            toRead = buf_size;
        else
            toRead = status.cbInQue;
    }

    if (ReadFile(handler, buffer, toRead, &bytesRead, NULL))
        return bytesRead;
    return 0;
}

uint8_t writeSerialPort(char* buffer, unsigned int buf_size) {
    DWORD bytesSend;

    if (!WriteFile(handler, (void*)buffer, buf_size, &bytesSend, 0)) {
        ClearCommError(handler, &errors, &status);
        return 0;
    }
    else return 1;
}

注意:我假设您的import csv from multiprocessing import Pool if __name__ == "__main__": # multiprocessing guard with open("c:\\links.csv", "r", newline="") as f: # open the CSV reader = csv.reader(f) # create a reader links = [r[0] for r in reader] # collect only the first column with Pool() as pool: # create a pool, it will make a pool with all your CPU cores... results = pool.map(scrape_code, links) # distribute your links to scrape_code print(results) 实际上在第一列中保留了链接,具体取决于您是如何预处理代码中的链接的。

但是,正如我在评论中所述,这并不一定比普通线程更快,所以我首先尝试使用线程。幸运的是,multiprocessing模块包含threading interfrace dummy,因此您只需将links.csv替换为from multiprocessing import Pool,并查看您的代码在哪种情况下更快地运行。