Question

我有数据，它在一个文本文件中。每一行都是一个计算。该文件大约有 100 000 000 行。

首先我将所有内容加载到 ram 中，然后我有一个方法来执行计算并给出以下结果：

def process(data_line):
    #do computation
    return result

然后我用 2000 行的数据包这样调用它，然后将结果保存到磁盘：

POOL_SIZE = 15 #nbcore - 1
PACKET_SIZE = 2000
pool = Pool(processes=POOL_SIZE)

data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets = int(number_of_lines/ PACKET_SIZE)
for i in range(number_of_packets):
    lines_packet = data_lines[:PACKET_SIZE]
    data_lines = data_lines[PACKET_SIZE:]
    results = pool.map(process, lines_packet)
    save_computed_data_to_disk(to_be_computed_filename, results)

# process the last packet, which is smaller
results.extend(pool.map(process, data_lines))
save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")

问题是，当我写入磁盘时，我的 CPU 没有计算任何东西并且有 8 个内核。它在看任务管理器，似乎损失了相当多的 CPU 时间。

我必须在完成计算后写入磁盘，因为结果比输入大 1000 倍。无论如何，我将不得不在某个时候写入磁盘。时间不在这里浪费，以后就会消失。

我该怎么做才能让一个内核写入磁盘，同时仍然与其他内核一起计算？切换到 C？

按照这个速度，我可以在 75 小时内处理 1 亿行，但我有 120 亿行要处理，所以欢迎任何改进。

时间示例：

Processing packet 2/15 953 of C:/processing/drop_zone\to_be_processed_txt_files\t_to_compute_303620.txt
Launching task and waiting for it to finish...
Task completed, Continuing
Packet was processed in 11.534576654434204 seconds
We are currently going at a rate of 0.002306915330886841 sec/words
Which is 433.47928145051293 words per seconds
Saving in temporary file
Printing writing 5000 computed line to disk took 0.04400920867919922 seconds
saving word to resume from : 06 20 25 00 00
Estimated time for processing the remaining packets is : 51:19:25

Answer 1

注意：此 SharedMemory 仅适用于 Python >= 3.8，因为它首次出现在那里

启动 3 种进程：Reader、Processor(s)、Writer。

让 Reader 进程以增量方式读取文件，通过 shared_memory 和队列共享结果。

让处理器消耗队列，消耗共享内存，并通过另一个队列返回结果。同样，作为 shared_memory。

让 Writer 进程使用第二个队列，写入目标文件。

让它们都通过一些 Event 或 DictProxy 与将充当协调器的 MainProcess 进行通信。

示例：

import time
import random
import hashlib
import multiprocessing as MP

from queue import Queue, Empty

# noinspection PyCompatibility
from multiprocessing.shared_memory import SharedMemory

from typing import Dict, List


def readerfunc(
        shm_arr: List[SharedMemory], q_out: Queue, procr_ready: Dict[str, bool]
):
    numshm = len(shm_arr)
    for batch in range(1, 6):
        print(f"Reading batch #{batch}")
        for shm in shm_arr:
            #### Simulated Reading ####
            for j in range(0, shm.size):
                shm.buf[j] = random.randint(0, 255)
            #### ####
            q_out.put((batch, shm))
        # Need to sync here because we're reusing the same SharedMemory,
        # so gotta wait until all processors are done before sending the
        # next batch
        while not q_out.empty() or not all(procr_ready.values()):
            time.sleep(1.0)


def processorfunc(
        q_in: Queue, q_out: Queue, suicide: type(MP.Event()), procr_ready: Dict[str, bool]
):
    pname = MP.current_process().name
    procr_ready[pname] = False
    while True:
        time.sleep(1.0)
        procr_ready[pname] = True
        if q_in.empty() and suicide.is_set():
            break
        try:
            batch, shm = q_in.get_nowait()
        except Empty:
            continue
        print(pname, "got batch", batch)
        procr_ready[pname] = False
        #### Simulated Processing ####
        h = hashlib.blake2b(shm.buf, digest_size=4, person=b"processor")
        time.sleep(random.uniform(5.0, 7.0))
        #### ####
        q_out.put((pname, h.hexdigest()))


def writerfunc(q_in: Queue, suicide: type(MP.Event())):
    while True:
        time.sleep(1.0)
        if q_in.empty() and suicide.is_set():
            break
        try:
            pname, digest = q_in.get_nowait()
        except Empty:
            continue
        print("Writing", pname, digest)
        #### Simulated Writing ####
        time.sleep(random.uniform(3.0, 6.0))
        #### ####
        print("Writing", pname, digest, "done")


def main():
    shm_arr = [
        SharedMemory(create=True, size=1024)
        for _ in range(0, 5)
    ]
    q_read = MP.Queue()
    q_write = MP.Queue()
    procr_ready = MP.Manager().dict()
    poison = MP.Event()
    poison.clear()

    reader = MP.Process(target=readerfunc, args=(shm_arr, q_read, procr_ready))

    procrs = []
    for n in range(0, 3):
        p = MP.Process(
            target=processorfunc, name=f"Proc{n}", args=(q_read, q_write, poison, procr_ready)
        )
        procrs.append(p)

    writer = MP.Process(target=writerfunc, args=(q_write, poison))

    reader.start()
    [p.start() for p in procrs]
    writer.start()

    reader.join()
    print("Reader has ended")

    while not all(procr_ready.values()):
        time.sleep(5.0)
    poison.set()
    [p.join() for p in procrs]
    print("Processors have ended")

    writer.join()
    print("Writer has ended")

    [shm.close() for shm in shm_arr]
    [shm.unlink() for shm in shm_arr]


if __name__ == '__main__':
    main()

Answer 2

代码首先想到的是在线程中运行保存功能。这样我们就排除了等待磁盘写入的瓶颈。像这样：

executor = ThreadPoolExecutor(max_workers=2)
future = executor.submit(save_computed_data_to_disk, to_be_computed_filename, results)
saving_futures.append(future)
...
concurrent.futures.wait(saving_futures, return_when=ALL_COMPLETED)  # wait all saved to disk after processing
print("Done")

Answer 3

你说你有 8 个核心，但你有：

POOL_SIZE = 15 #nbcore - 1

假设您想保留一个处理器空闲（大概用于主进程？）为什么这个数字不是 7？但是你为什么要免费阅读处理器呢？您正在连续调用 map。当主进程等待这些调用返回时，它需要知道 CPU。这就是为什么如果您在实例化池时没有指定池大小，它默认为您拥有的 CPU 数量而不是该数量减一。我将在下面对此进行更多说明。

由于您有一个非常大的内存列表，您是否有可能在循环中花费腰循环，在循环的每次迭代中重写此列表。相反，您可以只取列表的一部分并将其作为可迭代参数传递给 map:

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)
with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        save_computed_data_to_disk(to_be_computed_filename, results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        save_computed_data_to_disk(to_be_computed_filename, results)
print("Done")

在每次调用 map 之间，主进程将结果写出到 to_be_computed_filename。与此同时，池中的每个进程都处于空闲状态。这应该交给另一个进程（实际上是在主进程下运行的一个线程）：

import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000
data_lines = util.load_data_lines(to_be_computed_filename)
number_of_packets, remainder = divmod(number_of_lines, PACKET_SIZE)

def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    offset = 0
    for i in range(number_of_packets):
        results = pool.map(process, data_lines[offset:offset+PACKET_SIZE])
        offset += PACKET_SIZE
        q.put(results)
    if remainder:
        results = pool.map(process, data_lines[offset:offset+remainder])
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")

我选择在主进程的线程中运行 save_data。这也可能是另一个过程，在这种情况下，您需要使用 multiprocessing.Queue 实例。但我认为主进程线程主要是在等待 map 完成并且不会有 GIL 的竞争。现在，如果您不为线程作业 save_data 留出空闲处理器，则它可能会在创建所有结果后才完成大部分保存。您需要对此进行一些试验。

理想情况下，我还会修改对输入文件的读取，以便不必先将其全部读入内存，而是逐行读取，产生 2000 行块并将它们作为作业提交给 map过程：

import multiprocessing
import queue
import threading

POOL_SIZE = 15 # ????
PACKET_SIZE = 2000


def save_data(q):
    while True:
        results = q.get()
        if results is None:
            return # signal to terminate
        save_computed_data_to_disk(to_be_computed_filename, results)


def read_data():
    """
    yield lists of PACKET_SIZE
    """
    lines = []
    with open(some_file, 'r') as f:
        for line in iter(f.readline(), ''):
            lines.append(line)
            if len(lines) == PACKET_SIZE:
                yield lines
                lines = []
        if lines:
            yield lines

q = queue.Queue()
t = threading.Thread(target=save_data, args=(q,))
t.start()

with Pool(processes=POOL_SIZE) as pool:
    for l in read_data():
        results = pool.map(process, l)
        q.put(results)
q.put(None)
t.join() # wait for thread to terminate
print("Done")

Answer 4

我做了两个假设：写入达到 I/O 限制，而不是 CPU 限制 - 这意味着将更多内核投入写入不会提高性能。而 apply plugin: 'com.android.application' android { compileSdkVersion 21 buildToolsVersion "21.1.2" defaultConfig { applicationId "com.numix.calculator_pro" testApplicationId "com.numix.calculator_pro_pro.pro.tests" testInstrumentationRunner "android.test.InstrumentationTestRunner" } lintOptions { checkReleaseBuilds false // Or, if you prefer, you can continue to check for errors in release builds, // but continue the build even when errors are found: abortOnError false } buildTypes { release { runProguard false proguardFiles getDefaultProguardFile('proguard-android.txt'), 'proguard-rules.txt' } } } dependencies { implementation 'com.android.support:support-v4:21.0.3' compile files('libs/achartengine.jar') compile files('libs/ejml-0.21.jar') compile files('libs/arity-2.1.6.jar') compile files('libs/slider.jar') compile files('libs/acra-4.5.0-sources.jar') compile files('libs/acra-4.5.0-javadoc.jar') } 函数包含一些繁重的计算。

我会以不同的方式处理它：

将大列表拆分为一个列表列表
将其输入到流程中
存储总结果

示例代码如下：

process

关于元数据：您也可以查看 numpy 或 pandas（取决于数据），因为听起来您想朝那个方向做点什么。

这是我可以从 Python 多进程中获得的最大收益吗？

4 个答案:

示例：