Python 3:如何从多个进程写入同一个文件而不会弄乱它?

时间:2017-06-07 19:53:55

标签: python multithreading parallel-processing multiprocessing

我有一个可以随时启动或停止的程序。该程序用于从网页下载数据。首先,用户将在.csv文件中定义一组网页,然后保存该.csv文件,然后启动该程序。该程序将读取.csv文件并将其转换为作业列表。接下来,作业分为5个单独的downloader函数,这些函数并行工作,但可能需要不同的时间才能下载。

downloader(其中有5个)完成下载网页后,我需要它来打开.csv文件并删除链接。这样,随着时间的推移,.csv文件将变得越来越小。问题是有时两个download函数会尝试同时更新.csv文件,并导致程序崩溃。我怎么处理这个?

3 个答案:

答案 0 :(得分:3)

如果这是你的project from yesterday的延续,你已经在内存中有了下载列表 - 只需在他们的进程完成下载时从加载的列表中删除条目,并且只在你输入文件后写下整个列表。 #39;重新退出'下载程序'。没有理由不断写下这些变化。

如果您想知道(比如来自外部流程)何时下载网址,即使您的“下载程序”也是如此。正在运行,每次进程返回下载成功时,在downloaded.dat新行中写入。

当然,在这两种情况下,请在主进程/主题中写一下,这样就不必担心互斥锁。

更新 - 以下是如何使用与昨天相同的代码库添加其他文件的方法:

def init_downloader(params):  # our downloader initializator
    downloader = Downloader(**params[0])  # instantiate our downloader
    downloader.run(params[1])  # run our downloader
    return params  # job finished, return the same params for identification

if __name__ == "__main__":  # important protection for cross-platform use

    downloader_params = [  # Downloaders will be initialized using these params
        {"port_number": 7751},
        {"port_number": 7851},
        {"port_number": 7951}
    ]
    downloader_cycle = cycle(downloader_params)  # use a cycle for round-robin distribution

    with open("downloaded_links.dat", "a+") as diff_file:  # open your diff file
        diff_file.seek(0)  # rewind the diff file to the beginning to capture all lines
        diff_links = {row.strip() for row in diff_file}  # load downloaded links into a set
        with open("input_links.dat", "r+") as input_file:  # open your input file
            available_links = []
            download_jobs = []  # store our downloader parameters + a link here
            # read our file line by line and filter out downloaded links
            for row in input_file:  # loop through our file
                link = row.strip()  # remove the extra whitespace to get the link
                if link not in diff_links:  # make sure link is not already downloaded
                    available_links.append(row)
                    download_jobs.append([next(downloader_cycle), link])
            input_file.seek(0)  # rewind our input file
            input_file.truncate()  # clear out the input file
            input_file.writelines(available_links)  # store back the available links
            diff_file.seek(0)  # rewind the diff file
            diff_file.truncate()  # blank out the diff file now that the input is updated
        # and now let's get to business...
        if download_jobs:
            download_pool = Pool(processes=5)  # make our pool use 5 processes
            # run asynchronously so we can capture results as soon as they ar available
            for response in download_pool.imap_unordered(init_downloader, download_jobs):
                # since it returns the same parameters, the second item is a link
                # add the link to our `diff` file so it doesn't get downloaded again
                diff_file.write(response[1] + "\n")
        else:
            print("Nothing left to download...")

正如我在评论中所写的那样,整个想法是在下载链接时使用文件存储下载的链接,然后在下次运行时过滤掉下载的链接并更新输入文件。这样即使你强行杀死它,也会一直从它停止的地方恢复(部分下载除外)。

答案 1 :(得分:0)

查看python中的锁定文件。锁定文件将使下一个进程等待,直到解锁文件以进行修改。锁定文件是特定于平台的,因此您必须使用适用于您所使用的操作系统的任何方法。如果你需要找出os,请使用这样的switch语句。

import os

def my_lock(f):
    if os.name == "posix":
        # Unix or OS X specific locking here
    elif os.name == "nt":
        # Windows specific locking here
    else:
        print "Unknown operating system, lock unavailable"

然后我会查看this article 并确切了解您希望如何实施锁定。

答案 2 :(得分:0)

使用'锁定'从多处理库中使用该文件序列化操作。

您需要将锁定传递给每个进程。每个流程都应该获得'锁之前打开文件并发布'关闭文件后锁定。

https://docs.python.org/2/library/multiprocessing.html