在熊猫中异步读取多个数据帧的“ read_csv”-为什么不更快

时间:2019-09-10 13:04:14

标签: python pandas async-await

我想创建一个代码,例如从CSV文件(或从数据库)异步读取几个熊猫数据帧

我编写了以下代码,假设它应该更快地导入两个数据帧,但是似乎做得更慢:

import timeit

import pandas as pd
import asyncio

train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})

train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')


async def run_async_train():
    return pd.read_csv('train.csv')

async def run_async_test():
    return pd.read_csv('test.csv')

async def run_train_test_asinc():
    df = await asyncio.gather(run_async_train(), run_async_test())
    return df

start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async

start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start

print(time_to_run_async<time_to_run_without_async)

为什么在非异步版本中读取两个数据帧的速度更快?

为了明确起见,我真的要从Bigquery中读取数据,因此我真的很想使用上面的代码来加快两个请求(培训和测试)的速度。

谢谢!

1 个答案:

答案 0 :(得分:3)

pd.read_csv不是异步方法,因此我不认为您实际上从中获得了任何并行性。您需要使用aiofiles之类的异步文件库,以异步方式将文件读入缓冲区,然后将其发送到pd.read_csv(.)

请注意,大多数文件系统并不是真正异步的,因此aiofiles在功能上是线程池。但是,它仍然可能比连续读取要快。


这是我使用aiohttp从网址获取csv的示例:

import io
import asyncio

import aiohttp
import pandas as pd

async def get_csv_async(client, url):
    # Send a request.
    async with client.get(url) as response:
        # Read entire resposne text and convert to file-like using StringIO().
        with io.StringIO(await response.text()) as text_io:
            return pd.read_csv(text_io)

async def get_all_csvs_async(urls):
    async with aiohttp.ClientSession() as client:
        # First create all futures at once.
        futures = [ get_csv_async(client, url) for url in urls ]
        # Then wait for all the futures to complete.
        return await asyncio.gather(*futures)

urls = [
    # Some random CSV urls from the internet
    'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
    'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]

if '__main__' == __name__:
    # Run event loop
    # can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
    csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))

    for csv in csvs:
        print(csv)