我想创建一个代码,例如从CSV文件(或从数据库)异步读取几个熊猫数据帧
我编写了以下代码,假设它应该更快地导入两个数据帧,但是似乎做得更慢:
import timeit
import pandas as pd
import asyncio
train_to_save = pd.DataFrame(data={'feature1': [1, 2, 3],'period': [1, 1, 1]})
test_to_save = pd.DataFrame(data={'feature1': [1, 4, 12],'period': [2, 2, 2]})
train_to_save.to_csv('train.csv')
test_to_save.to_csv('test.csv')
async def run_async_train():
return pd.read_csv('train.csv')
async def run_async_test():
return pd.read_csv('test.csv')
async def run_train_test_asinc():
df = await asyncio.gather(run_async_train(), run_async_test())
return df
start_async = timeit.default_timer()
async_train,async_test=asyncio.run(run_train_test_asinc())
finish_async = timeit.default_timer()
time_to_run_async=finish_async-start_async
start = timeit.default_timer()
train=pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
finish = timeit.default_timer()
time_to_run_without_async = finish - start
print(time_to_run_async<time_to_run_without_async)
为什么在非异步版本中读取两个数据帧的速度更快?
为了明确起见,我真的要从Bigquery
中读取数据,因此我真的很想使用上面的代码来加快两个请求(培训和测试)的速度。
谢谢!
答案 0 :(得分:3)
pd.read_csv
不是异步方法,因此我不认为您实际上从中获得了任何并行性。您需要使用aiofiles
之类的异步文件库,以异步方式将文件读入缓冲区,然后将其发送到pd.read_csv(.)
。
请注意,大多数文件系统并不是真正异步的,因此aiofiles
在功能上是线程池。但是,它仍然可能比连续读取要快。
这是我使用aiohttp
从网址获取csv的示例:
import io
import asyncio
import aiohttp
import pandas as pd
async def get_csv_async(client, url):
# Send a request.
async with client.get(url) as response:
# Read entire resposne text and convert to file-like using StringIO().
with io.StringIO(await response.text()) as text_io:
return pd.read_csv(text_io)
async def get_all_csvs_async(urls):
async with aiohttp.ClientSession() as client:
# First create all futures at once.
futures = [ get_csv_async(client, url) for url in urls ]
# Then wait for all the futures to complete.
return await asyncio.gather(*futures)
urls = [
# Some random CSV urls from the internet
'https://people.sc.fsu.edu/~jburkardt/data/csv/hw_25000.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv',
'https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv',
]
if '__main__' == __name__:
# Run event loop
# can just do `csvs = asyncio.run(get_all_csvs_async(urls))` in python 3.7+
csvs = asyncio.get_event_loop().run_until_complete(get_all_csvs_async(urls))
for csv in csvs:
print(csv)