我正在尝试使用Dask构建一个距离矩阵,该矩阵涉及向API发出几百万个请求。 API接受一个原始位置和多个目的地。要获取每个单独请求的参数,我将使用以下代码构建Dask数据帧:
from dask import bag as db
origins = ['a','b','c','d','e']
dd = db.from_sequence(origins)
#all origin destination pairs (destinations are from the same set as origins)
dp = dd.product(dd)
# possible extra step to filter elements
# dp = dp.filter(lambda x: ...)
df = dp.to_dataframe()
df.columns = ['from','to']
# helper function to get chunks
def chunks(l, n):
"""Yield successive n-sized chunks from l."""
for i in range(0, len(l), n):
yield l[i:i + n]
chunk_size = 3
destinationsd = df.groupby('from').apply(lambda x:[el.tolist() for el in list(chunks(x['to'], chunk_size))],meta=(0, list))
这让我得到以下系列:
destinations = destinationsd.compute()
print(destinations)
>
from
b [[a, b, c], [d, e]]
a [[a, b, c], [d, e]]
e [[a, b, c], [d, e]]
d [[a, b, d], [c, e]]
c [[b, c, a], [e, d]]
Name: 0, dtype: object
该系列包含一行中单个原点的所有请求参数。我没有将原点的所有目的地都放在一个列表中,而是希望每个源 - 目的地列表都有单独的行。我可以在熊猫中做到这一点:
final = res.apply(lambda x:pd.Series(x)).stack()
final.index = final.index.droplevel(1)
print(final)
>
from
b [a, b, c]
b [d, e]
a [a, b, c]
a [d, e]
e [a, b, c]
e [d, e]
d [a, b, d]
d [c, e]
c [b, c, a]
c [e, d]
dtype: object
如何使用Dask向final
获得类似的结果?