Question

我有一个public class History { public interface HConstants{ public enum StateType { PAST,CURRENT,FUTURE} } //Inner class public class State implements HConstants{ public StateType stateField = StateType.PAST;，看起来像这样：

dask dataframe df

我还有另一个Main_Author PaperID A X B Y C Z，其外观如下：

dask dataframe pa

我想要一个结果如下所示的数据框：

PaperID  Co_Author
X        D
X        E
X        F
Y        A
Z        B
Z        D

这就是我所做的：

Main_Author  Co_Authors   Num_Co_Authors
A            (D,E,F)      3
B            (A)          1
C            (B,D)        2

这适用于小型数据框。但是，由于我正在与非常大型的公司一起工作，因此它一直被杀死。我相信这是因为我正在合并。有没有更优雅的方式来获得预期的结果？

Answer 1

如果您要使用两个大型df[df['A','B'] == ('a','b')]，则可以尝试将此DataFrame包装在dask.delayed

中

有一个很棒的例子，merge here in the Dask docs或here on SO
请参阅Dask用例here

进口

dask.delayed

生成虚拟数据，以便在每个from faker import Faker import pandas as pd import dask from dask.diagnostics import ProgressBar import random fake = Faker()中获得大量行

指定要在每个DataFrame中生成的伪数据的行数

DataFrame

使用faker库（每个this SO post）生成一些大数据集

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

打印前5行数据框

def create_rows(auth_colname, num=1):
    output = [{auth_colname:fake.name(),
               "PaperID":random.randint(1000,2000)} for x in range(num)]
    return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

在辅助函数中包装print(df.head()) Main_Author PaperID 0 Kyle Morton MD 1522 1 April Edwards 1992 2 Rachel Sullivan 1874 3 Kevin Johnson 1909 4 Julie Morton 1635 print(pa.head()) Co_Author PaperID 0 Deborah Cuevas 1911 1 Melissa Fox 1095 2 Sean Mcguire 1620 3 Cory Clarke 1424 4 David White 1569操作

merge

模糊方法-使用def merge_operations(df1, df2): df = df1.merge(df2, on="PaperID") df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index() df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x)) return df生成最终的DataFrame

dask.delayed

Dask方法的输出

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
    df_dask = dask.compute(ddf)

DataFrame

比较使用熊猫和达斯克方法获得的df_pandas = (merge_operations)(df, pa) print(df_pandas.head()) Main_Author Co_Author Num_Co_Authors 0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15 1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11 2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6 3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11 4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

DataFrame

两种方法的比较结果

from pandas.util.testing import assert_frame_equal
try:
    assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
    message = "\n"+str(e)
else:
    message = 'DataFrames created using Dask and Pandas are equivalent.'

寻找一种优雅的解决方案，避免合并两个数据框

1 个答案: