寻找一种优雅的解决方案,避免合并两个数据框

时间:2019-03-08 17:40:36

标签: python python-3.x dask

我有一个public class History { public interface HConstants{ public enum StateType { PAST,CURRENT,FUTURE} } //Inner class public class State implements HConstants{ public StateType stateField = StateType.PAST; ,看起来像这样:

dask dataframe df

我还有另一个Main_Author PaperID A X B Y C Z ,其外观如下:

dask dataframe pa

我想要一个结果如下所示的数据框:

PaperID  Co_Author
X        D
X        E
X        F
Y        A
Z        B
Z        D

这就是我所做的:

Main_Author  Co_Authors   Num_Co_Authors
A            (D,E,F)      3
B            (A)          1
C            (B,D)        2

这适用于小型数据框。但是,由于我正在与非常大型的公司一起工作,因此它一直被杀死。我相信这是因为我正在合并。有没有更优雅的方式来获得预期的结果?

1 个答案:

答案 0 :(得分:1)

如果您要使用两个大型 df[df['A','B'] == ('a','b')] ,则可以尝试将此DataFrame包装在dask.delayed

进口

dask.delayed

生成虚拟数据,以便在每个from faker import Faker import pandas as pd import dask from dask.diagnostics import ProgressBar import random fake = Faker() 中获得大量行

  • 指定要在每个DataFrame中生成的伪数据的行数
DataFrame

使用faker库(每个this SO post)生成一些大数据集

number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000

打印前5行数据框

def create_rows(auth_colname, num=1):
    output = [{auth_colname:fake.name(),
               "PaperID":random.randint(1000,2000)} for x in range(num)]
    return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))

在辅助函数中包装print(df.head()) Main_Author PaperID 0 Kyle Morton MD 1522 1 April Edwards 1992 2 Rachel Sullivan 1874 3 Kevin Johnson 1909 4 Julie Morton 1635 print(pa.head()) Co_Author PaperID 0 Deborah Cuevas 1911 1 Melissa Fox 1095 2 Sean Mcguire 1620 3 Cory Clarke 1424 4 David White 1569 操作

merge

模糊方法-使用def merge_operations(df1, df2): df = df1.merge(df2, on="PaperID") df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index() df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x)) return df 生成最终的DataFrame

dask.delayed

Dask方法的输出

ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
    df_dask = dask.compute(ddf)

熊猫方法-生成使用熊猫创建的最终[ ] | 0% Completed | 0.0s [ ] | 0% Completed | 0.1s [ ] | 0% Completed | 0.2s [ ] | 0% Completed | 0.3s [ ] | 0% Completed | 0.4s [ ] | 0% Completed | 0.5s [########################################] | 100% Completed | 0.6s print(df_dask[0].head()) Main_Author Co_Author Num_Co_Authors 0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15 1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11 2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6 3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11 4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

DataFrame

比较使用熊猫和达斯克方法获得的df_pandas = (merge_operations)(df, pa) print(df_pandas.head()) Main_Author Co_Author Num_Co_Authors 0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15 1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11 2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6 3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11 4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6

DataFrame

两种方法的比较结果

from pandas.util.testing import assert_frame_equal
try:
    assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
    message = "\n"+str(e)
else:
    message = 'DataFrames created using Dask and Pandas are equivalent.'