我有一个public class History {
public interface HConstants{
public enum StateType { PAST,CURRENT,FUTURE}
}
//Inner class
public class State implements HConstants{
public StateType stateField = StateType.PAST;
,看起来像这样:
dask dataframe df
我还有另一个Main_Author PaperID
A X
B Y
C Z
,其外观如下:
dask dataframe pa
我想要一个结果如下所示的数据框:
PaperID Co_Author
X D
X E
X F
Y A
Z B
Z D
这就是我所做的:
Main_Author Co_Authors Num_Co_Authors
A (D,E,F) 3
B (A) 1
C (B,D) 2
这适用于小型数据框。但是,由于我正在与非常大型的公司一起工作,因此它一直被杀死。我相信这是因为我正在合并。有没有更优雅的方式来获得预期的结果?
答案 0 :(得分:1)
如果您要使用两个大型 df[df['A','B'] == ('a','b')]
,则可以尝试将此DataFrame
包装在dask.delayed
有一个很棒的例子,merge
here in the Dask docs或here on SO
请参阅Dask用例here
进口
dask.delayed
生成虚拟数据,以便在每个from faker import Faker
import pandas as pd
import dask
from dask.diagnostics import ProgressBar
import random
fake = Faker()
中获得大量行
DataFrame
中生成的伪数据的行数DataFrame
使用faker
库(每个this SO post)生成一些大数据集
number_of_rows_in_df = 3000
number_of_rows_in_pa = 8000
打印前5行数据框
def create_rows(auth_colname, num=1):
output = [{auth_colname:fake.name(),
"PaperID":random.randint(1000,2000)} for x in range(num)]
return output
df = pd.DataFrame(create_rows("Main_Author", number_of_rows_in_df))
pa = pd.DataFrame(create_rows("Co_Author", number_of_rows_in_pa))
在辅助函数中包装print(df.head())
Main_Author PaperID
0 Kyle Morton MD 1522
1 April Edwards 1992
2 Rachel Sullivan 1874
3 Kevin Johnson 1909
4 Julie Morton 1635
print(pa.head())
Co_Author PaperID
0 Deborah Cuevas 1911
1 Melissa Fox 1095
2 Sean Mcguire 1620
3 Cory Clarke 1424
4 David White 1569
操作
merge
模糊方法-使用def merge_operations(df1, df2):
df = df1.merge(df2, on="PaperID")
df = df.groupby('Main_Author')['Co_Author'].apply(lambda x: tuple(x)).reset_index()
df['Num_Co_Authors'] = df['Co_Author'].apply(lambda x: len(x))
return df
生成最终的DataFrame
dask.delayed
Dask方法的输出
ddf = dask.delayed(merge_operations)(df, pa)
with ProgressBar():
df_dask = dask.compute(ddf)
熊猫方法-生成使用熊猫创建的最终[ ] | 0% Completed | 0.0s
[ ] | 0% Completed | 0.1s
[ ] | 0% Completed | 0.2s
[ ] | 0% Completed | 0.3s
[ ] | 0% Completed | 0.4s
[ ] | 0% Completed | 0.5s
[########################################] | 100% Completed | 0.6s
print(df_dask[0].head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
DataFrame
比较使用熊猫和达斯克方法获得的df_pandas = (merge_operations)(df, pa)
print(df_pandas.head())
Main_Author Co_Author Num_Co_Authors
0 Aaron Anderson (Elizabeth Peterson, Harry Gregory, Catherine ... 15
1 Aaron Barron (Martha Neal, James Walton, Amanda Wright, Sus... 11
2 Aaron Bond (Theresa Lawson, John Riley, Daniel Moore, Mrs... 6
3 Aaron Campbell (Jim Martin, Nicholas Stanley, Douglas Berry, ... 11
4 Aaron Castillo (Kevin Young, Patricia Gallegos, Tricia May, M... 6
DataFrame
两种方法的比较结果
from pandas.util.testing import assert_frame_equal
try:
assert_frame_equal(df_dask[0], df_pandas, check_dtype=True)
except AssertionError as e:
message = "\n"+str(e)
else:
message = 'DataFrames created using Dask and Pandas are equivalent.'