我正在尝试学习初学者数据科学,我有2个数据集,第一个是:
+----+-------+--------+------+------+-------+
| ID | bool | num1 | A | B | event |
+----+-------+--------+------+------+-------+
| a1 | TRUE | 123456 | 1001 | 1003 | 0 |
| a2 | FALSE | 123456 | 1006 | 1009 | 1 |
| a3 | TRUE | 144444 | 1020 | 1022 | 2 |
+----+-------+--------+------+------+-------+
和第二个:
+----+--------+-------+------+----------+------+-------+------+
| ID | num1 | event | C | category | num2 | num3 | num4 |
+----+--------+-------+------+----------+------+-------+------+
| a1 | 123456 | 0 | 1002 | aa | 1.11 | -1.01 | 1.23 |
| a1 | 123456 | 0 | 1003 | bb | 3.21 | 2.92 | 4.03 |
| a2 | 144444 | 1 | 1008 | aa | 6.34 | 5.56 | 7.02 |
| a2 | 144444 | 1 | 1009 | aa | 5.65 | 3.99 | 6.32 |
+----+--------+-------+------+----------+------+-------+------+
从他们那里我想做成第三个,其中数据是基于事件列的:
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
| event | ID | bool | num1 | C values | count cat1 | count cat2 | count cat3 | min num2 | avg num2 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
| 0 | a1 | TRUE | 123456 | 1002:1003 | 1 | 1 | 0 | 1.11 | 2.16 |
| 1 | a2 | FALSE | 123456 | 1008:1009 | 2 | 0 | 0 | 5.65 | 5.995 |
| 2 | a3 | TRUE | 144444 | 1020 | 0 | 0 | 1 | 4.02 | 4.02 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
这是一个简化的示例。我已经阅读了有关堆栈,groupby,基于另一列的计数,numpy.where,重塑等内容,但是我未能将它们结合起来以实现与我想要的东西类似的东西。从简单的建议开始,有什么解决方案?欢迎使用不同的解决方案,因此我可以尝试并全面理解它们。使用Python,Pandas。
答案 0 :(得分:0)
在一个简单的情况下,您可以将数据框沿着
连接起来
pd.concat([df1, df2], axis =1)
答案 1 :(得分:0)
您可以尝试以下操作:
#Use pd.get_dummies to create category counts on joined tables with merge
df_out = pd.get_dummies(df1.merge(df2,
on=['ID'],
how='left',
suffixes=('','_y')),
columns=['category'],
prefix='cat',
prefix_sep='_')
#compile a list of newly create columns from pd.get_dummies the category count columns
catcols = df_out.filter(like='cat_').columns.values.tolist()
#create a dictionary for agg function
aggdict = dict(zip(catcols,['sum']*len(catcols)))
#add to dictionary custom aggregrations for other columns
aggdict['C'] = ['min','max']
aggdict['num2'] = 'min'
#add other columns to column list
catcols.append('C')
catcols.append('num2')
#Groupby and flatten multiindex column headers
df_out = df_out.groupby(['event','ID','bool','num1'])[catcols].agg(aggdict)
df_out.columns = df_out.columns.map('_'.join)
print(df_out.reset_index())
输出:
event ID bool num1 cat_aa_sum cat_bb_sum C_min C_max num2_min
0 0.0 a1 TRUE 123456.0 1 1 1002.0 1003.0 1.11
1 1.0 a2 FALSE 123456.0 2 0 1008.0 1009.0 5.65
2 2.0 a3 TRUE 144444.0 0 0 NaN NaN NaN
答案 2 :(得分:0)
粗略的方法,但是我想这应该可行。
df_in = pd.merge(df1, df2, how="outer")
def func(x):
x["C_values"] = ":".join(str(int(i)) for i in list(x.C) if str(i)!="nan")
x["count_cat1"] = list(x.category).count("aa")
x["count_cat2"] = list(x.category).count("bb")
x["count_cat3"] = list(x.category).count("cc")
num2_list = [i for i in list(x.num2) if str(i)!="nan"]
if num2_list != []:
x["min_num2"] = min(num2_list)
x["avg_num2"] = mean(num2_list)
else:
x["min_num2"] = None
x["avg_num2"] = None
return x[["event","ID","bool","num1","C_values","count_cat1","count_cat2",
"count_cat3","min_num2","avg_num2"]].iloc[0]
输出:
df_out = df_in.groupby("event", as_index=False).apply(func)
event ID bool num1 C_values count_cat1 count_cat2 count_cat3 min_num2 avg_num2
0 a1 True 123456 1002:1003 1 1 0 1.11 2.16
1 a2 False 123456 1008:1009 2 0 0 5.65 5.995
2 a3 True 144444 0 0 0 nan nan