将具有条件行的数据集合并到Python中的列

时间:2018-07-19 16:49:02

标签: pandas data-science

我正在尝试学习初学者数据科学,我有2个数据集,第一个是:

+----+-------+--------+------+------+-------+
| ID | bool  |  num1  |  A   |  B   | event |
+----+-------+--------+------+------+-------+
| a1 | TRUE  | 123456 | 1001 | 1003 |     0 |
| a2 | FALSE | 123456 | 1006 | 1009 |     1 |
| a3 | TRUE  | 144444 | 1020 | 1022 |     2 |
+----+-------+--------+------+------+-------+

和第二个:

+----+--------+-------+------+----------+------+-------+------+
| ID |  num1  | event |  C   | category | num2 | num3  | num4 |
+----+--------+-------+------+----------+------+-------+------+
| a1 | 123456 |     0 | 1002 | aa       | 1.11 | -1.01 | 1.23 |
| a1 | 123456 |     0 | 1003 | bb       | 3.21 |  2.92 | 4.03 |
| a2 | 144444 |     1 | 1008 | aa       | 6.34 |  5.56 | 7.02 |
| a2 | 144444 |     1 | 1009 | aa       | 5.65 |  3.99 | 6.32 |
+----+--------+-------+------+----------+------+-------+------+

从他们那里我想做成第三个,其中数据是基于事件列的:

+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
| event | ID | bool  |  num1  | C values  | count cat1 | count cat2 | count cat3 | min num2 | avg num2 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+
|     0 | a1 | TRUE  | 123456 | 1002:1003 |          1 |          1 |          0 |     1.11 |     2.16 |
|     1 | a2 | FALSE | 123456 | 1008:1009 |          2 |          0 |          0 |     5.65 |    5.995 |
|     2 | a3 | TRUE  | 144444 | 1020      |          0 |          0 |          1 |     4.02 |     4.02 |
+-------+----+-------+--------+-----------+------------+------------+------------+----------+----------+

这是一个简化的示例。我已经阅读了有关堆栈,groupby,基于另一列的计数,numpy.where,重塑等内容,但是我未能将它们结合起来以实现与我想要的东西类似的东西。从简单的建议开始,有什么解决方案?欢迎使用不同的解决方案,因此我可以尝试并全面理解它们。使用Python,Pandas。

3 个答案:

答案 0 :(得分:0)

在一个简单的情况下,您可以将数据框沿着

连接起来

pd.concat([df1, df2], axis =1)

答案 1 :(得分:0)

您可以尝试以下操作:

#Use pd.get_dummies to create category counts on joined tables with merge
df_out = pd.get_dummies(df1.merge(df2, 
                                  on=['ID'], 
                                  how='left', 
                                  suffixes=('','_y')), 
                        columns=['category'], 
                        prefix='cat', 
                        prefix_sep='_')

#compile a list of newly create columns from pd.get_dummies the category count columns
catcols = df_out.filter(like='cat_').columns.values.tolist()
#create a dictionary for agg function
aggdict = dict(zip(catcols,['sum']*len(catcols)))

#add to dictionary custom aggregrations for other columns
aggdict['C'] = ['min','max']
aggdict['num2'] = 'min'

#add other columns to column list
catcols.append('C')
catcols.append('num2')

#Groupby and flatten multiindex column headers
df_out = df_out.groupby(['event','ID','bool','num1'])[catcols].agg(aggdict)
df_out.columns = df_out.columns.map('_'.join)
print(df_out.reset_index())

输出:

   event    ID     bool      num1  cat_aa_sum  cat_bb_sum   C_min   C_max  num2_min
0    0.0   a1    TRUE    123456.0           1           1  1002.0  1003.0      1.11
1    1.0   a2    FALSE   123456.0           2           0  1008.0  1009.0      5.65
2    2.0   a3    TRUE    144444.0           0           0     NaN     NaN       NaN

答案 2 :(得分:0)

粗略的方法,但是我想这应该可行。

df_in = pd.merge(df1, df2, how="outer")
def func(x):
    x["C_values"] = ":".join(str(int(i)) for i in list(x.C) if str(i)!="nan")
    x["count_cat1"] = list(x.category).count("aa")
    x["count_cat2"] = list(x.category).count("bb")
    x["count_cat3"] = list(x.category).count("cc")
    num2_list = [i for i in list(x.num2) if str(i)!="nan"]
    if num2_list != []:
        x["min_num2"] = min(num2_list)
        x["avg_num2"] = mean(num2_list)
    else:
        x["min_num2"] = None
        x["avg_num2"] = None

    return x[["event","ID","bool","num1","C_values","count_cat1","count_cat2",
                        "count_cat3","min_num2","avg_num2"]].iloc[0] 

输出:

df_out = df_in.groupby("event", as_index=False).apply(func)

    event   ID  bool    num1    C_values    count_cat1  count_cat2  count_cat3  min_num2    avg_num2
    0       a1  True    123456  1002:1003   1           1            0          1.11            2.16
    1       a2  False   123456  1008:1009   2           0            0          5.65            5.995
    2       a3  True    144444              0           0            0         nan             nan