假设我有两个数据框conset = set()
for x in consarray:
x = frozenset(x)
if x in conset:
continue
conset.add(x)
和t1h
。我希望以这样的方式合并该数据帧:对于特定的列列表,如果这些行看起来相似,我需要对其余列的内容执行加法运算。
T1H
t2h
T2H
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count b_sum test_count test_sum data1 \
0 False 46 24742949931480 46 9.250 0
1 True 48 28151237474796 48 9.040 0
2 False 36 21702308613722 36 7.896 0
3 True 24 13112423049120 24 5.602 0
4 False 62 29948023487954 62 12.648 0
data2
0 0
1 0
2 0
3 0
4 0
基于以下列列表,我需要获得输出:
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count b_sum test_count test_sum data1 \
0 False 44 22349502626302 44 9.410 0
1 True 32 16859760597754 32 5.988 0
2 False 46 23478212117794 46 8.972 0
3 True 36 20956236750016 36 7.124 0
4 False 54 35255787384306 54 9.898 0
data2
0 0
1 0
2 0
3 0
4 0
我希望它的输出方式如下:
注意:除了groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']
pd.merge(t1h, t2h, on=groupby_fields)
timestamp ip domain http_status \
0 1475740500.0 192.168.1.1 example.com 200
1 1475740500.0 192.168.1.1 example.com 200
2 1475740500.0 192.168.1.1 example.com 201
3 1475740500.0 192.168.1.1 example.com 201
4 1475740500.0 192.168.1.1 example.com 202
test b_count_x b_sum_x test_count_x test_sum_x \
0 False 46 24742949931480 46 9.250
1 True 48 28151237474796 48 9.040
2 False 36 21702308613722 36 7.896
3 True 24 13112423049120 24 5.602
4 False 62 29948023487954 62 12.648
data1_x data2_x b_count_y b_sum_y \
0 0 0 44 22349502626302
1 0 0 32 16859760597754
2 0 0 46 23478212117794
3 0 0 36 20956236750016
4 0 0 54 35255787384306
test_count_y test_sum_y data1_y data2_y
0 44 9.410 0 0
1 32 5.988 0 0
2 46 8.972 0 0
3 36 7.124 0 0
4 54 9.898 0 0
每列其他列中的列都属于groupby_fields
或int
float
请让我知道如何以优化的方式实现这一目标。
答案 0 :(得分:1)
groupby.agg()
函数假设t1h
和t2h
已经存在,且列名相同
groupby_fields = ['timestamp', 'ip', 'domain', 'http_status', 'test']
df = t2h.append(t2h, ignore_index = True)
agg_dict = {'b_count':'count',
'b_sum':'sum',
'test_count':'count',
'test_sum':'sum',
'data1':'sum',
'data2':'sum'}
df.groupby(groupby_fields).agg(agg_dict).reset_index()