我有一个看起来像这样的数据框
数据框有11列,每列都分配有一个等级。对于每条记录,我需要计算其中的A
,B
和C
的数量。
这是我预期的输出应为
我尝试使用apply
函数执行此操作。这是我到目前为止所拥有的
import pandas as pd
# sample data
df_dict = {'level_1': {0: 'C', 1: 'A', 2: 'C', 3: 'B', 4: 'A', 5: 'C', 6: 'A', 7: 'B', 8: 'B'},
'level_2': {0: 'B', 1: 'B', 2: 'C', 3: 'A', 4: 'A', 5: 'C', 6: 'B', 7: 'C', 8: 'A'},
'level_3': {0: 'B', 1: 'A', 2: 'B', 3: 'A', 4: 'B', 5: 'B', 6: 'C', 7: 'B', 8: 'C'},
'level_4': {0: 'A', 1: 'C', 2: 'B', 3: 'C', 4: 'B', 5: 'C', 6: 'A', 7: 'B', 8: 'C'},
'level_5': {0: 'B', 1: 'B', 2: 'B', 3: 'A', 4: 'A', 5: 'A', 6: 'B', 7: 'B', 8: 'A'},
'level_6': {0: 'C', 1: 'C', 2: 'C', 3: 'B', 4: 'B', 5: 'B', 6: 'C', 7: 'A', 8: 'C'},
'level_7': {0: 'C', 1: 'A', 2: 'C', 3: 'C', 4: 'C', 5: 'C', 6: 'C', 7: 'A', 8: 'A'},
'level_8': {0: 'B', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'A', 6: 'A', 7: 'A', 8: 'C'},
'level_9': {0: 'A', 1: 'B', 2: 'A', 3: 'C', 4: 'C', 5: 'B', 6: 'A', 7: 'C', 8: 'B'},
'level_10': {0: 'B', 1: 'C', 2: 'A', 3: 'A', 4: 'A', 5: 'A', 6: 'A', 7: 'A', 8: 'C'},
'level_11': {0: 'C', 1: 'B', 2: 'C', 3: 'B', 4: 'C', 5: 'B', 6: 'B', 7: 'C', 8: 'B'}
}
sample_df = pd.DataFrame(df_dict)
# function to count the values of A, B, C
def count_in_df(series):
_ = series.value_counts()
_ = _[['A', 'B', 'C']]
return _.tolist()
count_df = pd.DataFrame(sample_df.apply(count_in_df, axis=1).tolist(),
columns=['counts_of_A', 'counts_of_B', 'counts_of_C'])
# add count information back
sample_df = sample_df.join(count_df)
这给了我所需的信息,但是问题是代码太慢了。我有大约70万条记录和66列(而不是11列),我需要在这些列上执行此操作,并且花了大约30分钟才能获得结果。
有没有一种方法可以加快代码的速度?我可以尝试的任何优化方法吗?
答案 0 :(得分:2)
stack
+ groupby
+ value_counts
。重命名列,然后追加。
df = (sample_df
.stack()
.groupby(level=0)
.value_counts()
.unstack(1)
.add_prefix('counts_of_')
)
df = pd.concat([sample_df, df], axis=1)
df
count_of_A count_of_B count_of_C
0 2 5 4
1 4 4 3
2 2 4 5
3 4 4 3
4 4 4 3
5 3 4 4
6 5 3 3
7 4 4 3
8 3 3 5
答案 1 :(得分:2)
我使用str.get_dummies
sample_df.stack().str.get_dummies().sum(level=0)
Out[142]:
A B C
0 2 5 4
1 4 4 3
2 2 4 5
3 4 4 3
4 4 4 3
5 3 4 4
6 5 3 3
7 4 4 3
8 3 3 5
答案 2 :(得分:1)
@ALollz的回答很好。但是我的方法是这样的。
>>> dummy_df = pd.get_dummies(sample_df)
>>> sample_df['count_of_A'] = dummy_df.filter(regex='level_(\d+)_A').sum(axis=1)
>>> sample_df['count_of_A']
0 2
1 4
2 2
3 4
4 4
5 3
6 5
7 4
8 3
类似地,如果您有多个grades
。
>>> grades = list('ABC')
>>> for grade in grades:
... sample_df[f'count_of_{grade}'] = dummy_df.filter(regex=f'level_(\d+)_{grade}').sum(axis=1)
...
>>> sample_df.filter(regex='count_of_')
count_of_A count_of_B count_of_C
0 2 5 4
1 4 4 3
2 2 4 5
3 4 4 3
4 4 4 3
5 3 4 4
6 5 3 3
7 4 4 3
8 3 3 5