我想计算多列中某些字符串的出现并在新列中返回总计数
所以我知道我可以使用value_counts来计算给定列中值的总出现次数:
data['col'].value_counts(dropna=False)
结果:
[["win" TKO technical knockout] 336
[["win" UD unanimous decision] 307
[["win" KO knockout] 225
[["loss" UD unanimous decision] 97
[["loss" TKO technical knockout] 64
[["win" nan null] 53
[["draw" MD majority decision] 43
[["loss" KO knockout] 41
[["loss" MD majority decision] 35
[["loss" nan null] 32
[["loss" SD split decision] 29
[["unknown" nan null] 29
[["win" SD split decision] 27
[["draw" PTS null] 18
[["win" RTD corner retirement] 17
[["draw" SD split decision] 12
[["loss" RTD corner retirement] 11
[["win" MD majority decision] 9
[["loss" DQ disqualification] 6
[["win" PTS null] 6
[["unknown" NC null] 3
问题是,例如,我想统计每个相关列中的[[“ win” KO基因敲除]的发生率(相关列是col1到col20)。
以下是我的数据示例:
{'col1': {0: ['["win" UD unanimous decision'],
1: ['["win" UD unanimous decision'],
2: ['["win" TKO technical knockout'],
3: ['["win" UD unanimous decision'],
4: ['["win" UD unanimous decision']},
'col2': {0: ['["win" TKO technical knockout'],
1: ['["win" TKO technical knockout'],
2: ['["win" TKO technical knockout'],
3: ['["win" UD unanimous decision'],
4: ['["win" UD unanimous decision']},
'col3': {0: ['["win" TKO technical knockout'],
1: ['["win" KO knockout'],
2: ['["win" TKO technical knockout'],
3: ['["win" TKO technical knockout'],
4: ['["win" UD unanimous decision']},
'col4': {0: ['["win" UD unanimous decision'],
1: ['["win" UD unanimous decision'],
2: ['["win" KO knockout'],
3: ['["win" TKO technical knockout'],
4: ['["win" UD unanimous decision']}}
在这种情况下,所需的输出将是:
win UD win TKO win KO
0 2 2 0
1 2 1 1
2 0 3 1
3 2 2 0
4 4 0 0
更新:
我也尝试过使用size和groupby:
#list of column names
col_outcome = ['col'+str(i) for i in range(1,11)]
data.groupby(col_outcome).size()
但是这将返回此错误消息:
TypeError:不可散列的类型:“列表”
答案 0 :(得分:1)
IIUC,让我们用stack
将“宽”数据帧重塑为“长”,然后进行一些数据字符串清理,然后使用正则表达式extract
和replace
,下一个{{1 }}和groupby
apply
,最后使用value_count
重塑结果:
unstack
输出:
df.stack().str[0].str.replace('\[|\"','')\
.str.extract('(\w+\s\w+)')\
.groupby(level=0)[0].apply(pd.Series.value_counts).unstack(fill_value=0)