计算多列中每个唯一行的字符串出现次数

时间:2019-11-11 16:36:59

标签: python pandas

我想计算多列中某些字符串的出现并在新列中返回总计数

所以我知道我可以使用value_counts来计算给定列中值的总出现次数:

data['col'].value_counts(dropna=False)

结果:

[["win" TKO technical knockout]     336
[["win" UD unanimous decision]      307
[["win" KO knockout]                225
[["loss" UD unanimous decision]      97
[["loss" TKO technical knockout]     64
[["win" nan null]                    53
[["draw" MD majority decision]       43
[["loss" KO knockout]                41
[["loss" MD majority decision]       35
[["loss" nan null]                   32
[["loss" SD split decision]          29
[["unknown" nan null]                29
[["win" SD split decision]           27
[["draw" PTS null]                   18
[["win" RTD corner retirement]       17
[["draw" SD split decision]          12
[["loss" RTD corner retirement]      11
[["win" MD majority decision]         9
[["loss" DQ disqualification]         6
[["win" PTS null]                     6
[["unknown" NC null]                  3

问题是,例如,我想统计每个相关列中的[[“ win” KO基因敲除]的发生率(相关列是col1到col20)。

以下是我的数据示例:

{'col1': {0: ['["win" UD unanimous decision'],
  1: ['["win" UD unanimous decision'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" UD unanimous decision'],
  4: ['["win" UD unanimous decision']},
 'col2': {0: ['["win" TKO technical knockout'],
  1: ['["win" TKO technical knockout'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" UD unanimous decision'],
  4: ['["win" UD unanimous decision']},
 'col3': {0: ['["win" TKO technical knockout'],
  1: ['["win" KO knockout'],
  2: ['["win" TKO technical knockout'],
  3: ['["win" TKO technical knockout'],
  4: ['["win" UD unanimous decision']},
 'col4': {0: ['["win" UD unanimous decision'],
  1: ['["win" UD unanimous decision'],
  2: ['["win" KO knockout'],
  3: ['["win" TKO technical knockout'],
  4: ['["win" UD unanimous decision']}}

在这种情况下,所需的输出将是:

      win UD   win TKO   win KO 
0       2         2         0
1       2         1         1
2       0         3         1
3       2         2         0
4       4         0         0

更新:

我也尝试过使用size和groupby:

#list of column names
col_outcome = ['col'+str(i) for i in range(1,11)]
data.groupby(col_outcome).size()

但是这将返回此错误消息:

  

TypeError:不可散列的类型:“列表”

1 个答案:

答案 0 :(得分:1)

IIUC,让我们用stack将“宽”数据帧重塑为“长”,然后进行一些数据字符串清理,然后使用正则表达式extractreplace,下一个{{1 }}和groupby apply,最后使用value_count重塑结果:

unstack

输出:

df.stack().str[0].str.replace('\[|\"','')\
  .str.extract('(\w+\s\w+)')\
  .groupby(level=0)[0].apply(pd.Series.value_counts).unstack(fill_value=0)