计算熊猫数据框中列中每个值的列表中每个元素的性能

时间:2020-06-02 06:39:42

标签: python python-3.x pandas dataframe pandas-groupby

我正在处理的df看起来像这样

co1   col2
A     ['1','2','er']
A     []
B     ['1','3','4','abc']
B     ['5']
C     [] 

我想为col1中的每个值计算col2中列表中每个元素的百分比。即 为A计算1的百分比,为A计算2的百分比,为B计算abc的百分比 我正在寻找一种解决方案来迭代地执行此操作。谢谢

输入数据的链接(爆炸前)-[https://drive.google.com/file/d/1fuOBo8PK1heAtfufBlplXXfh4FiLpBCD/view?usp=sharing][1]

爆炸后输出的链接-[https://drive.google.com/file/d/1mcArrsu3TWJC6hYZ2kIHAkAzCaHd1DLH/view?usp=sharing][2]

1 个答案:

答案 0 :(得分:2)

我相信您需要DataFrame.explodeDataFrame.dropna

#changed data for better sample     
print (df)
  col1           col2
0    A      [1, 2, 1]
1    A             []
2    B  [3, abc, abc]
3    B          [abc]
4    C             []

df2 = df.explode('col2').dropna(subset=['col2'])
print (df2)
  col1 col2
0    A    1
0    A    2
0    A    1
2    B    3
2    B  abc
2    B  abc
3    B  abc

然后是SeriesGroupBy.value_counts

df2 = df2.groupby('col1')['col2'].value_counts(normalize=True).reset_index(name='%')
print (df2)
  col1 col2         %
0    A    1  0.666667
1    A    2  0.333333
2    B  abc  0.750000
3    B    3  0.250000

编辑:

import ast

df = pd.read_csv('beforeexplode.csv')

df['col2'] = df['col2'].apply(ast.literal_eval)
df2 = df.explode('col2').dropna(subset=['col2'])
print (df2)
     col1     col2
0    dev1  android
1    dev1  android
2    dev3     oscp
2    dev3     gpen
2    dev3      ceh
..    ...      ...
206  dev2     wcag
207  dev2    linux
207  dev2     unix
208  dev2    linux
208  dev2     unix

[460 rows x 2 columns]