Python Pandas操纵数据帧

时间:2017-06-14 11:24:26

标签: python pandas dataframe

我的df看起来像这样:

names    col1   col2   col3   total     total_col1      total_col2
 bbb      1      1      0      2         DF1, DF2           DF1           
 ccc      1      0      0      1         DF1                        
 zzz      0      1      1      2                            DF2     
 qqq      0      1      0      1                           DF1, Df2
 rrr      0      0      1      1

我希望计算每个total_col#中的数字并添加另一个full total col,以便输出为:

names    col1   col2   col3   total  total_full     total_col1      total_col2
 bbb      1      1      0      2          5              2             1   
 ccc      1      0      0      1          2              1                      
 zzz      0      1      1      2          3              1    
 qqq      0      1      0      1          3              2
 rrr      0      0      1      1

所以每个total col对其中的DF数进行求和,total full将这些col与total col相加。

pandas可以吗?

2 个答案:

答案 0 :(得分:0)

您可以使用:

#filter columns for replacement
cols = df.columns[df.columns.str.startswith('total_')]
#split and get length of lists, write back
df[cols] = df[cols].apply(lambda x: x.str.split(',').str.len())
#add new column to position next total column
df.insert(df.columns.get_loc('total') + 1, 'total_full', df.filter(like='total').sum(axis=1))
print (df)
  names  col1  col2  col3  total  total_full  total_col1  total_col2
0   bbb     1     1     0      2         5.0         2.0         1.0
1   ccc     1     0     0      1         2.0         1.0         NaN
2   zzz     0     1     1      2         3.0         NaN         1.0
3   qqq     0     1     0      1         3.0         NaN         2.0
4   rrr     0     0     1      1         1.0         NaN         NaN

答案 1 :(得分:0)

您可以使用

totals = df.filter(regex=r'^total_col')
counts = (totals.stack().str.count(',')+1).unstack()
#    total_col1  total_col2
# 0         2.0         1.0
# 1         1.0         NaN
# 2         NaN         1.0
# 3         NaN         2.0

计算总计列中的字符串数。

要将非NaN值排序到每行的末尾,您可以使用

counts_array = np.sort(counts.values, axis=1)
counts = pd.DataFrame(counts_array, columns=counts.columns, index=counts.index)
import numpy as np
import pandas as pd
nan = np.nan

df = pd.DataFrame({'col1': [1, 1, 0, 0, 0],
 'col2': [1, 0, 1, 1, 0],
 'col3': [0, 0, 1, 0, 1],
 'names': ['bbb', 'ccc', 'zzz', 'qqq', 'rrr'],
 'total': [2, 1, 2, 1, 1],
 'total_col1': ['DF1, DF2', 'DF1', nan, nan, nan],
 'total_col2': ['DF1', nan, 'DF2', 'DF1, Df2', nan]})

totals = df.filter(regex=r'^total_col')
counts = (totals.stack().str.count(',')+1).unstack()
counts_array = np.sort(counts.values, axis=1)
counts = pd.DataFrame(counts_array, columns=counts.columns, index=counts.index)
df[totals.columns] = counts
df['total_full'] = df.filter(regex=r'^total').sum(axis=1)
print(df)

产量

   col1  col2  col3 names  total  total_col1  total_col2  total_full
0     1     1     0   bbb      2         1.0         2.0         5.0
1     1     0     0   ccc      1         1.0         NaN         2.0
2     0     1     1   zzz      2         1.0         NaN         3.0
3     0     1     0   qqq      1         2.0         NaN         3.0
4     0     0     1   rrr      1         NaN         NaN         1.0