基于使用pandas的过滤和用户输入添加三列

时间:2018-02-12 13:55:39

标签: python pandas filtering addition

我有一个数据框。它是一个中间csv文件。 它有以下数据。

 sv1   val1    sv2    val2    sv3   val3
   2     0.2     4      0.6      8     0.3
   2     0.1     6      0.1      8     0.11
   2     0.12    6      -0.3     8     0.2
   5     0       4      1.6      8     0.7
   2     0.34    6      2.3      8     0.12
   ...   ....   ...     ....    ...   .....

目标:如果sv1,sv2,sv3不包含5,则添加val1 + val2 + val3。 如果任何svs列(比如sv1)包含5,那么添加将是val2 + val3

# Attempt
import pandas as pd
names=['sv1','sv2','sv3','val1','val2','val3']
df=pd.read_csv('Myfile.csv',names=names)
discard_id=int(raw_input('enter the number to discard')

add_result=df.loc[['sv1','sv2','sv3']!=discard_id]
           .....
          perform  addition

2 个答案:

答案 0 :(得分:2)

首先将所有值按discard_id进行比较,然后获得any每行至少一个True。然后按sumsubset列,并按numpy.where添加到新列:

discard_id = 5

m = (df[['sv1','sv2','sv3']] == discard_id).any(axis=1)
sum1 = df[['val1','val2','val3']].sum(axis=1)
sum2 = df[['val2','val3']].sum(axis=1)

df['new'] = np.where(m, sum2, sum1)

print (df)
   sv1  val1  sv2  val2  sv3  val3   new
0    2  0.20    4   0.6    8  0.30  1.10
1    2  0.10    6   0.1    8  0.11  0.31
2    2  0.12    6  -0.3    8  0.20  0.02
3    5  0.00    4   1.6    8  0.70  2.30
4    2  0.34    6   2.3    8  0.12  2.76

<强>详细

print (m)
0    False
1    False
2    False
3     True
4    False
dtype: bool

print (sum1)
0    1.10
1    0.31
2    0.02
3    2.30
4    2.76
dtype: float64

print (sum2)
0    0.90
1    0.21
2   -0.10
3    2.30
4    2.42
dtype: float64

<强>计时

df = pd.concat([df] * 1000, ignore_index=True)

In [312]: %%timeit
     ...: m = (df[['sv1','sv2','sv3']] == discard_id).any(axis=1)
     ...: sum1 = df[['val1','val2','val3']].sum(axis=1)
     ...: sum2 = df[['val2','val3']].sum(axis=1)
     ...: df['new'] = np.where(m, sum2, sum1)
     ...: 
100 loops, best of 3: 2.77 ms per loop

#jp_data_analysis's solution
In [313]: %%timeit
     ...: df['sum'] = df.apply(summer, axis=1, num=5)
     ...: 
1 loop, best of 3: 287 ms per loop

答案 1 :(得分:0)

这是一种方式:

def summer(row, num):
    return sum(i for i, j in zip([row['val1'], row['val2'], row['val3']],
                                 [row['sv1'], row['sv2'], row['sv3']]) if j!=num)

df['sum'] = df.apply(summer, axis=1, num=5)