数据框中的新列基于另一列中值的位置

时间:2017-07-16 18:08:13

标签: python dataframe

我正在尝试创建一个新列' ratioA'在数据帧df中,值与列A相关:

对于给定的行,df [' ratioA']等于该行和下一行中的df [' A']之间的比率。

我迭代索引列作为参考,但不确定为什么值显示为NaN - 技术上只有最后一行应显示为NaN。

import numpy as np
import pandas as pd

series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})

df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
df = df.reset_index()

for i in df['index']:
    df['ratioA'] = df['A'][df['index']==i]/df['A'][df['index']==i+1]

print (df)

输出结果为:

   index  A  B  ratioA
0      0  1  2     NaN
1      1  3  4     NaN
2      2  5  6     NaN
3      3  7  8     NaN

所需的输出应为:

   index  A  B  ratioA
0      0  1  2     0.33
1      1  3  4     0.60
2      2  5  6     0.71
3      3  7  8     NaN

1 个答案:

答案 0 :(得分:1)

您可以使用向量化解决方案 - 除以div shift ed列A

print (df['A'].shift(-1))
0    3.0
1    5.0
2    7.0
3    NaN
Name: A, dtype: float64

df['ratioA'] = df['A'].div(df['A'].shift(-1))
print (df)
   index  A  B    ratioA
0      0  1  2  0.333333
1      1  3  4  0.600000
2      2  5  6  0.714286
3      3  7  8       NaN

在pandas循环中非常慢,所以最好避免它们( Jeff (pandas developer)explain it better.):

for i, row in df.iterrows():
    if i != df.index[-1]:
        df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A'] 
print (df)
   index  A  B    ratioA
0      0  1  2  0.333333
1      1  3  4  0.600000
2      2  5  6  0.714286
3      3  7  8       NaN

<强>计时

series1 = pd.Series({'A': 1, 'B': 2})
series2 = pd.Series({'A': 3, 'B': 4})
series3 = pd.Series({'A': 5, 'B': 6})
series4 = pd.Series({'A': 7, 'B': 8})

df = pd.DataFrame([series1, series2, series3, series4], index=[0,1,2,3])
#[4000 rows x 3 columns]
df = pd.concat([df]*1000).reset_index(drop=True)
df = df.reset_index()


In [49]: %timeit df['ratioA1'] = df['A'].div(df['A'].shift(-1))
1000 loops, best of 3: 431 µs per loop

In [50]: %%timeit 
    ...: for i, row in df.iterrows():
    ...:     if i != df.index[-1]:
    ...:         df.loc[i, 'ratioA'] = df.loc[i,'A'] / df.loc[i+1, 'A']
    ...: 
1 loop, best of 3: 2.15 s per loop