Question

我对在满足条件的数据框的子集上的过程中创建新列的列中找到值的总和感兴趣。我不确定如何处理这两个新列的总和，因为当我尝试访问在此过程中创建的新列时遇到错误：

import pandas as pd 

d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3]
    }
df=pd.DataFrame(d1)

for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno #Selects the rows matching RUNno
    df[df1]['NewColumn']=df[df1]['X']+df[df1]['Y'] #For the selected dataset, calculates the sum of two columns and creates a new column
    print(df[df1].NewColumn) #Print the contents of the new column

我无法获取df [df1] .NewColumn的内容，因为它无法识别Key NewColumn。我很确定这种创建新列的方式适用于标准数据框df，但是不确定为什么它不适用于df [df1]。例如。

df['NewColumn']=df['X']+df['Y'] 
df.NewColumn

可以无缝地工作。

要更新问题，添加到形成新列的列数据条目来自两个不同的数据框。

import pandas as pd 
from scipy.interpolate import interp1d 
interpolating_functions=dict() 
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000], 
    'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3], 
    'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] } 
df=pd.DataFrame(d1) 
df2=pd.DataFrame(d2)
for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno 
    df3=df.RUN==RUNno 
    interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y) 
    df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X']) 
    print(df[df1].NewColumn)

Answer 1

将自定义函数与GroupBy.apply一起使用，并创建新列，然后返回每个组-在此处x：

def func(x):
    #check groups
    print (x)
    #working with groups DataFrame x
    x['NewColumn']=x['X']+x['Y']
    return x

df = df.groupby('RUN').apply(func)

print (df)
       X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

似乎您需要loc来通过掩码选择列，两个数据帧中的索引长度都必须相同：

for RUNno in (df.RUN.unique()):
    df1=df.RUN==RUNno 
    df3=df.RUN==RUNno 
    interpolating_functions[RUNno]=interp1d(df2.loc[df3, 'X'], df2.loc[df3,'Y']) 

    df.loc[df1, 'NewColumn'] = df.loc[df1, 'X'] + interpolating_functions[RUNno](df2.loc[df3, 'X']) 

print (df)
       X      Y  RUN  NewColumn
0      1  0.200    1      1.200
1     10  0.500    1     10.500
2    100  0.400    1    100.400
3   1000  1.200    1   1001.200
4      1  0.100    2      1.100
5     10  0.250    2     10.250
6    100  0.200    2    100.200
7   1000  0.600    2   1000.600
8      1  0.050    3      1.050
9     10  0.125    3     10.125
10   100  0.100    3    100.100
11  1000  0.300    3   1000.300

计算作为值子集的pandas数据框中的新列将返回未找到列错误

1 个答案: