我对在满足条件的数据框的子集上的过程中创建新列的列中找到值的总和感兴趣。我不确定如何处理这两个新列的总和,因为当我尝试访问在此过程中创建的新列时遇到错误:
import pandas as pd
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3]
}
df=pd.DataFrame(d1)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno #Selects the rows matching RUNno
df[df1]['NewColumn']=df[df1]['X']+df[df1]['Y'] #For the selected dataset, calculates the sum of two columns and creates a new column
print(df[df1].NewColumn) #Print the contents of the new column
我无法获取df [df1] .NewColumn的内容,因为它无法识别Key NewColumn。我很确定这种创建新列的方式适用于标准数据框df,但是不确定为什么它不适用于df [df1]。例如。
df['NewColumn']=df['X']+df['Y']
df.NewColumn
可以无缝地工作。
要更新问题,添加到形成新列的列数据条目来自两个不同的数据框。
import pandas as pd
from scipy.interpolate import interp1d
interpolating_functions=dict()
d1={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] }
d2={'X':[1,10,100,1000,1,10,100,1000,1,10,100,1000],
'Y':[0.2,0.5,0.4,1.2,0.1,0.25,0.2,0.6,0.05,0.125,0.1,0.3],
'RUN':[1,1,1,1,2,2,2,2,3,3,3,3] }
df=pd.DataFrame(d1)
df2=pd.DataFrame(d2)
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno
df3=df.RUN==RUNno
interpolating_functions[RUNno]=interp1d(df2[df3].X,df2[df3].Y)
df[df1]['NewColumn']=df[df1]['X']+interpolating_functions[RUNno](df2[df3]['X'])
print(df[df1].NewColumn)
答案 0 :(得分:1)
将自定义函数与GroupBy.apply
一起使用,并创建新列,然后返回每个组-在此处x
:
def func(x):
#check groups
print (x)
#working with groups DataFrame x
x['NewColumn']=x['X']+x['Y']
return x
df = df.groupby('RUN').apply(func)
print (df)
X Y RUN NewColumn
0 1 0.200 1 1.200
1 10 0.500 1 10.500
2 100 0.400 1 100.400
3 1000 1.200 1 1001.200
4 1 0.100 2 1.100
5 10 0.250 2 10.250
6 100 0.200 2 100.200
7 1000 0.600 2 1000.600
8 1 0.050 3 1.050
9 10 0.125 3 10.125
10 100 0.100 3 100.100
11 1000 0.300 3 1000.300
似乎您需要loc
来通过掩码选择列,两个数据帧中的索引长度都必须相同:
for RUNno in (df.RUN.unique()):
df1=df.RUN==RUNno
df3=df.RUN==RUNno
interpolating_functions[RUNno]=interp1d(df2.loc[df3, 'X'], df2.loc[df3,'Y'])
df.loc[df1, 'NewColumn'] = df.loc[df1, 'X'] + interpolating_functions[RUNno](df2.loc[df3, 'X'])
print (df)
X Y RUN NewColumn
0 1 0.200 1 1.200
1 10 0.500 1 10.500
2 100 0.400 1 100.400
3 1000 1.200 1 1001.200
4 1 0.100 2 1.100
5 10 0.250 2 10.250
6 100 0.200 2 100.200
7 1000 0.600 2 1000.600
8 1 0.050 3 1.050
9 10 0.125 3 10.125
10 100 0.100 3 100.100
11 1000 0.300 3 1000.300