在下文中,func
表示使用多个列的功能(在整个组中具有耦合),并且无法直接在pandas.Series
上运行。 0*d['x']
语法是我能想到的最轻的强制转换,但我认为这很尴尬。
此外,生成的pandas.Series
(s
)仍然包含组索引,在将列添加到pandas.DataFrame
之前必须将其删除。 s.reset_index(...)
索引操作看起来很脆弱且容易出错,所以我很好奇是否可以避免。这样做有成语吗?
import pandas
import numpy
df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)
def func(d):
x = numpy.array(d['x'])
y = numpy.array(d['y'])
# I want to do math with x,y that cannot be applied to
# pandas.Series, so explicitly convert to numpy arrays.
#
# We have to return an appropriately-indexed pandas.Series
# in order for it to be admissible as a column in the
# pandas.DataFrame. Instead of simply "return x + y", we
# have to make the conversion.
return 0*d['x'] + x + y
s = df.groupby(df.index).apply(func)
# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)
df['z'] = s
print('# df\n', df)
答案 0 :(得分:3)
而不是
0*d['x'] + x + y
你可以使用
pd.Series(x+y, index=d.index)
使用groupy-apply
时,请使用以下方法删除组密钥索引:
s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s
您可以告诉groupby
使用the keyword parameter group_keys=False
删除密钥:
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
def func(d):
x = np.array(d['x'])
y = np.array(d['y'])
return pd.Series(x+y, index=d.index)
df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)
产量
x y z
i j
1 1 0 1000.000000 1000.000000
1 1 1000.841471 1001.841471
1 2 1000.909297 1002.909297
1 3 1000.141120 1003.141120
2 0 2000.000000 2000.000000
2 1 2000.841471 2001.841471
2 2 2000.909297 2002.909297
2 3 2000.141120 2003.141120