清洁熊猫适用于不能使用pandas.Series和非唯一索引的功能

时间:2014-09-06 06:16:12

标签: numpy pandas

在下文中,func表示使用多个列的功能(在整个组中具有耦合),并且无法直接在pandas.Series上运行。 0*d['x']语法是我能想到的最轻的强制转换,但我认为这很尴尬。

此外,生成的pandas.Seriess)仍然包含组索引,在将列添加到pandas.DataFrame之前必须将其删除。 s.reset_index(...)索引操作看起来很脆弱且容易出错,所以我很好奇是否可以避免。这样做有成语吗?

import pandas
import numpy

df = pandas.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = numpy.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])
print('# df\n', df)

def func(d):
    x = numpy.array(d['x'])
    y = numpy.array(d['y'])
    # I want to do math with x,y that cannot be applied to
    # pandas.Series, so explicitly convert to numpy arrays.
    #
    # We have to return an appropriately-indexed pandas.Series
    # in order for it to be admissible as a column in the
    # pandas.DataFrame.  Instead of simply "return x + y", we
    # have to make the conversion.
    return 0*d['x'] + x + y

s = df.groupby(df.index).apply(func)

# The Series is still adorned with the (unnamed) group index,
# which will prevent adding as a column of df due to
# Exception: cannot handle a non-unique multi-index!
s = s.reset_index(level=0, drop=True)
print('# s\n', s)

df['z'] = s
print('# df\n', df)

1 个答案:

答案 0 :(得分:3)

而不是

0*d['x'] + x + y

你可以使用

pd.Series(x+y, index=d.index)

使用groupy-apply时,请使用以下方法删除组密钥索引:

s = df.groupby(df.index).apply(func)
s = s.reset_index(level=0, drop=True)
df['z'] = s

您可以告诉groupby使用the keyword parameter group_keys=False删除密钥:

df['z'] = df.groupby(df.index, group_keys=False).apply(func)

import pandas as pd
import numpy as np

df = pd.DataFrame(dict(i=[1]*8,j=[1]*4+[2]*4,x=list(range(4))*2))
df['y'] = np.sin(df['x']) + 1000*df['j']
df = df.set_index(['i','j'])

def func(d):
    x = np.array(d['x'])
    y = np.array(d['y'])
    return pd.Series(x+y, index=d.index)

df['z'] = df.groupby(df.index, group_keys=False).apply(func)
print(df)

产量

     x            y            z
i j                             
1 1  0  1000.000000  1000.000000
  1  1  1000.841471  1001.841471
  1  2  1000.909297  1002.909297
  1  3  1000.141120  1003.141120
  2  0  2000.000000  2000.000000
  2  1  2000.841471  2001.841471
  2  2  2000.909297  2002.909297
  2  3  2000.141120  2003.141120