我试图将简单函数应用于大多数pandas中的数字数据。数据是一组由时间索引的矩阵。我想使用分层/多级索引来表示它,然后使用split-apply-combine like操作对数据进行分组,应用操作,并将结果汇总为数据帧。我希望这些操作的结果是数据帧而不是Series对象。
下面是一个简单的例子,其中两个矩阵(两个时间点)表示为多级数据帧。我想从每个时间点中减去一个矩阵,然后通过取均值来折叠数据,然后返回一个保留数据原始列名的数据帧。
我尝试的一切都失败或给出奇怪的结果。我试着遵循http://pandas.pydata.org/pandas-docs/stable/groupby.html,因为这基本上是一个split-apply-combine操作,我认为,但文档很难理解,而且示例很密集。
如何在大熊猫中实现这一目标?我注释了我的代码在相关行中失败的地方:
import pandas
import numpy as np
t1 = pandas.DataFrame([[0, 0, 0],
[0, 1, 1],
[5, 5, 5]], columns=[1, 2, 3], index=["A", "B", "C"])
t2 = pandas.DataFrame([[10, 10, 30],
[5, 1, 1],
[2, 2, 2]], columns=[1, 2, 3], index=["A", "B", "C"])
m = np.ones([3,3])
c = pandas.concat([t1, t2], keys=["t1", "t2"], names=["time", "name"])
#print "c: ", c
# How to view just the 'time' column values?
#print c.ix["time"] # fails
#print c["time"] # fails
# How to group matrix by time, subtract value from each matrix, and then
# take the mean across the columns and get a dataframe back?
result = c.groupby(level="time").apply(lambda x: np.mean(x - m, axis=1))
# Why does 'result' appear to have TWO "time" columns?!
print result
# Why is 'result' a series and not a dataframe?
print type(result)
# Attempt to get a dataframe back
df = pandas.DataFrame(result)
# Why does 'df' have a weird '0' outer (hierarchical) column??
print df
# 0
# time time name
# t1 t1 A -1.000000
# B -0.333333
# C 4.000000
# t2 t2 A 15.666667
# B 1.333333
# C 1.000000
简而言之,我想做的是:
for each time point:
subtract m from time point matrix
collapse the result matrix across the columns by taking the mean (preserving the row labels "A", "B", "C"
return result as dataframe
答案 0 :(得分:1)
如何查看' time'列值?
In [11]: c.index.levels[0].values
Out[11]: array(['t1', 't2'], dtype=object)
如何按时间对矩阵进行分组,从每个矩阵中减去值,然后 取整列的平均值并得到一个数据帧?
你的尝试非常接近:
In [46]: c.groupby(level='time').apply(lambda x: x - m).mean(axis=1)
Out[46]:
time name
t1 A -1.000000
B -0.333333
C 4.000000
t2 A 15.666667
B 1.333333
C 1.000000
dtype: float64