我有一些Pandas(python)数据帧,它们是通过大约每8毫秒收集数据而创建的。数据被分解为块,序列重新开始。所有块都有一个标签,并且有一个时间戳列,指示收集样本的时间(从文件开头)。为了得到一个想法,框架看起来像这样:
| | EXPINDEX | EXPTIMESTAMP | DATA1 | DATA2 |
-----------------------------------------------------
| BLOCK | 0 | | | |
| Block1 | 1 | 0 | .423 | .926 |
| | 2 | 8.215 | .462 | .919 |
| | 3 | 17.003 | .472 | .904 |
| Block2 | 4 | 55.821 | .243 | .720 |
| | 5 | 63.521 | .237 | .794 |
| ... | ... | ... | ... | ... |
------------------------------------------------------
EXPTIMESTAMP列是DateTimeIndex。我想要做的是稍后将该列保留为实用程序,但使用块相对DateTimeIndex创建不同的子索引,例如:
| | | EXPTIMESTAMP | DATA1 | DATA2 |
----------------------------------------------------------
| BLOCK | BLOCKTIMESTAMP | | | |
| Block1 | 0 | 0 | .423 | .926 |
| | 8.215 | 8.215 | .462 | .919 |
| | 17.003 | 17.003 | .472 | .904 |
| Block2 | 0 | 55.821 | .243 | .720 |
| | 7.700 | 63.521 | .237 | .794 |
| ... | ... | ... | ... | ... |
----------------------------------------------------------
我已经完成了这项工作:
blockreltimestamp = []
blocks = list(df.index.levels[0])
for block in blocks:
dfblock = df.xs(block, level='BLOCK').copy()
dfblock["InitialVal"] = dfblock.iloc[0]["EXPTIMESTAMP"]
reltime = dfsblock["EXPTIMESTAMP"] - dfblock["InitialVal"]
blockreltimestamp.extend(list(reltime))
df["BLOCKTIMESTAMP"] = blockreltimestamp
df.set_index(["BLOCK","BLOCKTIMESTAMP"], drop=False, inplace=True)
但我想知道是否有更清洁/更有效/更多熊猫式的方式来进行这种转型。
谢谢!
答案 0 :(得分:0)
更干净的解决方案最终处理非多索引数据框,其中BLOCK仍然是具有块ID的列,而EXPTIMESTAMP是一列,正如我最终想要的那样。从那里开始,我使用了熊猫' groupby功能:
initialvalmatrix = df.groupby("BLOCK").min()[["EXPTIMESTAMP"]]
这将创建一个索引为" BLOCK"的数据框,以及一列" EXPTIMESTAMP"包含" EXPTIMESTAMP"的最小值对于每个街区。
为清楚起见,我重命名为" EXPTIMESTAMP"列到" INITIALBYBLOCK":
initialvalmatrix.columns = ["INITIALBYBLOCK"]
然后我用了大熊猫'适用于跨列运行函数来计算" BLOCKTIMESTAMP"柱:
df["BLOCKTIMESTAMP"] = df.apply(apply_zero_timestamp, axis=1, tslookup=initialvalmatrix)
#Keyword arguments, if not used in the apply method, are passed into the function specified.
..." apply_zero_timestamp"功能定义为:
def apply_zero_timestamp(series, tslookup):
zeroval = series["EXPTIMESTAMP"] - tslookup["INITIALBYBLOCK"][series["BLOCK"]]
return zeroval
最后,我只需按照自己的意愿设置索引:
df.set_index(["BLOCK","BLOCKTIMESTAMP"], drop=False, inplace=True)
希望它有所帮助!