我在熊猫中发现了一种我无法向自己解释的行为。
我正在研究一个包含N + 2列的音频功能数据库:ID,时间t
和与时间t
相关的N个音频功能。出于各种原因,我想在每一行中都放入下一个T时间步骤的功能。 (是的,相同的数据将重复多达T次)。因此,我编写了一个函数,用于创建包含连续时间步骤数据的附加功能列。我已经以三种方式实现了它,正如您在附加的代码中看到的那样,其中一个不起作用,这对我来说很令人惊讶,因为如果底层数据结构是numpy数组它可以工作。谁能解释我为什么?
def create_datapoints_for_dnn(df, T):
"""
Here we take the data frame with chroma features at time t and create all features at times t+1, t+2, ..., t+T-1.
:param df: initial data frame of chroma features
:param T: number of time steps to keep
:return: expanded data frame of chroma features
"""
res = df.copy()
original_labels = df.columns.values
n_steps = df.shape[0] # the number of time steps in this song
nans = pd.Series(np.full(n_steps, np.NaN)).values # a column of nans of the correct length
for n in range(1, T):
new_labels = [ol + '+' + str(n) for ol in original_labels[2:]]
for nl, ol in zip(new_labels, original_labels[2:]):
# df.assign would use the name "nl" instead of what nl contains, so we build and unpack a dictionary
res = res.assign(**{nl: nans}) # create a new column
# CORRECT BUT EXTREMELY SLOW
# for i in range(n_steps - (T - 1)):
# res.iloc[i, res.columns.get_loc(nl)] = df.iloc[n+i, df.columns.get_loc(ol)]
# CORRECT AND FAST
res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[:, df.columns.get_loc(ol)].shift(-n)
# NOT WORKING
# res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)]
return res[: - (T - 1)] # drop the last T-1 rows because time t+T-1 is not defined for them
数据示例(将其放在csv中):
songID,time,A_t,A#_t
CrossEra-0850,0.0,0.0,0.0
CrossEra-0850,0.1,0.0,0.0
CrossEra-0850,0.2,0.0,0.0
CrossEra-0850,0.3,0.31621,0.760299
CrossEra-0850,0.4,0.0,0.00107539
CrossEra-0850,0.5,0.0,0.142832
CrossEra-0850,0.6,0.8506459999999999,0.12481600000000001
CrossEra-0850,0.7,0.0,0.21206399999999997
CrossEra-0850,0.8,0.0796207,0.28227399999999997
CrossEra-0850,0.9,2.55144,0.169434
CrossEra-0850,1.0,3.4581699999999995,0.08014550000000001
CrossEra-0850,1.1,3.1061400000000003,0.030419599999999998
运行它的代码
import pandas as pd
import numpy as np
T = 4 # how many successive steps we want to put in a single row
df = pd.read_csv('path_to_csv')
res = create_datapoints_for_dnn(df, T)
res.to_csv('path_to_output', index=False)
答案 0 :(得分:0)
使用pd.DataFrame.shift
和concat
f-string需要Python 3.6。否则使用'+{}'.format(i)'
cols = ['songID', 'time']
d = df.drop(['songID', 'time'], 1)
df[cols].join(
pd.concat(
[d.shift(-i).add_suffix(f'+{i}') for i in range(4)],
axis=1
)
)
songID time A_t+0 A#_t+0 A_t+1 A#_t+1 A_t+2 A#_t+2 A_t+3 A#_t+3
0 CrossEra-0850 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299
1 CrossEra-0850 0.1 0.000000 0.000000 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075
2 CrossEra-0850 0.2 0.000000 0.000000 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832
3 CrossEra-0850 0.3 0.316210 0.760299 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816
4 CrossEra-0850 0.4 0.000000 0.001075 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064
5 CrossEra-0850 0.5 0.000000 0.142832 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274
6 CrossEra-0850 0.6 0.850646 0.124816 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434
7 CrossEra-0850 0.7 0.000000 0.212064 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146
8 CrossEra-0850 0.8 0.079621 0.282274 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420
9 CrossEra-0850 0.9 2.551440 0.169434 3.458170 0.080146 3.106140 0.030420 NaN NaN
10 CrossEra-0850 1.0 3.458170 0.080146 3.106140 0.030420 NaN NaN NaN NaN
11 CrossEra-0850 1.1 3.106140 0.030420 NaN NaN NaN NaN NaN NaN