Question

我在熊猫中发现了一种我无法向自己解释的行为。

我正在研究一个包含N + 2列的音频功能数据库：ID，时间t和与时间t相关的N个音频功能。出于各种原因，我想在每一行中都放入下一个T时间步骤的功能。（是的，相同的数据将重复多达T次）。因此，我编写了一个函数，用于创建包含连续时间步骤数据的附加功能列。我已经以三种方式实现了它，正如您在附加的代码中看到的那样，其中一个不起作用，这对我来说很令人惊讶，因为如果底层数据结构是numpy数组它可以工作。谁能解释我为什么？

def create_datapoints_for_dnn(df, T):
    """
    Here we take the data frame with chroma features at time t and create all features at times t+1, t+2, ..., t+T-1.

    :param df: initial data frame of chroma features
    :param T: number of time steps to keep
    :return: expanded data frame of chroma features
    """
    res = df.copy()
    original_labels = df.columns.values
    n_steps = df.shape[0]  # the number of time steps in this song
    nans = pd.Series(np.full(n_steps, np.NaN)).values  # a column of nans of the correct length
    for n in range(1, T):
        new_labels = [ol + '+' + str(n) for ol in original_labels[2:]]
        for nl, ol in zip(new_labels, original_labels[2:]):
            # df.assign would use the name "nl" instead of what nl contains, so we build and unpack a dictionary
            res = res.assign(**{nl: nans})  # create a new column

            # CORRECT BUT EXTREMELY SLOW
            # for i in range(n_steps - (T - 1)):
            #     res.iloc[i, res.columns.get_loc(nl)] = df.iloc[n+i, df.columns.get_loc(ol)]

            # CORRECT AND FAST
            res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[:, df.columns.get_loc(ol)].shift(-n)

            # NOT WORKING
            # res.iloc[:-n, res.columns.get_loc(nl)] = df.iloc[n:, df.columns.get_loc(ol)]

    return res[: - (T - 1)]  # drop the last T-1 rows because time t+T-1 is not defined for them

数据示例（将其放在csv中）：

songID,time,A_t,A#_t
CrossEra-0850,0.0,0.0,0.0
CrossEra-0850,0.1,0.0,0.0
CrossEra-0850,0.2,0.0,0.0
CrossEra-0850,0.3,0.31621,0.760299
CrossEra-0850,0.4,0.0,0.00107539
CrossEra-0850,0.5,0.0,0.142832
CrossEra-0850,0.6,0.8506459999999999,0.12481600000000001
CrossEra-0850,0.7,0.0,0.21206399999999997
CrossEra-0850,0.8,0.0796207,0.28227399999999997
CrossEra-0850,0.9,2.55144,0.169434
CrossEra-0850,1.0,3.4581699999999995,0.08014550000000001
CrossEra-0850,1.1,3.1061400000000003,0.030419599999999998

运行它的代码

import pandas as pd
import numpy as np

T = 4  # how many successive steps we want to put in a single row
df = pd.read_csv('path_to_csv')
res = create_datapoints_for_dnn(df, T)
res.to_csv('path_to_output', index=False)

结果：

Answer 1

使用pd.DataFrame.shift和concat
f-string需要Python 3.6。否则使用'+{}'.format(i)'

cols = ['songID', 'time']
d = df.drop(['songID', 'time'], 1)
df[cols].join(
    pd.concat(
        [d.shift(-i).add_suffix(f'+{i}') for i in range(4)],
        axis=1
    )
)

           songID  time     A_t+0    A#_t+0     A_t+1    A#_t+1     A_t+2    A#_t+2     A_t+3    A#_t+3
0   CrossEra-0850   0.0  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.316210  0.760299
1   CrossEra-0850   0.1  0.000000  0.000000  0.000000  0.000000  0.316210  0.760299  0.000000  0.001075
2   CrossEra-0850   0.2  0.000000  0.000000  0.316210  0.760299  0.000000  0.001075  0.000000  0.142832
3   CrossEra-0850   0.3  0.316210  0.760299  0.000000  0.001075  0.000000  0.142832  0.850646  0.124816
4   CrossEra-0850   0.4  0.000000  0.001075  0.000000  0.142832  0.850646  0.124816  0.000000  0.212064
5   CrossEra-0850   0.5  0.000000  0.142832  0.850646  0.124816  0.000000  0.212064  0.079621  0.282274
6   CrossEra-0850   0.6  0.850646  0.124816  0.000000  0.212064  0.079621  0.282274  2.551440  0.169434
7   CrossEra-0850   0.7  0.000000  0.212064  0.079621  0.282274  2.551440  0.169434  3.458170  0.080146
8   CrossEra-0850   0.8  0.079621  0.282274  2.551440  0.169434  3.458170  0.080146  3.106140  0.030420
9   CrossEra-0850   0.9  2.551440  0.169434  3.458170  0.080146  3.106140  0.030420       NaN       NaN
10  CrossEra-0850   1.0  3.458170  0.080146  3.106140  0.030420       NaN       NaN       NaN       NaN
11  CrossEra-0850   1.1  3.106140  0.030420       NaN       NaN       NaN       NaN       NaN       NaN

熊猫：切片与numpy的不相容

1 个答案: