Question

我正在基于来自另一个数据框的一些数据和来自我要扩展的数据框的一些数据创建新列。

我有一个可行的解决方案，但是我很想了解是否存在矢量化方法，因为当前使用pandas.apply（）方法（会在行内进行迭代）需要花费很多时间。

执行转换的函数：

import pandas as pd


def add_new_columns(row, **kwds):
    participant = row['participant']
    time = row['time']

    ### NOTE ###
    # There is some other code here which handles cases where other
    # dataframe does not contain information, in that case we impute

    return pd.Series(kwds['other_df'].loc[participant, time])

和调用上述函数的语句：

main_df = pd.merge(
    main_df,
    main_df.apply(
        add_new_columns,
        axis=1,
        other_df=other_df
    ),
    left_index=True,
    right_index=True
)

一些可以使用上述代码的数据集的基本示例：

main_df = pd.DataFrame(
    [
        ['001', 'P1', 3, 'jumped'],
        ['002', 'P3', 8, 'yawned'],
        ['004', 'P2', 7, 'made something up']
    ],
    columns=['id', 'participant', 'time', 'action']
).set_index('id')

other_df = pd.DataFrame(
    [
        ['P1', 3, 2, 9, 8],
        ['P3', 8, 5, 6, 3],
        ['P2', 7, 9, 8, 5]
    ],
    columns=['participant', 'time', 'sugar-levels', 'some-other-measure', 'some-other-measure2']
).set_index(['participant', 'time'])

我的数据长80万行。如果可能的话，我想避免迭代。熊猫还有其他方法可能会帮助您吗？

Answer 1

您基本上是在尝试将other_df和main_df上的participant加入time。在不了解插补逻辑的情况下，很难给出完整的答案，但是您可以像这样首先合并other_df：

merged_df = pd.merge(main_df, other_df, how='left', on=['participant','time'])

然后使用您选择的插补方法在sugar-levels的{{1}}列中填写缺失的值。

如何使用矢量化方法将创建多个新列的函数应用于熊猫数据框？

1 个答案: