Question

我正在努力进行数据清理操作。我有一个由id，投资组合月（port_months）和投资组合编号（端口）组成的大型数据框，例如：

                id          port      port_months backtest_month
49025        USA0EBZ0         0            1             1
80689        USA0EBZ0         0            2             2
224952       USA0EBZ0         0            3             4
  ...           ...          ...          ... 
227370       USA03BE0         1            1             12
229804       USA03BE0         1            2             13
232262       USA03BE0         1            3             14
  ...           ...          ...          ...

不幸的是，我经常遇到一个新的id进入系统的情况，例如：

                id          port      port_months backtest_month
63682        USA06W90         5            7           66
236452       USA06W90         5            8           67
238905       USA06W90         5            9           68
241358       USA06W90         5           10           69
243808       USA06W90         5           11           70
246229       USA06W90         5           12           71

此处的问题是此ID的数据位于port_months = 7的数据框中，而不是port_months = 1。我需要删除所有这些不完整的数据，因为另一个函数需要对仅包含完整数据的数据集进行操作。所以，在这个例子中，我需要删除这个id的数据，USA06W90，对于port = 5（尽管你在这里看不到它，但是有完整的数据用于port = 6等等。）

我已经编写了一个简单的循环来完成我想要的东西，但速度非常慢，而且我确信我可以使用矢量化做更复杂的事情：

for id in df.id:
    for port in df.port.unique(): #so loop over ports where the current stock has some data, not those for which it is absent from the system
        first_df = df[(df.id == id) & (df.port == port) & (df.port_months == 1)] #get the 1st row from the current port's dataframe
        if first_df.empty:
            df.drop(df[(df.id == id) & (df.port == port)].index, inplace = True) # drop all the rows associated with current id and port (i.e. all port_months for current port and id)

目前执行时间超过30分钟！

我一直在尝试使用聪明的方式来使用

groupby('id', port).apply(lambda x: x.port = x[x.port_months == 1].port)

或其他什么，或试图以某种方式使用一些技巧来构建新的投资组合并做ffill

port_new = df[df.port_months == 1].groupby('id', as_index = False).apply(lambda x: x.backtest_month / 12 )

重置索引，然后通过合并索引

与df重新组合

这给出了：

                id          port      port_months backtest_month
49025        USA0EBZ0         0            1             1
80689        USA0EBZ0         NaN          2             2
224952       USA0EBZ0         NaN          3             4
  ...           ...          ...          ... 
227370       USA03BE0         1            1             12
229804       USA03BE0         NaN          2             13
232262       USA03BE0         NaN          3             14
  ...           ...          ...          ...

然后可以使用

填充nans

df.fillna['port_new'](method = 'ffill')

这几乎可以正常工作，并且快速闪电，但问题是你有id进入的情况，然后再次离开数据集，所以ffill也会填充所有这些，而不是删除行，例如下面的Nans将填充5s。

e.g。

                id          port      port_months backtest_month
63682        USA06W90         5            11           70
236452       USA06W90         5            12           71
238905       USA06W90       NaN             1           121
241358       USA06W90       NaN             2           122
243808       USA06W90       NaN             3           123
246229       USA06W90       NaN             4           124

Answer 1

要生成唯一的投资组合，您似乎需要创建一个由id和port组成的密钥。然后，您可以使用.loc进行有效过滤，如下所示：

df = pd.DataFrame({'backtest_month': [70, 71, 121, 122, 123],
                   'id': ['USA06W90', 'USA06W90', 'USA06W90', 'USA06W90', 'USA06W90'],
                   'port': [5, 5, 1, 1, 1],
                   'port_months': [11, 12, 1, 2, 3]})

>>> df
              id  port  port_months  backtest_month         key
63682   USA06W90     5           11              70  USA06W90_5
236452  USA06W90     5           12              71  USA06W90_5
238905  USA06W90     1            1             121  USA06W90_1
241358  USA06W90     1            2             122  USA06W90_1
243808  USA06W90     1            3             123  USA06W90_1

#  Create a unique portfolio identifier.
df['key'] = df['id'] + '_' + df.port.astype(str)

# Use .loc to locate all unique portfolios that had a `port_months` value of one.
portfolios_first_month = df.loc[df.port_months == 1, 'key'].unique()
>>> portfolios_first_month
array(['USA06W90_1'], dtype=object)

# Use .loc again to locate all portfolio keys that were previously identified above.  
# The colon indicates that all columns should be returned.
df_filtered = df.loc[df.key.isin(portfolios_first_month), :]

>>> df_filtered
              id  port  port_months  backtest_month         key
238905  USA06W90     1            1             121  USA06W90_1
241358  USA06W90     1            2             122  USA06W90_1
243808  USA06W90     1            3             123  USA06W90_1

它生成一个包含所有唯一键的数组，其中port_months的值为1（即没有丢失的数据）。

df.loc[df.key.isin(portfolios_first_month), :]然后找到所有这些键值并返回数据帧中的所有列。

用向量化方法替换python for循环以丢弃丢失的数据

1 个答案: