Question

我有一个df，其中有很多列代表构成索引的公司的市值。 df的索引是日期。

我希望消除每63天/行，接下来的63天/行除了500个最大值之外的所有值。

换句话说：对于前63天/行，应显示的唯一值是那些市值超过第一行500的公司。

例如：

[in]: pd.DataFrame(np.array([[1, 1, 0.5], [5 ,2, 10], [1,3, 10],[4,2, 10]]), 
                       columns=['a', 'b','c'])

[out]:     a   b   c    
      0   1.0 1.0 0.5
      1   5.0 2.0 10.0    
      2   1.0 3.0 10.0    
      3   4.0 2.0 10.0

假设在这个例子中我想使用2天/行。期望的输出是：

    a   b   c
 0 1.0 1.0 NaN
 1 5.0 2.0 NaN
 2 NaN 3.0 10.0
 3 NaN 2.0 10.0

这是我现在使用的代码。它有效，但需要永远。

for x in range(0,len(dfcap)/63 - 1):
    lst = list()
    for value in dfcap.iloc[x*63].nlargest(500):
         lst.append((dfcap == value).idxmax(axis=1)[x*63])
    for column in dfcap.columns:
         for n in range(x*63,x*63 + 63):
             if column not in lst: dfcap[column][n] = 0

Answer 1

如果我理解你的问题，这对你来说应该快得多这是我在英特尔i5上运行的VM中的630k行x 1000列的%%timeit输出。

%%timeit -n 2 -r 2
19.3 s ± 549 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

import pandas as pd
import numpy as np
import random, string

def randomticker(length):
""" Generate random uppercase string of length given """
   letters = string.ascii_uppercase
   return ''.join(random.choice(letters) for i in range(length))

# generate random data, 630k rows (dates) by 1000 columns (companies)
data = np.random.rand(63 * 10000,1000)
# generate 1000 random uppercase strings (stock tickers)
companies = [randomticker(4) for x in range(1000)]

df = pd.DataFrame(data, columns=companies)

# Number of columns to make NA, in your case (width of DF - 500)
mask_na_count = len(df.columns) - 500 
# If your index is not sorted 0-n integers use this line
# df = df.reset_index(drop=True)  

for x in range(0,len(df)//63 - 1):
    # Get the smallest (width-500)  valued column names at x*63 index
    na_cols = df.iloc[x*63].nsmallest(mask_na_count).index
    # Grab chunk of 63 rows and make smallest columns np.nan
    df.loc[(x-1)*63:x*63, na_cols] = np.nan

如果您再次需要索引作为日期，则可以在重置之前保存索引，然后再次应用索引 save_index = df.index和df.index = save_index

如何消除数据框中的嵌套循环

1 个答案: