Question

数据框子集功能正在跨数据框行的for循环中使用。结果似乎是准确的，但是完成2000个奇数行上的循环所花费的时间超过4分钟。有关代码质量的任何建议或指导？

Datasets:

DF1 input   customer_id 31-12-2019 00:00    31-12-2018 00:00    31-12-2017 00:00    31-12-2016 00:00    31-12-2015 00:00    31-12-2014 00:00    31-12-2013 00:00    31-12-2012 00:00    31-12-2011 00:00    31-12-2010 00:00
    70464016                                        
    70453975                                        
    79983381                                        
    76615995                                        
    73543785                                        
    78226476                                        
    70117143                                        
    76448285                                        
    73980212                                        
    74540790    

File input
upload_date customer_id date    rating  rating_agency
05-03-2019  70464016    31-Dec-18   3   INTERNAL
05-03-2019  70453975    31-Dec-18   4+  INTERNAL
05-03-2019  79983381    31-Dec-18   3   INTERNAL
05-03-2019  76615995    31-Dec-18   4   INTERNAL
05-03-2019  73543785    31-Dec-18   4   INTERNAL
05-03-2019  78226476    31-Dec-18   4   INTERNAL
05-03-2019  70117143    31-Dec-18   4-  INTERNAL
05-03-2019  76448285    31-Dec-18   4-  INTERNAL
05-03-2019  73980212    31-Dec-18   5   INTERNAL
05-03-2019  74540790    31-Dec-18   5   INTERNAL
05-03-2019  76241783    31-Dec-18   5   INTERNAL
05-03-2019  76323368    31-Dec-18   5+  INTERNAL
05-03-2019  70732832    31-Dec-18   5   INTERNAL
05-03-2019  70453263    31-Dec-18   4-  INTERNAL
05-03-2019  73807515    31-Dec-18   5   INTERNAL
05-03-2019  71584306    31-Dec-18   5+  INTERNAL
05-03-2019  71017190    31-Dec-18   5   INTERNAL
05-03-2019  79142410    31-Dec-18   5   INTERNAL
05-03-2019  70455229    31-Dec-18   5   INTERNAL

代码如下：

for j in df1.itertuples(index=True, name='Pandas'):
    for i in range(1,len(df1.columns)):
        #for j in range(len(df1)):
            flag = file[(file['customer_id'] == j.customer_id) & (file['year'] == df1.columns[i].year)]
            flag = flag[(flag['date']== flag['date'].max())]

            if len(flag) != 0:
                df1.iat[j.Index,i] = flag.rating.iloc[0]
            else:
                pass

Answer 1

我知道您有一些代码可以从其他地方获取标志，并且您想查看数据框中每个值的标志。我建议编写一个从DataFrame的值返回flag的函数，然后使用df.applymap将函数应用于DataFrame的每个值。

df.applymap返回一个DataFrame，它应该快得多。通常，循环DataFrame效率不是很高，通常是可以避免的。

def get_flags(val):
    flag = # Your code for the value of the flag here
    return flag

flags = df.applymap(get_flags)

如果每行或每列只有一个标志，请改用df.apply。 More details in the docs。

在数据框上循环需要很多时间

1 个答案: