Question

我有一个包含25-30列和数千行的大型数据帧。我需要分析一些列的趋势以及一些列比率相互之间的趋势。

现在我有3个选择：

1）逐行迭代，在范围内使用简单的i（len（df）），并构建一系列if / else条件，将每个值与前一个和后一个值进行比较，用于不同的列;

for i in range(len(df)):
    if (df.iloc[i]['col1'] < df.iloc[i]['col2']):
      print ('error type 1')
    if (df.iloc[i]['col2'] < df.iloc[i]['col3']):
      print ('error type 2')
    if (df.iloc[i+1]['col1'] > 2 * df.iloc[i]['col1']):
      print ('error trend 1')
    if (df.iloc[i+1]['col2'] > 2 * df.iloc[i+1]['col2']):
      print ('error trend 2')
    if (df.iloc[i-1]['col2'] > 2 * df.iloc[i]['col2']):
      print ('error trend 2')
   # and so on, with around 40-50 if statements per line

2）迭代iterrows或itertuples，但我不确定能够轻松访问上一行和下一行

3）创建移位列并使用向量化操作，但这意味着我将在数据帧中创建大约100个列（移位+1，+ 2，-1，-2 x 20列）：

df['ratio12'] = df['col1'] / df['col2']
df['ratio12up1'] = df['ratio12'].shift(-1)
df['ratio12up2'] = df['ratio12'].shift(-2)
df['ratio12dn1'] = df['ratio12'].shift(1)
df['ratio12dn2'] = df['ratio12'].shift(2)
df['ratio23'] = df['col2'] / df['col3']
df['ratio23up1'] = df['ratio23'].shift(-1)
df['ratio23up2'] = df['ratio23'].shift(-2)
df['ratio23dn1'] = df['ratio23'].shift(1)
df['ratio23dn2'] = df['ratio23'].shift(2)
df['ratio34'] = df['col3'] / df['col4']
#..... for other 10 ratios
# and then do the checks on all these new columns, like:
peak_ratio12 = ( (df['ratio12'] / df['ratio12up1']) > 1.5 && (df['ratio12'] / df['ratio12dn1'] > 1.5) )

编辑：示例：我有这张表：

   Index col1 col2 col3 col4 col5 col6 col7
   0     732    58   18  10    6    3    3
   1     754    60   18  10    6    3    3
   2     3964   365  98  34   34    17  13
   3     4286   417 110  36   35    19  15
   4     5807   545 155  54   53    27  21
   5     1681   132  46  16   13    9   8
   6     542    620  13  11    4    3   2
   7     319    38   30  20    4    2   2
   8     286    22   17  10    3    2   2
   9     324    25   18  10    3    2   2
   10    370    29   10   0    4    2   2
   11    299    28   19  10    3    2   2
   12    350    36   14  11    6    3   4
   13    309    34   14  11    7    3   4

在这个极小的数据部分，我想找到错误：

当列中的值变为前一个和下一个值的2倍或一半时（例如，在行2,5,6中）
当连续列高于前一列时（如第6行中col2＆gt; col1，第12行和第13行中col7＆gt; col6等等）
并且对这些列进行了大量其他检查（例如col1 / col2必须非常稳定，col2 / col3必须相同，col6 + col7必须小于col3等）

我的主要问题是使用上一个和下一个值检查列比率。所有其他检查都很容易进行矢量化。

关于如何进行的任何建议？

谢谢！

检查pandas数据帧中多个趋势的最快方法是什么？

0 个答案: