Question

我有这个Pandas数据帧df：

station a_d direction
   a     0      0
   a     0      0
   a     1      0
   a     0      0
   a     1      0
   b     0      0
   b     1      0
   c     0      0
   c     1      0
   c     0      1
   c     1      1
   b     0      1
   b     1      1
   b     0      1
   b     1      1
   a     0      1
   a     1      1
   a     0      0
   a     1      0

我指定一个value_id，它在方向值改变时递增，并且仅指最后一对站值，它首先以不同的[0,1] a_d值改变。我可以忽略最后一个（在这个例子中是最后两个）数据帧行。换句话说：

station a_d direction id_value
   a     0      0
   a     0      0
   a     1      0
   a     0      0        0
   a     1      0        0
   b     0      0        0
   b     1      0        0
   c     0      0        0
   c     1      0        0
   c     0      1        1
   c     1      1        1
   b     0      1         
   b     1      1        
   b     0      1        1
   b     1      1        1
   a     0      1        1
   a     1      1        1
   a     0      0
   a     1      0

使用df.iterrows()我写这个脚本：

df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
    if i == 0:
        continue
    elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
        value_id += 1
    for z in range(1,11):
        if i+z >= len(df)-1:
            break
        elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
            break
        elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
            break
        else:
            df.loc[i,'value_id'] = value_id

它有效，但速度很慢。使用10*10^6行数据框我需要更快的方式。有什么想法吗？

@ user5402代码效果很好，但我注意到在最后break之后else减少了计算时间：

df['value_id'] = ""
value_id = 0
row_iterator = df.iterrows()
for i, row in row_iterator:
    if i == 0:
        continue
    elif (df.loc[i-1,'direction'] != df.loc [i,'direction']):
        value_id += 1
    for z in range(1,11):
        if i+z >= len(df)-1:
            break
        elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
            break
        elif (df.loc[i+1,'a_d'] != df.loc [i,'a_d']) and (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
            break
        else:
            df.loc[i,'value_id'] = value_id
            break

Answer 1

你没有在内部for循环中有效地使用z。您永远不会访问i+z行。您可以访问第i行和i+1行以及i+2行，但不能访问i+z行。

您可以用以下内容替换内部for循环：

  if i+1 > len(df)-1:
    pass
  elif (df.loc[i+1,'a_d'] == df.loc [i,'a_d']):
    pass
  elif (df.loc [i+2,'station'] == df.loc [i,'station'] and (df.loc [i+2,'direction'] == df.loc [i,'direction'])):
    pass
  else:
    df.loc[i,'value_id'] = value_id

请注意，我还略微优化了第二个elif，因为此时您已经知道df.loc[i+1,'a_d']不等于df.loc [i,'a_d']。

不必循环z将节省大量时间。

在'n'下一行迭代Pandas数据帧

1 个答案: