根据上次某些条件为真时,将一列中的数据与另一行对齐

时间:2017-07-28 20:23:46

标签: python python-2.7 pandas

我正在尝试解析遭受不幸缺陷的数百万行日志文件。与单个事件相关的数据可以跨日志条目进行拆分,但没有直接链接可以将数据跨多行重新排列成一行;相反,我必须推断出这种关系。

简要背景:

  1. 我关心的4个对象将被多次修改
  2. 有一个8个线程的线程池将随机收集其中一个并开始处理它。此事件由thing_nABC标识,都具有非空值,我也可以从此记录的事件中获取线程编号。
  3. 稍后在日志中的某个地方,会有一个日志条目,指出线程执行了多少次迭代。此活动不包含任何其他信息(即,它不会报告其运营的thing_n
  4. thread_num / thing_n配对会不断变化
  5. 任意数量的线程都可以记录第2点和第3点之间的任意数量的事件,因此您不能只.shift() Iterations列将数据重新排列为一行。
  6. 不知何故,我需要将Iterations列重新排列在thing_I_care_aboutABC不为空的前一行(以及前一行)中,和thread_num匹配。有时间戳(不在我的MCVE中),如果有帮助,所有事件都按升序排序。

    示例输入:

       thing_I_care_about  thread_num    A    B    C      Iterations
    0  thing_1             2             X    X    X      NaN
    1  NaN                 2             X    X    NaN    NaN
    2  thing_2             3             NaN  X    X      NaN
    3  NaN                 2             NaN  NaN  NaN    110.0
    4  thing_3             7             X    X    X      NaN
    5  thing_4             5             X    X    NaN    NaN
    6  NaN                 7             NaN  NaN  NaN    150.0
    

    示例输出:

       thing_I_care_about  thread_num    A    B    C      Realigned Iterations
    0  thing_1             2             X    X    X      110.0
    1  NaN                 2             X    X    NaN    NaN
    2  thing_2             3             NaN  X    X      NaN
    3  NaN                 2             NaN  NaN  NaN    NaN
    4  thing_3             7             X    X    X      150.0
    5  thing_4             5             X    X    NaN    NaN
    6  NaN                 7             NaN  NaN  NaN    NaN
    

    我可以管理纯python方法(底部),但这种分析将根据需要重复进行,并且必须处理数亿个此类事件。从概念上讲,我能想到在熊猫中这样做的唯一方法是:

    1. groupby() thread_num并按时间戳对每个组进行排序
    2. 尝试以某种方式获取每个线程的DF,其中包含交替的notnull([thing_n, A, B, C, thread_num])notnull([thread_num, Iterations])行,以便shift(-1)他们可以重新对齐数据
    3. 以某种方式将其与原始DataFrame
    4. 联系起来

      然而,我似乎无法用这种方法取得进展。有没有聪明的方法可以做到这一点,还是我在Python中处理这个部分?

      纯python方法:

      import numpy as np
      import pandas as pd
      
      raw_data = [['thing_I_care_about', 'thread_num', 'A', 'B', 'C', 'Iterations'], ['thing_1', 2, 'X', 'X', 'X', np.nan], [np.nan, 2, 'X', 'X', np.nan, np.nan], ['thing_2', 3, np.nan, 'X', 'X', np.nan], [np.nan, 2, np.nan, np.nan, np.nan, 110], ['thing_3', 7, 'X', 'X', 'X', np.nan], ['thing_4', 5, 'X', 'X', np.nan, np.nan], [np.nan, 7, np.nan, np.nan, np.nan, 150]]
      
      data = pd.DataFrame(raw_data[1:], columns=raw_data[0])
      print "Input format"
      print data
      
      header_dict = {item: x for x, item in enumerate(data.columns)}
      
      # Take data out of DF to become nested list
      data_list = data.as_matrix()
      
      # Track the row in which a thread starts its process
      active_threads = {} 
      
      # Create a list to become to re-aligned column in the DF at the end for num iterations
      realigned_data = [np.nan for x in xrange(len(data_list))]
      
      for x, entry in enumerate(data_list):
          thread_num = int(entry[header_dict['thread_num']])
      
          if all([pd.notnull(entry[header_dict['thing_I_care_about']]),
                 pd.notnull(entry[header_dict['A']]),
                 pd.notnull(entry[header_dict['B']]),
                 pd.notnull(entry[header_dict['C']])]):
              active_threads[thread_num] = x
      
          elif pd.notnull(entry[header_dict['Iterations']]) and entry[header_dict['thread_num']] in active_threads:
              realigned_data[active_threads[thread_num]] = entry[header_dict['Iterations']]
      
      data['realigned_iterations'] = realigned_data
      print "Output format"
      print data
      

1 个答案:

答案 0 :(得分:1)

IIUC,我认为你可以这样做。创建两个掩码,表示当前迭代值所在的行。并且,第二个掩码将True放在您希望迭代值也移动的第一个记录上。然后在第一个带有cumsum的蒙版上分组并将当前值放在所有记录上,然后使用第二个蒙版与where。

mask=(df['thing_I_care_about'].isnull() &
      df['A'].isnull() &
      df['B'].isnull() &
      df['C'].isnull())

fmask  = (df['thing_I_care_about'].notnull() &
      df['A'].notnull() &
      df['B'].notnull() &
      df['C'].notnull())

df.assign(Iterations=df.groupby(mask[::-1].cumsum())['Iterations'].transform(lambda x: x.iloc[-1]).where(fmask))

输出:

  thing_I_care_about  thread_num    A    B    C  Iterations
0            thing_1           2    X    X    X       110.0
1                NaN           2    X    X  NaN         NaN
2            thing_2           3  NaN    X    X         NaN
3                NaN           2  NaN  NaN  NaN         NaN
4            thing_3           7    X    X    X       150.0
5            thing_4           5    X    X  NaN         NaN
6                NaN           7  NaN  NaN  NaN         NaN