我正在尝试解析遭受不幸缺陷的数百万行日志文件。与单个事件相关的数据可以跨日志条目进行拆分,但没有直接链接可以将数据跨多行重新排列成一行;相反,我必须推断出这种关系。
简要背景:
thing_n
,A
,B
和C
标识,都具有非空值,我也可以从此记录的事件中获取线程编号。thing_n
)thread_num
/ thing_n
配对会不断变化.shift()
Iterations
列将数据重新排列为一行。不知何故,我需要将Iterations列重新排列在thing_I_care_about
,A
,B
,C
不为空的前一行(以及前一行)中,和thread_num
匹配。有时间戳(不在我的MCVE中),如果有帮助,所有事件都按升序排序。
示例输入:
thing_I_care_about thread_num A B C Iterations
0 thing_1 2 X X X NaN
1 NaN 2 X X NaN NaN
2 thing_2 3 NaN X X NaN
3 NaN 2 NaN NaN NaN 110.0
4 thing_3 7 X X X NaN
5 thing_4 5 X X NaN NaN
6 NaN 7 NaN NaN NaN 150.0
示例输出:
thing_I_care_about thread_num A B C Realigned Iterations
0 thing_1 2 X X X 110.0
1 NaN 2 X X NaN NaN
2 thing_2 3 NaN X X NaN
3 NaN 2 NaN NaN NaN NaN
4 thing_3 7 X X X 150.0
5 thing_4 5 X X NaN NaN
6 NaN 7 NaN NaN NaN NaN
我可以管理纯python方法(底部),但这种分析将根据需要重复进行,并且必须处理数亿个此类事件。从概念上讲,我能想到在熊猫中这样做的唯一方法是:
groupby()
thread_num
并按时间戳对每个组进行排序notnull([thing_n, A, B, C, thread_num])
和notnull([thread_num, Iterations])
行,以便shift(-1)
他们可以重新对齐数据然而,我似乎无法用这种方法取得进展。有没有聪明的方法可以做到这一点,还是我在Python中处理这个部分?
纯python方法:
import numpy as np
import pandas as pd
raw_data = [['thing_I_care_about', 'thread_num', 'A', 'B', 'C', 'Iterations'], ['thing_1', 2, 'X', 'X', 'X', np.nan], [np.nan, 2, 'X', 'X', np.nan, np.nan], ['thing_2', 3, np.nan, 'X', 'X', np.nan], [np.nan, 2, np.nan, np.nan, np.nan, 110], ['thing_3', 7, 'X', 'X', 'X', np.nan], ['thing_4', 5, 'X', 'X', np.nan, np.nan], [np.nan, 7, np.nan, np.nan, np.nan, 150]]
data = pd.DataFrame(raw_data[1:], columns=raw_data[0])
print "Input format"
print data
header_dict = {item: x for x, item in enumerate(data.columns)}
# Take data out of DF to become nested list
data_list = data.as_matrix()
# Track the row in which a thread starts its process
active_threads = {}
# Create a list to become to re-aligned column in the DF at the end for num iterations
realigned_data = [np.nan for x in xrange(len(data_list))]
for x, entry in enumerate(data_list):
thread_num = int(entry[header_dict['thread_num']])
if all([pd.notnull(entry[header_dict['thing_I_care_about']]),
pd.notnull(entry[header_dict['A']]),
pd.notnull(entry[header_dict['B']]),
pd.notnull(entry[header_dict['C']])]):
active_threads[thread_num] = x
elif pd.notnull(entry[header_dict['Iterations']]) and entry[header_dict['thread_num']] in active_threads:
realigned_data[active_threads[thread_num]] = entry[header_dict['Iterations']]
data['realigned_iterations'] = realigned_data
print "Output format"
print data
答案 0 :(得分:1)
IIUC,我认为你可以这样做。创建两个掩码,表示当前迭代值所在的行。并且,第二个掩码将True放在您希望迭代值也移动的第一个记录上。然后在第一个带有cumsum的蒙版上分组并将当前值放在所有记录上,然后使用第二个蒙版与where。
mask=(df['thing_I_care_about'].isnull() &
df['A'].isnull() &
df['B'].isnull() &
df['C'].isnull())
fmask = (df['thing_I_care_about'].notnull() &
df['A'].notnull() &
df['B'].notnull() &
df['C'].notnull())
df.assign(Iterations=df.groupby(mask[::-1].cumsum())['Iterations'].transform(lambda x: x.iloc[-1]).where(fmask))
输出:
thing_I_care_about thread_num A B C Iterations
0 thing_1 2 X X X 110.0
1 NaN 2 X X NaN NaN
2 thing_2 3 NaN X X NaN
3 NaN 2 NaN NaN NaN NaN
4 thing_3 7 X X X 150.0
5 thing_4 5 X X NaN NaN
6 NaN 7 NaN NaN NaN NaN