我正在尝试在每行的DataFrame 中找到第一个有效值和最后一个有效值之间的差异。
我有一个带有for循环的工作代码,并且可以更快地查找内容。 这是我目前正在做的一个例子:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.arange(16).astype(np.float).reshape(4, 4),
columns=['a', 'b', 'c', 'd'])
# Fill some NaN
df.loc[0, ['a', 'd']] = np.nan
df.loc[1, ['c', 'd']] = np.nan
df.loc[2, 'b'] = np.nan
df.loc[3, :] = np.nan
print(df)
# a b c d
# 0 NaN 1.0 2.0 NaN
# 1 4.0 5.0 NaN NaN
# 2 8.0 NaN 10.0 11.0
# 3 NaN NaN NaN NaN
diffs = pd.Series(index=df.index)
for i in df.index:
row = df.loc[i]
min_i = row.first_valid_index()
max_i = row.last_valid_index()
if min_i is None or min_i == max_i: # 0 or 1 valid values
continue
diffs[i] = df.loc[i, max_i] - df.loc[i, min_i]
df['diff'] = diffs
print(df)
# a b c d diff
# 0 NaN 1.0 2.0 NaN 1.0
# 1 4.0 5.0 NaN NaN 1.0
# 2 8.0 NaN 10.0 11.0 3.0
# 3 NaN NaN NaN NaN NaN
答案 0 :(得分:3)
一种方法是back and forward fill缺少的值,然后只比较第一行和最后一行。
df2 = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1)
df['diff'] = df2.ix[:, -1] - df2.ix[:, 0]
如果您想在一行中完成,而不创建新的数据框:
df['diff'] = df.fillna(method='ffill', axis=1).fillna(method='bfill', axis=1).apply(lambda r: r.d - r.a, axis=1)
答案 1 :(得分:1)
Pandas让您的生活轻松,一次一个方法(first_valid_values())。请注意,您必须删除任何具有所有 NaN值的行(无论如何都没有任何意义):
对于第一个有效值:
a= [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row.first_valid_index(), axis=1))]
有关上一个有效值:
b = [df.ix[x,i] for x,i in enumerate(df.apply(lambda row: row[::-1].first_valid_index(), axis=1))]
减去最终结果:
a-b