我有以下数据框:
df =
VD_1 VD_2 VD_3 VD_4 VD_5 TYPE VAL
NaN XX VV DD NaN ABC 5
NaN XX MM VV NaN ABC 6
XX MM NaN NaN NaN ABC 6
TT XX MM NaN NaN ABC 5
我想只保留那些第一个非NaN值等于XX的行和至少两个不等于NaN的后续值到XX。
问题是return x
会返回None, None, None
...仅当我使用return row
时才有效,但结果不包含与df
相同的列数}。该代码既不会从分析中排除列TYPE
和VAL
。
def customFilter(x):
row = x.dropna()
if (row[0] == 'XX') & (('XX' not in row[1:]) & (len(row[1:]) >= 2)):
return row
return np.nan
df = df.apply(customFilter, axis=1).dropna(how='all', axis=0)
Is there any trick to solve the mentioned issues?
更新:
# Delete rows that do not start from AG
def calculate_correct_rows(df):
# Create drop rows
drop_rows = []
i = 0
for index, x in df.iterrows():
row = x.dropna()
if (row[0] == 'XX') & (('XX' not in row[1:]) & (len(row[1:]) >= 2)):
drop_rows.append(i)
i = i + 1
return drop_rows
# Drop the rows in list
subset2 = df.filter(like='VD_')
correct_rows = calculate_correct_rows(subset2)
final2 = df.loc[correct_rows,:]
答案 0 :(得分:1)
可能有一种更漂亮的方法,但你可以简单地分两步执行过滤,而不是一步。首先,创建一个不符合上述标准的所有行的列表。其次,使用df.drop(rows)
删除步骤1中创建的列表中的行。
这是指向drop
:drop
e.g。
def calculate_drop_rows(df):
# Create drop rows
drop_rows = []
i = 0
for row in df:
if [condition]:
drop_rows.append(i)
i = i + 1
return drop_rows
# Drop the rows in list
drop_rows = calculate_drop_rows(df)
df = df.drop(drop_rows)