我如何从“ B”列单元格(它可能包含多行)中检查值,是否在“ A”列中,如果是-插入孔行(例如,我具有值m32 \ nm83 \ nm18 )在“ A”列中找到匹配项的行下方(例如m32)?
这是数据帧:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m83
m65
m73 m77\nm78 558
m23
m98
m77
m18
m4
m12
m78
这就是我想要得到的:
df
A B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
m55 m32\nm83\nm18 123
m83
m55 m32\nm83\nm18 123
m65
m73 m77\nm78 558
m23
m98
m77
m73 m77\nm78 558
m18
m55 m32\nm83\nm18 123
m4
m12
m56 m12 546
m78
m73 m77\nm78 558
我尝试过这个:
def insert_row(idx, df, df_insert):
return df.iloc[:idx, ].append(df_insert).append(df.iloc[idx:, ]).reset_index(drop = True)
dfB = dfB[dfB.apply(lambda x: isinstance(x, str))]
dfBidx = dfB.index
j=0
for b in dfBidx:
try:
idx = df.index[df["A"].apply(lambda x: isinstance(x, str)).str.contains("|".join(dfB[b].split("\n")))]
for i in idx:
i+=j
df_new = df.loc[i]
df = insert_row(i+j+1, df, df_new)
j+= int(df_new.size/len(df_new.columns.values))
except:
pass
还有其他方法吗?我在“ A”列中的NaN值有问题,并且通常在使用函数时有些不匹配:
str(), contains(), apply()
编辑:
我有第二个数据帧(df2),我将从中提取行并将其插入df。我正在提取“关键字”列中从“测试”到“测试”的行。
df2
Keyword B C
test m32\nm83\nm18 123
something
something
something
test
something
something
test m12 546
something
test m77\nm78 558
test
something
所以,最后我需要这个:
df
A Keyword B C
m55 m32\nm83\nm18 123
m56 m12 546
m68
m32
test m32\nm83\nm18 123
something
something
something
m83
test m32\nm83\nm18 123
something
something
something
m65
m73 m77\nm78 558
m23
m98
m77
test m77\nm78 558
m18
test m32\nm83\nm18 123
something
something
something
m4
m12
test m12 546
something
m78
test m77\nm78 558
答案 0 :(得分:1)
解决方案使用默认RangeIndex
。
将插入行的索引与源行的索引(d1
)和列表理解重复行中的索引组成的字典,同时添加0.5
以正确排序。最后concat
一起sort_index
,并通过reset_index
创建默认索引:
d = df['B'].dropna().to_dict()
print (d)
{0: 'm32\\nm83\\nm18', 1: 'm12', 6: 'm77\\nm78'}
d1 = {k: df.index[df['A'].str.contains("|".join(v.split("\\n")))] for k, v in d.items()}
print (d1)
{0: Int64Index([3, 4, 10], dtype='int64'),
1: Int64Index([12], dtype='int64'),
6: Int64Index([9, 13], dtype='int64')}
L = [pd.concat([df.loc[[k]]] * len(v)).set_index([v + .5]) for k, v in d1.items()]
df = pd.concat([df] + L).sort_index().reset_index(drop=True)
print (df)
A B C
0 m55 m32\nm83\nm18 123.0
1 m56 m12 546.0
2 m68 NaN NaN
3 m32 NaN NaN
4 m55 m32\nm83\nm18 123.0
5 m83 NaN NaN
6 m55 m32\nm83\nm18 123.0
7 m65 NaN NaN
8 m73 m77\nm78 558.0
9 m23 NaN NaN
10 m98 NaN NaN
11 m77 NaN NaN
12 m73 m77\nm78 558.0
13 m18 NaN NaN
14 m55 m32\nm83\nm18 123.0
15 m4 NaN NaN
16 m12 NaN NaN
17 m56 m12 546.0
18 m78 NaN NaN
19 m73 m77\nm78 558.0