我有这样的数据,这让我发疯。来源是我用tabula阅读的pdf文件以提取表格。问题是表中的某些行在文档中是多行,这就是我看到输出的方式。
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 NaN Fffff Ggggg
89 FGH NaN
90 NaN Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
我想要得到的是这样的东西。
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 FGH Fffff Ggggg Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
我正为此combine_first苦苦挣扎:
sub_df.iloc[[88]].combine_first(sub_df.iloc[[87]])
但是结果不是我所期望的。
也欢迎使用groupby的解决方案。
注意:索引并不重要,可以重新设置。我只想加入一些连续的行,其列为NaN,然后将其转储到csv中,所以我不需要它们。
答案 0 :(得分:2)
让我们尝试一下:
df = df.assign(Meaning = df['Meaning'].ffill())
mask = ~((df.Meaning.duplicated(keep='last')) & df.Acronym.isnull())
df = df[mask]
df = df.assign(Acronym = df['Acronym'].ffill())
df_out = df.groupby('Acronym').apply(lambda x: ' '.join(x['Meaning'].str.split('\s').sum())).reset_index()
输出:
Acronym 0
0 ABC Aaaaa Bbbbb Ccccc
1 CDE Ccccc Ddddd Eeeee
2 FGH Fffff Ggggg Hhhhh
3 IJK Iiiii Jjjjj Kkkkk
4 LMN Lllll Mmmmm Nnnnn
5 OPQ Ooooo Ppppp Qqqqq
6 RST Rrrrr Sssss Ttttt
7 UVZ Uuuuu Vvvvv Zzzzz
答案 1 :(得分:2)
这是一个非常棘手的问题,ffill
和bfill
都不适合这个问题
s1=(~(df.Acronym.isnull()|df.Meaning.isnull())) # create the group
s=s1.astype(int).diff().ne(0).cumsum() # create the group for each bad line it will assign the single id
bad=df[~s1]# we just only change the bad one
good=df[s1]# keep the good one no change
bad=bad.groupby(s.loc[bad.index]).agg({'1':'first','Acronym':'first','Meaning':lambda x : ''.join(x[x.notnull()])})
pd.concat([good,bad]).sort_index()
Out[107]:
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz
答案 2 :(得分:2)
以下是一种使用numpy.where
进行条件填充的方法:
df['Acronym'] = np.where(df[['Acronym']].assign(Meaning=df.Meaning.shift()).isna().all(1),
df.Acronym.ffill(),
df.Acronym.bfill())
clean_meaning = df.dropna().groupby('Acronym')['Meaning'].apply(lambda x : ' '.join(x)).to_frame()
df_new = (df[['1', 'Acronym']]
.drop_duplicates(subset=['Acronym'])
.merge(clean_meaning,
left_on='Acronym',
right_index=True))
[out]
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz