我今天有一个奇怪的人。我正在使用Tabula-py抓取数千个PDF,无论出于何种原因,可以根据表的实际拆分情况自动合并包裹了文本的同一张表(不同的PDF),但在其他情况下,pandas数据框将包含许多NaN行占包装文本的比例。通常比率是50:1合并。因此,使合并过程自动化。这是示例:
所需的数据框:
Column1 | Column2 | Column3
A Many Many ... Lots and ... This keeps..
B lots of text.. Many Texts.. Johns and jo..
C ...
D
抓取返回的数据框
Column1 | Column2 | Column3
A Many Many Lots This keeps Just
Nan Many Many and lots Keeps Going!
Nan Texts Nan Nan
B lots of Many Texts John and
Nan text here Johnson inc.
C ...
在这种情况下,应合并文本,以使“很多很多文本”都在单元格Column1中,依此类推。
我已通过以下解决方案解决了此问题,但感觉很脏。有大量的索引设置可以避免必须管理列和避免删除所需的值。有谁知道更好的解决方案?
df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')
欢呼
答案 0 :(得分:1)
类似于Ben的想法:
# fill the missing index
df.index = df.index.to_series().ffill()
(df.stack() # stack to kill the other NaN values
.groupby(level=(0,1)) # grouby (index, column)
.apply(' '.join) # join those strings
.unstack(level=1) # unstack to get columns back
)
输出:
Column1 Column2 Column3
A Many Many Many Many Texts Lots and lots This keeps Just Keeps Going!
B lots of text Many Texts here John and Johnson inc.
答案 1 :(得分:1)
尝试一下:
df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)
Out[1390]:
Column1 Column2 \
Unnamed: 0
A Many Many Many Many Texts Lots and lots
B lots of text Many Texts here
Column3
Unnamed: 0
A This keeps Just Keeps Going!
B John and Johnson inc.
答案 2 :(得分:0)
我认为您可以直接在ffill
中对索引使用groupby
。然后使用agg
代替transform
。
# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')},
index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
a b
A a 1
NaN b 2
NaN c 3
B d 4
C e 5
NaN f 6
#then groupby on the filled index and agg
new_df = (df.fillna('')
.groupby(pd.Series(df.index).ffill().values)[df.columns]
.agg(lambda x: ' '.join(x)))
print (new_df)
a b
A a b c 1 2 3
B d 4
C e f 5 6