Question

我今天有一个奇怪的人。我正在使用Tabula-py抓取数千个PDF，无论出于何种原因，可以根据表的实际拆分情况自动合并包裹了文本的同一张表（不同的PDF），但在其他情况下，pandas数据框将包含许多NaN行占包装文本的比例。通常比率是50：1合并。因此，使合并过程自动化。这是示例：

所需的数据框：

    Column1      | Column2     | Column3
A  Many Many ...  Lots and ...  This keeps..
B  lots of text.. Many Texts..  Johns and jo..
C   ...
D

抓取返回的数据框

        Column1      | Column2     | Column3
    A  Many Many       Lots         This keeps Just
   Nan Many Many       and lots     Keeps Going!
   Nan Texts           Nan          Nan
    B  lots of        Many Texts    John and
   Nan text           here          Johnson inc.
    C  ...

在这种情况下，应合并文本，以使“很多很多文本”都在单元格Column1中，依此类推。

我已通过以下解决方案解决了此问题，但感觉很脏。有大量的索引设置可以避免必须管理列和避免删除所需的值。有谁知道更好的解决方案？

df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')

欢呼

Answer 1

类似于Ben的想法：

# fill the missing index
df.index = df.index.to_series().ffill()


(df.stack()               # stack to kill the other NaN values
    .groupby(level=(0,1)) # grouby (index, column)
    .apply(' '.join)      # join those strings
    .unstack(level=1)     # unstack to get columns back
)

输出：

                     Column1          Column2                       Column3
A  Many Many Many Many Texts    Lots and lots  This keeps Just Keeps Going!
B               lots of text  Many Texts here         John and Johnson inc.

Answer 2

尝试一下：

df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)


Out[1390]:
                              Column1          Column2  \
Unnamed: 0
A           Many Many Many Many Texts   Lots and lots
B                        lots of text  Many Texts here

                                  Column3
Unnamed: 0
A           This keeps Just Keeps Going!
B                   John and Johnson inc.

Answer 3

我认为您可以直接在ffill中对索引使用groupby。然后使用agg代替transform。

# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')}, 
                   index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
     a  b
A    a  1
NaN  b  2
NaN  c  3
B    d  4
C    e  5
NaN  f  6
#then groupby on the filled index and agg
new_df = (df.fillna('')
            .groupby(pd.Series(df.index).ffill().values)[df.columns]
            .agg(lambda x: ' '.join(x)))
print (new_df)
       a      b
A  a b c  1 2 3
B      d      4
C    e f    5 6

如何基于nan索引值合并行

3 个答案: