如何基于nan索引值合并行

时间:2019-06-12 19:01:04

标签: python pandas dataframe

我今天有一个奇怪的人。我正在使用Tabula-py抓取数千个PDF,无论出于何种原因,可以根据表的实际拆分情况自动合并包裹了文本的同一张表(不同的PDF),但在其他情况下,pandas数据框将包含许多NaN行占包装文本的比例。通常比率是50:1合并。因此,使合并过程自动化。这是示例:

所需的数据框:

    Column1      | Column2     | Column3
A  Many Many ...  Lots and ...  This keeps..
B  lots of text.. Many Texts..  Johns and jo..
C   ...
D

抓取返回的数据框

        Column1      | Column2     | Column3
    A  Many Many       Lots         This keeps Just
   Nan Many Many       and lots     Keeps Going!
   Nan Texts           Nan          Nan
    B  lots of        Many Texts    John and
   Nan text           here          Johnson inc.
    C  ...

在这种情况下,应合并文本,以使“很多很多文本”都在单元格Column1中,依此类推。

我已通过以下解决方案解决了此问题,但感觉很脏。有大量的索引设置可以避免必须管理列和避免删除所需的值。有谁知道更好的解决方案?

df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')

欢呼

3 个答案:

答案 0 :(得分:1)

类似于Ben的想法:

# fill the missing index
df.index = df.index.to_series().ffill()


(df.stack()               # stack to kill the other NaN values
    .groupby(level=(0,1)) # grouby (index, column)
    .apply(' '.join)      # join those strings
    .unstack(level=1)     # unstack to get columns back
)

输出:

                     Column1          Column2                       Column3
A  Many Many Many Many Texts    Lots and lots  This keeps Just Keeps Going!
B               lots of text  Many Texts here         John and Johnson inc.

答案 1 :(得分:1)

尝试一下:

df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)


Out[1390]:
                              Column1          Column2  \
Unnamed: 0
A           Many Many Many Many Texts   Lots and lots
B                        lots of text  Many Texts here

                                  Column3
Unnamed: 0
A           This keeps Just Keeps Going!
B                   John and Johnson inc.

答案 2 :(得分:0)

我认为您可以直接在ffill中对索引使用groupby。然后使用agg代替transform

# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')}, 
                   index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
     a  b
A    a  1
NaN  b  2
NaN  c  3
B    d  4
C    e  5
NaN  f  6
#then groupby on the filled index and agg
new_df = (df.fillna('')
            .groupby(pd.Series(df.index).ffill().values)[df.columns]
            .agg(lambda x: ' '.join(x)))
print (new_df)
       a      b
A  a b c  1 2 3
B      d      4
C    e f    5 6