Question

当前，我有一个PDF文件的数据帧，该数据帧已转换为CSV文件格式，因此PDF由4页组成，并且全部都放在一个数据帧中。

所以我的目标是根据page_num划分数据帧。

例如：

page_num  word_num    left    top  width  text
1          1           322     14   14     My
1          2           304     4    41     Name
1          3           322     5    9      is
1          4           316     14   20     Raghav
2          1           420     129  34     Problem 
2          2           420     31   27     just
2          3           420     159  27     got
2          4           431     2    38     complicated
3          1           322     14   14     #40
3          2           304     4    41     @gmail.com   
3          1           420     129  34     2019 
3          2           420     31   27     January

因此，我使用熊猫库将数据帧（df）分为3个数据帧（df1，df2，df3）。

谢谢！

Answer 1

您可以将groupby与operator.itemgetter结合使用：

from operator import itemgetter
df1, df2, df3 = map(itemgetter(1), df.groupby('page_num'))

请注意，groupby默认情况下具有sort=True ，因此您可以假定它将按'1'，'2'，'3'进行过滤以此顺序。

有关任意数量的数据帧，请参见Splitting dataframe into multiple dataframes：在这种情况下，list或dict更合适。

Answer 2

您可以使用loc访问特定的行和/或列

df1 = df.loc[df['page_num']  == 1]
df2 = df.loc[df['page_num']  == 2]
df3 = df.loc[df['page_num']  == 3]

输出：

   page_num  word_num  left  top  width    text
0         1         1   322   14     14      My
1         1         2   302    4     41    Name
2         1         3   322    5      9      is
3         1         4   316   14     20  Raghav
   page_num  word_num  left  top  width         text
4         2         1   420  129     34      Problem
5         2         2   420   31     27         just
6         2         3   420  159     27          got
7         2         4   431    2     38  complicated
    page_num  word_num  left  top  width         text
8          3         1   322   14     14          #40
9          3         2   304    4     41   @gmail.com
10         3         1   420  129     34         2019
11         3         2   420   31     27      January

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

如何将我的数据框拆分为不同的数据框？

2 个答案: