熊猫数据清理

时间:2018-07-31 15:53:36

标签: python-3.x pandas dataframe pdf

因此,我正在将PDF中的表格读取到pandas数据框中,但是对于pandas来说我还很陌生,并且在文档中浏览过程相当艰巨。我敢肯定有一种相当简单的方法可以做我需要做的事情,但是我只是不知道怎么做。

          0                    1           2        3                4                5       6       7                      8              9        10               11          12   13
0        NaN                 col0        col1     col2             col3             col4    col5    col6                   col7           col8     col9            col10       col11  NaN
1        NaN             Location        Date      NaN              NaN              NaN     NaN     NaN                    NaN            NaN      NaN              NaN         NaN  NaN
2        NaN             measure1         1**     40**             30**             20**      20  0.02**                    3**           10**      5**            100**        15**  NaN
3        NaN             measure2         100      400              300              200     200       2                    300            100       50            1,000         150  NaN
4        NaN            location1   1/15/1994     5900            28000             7600   25000     150                    ---            ---      ---              ---         ---  ---
5        NaN                  NaN   3/16/1994     4900            12000             4400   11000      60                    ---            ---      ---              ---         ---  ---
6        NaN                  NaN    1/4/1995        1                1                1       1       8                    ---            ---      ---              ---         ---  ---
7        NaN                  NaN   4/12/2004     8400            34000             4600   17000   <1000                    ---            ---      ---              ---         ---  ---
8        NaN                  NaN   7/28/2008     3200            15400             4430   17100  172  I                    ---            ---      ---              ---         ---  ---
9        NaN                  NaN   5/19/2011     2000            11000             2500    9200  0.2  1                    ---            ---      ---              ---         ---  ---
10       NaN                  NaN    8/6/2013     2700            20000             5300   20000    2  6                    ---            ---      ---              ---         ---  ---
11       NaN                  NaN  11/13/2013     2600            14000             5400   20000  0.1  3                    ---            ---      ---              ---         ---  ---
12       NaN                  NaN    2/5/2014     3200            19000             6400   25000   18  0                    ---            ---      ---              ---         ---  ---
13       NaN                  NaN    5/7/2014     2000            15000             4100   16000   22  0                    ---            ---      ---              ---         ---  ---
14       NaN                  NaN  12/18/2014     2500            32000             5200   20000    8  8                    ---            ---      ---              ---         ---  ---
15       NaN                  NaN    6/4/2015     1700            15000             5200   21000   44  0                    ---            ---      ---              ---         ---  ---
16       NaN                  NaN   1/20/2017     1400           15,000            6,300  21,000    1  2                    ---            ---      ---              ---         ---  ---
17       NaN            location2   1/15/1994      210              290               39     180      69                    ---            ---      ---              ---         ---  ---
18       NaN                  NaN   3/24/1994     1500            12000             4100   18000  400  0                    ---            ---      ---              ---         ---  ---
19       NaN                  NaN    1/4/1995        1                1                1       1       8                    ---            ---      ---              ---         ---  ---
20       NaN                  NaN    2/1/2000    <1000             8900             5200   58000  <10000                    ---            ---      ---              ---         ---  ---
21       NaN                  NaN   4/12/2004     <5.0               42               78     540     150                    ---            ---      ---              ---         ---  ---
22       NaN                  NaN   7/28/2008     23.3             27.9               28     409    9.34                    ---            ---      ---              ---         ---  ---
23       NaN                  NaN   5/19/2011      1.8               12               22     170  0.2  1                    ---            ---      ---              ---         ---  ---
24       NaN                  NaN    8/6/2013      4.3               23               71     590  0.1  3                    ---            ---      ---              ---         ---  ---
25       NaN                  NaN   1/19/2017   0.21 I           0.26 I              7.7      42  0.2  4                    ---            ---      ---              ---         ---  ---
26       NaN            location3   3/21/1994       <1               <1               <1      <1      <8                    ---            ---      ---              ---         ---  ---
27  2/1/2000                   <1          <1       <1               <2              <10     ---     ---                    ---            ---      ---              ---         NaN  NaN

所以我需要处理三个主要问题。

第一:最后一行与其他行不符。我需要将丢失的行中的所有值向右移两列,以便将日期对齐。这也意味着第一列不应该存在。

第二:由于这些表格在PDF中的设置很笨拙,因此其他一些事情变得一团糟。日期列应该只是日期。我需要以某种方式将“日期”列中所有不显示“日期”或将日期向下移到一列的行。

最后:位置NaNs。每个位置下的所有NaN值实际上都属于同一位置,因此我需要以某种方式填写这些值。

所以我想要的输出看起来像这样...

          0                 1           2        3                4                5       6       7                      8              9        10               11          12      13
0       
1                     Location        Date     col1             col2             col3    col4    col5                   col6           col7     col8             col9       col10    col11
2                     measure1         NaN      1**             40**             30**    20**      20                 0.02**            3**     10**              5**       100**     15**
3                     measure2         NaN      100              400              300     200     200                      2            300      100               50       1,000     150
4                    location1   1/15/1994     5900            28000             7600   25000     150                    ---            ---      ---              ---         ---     ---
5                    location1   3/16/1994     4900            12000             4400   11000      60                    ---            ---      ---              ---         ---     ---
6                    location1    1/4/1995        1                1                1       1       8                    ---            ---      ---              ---         ---     ---
7                    location1   4/12/2004     8400            34000             4600   17000   <1000                    ---            ---      ---              ---         ---     ---
8                    location1   7/28/2008     3200            15400             4430   17100  172  I                    ---            ---      ---              ---         ---     ---
9                    location1   5/19/2011     2000            11000             2500    9200  0.2  1                    ---            ---      ---              ---         ---     ---
10                   location1    8/6/2013     2700            20000             5300   20000    2  6                    ---            ---      ---              ---         ---     ---
11                   location1  11/13/2013     2600            14000             5400   20000  0.1  3                    ---            ---      ---              ---         ---     ---
12                   location1    2/5/2014     3200            19000             6400   25000   18  0                    ---            ---      ---              ---         ---     ---
13                   location1    5/7/2014     2000            15000             4100   16000   22  0                    ---            ---      ---              ---         ---     ---
14                   location1  12/18/2014     2500            32000             5200   20000    8  8                    ---            ---      ---              ---         ---     ---
15                   location1    6/4/2015     1700            15000             5200   21000   44  0                    ---            ---      ---              ---         ---     ---
16                   location1   1/20/2017     1400           15,000            6,300  21,000    1  2                    ---            ---      ---              ---         ---     ---
17                   location2   1/15/1994      210              290               39     180      69                    ---            ---      ---              ---         ---     ---
18                   location2   3/24/1994     1500            12000             4100   18000  400  0                    ---            ---      ---              ---         ---     ---
19                   location2    1/4/1995        1                1                1       1       8                    ---            ---      ---              ---         ---     ---
20                   location2    2/1/2000    <1000             8900             5200   58000  <10000                    ---            ---      ---              ---         ---     ---
21                   location2   4/12/2004     <5.0               42               78     540     150                    ---            ---      ---              ---         ---     ---
22                   location2   7/28/2008     23.3             27.9               28     409    9.34                    ---            ---      ---              ---         ---     ---
23                   location2   5/19/2011      1.8               12               22     170  0.2  1                    ---            ---      ---              ---         ---     ---
24                   location2    8/6/2013      4.3               23               71     590  0.1  3                    ---            ---      ---              ---         ---     ---
25                   location2   1/19/2017   0.21 I           0.26 I              7.7      42  0.2  4                    ---            ---      ---              ---         ---     ---
26                   location3   3/21/1994       <1               <1               <1      <1      <8                    ---            ---      ---              ---         ---     ---
27                   location3    2/1/2000       <1               <1               <1      <2     <10                    ---            ---      ---              ---         ---     ---

1 个答案:

答案 0 :(得分:0)

首先,您可以尝试以下方法:

df = df.T
df.iloc[:,-1] = df.iloc[:,-1].shift(1)
df = df.T
df = df.drop(df.columns[0], axis=1)

最后一点:

df['1'] = df['1'].ffill()
相关问题