因此,我正在将PDF中的表格读取到pandas数据框中,但是对于pandas来说我还很陌生,并且在文档中浏览过程相当艰巨。我敢肯定有一种相当简单的方法可以做我需要做的事情,但是我只是不知道怎么做。
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0 NaN col0 col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 NaN
1 NaN Location Date NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN measure1 1** 40** 30** 20** 20 0.02** 3** 10** 5** 100** 15** NaN
3 NaN measure2 100 400 300 200 200 2 300 100 50 1,000 150 NaN
4 NaN location1 1/15/1994 5900 28000 7600 25000 150 --- --- --- --- --- ---
5 NaN NaN 3/16/1994 4900 12000 4400 11000 60 --- --- --- --- --- ---
6 NaN NaN 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
7 NaN NaN 4/12/2004 8400 34000 4600 17000 <1000 --- --- --- --- --- ---
8 NaN NaN 7/28/2008 3200 15400 4430 17100 172 I --- --- --- --- --- ---
9 NaN NaN 5/19/2011 2000 11000 2500 9200 0.2 1 --- --- --- --- --- ---
10 NaN NaN 8/6/2013 2700 20000 5300 20000 2 6 --- --- --- --- --- ---
11 NaN NaN 11/13/2013 2600 14000 5400 20000 0.1 3 --- --- --- --- --- ---
12 NaN NaN 2/5/2014 3200 19000 6400 25000 18 0 --- --- --- --- --- ---
13 NaN NaN 5/7/2014 2000 15000 4100 16000 22 0 --- --- --- --- --- ---
14 NaN NaN 12/18/2014 2500 32000 5200 20000 8 8 --- --- --- --- --- ---
15 NaN NaN 6/4/2015 1700 15000 5200 21000 44 0 --- --- --- --- --- ---
16 NaN NaN 1/20/2017 1400 15,000 6,300 21,000 1 2 --- --- --- --- --- ---
17 NaN location2 1/15/1994 210 290 39 180 69 --- --- --- --- --- ---
18 NaN NaN 3/24/1994 1500 12000 4100 18000 400 0 --- --- --- --- --- ---
19 NaN NaN 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
20 NaN NaN 2/1/2000 <1000 8900 5200 58000 <10000 --- --- --- --- --- ---
21 NaN NaN 4/12/2004 <5.0 42 78 540 150 --- --- --- --- --- ---
22 NaN NaN 7/28/2008 23.3 27.9 28 409 9.34 --- --- --- --- --- ---
23 NaN NaN 5/19/2011 1.8 12 22 170 0.2 1 --- --- --- --- --- ---
24 NaN NaN 8/6/2013 4.3 23 71 590 0.1 3 --- --- --- --- --- ---
25 NaN NaN 1/19/2017 0.21 I 0.26 I 7.7 42 0.2 4 --- --- --- --- --- ---
26 NaN location3 3/21/1994 <1 <1 <1 <1 <8 --- --- --- --- --- ---
27 2/1/2000 <1 <1 <1 <2 <10 --- --- --- --- --- --- NaN NaN
所以我需要处理三个主要问题。
第一:最后一行与其他行不符。我需要将丢失的行中的所有值向右移两列,以便将日期对齐。这也意味着第一列不应该存在。
第二:由于这些表格在PDF中的设置很笨拙,因此其他一些事情变得一团糟。日期列应该只是日期。我需要以某种方式将“日期”列中所有不显示“日期”或将日期向下移到一列的行。
最后:位置NaNs。每个位置下的所有NaN值实际上都属于同一位置,因此我需要以某种方式填写这些值。
所以我想要的输出看起来像这样...
0 1 2 3 4 5 6 7 8 9 10 11 12 13
0
1 Location Date col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11
2 measure1 NaN 1** 40** 30** 20** 20 0.02** 3** 10** 5** 100** 15**
3 measure2 NaN 100 400 300 200 200 2 300 100 50 1,000 150
4 location1 1/15/1994 5900 28000 7600 25000 150 --- --- --- --- --- ---
5 location1 3/16/1994 4900 12000 4400 11000 60 --- --- --- --- --- ---
6 location1 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
7 location1 4/12/2004 8400 34000 4600 17000 <1000 --- --- --- --- --- ---
8 location1 7/28/2008 3200 15400 4430 17100 172 I --- --- --- --- --- ---
9 location1 5/19/2011 2000 11000 2500 9200 0.2 1 --- --- --- --- --- ---
10 location1 8/6/2013 2700 20000 5300 20000 2 6 --- --- --- --- --- ---
11 location1 11/13/2013 2600 14000 5400 20000 0.1 3 --- --- --- --- --- ---
12 location1 2/5/2014 3200 19000 6400 25000 18 0 --- --- --- --- --- ---
13 location1 5/7/2014 2000 15000 4100 16000 22 0 --- --- --- --- --- ---
14 location1 12/18/2014 2500 32000 5200 20000 8 8 --- --- --- --- --- ---
15 location1 6/4/2015 1700 15000 5200 21000 44 0 --- --- --- --- --- ---
16 location1 1/20/2017 1400 15,000 6,300 21,000 1 2 --- --- --- --- --- ---
17 location2 1/15/1994 210 290 39 180 69 --- --- --- --- --- ---
18 location2 3/24/1994 1500 12000 4100 18000 400 0 --- --- --- --- --- ---
19 location2 1/4/1995 1 1 1 1 8 --- --- --- --- --- ---
20 location2 2/1/2000 <1000 8900 5200 58000 <10000 --- --- --- --- --- ---
21 location2 4/12/2004 <5.0 42 78 540 150 --- --- --- --- --- ---
22 location2 7/28/2008 23.3 27.9 28 409 9.34 --- --- --- --- --- ---
23 location2 5/19/2011 1.8 12 22 170 0.2 1 --- --- --- --- --- ---
24 location2 8/6/2013 4.3 23 71 590 0.1 3 --- --- --- --- --- ---
25 location2 1/19/2017 0.21 I 0.26 I 7.7 42 0.2 4 --- --- --- --- --- ---
26 location3 3/21/1994 <1 <1 <1 <1 <8 --- --- --- --- --- ---
27 location3 2/1/2000 <1 <1 <1 <2 <10 --- --- --- --- --- ---
答案 0 :(得分:0)
首先,您可以尝试以下方法:
df = df.T
df.iloc[:,-1] = df.iloc[:,-1].shift(1)
df = df.T
df = df.drop(df.columns[0], axis=1)
最后一点:
df['1'] = df['1'].ffill()