使用pandas从评论字段中提取元数据

时间:2013-11-12 03:48:07

标签: python pandas

我需要下载并处理澳大利亚气象局气象文件。到目前为止,下面的Python运行良好,它正是我想要的提取和清理数据

import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax',   'amin',   'gmin',   'suns',   'rain',   'prob'], header=0, converters={'stn': str})

问题是文件每天被覆盖,并且指示生成预测的日期和时间的元数据位于前两行的注释字段中,即文件包含以下数据

# date=20131111
# time=06
[fcst_DB]
stn[7]  , per,   evap,   amax,   amin,   gmin,   suns,   rain,   prob
"001006",   0,-9999.0,   39.9,-9999.0,-9999.0,-9999.0,    4.0,  100.0
"001006",   1,-9999.0,   39.4,   26.5,-9999.0,-9999.0,    6.0,  100.0
"001006",   2,-9999.0,   35.5,   26.2,-9999.0,-9999.0,    7.0,  100.0

是否可以使用pandas在结果中包含前两行。理想情况下,通过向结果添加日期和时间列,并为输出中的每一行使用值20131111和06.

此致 戴夫

1 个答案:

答案 0 :(得分:1)

前两行总是一个日期和时间吗?在这种情况下,我建议单独解析它们并将其余的流交给read_csv。

import urllib2
r = urllib2.urlopen(url)

In [29]: r = urllib2.urlopen(url)

In [30]: date = next(r).strip('# date=').rstrip()

In [31]: time = next(r).strip('# time=').rstrip()

In [32]: stamp = pd.to_datetime(x + ' ' + time)

In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)

然后使用您的代码阅读(我将skiprows更改为1)

In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
             skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
             names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
                    'rain',   'prob'], header=0, converters={'stn': str})

In [43]: df['timestamp'] = stamp

In [44]: df.head()
Out[44]: 
      stn  per  evap  amax  amin  gmin  suns  rain   prob           timestamp
0  001006    0   NaN  39.9   NaN   NaN   NaN   2.9  100.0 2013-11-12 00:00:00
1  001006    1   NaN  35.8  25.8   NaN   NaN   7.0  100.0 2013-11-12 00:00:00
2  001006    2   NaN  37.0  25.5   NaN   NaN   4.0   71.4 2013-11-12 00:00:00
3  001006    3   NaN  39.0  26.0   NaN   NaN   1.0   60.0 2013-11-12 00:00:00
4  001006    4   NaN  41.2  26.1   NaN   NaN   0.0   40.0 2013-11-12 00:00:00