我需要下载并处理澳大利亚气象局气象文件。到目前为止,下面的Python运行良好,它正是我想要的提取和清理数据
import pandas as pd
df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#', skiprows=3, na_values=-9999.0, quotechar='"', skipfooter=1, names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns', 'rain', 'prob'], header=0, converters={'stn': str})
问题是文件每天被覆盖,并且指示生成预测的日期和时间的元数据位于前两行的注释字段中,即文件包含以下数据
# date=20131111
# time=06
[fcst_DB]
stn[7] , per, evap, amax, amin, gmin, suns, rain, prob
"001006", 0,-9999.0, 39.9,-9999.0,-9999.0,-9999.0, 4.0, 100.0
"001006", 1,-9999.0, 39.4, 26.5,-9999.0,-9999.0, 6.0, 100.0
"001006", 2,-9999.0, 35.5, 26.2,-9999.0,-9999.0, 7.0, 100.0
是否可以使用pandas在结果中包含前两行。理想情况下,通过向结果添加日期和时间列,并为输出中的每一行使用值20131111和06.
此致 戴夫
答案 0 :(得分:1)
前两行总是一个日期和时间吗?在这种情况下,我建议单独解析它们并将其余的流交给read_csv。
import urllib2
r = urllib2.urlopen(url)
In [29]: r = urllib2.urlopen(url)
In [30]: date = next(r).strip('# date=').rstrip()
In [31]: time = next(r).strip('# time=').rstrip()
In [32]: stamp = pd.to_datetime(x + ' ' + time)
In [33]: stamp
Out[33]: Timestamp('2013-11-12 00:00:00', tz=None)
然后使用您的代码阅读(我将skiprows
更改为1)
In [34]: df = pd.read_csv("ftp://ftp.bom.gov.au/anon/gen/fwo/IDY02122.dat", comment='#',
skiprows=1, na_values=-9999.0, quotechar='"', skipfooter=1,
names=['stn', 'per', 'evap', 'amax', 'amin', 'gmin', 'suns',
'rain', 'prob'], header=0, converters={'stn': str})
In [43]: df['timestamp'] = stamp
In [44]: df.head()
Out[44]:
stn per evap amax amin gmin suns rain prob timestamp
0 001006 0 NaN 39.9 NaN NaN NaN 2.9 100.0 2013-11-12 00:00:00
1 001006 1 NaN 35.8 25.8 NaN NaN 7.0 100.0 2013-11-12 00:00:00
2 001006 2 NaN 37.0 25.5 NaN NaN 4.0 71.4 2013-11-12 00:00:00
3 001006 3 NaN 39.0 26.0 NaN NaN 1.0 60.0 2013-11-12 00:00:00
4 001006 4 NaN 41.2 26.1 NaN NaN 0.0 40.0 2013-11-12 00:00:00