我正在尝试使用pandas DataFrame
从网页(http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm)获取Python 2.7的时间序列数据。有人可以帮助我如何编写代码。谢谢!
我按照以下方式尝试了我的代码:
html =urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
text= html.read();
df=pd.DataFrame(index=datum, columns=['m_ta','m_tax','m_taxd', 'm_tan','m_tand'])
但它没有给予任何东西。在这里,我希望按原样显示表格。
答案 0 :(得分:1)
您可以使用BeautifulSoup
来解析所有font
代码,然后使用split
列a
,set_index
来自idx
列,{{3} } None
- 删除index
name
:
import pandas as pd
import urllib
from bs4 import BeautifulSoup
html = urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
soup = BeautifulSoup(html)
#print soup
fontTags = soup.findAll('font')
#print fontTags
#get text from tags fonts
li = [x.text for x in soup.findAll('font')]
#remove first 13 tags, before not contain necessary data
df = pd.DataFrame(li[13:], columns=['a'])
#split data by arbitrary whitspace
df = df.a.str.split(r'\s+', expand=True)
#set column names
df.columns = columns=['idx','m_ta','m_tax','m_taxd', 'm_tan','m_tand']
#convert column idx to period
df['idx'] = pd.to_datetime(df['idx']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
#set column idx to index, remove index name
df = df.set_index('idx').rename_axis(None)
print df
m_ta m_tax m_taxd m_tan m_tand
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
1902-07 20.2 25.2 1902-07-01 15.5 1902-07-03
1902-08 21.1 25.4 1902-08-07 14.7 1902-08-13
1902-09 16.1 23.8 1902-09-05 9.5 1902-09-24
1902-10 10.8 15.4 1902-10-12 4.9 1902-10-25
1902-11 2.4 9.1 1902-11-01 -4.2 1902-11-18
1902-12 -3.1 7.2 1902-12-27 -17.6 1902-12-15
1903-01 -0.5 8.3 1903-01-11 -11.5 1903-01-23
1903-02 4.6 13.4 1903-02-23 -2.7 1903-02-17
1903-03 9.0 16.1 1903-03-28 4.9 1903-03-09
1903-04 9.0 16.5 1903-04-29 2.6 1903-04-19
1903-05 16.4 21.2 1903-05-03 11.3 1903-05-19
1903-06 19.0 23.1 1903-06-03 15.6 1903-06-07
... ... ... ... ... ...
1998-07 22.5 30.7 1998-07-23 15.0 1998-07-09
1998-08 22.3 30.5 1998-08-03 14.8 1998-08-29
1998-09 16.0 21.0 1998-09-12 10.4 1998-09-14
1998-10 11.9 17.2 1998-10-07 8.2 1998-10-27
1998-11 3.8 8.4 1998-11-05 -1.6 1998-11-21
1998-12 -1.6 6.2 1998-12-14 -8.2 1998-12-26
1999-01 0.6 4.7 1999-01-15 -4.8 1999-01-31
1999-02 1.5 6.9 1999-02-05 -4.8 1999-02-01
1999-03 8.2 15.5 1999-03-31 3.0 1999-03-16
1999-04 13.1 17.1 1999-04-16 6.1 1999-04-18
1999-05 17.2 25.2 1999-05-31 11.1 1999-05-06
1999-06 19.8 24.4 1999-06-07 12.2 1999-06-22
1999-07 22.3 28.0 1999-07-06 16.3 1999-07-23
1999-08 20.6 26.7 1999-08-09 17.3 1999-08-23
1999-09 19.3 22.9 1999-09-26 15.0 1999-09-02
1999-10 11.5 19.0 1999-10-03 5.7 1999-10-18
1999-11 3.9 12.6 1999-11-04 -2.2 1999-11-21
1999-12 1.3 6.4 1999-12-13 -8.1 1999-12-25
2000-01 -0.7 8.7 2000-01-31 -6.6 2000-01-25
2000-02 4.5 10.2 2000-02-01 -0.1 2000-02-23
2000-03 6.7 11.6 2000-03-09 0.6 2000-03-17
2000-04 14.8 22.1 2000-04-21 5.8 2000-04-09
2000-05 18.7 23.9 2000-05-27 12.3 2000-05-22
2000-06 21.9 29.3 2000-06-14 15.4 2000-06-17
2000-07 20.3 26.6 2000-07-03 14.0 2000-07-16
2000-08 23.8 29.7 2000-08-20 18.5 2000-08-31
2000-09 16.1 21.5 2000-09-14 12.7 2000-09-24
2000-10 14.1 18.7 2000-10-04 8.0 2000-10-23
2000-11 9.0 14.9 2000-11-15 3.7 2000-11-30
2000-12 3.0 9.4 2000-12-14 -6.8 2000-12-24
[1200 rows x 5 columns]