使用DataFrame从网页获取数据

时间:2016-03-29 09:21:51

标签: python-2.7 pandas dataframe

我正在尝试使用pandas DataFrame从网页(http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm)获取Python 2.7的时间序列数据。有人可以帮助我如何编写代码。谢谢!

我按照以下方式尝试了我的代码:

html =urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
text= html.read();
df=pd.DataFrame(index=datum, columns=['m_ta','m_tax','m_taxd', 'm_tan','m_tand'])

但它没有给予任何东西。在这里,我希望按原样显示表格。

1 个答案:

答案 0 :(得分:1)

您可以使用BeautifulSoup来解析所有font代码,然后使用splitaset_index来自idx列,{{3} } None - 删除index name

import pandas as pd
import urllib
from bs4 import BeautifulSoup

html = urllib.urlopen("http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm");
soup = BeautifulSoup(html)
#print soup

fontTags = soup.findAll('font')
#print fontTags

#get text from tags fonts
li = [x.text for x in soup.findAll('font')]

#remove first 13 tags, before not contain necessary data 
df = pd.DataFrame(li[13:], columns=['a'])

#split data by arbitrary whitspace 
df = df.a.str.split(r'\s+', expand=True)

#set column names
df.columns = columns=['idx','m_ta','m_tax','m_taxd', 'm_tan','m_tand']

#convert column idx to period
df['idx'] = pd.to_datetime(df['idx']).dt.to_period('M')

#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])

#set column idx to index, remove index name
df = df.set_index('idx').rename_axis(None)
print df

         m_ta m_tax     m_taxd  m_tan     m_tand
1901-01  -4.7   5.0 1901-01-23  -12.2 1901-01-10
1901-02  -2.1   3.5 1901-02-06   -7.9 1901-02-15
1901-03   5.8  13.5 1901-03-20    0.6 1901-03-01
1901-04  11.6  18.2 1901-04-10    7.4 1901-04-23
1901-05  16.8  22.5 1901-05-31   12.2 1901-05-05
1901-06  21.0  24.8 1901-06-03   14.6 1901-06-17
1901-07  22.4  27.4 1901-07-30   16.9 1901-07-04
1901-08  20.7  25.9 1901-08-01   14.7 1901-08-29
1901-09  15.9  19.9 1901-09-01   11.8 1901-09-09
1901-10  12.6  17.9 1901-10-04    8.3 1901-10-31
1901-11   4.7  11.1 1901-11-14   -0.2 1901-11-26
1901-12   4.2   8.4 1901-12-22   -1.4 1901-12-07
1902-01   3.4   7.5 1902-01-25   -2.2 1902-01-15
1902-02   2.8   6.6 1902-02-09   -2.8 1902-02-06
1902-03   5.3  13.3 1902-03-22   -3.5 1902-03-13
1902-04  10.5  15.8 1902-04-21    6.1 1902-04-08
1902-05  12.5  20.6 1902-05-31    8.5 1902-05-10
1902-06  18.5  23.8 1902-06-30   14.4 1902-06-19
1902-07  20.2  25.2 1902-07-01   15.5 1902-07-03
1902-08  21.1  25.4 1902-08-07   14.7 1902-08-13
1902-09  16.1  23.8 1902-09-05    9.5 1902-09-24
1902-10  10.8  15.4 1902-10-12    4.9 1902-10-25
1902-11   2.4   9.1 1902-11-01   -4.2 1902-11-18
1902-12  -3.1   7.2 1902-12-27  -17.6 1902-12-15
1903-01  -0.5   8.3 1903-01-11  -11.5 1903-01-23
1903-02   4.6  13.4 1903-02-23   -2.7 1903-02-17
1903-03   9.0  16.1 1903-03-28    4.9 1903-03-09
1903-04   9.0  16.5 1903-04-29    2.6 1903-04-19
1903-05  16.4  21.2 1903-05-03   11.3 1903-05-19
1903-06  19.0  23.1 1903-06-03   15.6 1903-06-07
...       ...   ...        ...    ...        ...
1998-07  22.5  30.7 1998-07-23   15.0 1998-07-09
1998-08  22.3  30.5 1998-08-03   14.8 1998-08-29
1998-09  16.0  21.0 1998-09-12   10.4 1998-09-14
1998-10  11.9  17.2 1998-10-07    8.2 1998-10-27
1998-11   3.8   8.4 1998-11-05   -1.6 1998-11-21
1998-12  -1.6   6.2 1998-12-14   -8.2 1998-12-26
1999-01   0.6   4.7 1999-01-15   -4.8 1999-01-31
1999-02   1.5   6.9 1999-02-05   -4.8 1999-02-01
1999-03   8.2  15.5 1999-03-31    3.0 1999-03-16
1999-04  13.1  17.1 1999-04-16    6.1 1999-04-18
1999-05  17.2  25.2 1999-05-31   11.1 1999-05-06
1999-06  19.8  24.4 1999-06-07   12.2 1999-06-22
1999-07  22.3  28.0 1999-07-06   16.3 1999-07-23
1999-08  20.6  26.7 1999-08-09   17.3 1999-08-23
1999-09  19.3  22.9 1999-09-26   15.0 1999-09-02
1999-10  11.5  19.0 1999-10-03    5.7 1999-10-18
1999-11   3.9  12.6 1999-11-04   -2.2 1999-11-21
1999-12   1.3   6.4 1999-12-13   -8.1 1999-12-25
2000-01  -0.7   8.7 2000-01-31   -6.6 2000-01-25
2000-02   4.5  10.2 2000-02-01   -0.1 2000-02-23
2000-03   6.7  11.6 2000-03-09    0.6 2000-03-17
2000-04  14.8  22.1 2000-04-21    5.8 2000-04-09
2000-05  18.7  23.9 2000-05-27   12.3 2000-05-22
2000-06  21.9  29.3 2000-06-14   15.4 2000-06-17
2000-07  20.3  26.6 2000-07-03   14.0 2000-07-16
2000-08  23.8  29.7 2000-08-20   18.5 2000-08-31
2000-09  16.1  21.5 2000-09-14   12.7 2000-09-24
2000-10  14.1  18.7 2000-10-04    8.0 2000-10-23
2000-11   9.0  14.9 2000-11-15    3.7 2000-11-30
2000-12   3.0   9.4 2000-12-14   -6.8 2000-12-24

[1200 rows x 5 columns]