尝试使用beautifulsoup在div元素下提取表

时间:2017-06-20 06:57:21

标签: web-scraping beautifulsoup html5lib

我是bs4的新手,我期待提取价格表。

我面临的主要问题是,在html页面中,table元素不是这样显示的,而是div。 我试图通过classid来查看,但我无法获得价格。

这就是我的尝试:

url = "http://www.valoreazioni.com/indici/ftse-mib_ftsemib_mi"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,"html5lib")

以下是我为了获得价格表而应用的过滤器     不成功

# table=soup.find('div',{'id':'maidMoneyTable'})
# table=soup.find(id='maidMoneyTable')

route=pd.read_html(str(tables),flavor='html5lib')

print(route)

在这两种情况下,返回值为no tables were found

有谁能告诉我如何获得所需的桌子?

1 个答案:

答案 0 :(得分:0)

使用BeautifulSoup从页面中截取数据,暂时将其保存在sqlite3表中,然后使用pandas处理sql将sqlite3中的数据转换为pandas。

>>> import requests
>>> page = requests.get('http://www.valoreazioni.com/indici/ftse-mib_ftsemib_mi').content
>>> import bs4
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> maidMoneyTable = soup.find_all(id='maidMoneyTable')
>>> table_rows = maidMoneyTable.findAll('li', attrs={'class': 'order'})
>>> for row in table_rows:
...     link = row.find('a')
...     data = [link.attrs['href']] + [_.text for _ in link.findAll('li')]
...     result = c.execute('''INSERT INTO market VALUES (?,?,?,?,?,?,?)''', data)
... 
>>> df = pd.read_sql_query('SELECT * FROM market', conn)
>>> df.head()
                                                 url   symbol  \
0      http://www.valoreazioni.com/titoli/a2a-a2a-mi   A2A.MI   
1  http://www.valoreazioni.com/titoli/anima-holdi...  ANIM.MI   
2  http://www.valoreazioni.com/titoli/atlantia-at...   ATL.MI   
3  http://www.valoreazioni.com/titoli/azimut-hold...   AZM.MI   
4  http://www.valoreazioni.com/titoli/banca-medio...  BMED.MI   

                name  item_1  item_2  item_3   item_4  
0            A2A SpA    1.50   1.503   0.003  +0.200%  
1  ANIMA HOLDING SPA    6.26   6.210  -0.040   -0.64%  
2           ATLANTIA   25.96  25.640  -0.240   -0.93%  
3     AZIMUT HOLDING   17.94  17.930   0.060   +0.34%  
4   BANCA MEDIOLANUM    7.43   7.290  -0.150   -2.02%