通过网络抓取数据

时间:2019-05-22 10:23:31

标签: web-scraping

有人可以指导我如何从该特定表中提取数据吗?我已经尝试了多次,但没有成功提取所需的数据。

`import requests 
from bs4 import BeautifulSoup as bs
import pandas as pd
r = requests.get('https://etfdb.com/etf/ICLN/#fact-sheet', proxies = proxy_support).text
soup = bs(r,'html.parser')
da = soup.find_all('ul', {'class':'list-unstyled'})[0]
n_rows = 0
n_columns = 0
column_names = []

for row in da.find_all('li'):
td_tags = row.find('span')
if len(td_tags) > 0:
   n_rows+=1
   if n_columns == 0:
      n_columns = len(td_tags)

th_tags = row.find_all('a href') 
if len(th_tags) > 0 and len(column_names) == 0:
   for th in th_tags:
    column_names.append(th.get_text())

if len(column_names) > 0 and len(column_names) != n_columns:
            raise Exception("Column titles do not match the number of columns")


 columns = column_names if len(column_names) > 0 else range(0,n_columns)

 df = pd.DataFrame(columns = columns, index= range(0,n_rows))

 row_marker = 0

 for row in da.find_all('li'):
 column_marker = 0
 columns = row.find_all('span')
  for column in columns:
    df.iat[row_marker,column_marker] = columns.get_text()
    column_marker += 1
 if len(columns) > 0:
   row_marker += 1  

对于上面的代码,我得到以下错误:

  

AttributeError:ResultSet对象没有属性'get_text'。你是   可能将项目列表像单个项目一样对待。你打过电话吗   当您打算调用find()时使用find_all()?

有人可以告诉我我在做什么错吗?

1 个答案:

答案 0 :(得分:1)

使用bs4 4.7.1。获得第一张桌子

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://etfdb.com/etf/ICLN/#fact-sheet')
soup = bs(r.content, 'lxml')
items = soup.select('h3:contains(Vitals) + ul li')

for item in items:
    print([i.text for i in item.select('span')])

bs的早期版本

items = soup.select_one('h3 + ul').select('li')

for item in items:
    print([i.text for i in item.select('span')])