使用BeautifulSoup来自网页的文字

时间:2018-11-20 17:24:10

标签: python http beautifulsoup

我正在尝试使用Python从https://markets.cboe.com/europe/equities/market_share/index/all/提取一些数据

特别是“市场未显示的总销量”数字,我尝试了多种使用BeautifulSoup的方法,但似乎没有一种方法可以使我到达那里。

有什么想法吗?

2 个答案:

答案 0 :(得分:0)

我建议给熊猫html读者一些机会:

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Grab the second table founds
df = results[1]
# Set the first column as the index
df = df.set_index(0)
# Switch columns and indexes
df = df.T
# Drop any columns that have no data in them
df = df.dropna(how='all', axis=1)
# Set the column under "Displayed Price Venues" as the index
df = df.set_index('Displayed Price Venues')
# Switch columns and indexes again
df = df.T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

您也可以采用更加紧凑的方式来执行此操作(相同的代码,但不会破坏步骤):

import pandas as pd

# Read in all tables at this address as pandas dataframes
results = pd.read_html('https://markets.cboe.com/europe/equities/market_share/index/all')

# Do all the stuff above in one go
df = results[1].set_index(0).T.dropna(how='all',axis=1).set_index('Displayed Price Venues').T

# Aesthetic. Don't like having an index name myself! 
del df.index.name

# Separate the three subtables from each other!  
displayed = df.iloc[0:18]
non_displayed = df.iloc[18:-1]
total = df.iloc[-1]

答案 1 :(得分:0)

问题是id不断变化。否则,我只会使用而不会使用。假设“输出”值正是您要寻找的,只要内容不发生变化或移位,它也应该起作用。

from bs4 import BeautifulSoup as bs
import requests

url = 'https://markets.cboe.com/europe/equities/market_share/index/all/'
page = requests.get(url)
html = bs(page.text, 'lxml')
total_volume = html.findAll('td', class_='idx_val')
print(total_volume[645].text)

Output:
€4,378,517,621