Python - 如何从网站检索某些文本

时间:2018-02-25 15:33:26

标签: python python-3.x beautifulsoup

我有以下代码:

import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import re

market = 'INDU:IND'
quote_page = 'http://www.bloomberg.com/quote/' + market

page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs={'class': 'name'})
name = name_box.text.strip()
print('Market: ' + name)

此代码有效,让我从网址获取市场名称。我尝试做类似于this网站的事情。这是我的代码:

market = 'BTC-GBP'
quote_page = 'https://uk.finance.yahoo.com/quote/' + market
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('span', attrs={'class': 'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)'})
name = name_box.text.strip()
print('Market: ' + name)

我不确定该怎么做。我想检索当前的费率,它增加/减少的数量是一个数字&百分比。最后,当信息更新时。我怎么做,我不介意你做一个我以前用过的方法,只要你解释它。如果我的代码效率低下/ unpythonic你也可以告诉我该怎么做才能解决这个问题。我对网络抓取和这些新模块都很陌生。谢谢!

1 个答案:

答案 0 :(得分:0)

您可以使用BeautifulSoup,在搜索所需数据时,使用正则表达式匹配网站后端脚本生成的动态范围类名:

from bs4 import BeautifulSoup as soup
import requests
import re

data = requests.get('https://uk.finance.yahoo.com/quote/BTC-GBP').text
s = soup(data, 'lxml')
d = [i.text for i in s.find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(\w+\) Fz\(\d+px\) Mb\(-\d+px\) D\(\w+\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})]
date_published = re.findall('As of\s+\d+:\d+PM GMT\.|As of\s+\d+:\d+AM GMT\.', data) 
final_results = dict(zip(['current', 'change', 'published'], d+date_published))

输出:

{'current': u'6,785.02', 'change': u'-202.99 (-2.90%)', 'published': u'As of  3:55PM GMT.'}

编辑:给定新网址,您需要更改范围类名:

data = requests.get('https://uk.finance.yahoo.com/quote/AAPL?p=AAPL').text
final_results = dict(zip(['current', 'change', 'published'], [i.text for i in soup(data, 'lxml').find_all('span', {'class':re.compile('Trsdu\(0\.\d+s\) Trsdu\(0\.\d+s\) Fw\(b\) Fz\(\d+px\) Mb\(-\d+px\) D\(b\)|Trsdu\(0\.\d+s\) Fw\(\d+\) Fz\(\d+px\) C\(\$data\w+\)')})] + re.findall('At close:\s+\d:\d+PM EST', data)))

输出:

{'current': u'175.50', 'change': u'+3.00 (+1.74%)', 'published': u'At close:  4:00PM EST'}