如何从给定的HTML中提取股票价格信息?

时间:2014-10-30 18:13:57

标签: python web-scraping beautifulsoup findall

我试图从雅虎财经中检索股票信息。我已经想出如何使用re.findall将价格列入清单。如果股票代码/价格不存在,我找到了一种方法来反复它说['没有这样的股票代码符号']。我的问题是我需要按顺序在同一个清单中找到价格和没有这样的票证符号。到目前为止这是我的代码。是否有可能在findall()中有两个模式,所以它可以将它们放在一个列表中?

import urllib.request
import re

li = [i.strip().split() for i in open("Portfolio.txt").readlines()]
li[0:26] =[]
li = [x for x in li if x]
li.sort()


def retrieve_page(url):
    my_socket = urllib.request.urlopen(url)
    dta = str(my_socket.readall())
    my_socket.close()
    price = re.findall((r'<td class="col-price cell-raw:(.*?)"><span'), dta)
    noprice = re.findall(r'<span class ="no-symbol">(.*?):<strong>', dta)
    print(price)
    print(noprice)

retrieve_page("http://finance.yahoo.com/quotes/AAPL,GOOG,HWP,IBM,MSFT")

我的输出如下

['107.120003', '552.25', '164.478699', '46.0938']
['No such ticker symbol']

1 个答案:

答案 0 :(得分:3)

如果是我,我avoid parsing HTML with a regular expression并改为使用BeautifulSoup

import requests
from bs4 import BeautifulSoup

def retrieve_page(url):
    dta = requests.get(url).text
    soup = BeautifulSoup(dta)
    price = soup.find_all(class_=["col-price", "invalid-symbol"])
    price = [next(x.strings) for x in price]
    # fix up ': '
    price = [x.replace(': ','') for x in price]
    print(price)

retrieve_page("http://finance.yahoo.com/quotes/AAPL,GOOG,HWP,IBM,MSFT")

结果:

['106.54', '547.45', 'No such ticker symbol', '163.86', '45.86']