正则表达式在python中的html标签之间刮取字符串

时间:2017-10-15 12:59:27

标签: python regex web-scraping

我正试图从https://finance.yahoo.com/quote/GOOG?ltr=1和元素

中提取价格
<title>GOOG 989.68 1.85 0.19% : Alphabet Inc. - Yahoo Finance</title>

但我的输出不包含989.68的价格。相反,我得到了这个:

['GOOG : Summary for Alphabet Inc. - Yahoo Finance']

这是我的代码:

import urllib.request 
import re

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");

htmltext = htmlfile.read();

pattern = re.compile('<title>(.*?)</title>');

price = pattern.findall(str(htmltext));
print(price);

5 个答案:

答案 0 :(得分:2)

我在run -> edit configurations 中没有看到任何股票信息,但我能够使用BeautifulSoup让它工作:

<title></title>

,其输出为

import requests
from bs4 import BeautifulSoup

page = requests.get('https://finance.yahoo.com/quote/GOOG?ltr=1')
soup = BeautifulSoup(page.content, 'html.parser')
container = soup.select_one('div#quote-header-info')

print(container.find('h1').text)

for ele in container.find_all('span'):
    print(ele.text)

我强烈建议使用GOOG - Alphabet Inc. NasdaqGS - NasdaqGS Delayed Price. Currency in USD 989.68 +1.85 (+0.19%) At close: 4:00PM EDT 来查找您的元素,因为在新版本发布到网站后,这种情况很可能会发生变化。它是React框架使用的内部ID。此外,在某些浏览器中,React甚至没有将react-id作为属性,而是将data-reactid

答案 1 :(得分:1)

标题中实际上并未包含价格。转到页面源并亲自查看。如果你只使用BeautifulSoup而不是re:

,它会简单得多
import requests
from bs4 import BeautifulSoup

url = 'https://finance.yahoo.com/quote/GOOG'

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

# Use this to look at the source code
# print soup.prettify()

# Here is the exact tag of the span containing the price, 
# not sure if it'll be the same every time
for span in soup.find_all('span', attrs={'class': 'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'}):
    price = span.text
    break

print price

989.68

# Here is a more generic tag for the span, the value for this can change as well, 
# but its a simpler change. The price is contained in the first span like this, 
# so a break will make sure you get the correct one
for span in soup.find_all('span', attrs={'data-reactid': '14'}):
    price = span.text
    break

print price

989.68

答案 2 :(得分:0)

我已经浏览了您提到的网页网址的html来源。如你所说,在javascript的帮助下,价格被加载到标题中。如果检查html源代码,则可以在title标记之前看到该脚本。因为无论何时使用脚本向网站发出请求,它都会返回html代码作为响应。 Python脚本不理解javascripts,因此标题中没有加载价格。我建议你使用请求库来提出请求,因为它有先进的功能。requests docs。和其他人一样,我会使用BeautifulSoup来解析html。这很容易理解。BeautifulSoup docs。使用lxml解析器。因此,如果你在脚本中遵循这些,你的代码应该是

import requests
from bs4 import BeautifulSoup
url="https://finance.yahoo.com/quote/GOOG?ltr=1"
response=requests.get(url)
soup=BeautifulSoup(response.contemt,"lxml")
price=soup.find("span",{"data-reactid":"35"}).text
print price

这应该按预期返回价格。

答案 3 :(得分:0)

使用正则表达式可以获得所需的项目。这是代码。

import urllib
import re

htmlfile = urllib.urlopen("http://finance.yahoo.com/q?s=GOOG")
htmltext = htmlfile.read()

# for the title
pattern = re.compile('<title>(.*?)</title>')
title = pattern.findall(str(htmltext))
print('title:',title[0])

# regularMarketPrice
pattern = re.compile('\"regularMarketPrice\":{\"raw\":(.*?),')
regularMarketPrice = pattern.findall(str(htmltext))
print('regularMarketPrice:', regularMarketPrice[0])

# regularMarketChange
pattern = re.compile('\"regularMarketChange\":{\"raw\":(.*?),')
regularMarketChange = pattern.findall(str(htmltext))
print('regularMarketChange:',regularMarketChange[0])

# regularMarketChangePercent
pattern = re.compile('\"regularMarketChangePercent\":{\"raw\":(.*?),')
regularMarketChangePercent = pattern.findall(str(htmltext))
print('regularMarketChangePercent:',regularMarketChangePercent[0])  # x100 to get percent

# for close time
pattern = re.compile('<span data-reactid="21">At close:(.*?)</span>')
at_close = pattern.findall(str(htmltext))
print('At close:',at_close[0])

输出:

('title:', 'GOOG : Summary for Alphabet Inc. - Yahoo Finance')
('regularMarketPrice:', '989.68')
('regularMarketChange:', '1.8499756')
('regularMarketChangePercent:', '0.0018727671')
('At close:', '  4:00PM EDT')

答案 4 :(得分:0)

你可以这样做,以获得所需的输出而不使用正则表达式:

import requests
from bs4 import BeautifulSoup

soup = BeautifulSoup(requests.get('https://finance.yahoo.com/quote/GOOG?ltr=1').text, 'lxml')
for item in soup.select("div#quote-header-info"):
    title = item.select("h1")[0].text
    price = [elem.text for elem in item.select("span")[1:3]]
    print("Name: {}\nClosing Status: {}".format(title,' '.join(price)))

结果:

Name: GOOG - Alphabet Inc.
Closing Status: 989.68 +1.85 (+0.19%)