Question

我制作了一个python脚本，以获取雅虎财经的最新股票价格。

import urllib.request
import re

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");

htmltext = htmlfile.read();

price = re.findall(b'<span id="yfs_l84_goog">(.+?)</span>',htmltext);
print(price);

它运作顺畅，但当我输出价格时，它就像这样[b'1,217.04']

这可能是一个小问题，但我是python脚本的新手，所以如果可以，请帮助我。
我想摆脱'b'。如果我从b'<span id="yfs_l84_goog">"删除'b'，则会显示此错误。

File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object

我希望输出只是

1,217.04

Answer 1

b''是Python中bytes文字的语法。您可以在Python源代码中定义字节序列。

您在输出中看到的是bytes返回的price列表中单个re.findall()对象的表示形式。您可以将其解码为字符串并打印出来：

>>> for item in price:
...     print(item.decode()) # assume utf-8
... 
1,217.04

您也可以直接将字节写入stdout，例如sys.stdout.buffer.write(price[0])。

你可以use an html parser instead of a regex to parse html：

#!/usr/bin/env python3
import cgi
from html.parser import HTMLParser
from urllib.request import urlopen

url = 'http://finance.yahoo.com/q?s=GOOG'

def is_price_tag(tag, attrs):
    return tag == 'span' and dict(attrs).get('id') == 'yfs_l84_goog'

class Parser(HTMLParser):
    """Extract tag's text content from html."""
    def __init__(self, html, starttag_callback):
        HTMLParser.__init__(self)
        self.contents = []
        self.intag = None
        self.starttag_callback = starttag_callback
        self.feed(html)

    def handle_starttag(self, tag, attrs):
        self.intag = self.starttag_callback(tag, attrs)
    def handle_endtag(self, tag):
        self.intag = False
    def handle_data(self, data):
        if self.intag:
            self.contents.append(data)

# download and convert to Unicode
response = urlopen(url)
_, params = cgi.parse_header(response.headers.get('Content-Type', ''))
html = response.read().decode(params['charset'])

# parse html (extract text from the price tag)
content = Parser(html, is_price_tag).contents[0]
print(content)

检查yahoo是否提供不需要网络抓取的API。

Answer 2

好好找了一会儿。我找到了解决方案。工作对我很好。

import urllib.request 
import re

htmlfile = urllib.request.urlopen("http://finance.yahoo.com/q?s=GOOG");

htmltext = htmlfile.read();

pattern = re.compile('<span id="yfs_l84_goog">(.+?)</span>');

price = pattern.findall(str(htmltext));
print(price);

在python中使用re.findall（）时如何显示正确的输出？

2 个答案: