Question

我正在使用python 3来抓取网站并打印一个值。这是代码

import urllib.request
import re

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
regex = '<span id="yfs_l84_SPY">(.+?)</span>'
code = re.compile(regex)
price = re.findall(code,htext)
print (price)

当我运行此代码片段时，它会打印一个空列表，即。 []，但我期待一个值，例如483.33。

我错了什么？帮助

Answer 1

我必须建议您不要使用正则表达式来解析HTML，因为HTML is not a regular language。是的，你可以在这里使用它。进入不是一个好习惯。

我认为您遇到的最大问题是，您在该页面上寻找的id的真实 span为yfs_l84_spy 。注意案例。

也就是说，这是BeautifulSoup中的快速实现。

import urllib.request
from bs4 import BeautifulSoup

url = "http://in.finance.yahoo.com/q?s=spy"  
hfile = urllib.request.urlopen(url)
htext = hfile.read().decode('utf-8')
soup = BeautifulSoup(htext)
soup.find('span',id="yfs_l84_spy")
Out[18]: <span id="yfs_l84_spy">176.12</span>

并获得该数字：

found_tag = soup.find('span',id="yfs_l84_spy") #tag is a bs4 Tag object
found_tag.next #get next (i.e. only) element of the tag
Out[36]: '176.12'

Answer 2

您没有正确使用正则表达式，有两种方法可以执行此操作：

1

regex = '<span id="yfs_l84_spy">(.+?)</span>'
code = re.compile(regex)
price = code.findall(htext)

2

regex = '<span id="yfs_l84_spy">(.+?)</span>'
price = re.findall(regex, htext)

应该注意的是，Python正则表达式库在内部进行了一些缓存，因此预先缓存的效果有限。

使用正则表达式在python中刮擦没有给出任何结果？

2 个答案: