使用Python 2.5,我正在阅读三个不同信息的HTML文件。我能够找到信息的方法是找到与 regex *的匹配,然后从匹配行计算特定行数以获取我正在寻找的实际信息。 问题是我必须重新打开网站3次(我正在查找每条信息一个)。我认为它效率低下,并希望能够只查看打开网站的所有三件事。有没有人有更好的方法或建议?
* 我会学习更好的方法,例如BeautifulSoup,但是现在,我需要快速修复
代码:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
谢谢,
乙
我找到了一个有效的解决方案!我删除了两个无关的urlopen和readlines命令,只留下一个用于循环(在我只删除urlopen命令之前,但是留下了readlines)。这是我更正后的代码:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
#f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
#lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
#f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
#lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
print ticker,LastDiv,AnnualDiv,LastExDivDate
print '@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@'
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
答案 0 :(得分:1)
BeautifulSoup示例供参考(来自内存的Python2:我这里只有Python3,因此有些语法可能会有点偏差):
from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen
yoursite = "http://...."
with urlopen(yoursite) as f:
soup = BeautifulSoup(f)
for node in soup.findAll('td', attrs={'class':'descrip'}):
print node.text
print node.next_sibling.next_sibling.text
输出(样本输入&#39; GOOG&#39;):
Last Close:
$910.68
Annual Dividend:
N/A
Pay Date:
N/A
Dividend Yield:
N/A
Ex-Dividend Date:
N/A
Years Paying:
N/A
52 Week Dividend:
$0.00
etc.
BeautifulSoup可以在具有可预测架构的网站上轻松使用。
答案 1 :(得分:0)
def scrubdividata(ticker):
try:
end = '</td>'
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for i in range(0,len(lines)):
line = lines[i]
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass
答案 2 :(得分:0)
请注意,lines
将包含您需要的行,因此无需再次调用f.readlines()
。只需重复使用lines
小笔记:您可以使用for line in lines
:
def scrubdividata(ticker):
try:
f = urllib2.urlopen('http://dividata.com/stock/%s'%(ticker))
lines = f.readlines()
for line in lines:
if "Annual Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
AnnualDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
for line in lines:
if "Last Dividend:" in line:
s = str(lines[i+1])
start = '>\$'
end = '</td>'
LastDiv = re.search('%s(.*)%s' % (start, end), s).group(1)
for line in lines:
if "Last Ex-Dividend Date:" in line:
s = str(lines[i+1])
start = '>'
end = '</td>'
LastExDivDate = re.search('%s(.*)%s' % (start, end), s).group(1)
divlist.append((ticker,LastDiv,AnnualDiv,LastExDivDate))
except:
if ticker not in errorlist:
errorlist.append(ticker)
else:
pass
pass