编程新手,这似乎是一个基本问题,但我无法弄清楚。下面的代码创建了一个.txt文件,该文件提供了最后一个数据集的两个实例。
有人可以帮助/解释为什么这段代码会生成最后一个数据集的两倍? 感谢,
import urllib
import re
##NL East stats.
teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami- marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]
j=0
i=0
while (i<len(teamnamelist)) and (j<len(teamstate)):
url = "http://espn.go.com/mlb/team/_/name/" + teamstate[j] + "/" +teamnamelist[i]
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span class="stat">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print "the batting average of the",teamlist[i]," is: " ,price
i+=1
j+=1
text_file = open("statstest.txt", "a")
text_file.write("averages: {0}\n".format(price))
text_file.close()
答案 0 :(得分:1)
一些事情:
zip
。这几乎将它们组合成一个由元组组成的列表,它们的元素匹配起来。由于您已经正确地订购了您的元素,因此这将毫不费力地工作。re.findall
将返回一个列表,因此如果您想要正确获得击球平均值(列表中的第二个),则需要进行一些转换。上面的数字2很大程度上是您的代码返回以下内容的原因:
the batting average of the Washington Nationals is: ['22', '.304', '.362', '.530', '3.21', '2', '0.93', '.179']
the batting average of the Philadelphia Phillies is: ['19', '.306', '.364', '.468', '5.96', '2', '1.75', '.311']
the batting average of the New York Mets is: ['10', '.179', '.243', '.337', '6.75', '2', '1.64', '.304']
the batting average of the Miami Marlins is: ['27', '.301', '.358', '.451', '3.00', '2', '1.31', '.268']
the batting average of the Atlanta Braves is: ['6', '.179', '.225', '.337', '1.38', '3', '0.85', '.184']
[Finished in 19.0s]
稍微改变一下你的方法:
import urllib
import re
##NL East stats.
teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami-marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]
for x, y, z in zip(teamstate, teamnamelist, teamlist):
url = "http://espn.go.com/mlb/team/_/name/%s/%s" % (x, y)
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span class="stat">(.+?)</span>'
pattern = re.compile(regex)
val = re.findall(pattern,htmltext)[1]
print "The batting average of the %s is %s." % (z, str(val))
结果:
The batting average of the Washington Nationals is .304.
The batting average of the Philadelphia Phillies is .306.
The batting average of the New York Mets is .179.
The batting average of the Miami Marlins is .301.
The batting average of the Atlanta Braves is .179.
[Finished in 22.5s]
使用lxml
和requests
(因为从长远来看速度更快):
import requests as rq
from lxml import html
teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami-marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]
for x, y, z in zip(teamstate, teamnamelist, teamlist):
url = "http://espn.go.com/mlb/team/_/name/%s/%s" % (x, y)
r = rq.get(url)
tree = html.fromstring(r.text)
val = tree.xpath("//span[@class='stat']/text()")[1]
print "The batting average of the %s is %s." % (z, str(val))
结果:
The batting average of the Washington Nationals is .304.
The batting average of the Philadelphia Phillies is .306.
The batting average of the New York Mets is .179.
The batting average of the Miami Marlins is .301.
The batting average of the Atlanta Braves is .179.
[Finished in 10.6s]
如果有帮助,请告诉我们。