Question

编程新手，这似乎是一个基本问题，但我无法弄清楚。下面的代码创建了一个.txt文件，该文件提供了最后一个数据集的两个实例。

有人可以帮助/解释为什么这段代码会生成最后一个数据集的两倍？感谢，

import urllib
import re
##NL East stats.
teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami-    marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]

j=0
i=0
while (i<len(teamnamelist)) and (j<len(teamstate)):
    url = "http://espn.go.com/mlb/team/_/name/" + teamstate[j] + "/" +teamnamelist[i]
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span class="stat">(.+?)</span>'
    pattern = re.compile(regex)
    price = re.findall(pattern,htmltext)
    print "the batting average of the",teamlist[i]," is: " ,price
    i+=1
    j+=1

text_file = open("statstest.txt", "a")
text_file.write("averages: {0}\n".format(price)) 
text_file.close()

Answer 1

一些事情：

在列表中使用zip。这几乎将它们组合成一个由元组组成的列表，它们的元素匹配起来。由于您已经正确地订购了您的元素，因此这将毫不费力地工作。
如果您检查页面，则大约有7或8个元素与正则表达式匹配。使用re.findall将返回一个列表，因此如果您想要正确获得击球平均值（列表中的第二个），则需要进行一些转换。

上面的数字2很大程度上是您的代码返回以下内容的原因：

the batting average of the Washington Nationals  is:  ['22', '.304', '.362', '.530', '3.21', '2', '0.93', '.179']
the batting average of the Philadelphia Phillies  is:  ['19', '.306', '.364', '.468', '5.96', '2', '1.75', '.311']
the batting average of the New York Mets  is:  ['10', '.179', '.243', '.337', '6.75', '2', '1.64', '.304']
the batting average of the Miami Marlins  is:  ['27', '.301', '.358', '.451', '3.00', '2', '1.31', '.268']
the batting average of the Atlanta Braves  is:  ['6', '.179', '.225', '.337', '1.38', '3', '0.85', '.184']
[Finished in 19.0s]

稍微改变一下你的方法：

import urllib
import re
##NL East stats.
teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami-marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]

for x, y, z in zip(teamstate, teamnamelist, teamlist):
    url = "http://espn.go.com/mlb/team/_/name/%s/%s" % (x, y)
    htmlfile = urllib.urlopen(url)
    htmltext = htmlfile.read()
    regex = '<span class="stat">(.+?)</span>'
    pattern = re.compile(regex)
    val = re.findall(pattern,htmltext)[1]
    print "The batting average of the %s is %s." % (z, str(val))

结果：

The batting average of the Washington Nationals is .304.
The batting average of the Philadelphia Phillies is .306.
The batting average of the New York Mets is .179.
The batting average of the Miami Marlins is .301.
The batting average of the Atlanta Braves is .179.
[Finished in 22.5s]

使用lxml和requests（因为从长远来看速度更快）：

import requests as rq
from lxml import html

teamstate = ["wsh","phi","nym","mia","atl"]
teamnamelist = ["washington-nationals","philadelphia-phillies","new-york-mets","miami-marlins","atlanta-braves"]
teamlist = ["Washington Nationals","Philadelphia Phillies","New York Mets","Miami Marlins","Atlanta Braves"]

for x, y, z in zip(teamstate, teamnamelist, teamlist):
    url = "http://espn.go.com/mlb/team/_/name/%s/%s" % (x, y)
    r = rq.get(url)
    tree = html.fromstring(r.text)
    val = tree.xpath("//span[@class='stat']/text()")[1]
    print "The batting average of the %s is %s." % (z, str(val))

结果：

The batting average of the Washington Nationals is .304.
The batting average of the Philadelphia Phillies is .306.
The batting average of the New York Mets is .179.
The batting average of the Miami Marlins is .301.
The batting average of the Atlanta Braves is .179.
[Finished in 10.6s]

如果有帮助，请告诉我们。

Python生成两个数据集实例

1 个答案: