我正在尝试使用beautifulsoup4(Pyhton3.4)实现一个基本的python web爬行脚本。它被用来获得美国国家篮球协会(NBA-Reg Season)目前的“联赛积分榜”。
我试图以更“'表格”的方式查看文字,但无法这样做。示例:
Golden State Warriors 67 7 0.905 40-5
San Antonio Spurs 62 12 0.838 39-6
相反,它看起来像这样(疯狂的方式)
Golden State Warriors 67 7 0.905 40-5
San Antonio Spurs 62 12 0.838 39-6
我尝试过使用string.format()
,但无济于事。
这是我用于从网页中提取数据的代码段:
for row in tableStats.find_all('tr')[2:]:
print("\n")
row_team = row.find_all("td")
try:
for stat in row_team:
print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
f.write("{0:^2} {1:^3} ".format(stat.text," "))
if(i == 16 and flag == 0):
i = int("0")
flag = int('1')
print("\n\n\n\n")
print("Western Conference".center(10),"\n\n\n")
f.write("Western Conference\n\n")
i = i + 1
f.write("\n")
except Exception as e: #In Case a none object gets returned
pass
有关如何使其正常工作的建议?
答案 0 :(得分:0)
由于您没有提供可重现的示例,我将继续提供一些建议,以下所有代码都未经过测试,因此需要考虑算法创意,而不是直接复制/粘贴。
你有两个策略来解决这个问题:
对于第一个策略,你可以在一个循环中完成(正如你所做的那样),但是你需要一种方法来区别对待一行中的第一个单元格,这样你就可以给它一个更大的尺寸。那就是:
### within your try/except block:
# take the first cell to show off the team name on 20 columns
# and strip it if it's longer than 20 columns. I like to add
# three dots to strings I'm cutting, so here it goes:
if len(row_team[0]) > 20:
out_l = ['{}…'.format(row_team[0][:19])]
else:
# the ljust() method pads the right side of your string
# with spaces
out_l = [row_team[0].ljust(20)]
for stat in row_team[1:]:
# for each stat, parse it as float, and reinterpret it so
# it's a ' 0.00' format, you might want to do 5.2f if some
# values are in the 100s
out_l.append("{: 4.2f}".format(float(stat)))
# printing out the line, by making a string out of the list
# using the ' '.join() method, adding a single space between
# elements
out = ' '.join(out_l)
print(out)
# write the line with a carriage return
f.write('{}\n'.format(out))
if(i == 16 and flag == 0):
# here I'm centering the string's middle at 40 columns
# considering a full width of 80 columns. If you set 10
# columns for a string that's 18 characters, it's going
# to have no effects!
out = "Western Conference".center(80)
print() # empty line
print(out)
print() # empty line
# print the string surrounded by empty lines
f.write("\n{}\n\n".format(out))
BTW,以避免必须将i
作为:
i = 0
for whatever:
something
i = i + 1
你可以这样做:
for i, row in enumerate(tableStats.find_all('tr')[2:]):
我将为每个值递增。哪个会给你一个输出:
Golden State Warrio… 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
^^^^^- this is not handled with
the code above, cf the end
of my post.
对于第二个策略,您需要构建一个矩阵(基本上是一个列表列表),作为第一个传递:
# init the matrix as an empty list
stats_matrix = []
for row in tableStats.find_all('tr')[2:]:
row_team = row.find_all("td")
# build a list, starting with the first cell:
line = [row_team[0]]
# find out what's the largest string for the first column
max_header_size = max(max_header_size, len(row_team[0])
for stat in row_team[1:]:
# then all the other cells as floats
line.append(float(stat))
# add it to the matrix:
stats_matrix.append(line)
然后一旦完成,您可以使用max_header_size
格式化第一列:
for line in stats_matrix:
# show the first cell with a padding on the right of size "max_header_size"
out = [line[0].ljust(max_header_size)]
for stat in line[1:]:
# print each stat, which was stored as float, as a ' 0.00' string
out.append("{: 4.2f}".format(stat))
# show on standard output
print(' '.join(line))
# and write to file (with extra \n at the end)
f.write('{}\n'.format(' '.join(line)))
然后你应该看到它格式化。
N.B。:所有这一切,这段代码不适用于您的数据集,因为最后一个值是 NOT 一个浮点数,而是一个得分(NN-NN
)。所以由你来解决它所以最后一个元素不被当作浮动。
如果我是你,我会考虑这个选项(针对第二种策略):
…
# iterate over the stats, leaving out the first and last value
for stat in line[1:-1]:
# do stuff with the floats
score = line[-1].split('-') # split the string in two values
line.append(score) # store the value as a tuple
然后在第二个循环中:
…
for stat in line[1:-1]:
…
line.append('{:02d}-{:02d}'.format(score[0], score[1]))
# show on standard output
print(' '.join(line))
…
然后你应该有一个输出:
Golden State Warriors 67.00 7.00 0.90 40-05
San Antonio Spurs 62.00 12.00 0.83 39-06
HTH