Question

我正在尝试使用beautifulsoup4（Pyhton3.4）实现一个基本的python web爬行脚本。它被用来获得美国国家篮球协会（NBA-Reg Season）目前的“联赛积分榜”。

我试图以更“'表格”的方式查看文字，但无法这样做。示例：

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs      62 12  0.838  39-6

相反，它看起来像这样（疯狂的方式）

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs  62  12  0.838  39-6

我尝试过使用string.format()，但无济于事。

这是我用于从网页中提取数据的代码段：

for row in tableStats.find_all('tr')[2:]:
    print("\n")
    row_team = row.find_all("td")

    try:
        for stat in row_team:
            print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
            f.write("{0:^2} {1:^3} ".format(stat.text," "))
        if(i == 16 and flag == 0):
            i = int("0")
            flag = int('1')
            print("\n\n\n\n")
            print("Western Conference".center(10),"\n\n\n")
            f.write("Western Conference\n\n")

        i = i + 1
        f.write("\n")
    except Exception as e:   #In Case a none object gets returned
        pass

有关如何使其正常工作的建议？

Answer 1

由于您没有提供可重现的示例，我将继续提供一些建议，以下所有代码都未经过测试，因此需要考虑算法创意，而不是直接复制/粘贴。

你有两个策略来解决这个问题：

你正在解析每列的宽度;
您将根据最大的单元格获得每列的大小。

策略①：单通，但固定标题列宽

对于第一个策略，你可以在一个循环中完成（正如你所做的那样），但是你需要一种方法来区别对待一行中的第一个单元格，这样你就可以给它一个更大的尺寸。那就是：

### within your try/except block:
# take the first cell to show off the team name on 20 columns
# and strip it if it's longer than 20 columns. I like to add
# three dots to strings I'm cutting, so here it goes:
if len(row_team[0]) > 20:
    out_l = ['{}…'.format(row_team[0][:19])]
else:
    # the ljust() method pads the right side of your string 
    # with spaces 
    out_l = [row_team[0].ljust(20)]
for stat in row_team[1:]:
    # for each stat, parse it as float, and reinterpret it so
    # it's a ' 0.00' format, you might want to do 5.2f if some
    # values are in the 100s
    out_l.append("{: 4.2f}".format(float(stat)))

# printing out the line, by making a string out of the list
# using the ' '.join() method, adding a single space between
# elements
out = ' '.join(out_l)
print(out)
# write the line with a carriage return
f.write('{}\n'.format(out))

if(i == 16 and flag == 0):
    # here I'm centering the string's middle at 40 columns
    # considering a full width of 80 columns. If you set 10
    # columns for a string that's 18 characters, it's going
    # to have no effects!
    out = "Western Conference".center(80)
    print() # empty line
    print(out)
    print() # empty line
    # print the string surrounded by empty lines
    f.write("\n{}\n\n".format(out))

BTW，以避免必须将i作为：

i = 0
for whatever:
    something
    i = i + 1

你可以这样做：

for i, row in enumerate(tableStats.find_all('tr')[2:]):

我将为每个值递增。哪个会给你一个输出：

Golden State Warrio… 67.00  7.00 0.90 40-05
San Antonio Spurs    62.00 12.00 0.83 39-06
                                      ^^^^^- this is not handled with
                                             the code above, cf the end
                                             of my post.

策略②：两遍

对于第二个策略，您需要构建一个矩阵（基本上是一个列表列表），作为第一个传递：

# init the matrix as an empty list
stats_matrix = []
for row in tableStats.find_all('tr')[2:]:
    row_team = row.find_all("td")
    # build a list, starting with the first cell:
    line = [row_team[0]]
    # find out what's the largest string for the first column
    max_header_size = max(max_header_size, len(row_team[0])
    for stat in row_team[1:]:
        # then all the other cells as floats
        line.append(float(stat))
    # add it to the matrix:
    stats_matrix.append(line)

然后一旦完成，您可以使用max_header_size格式化第一列：

for line in stats_matrix:
    # show the first cell with a padding on the right of size "max_header_size"
    out = [line[0].ljust(max_header_size)]
    for stat in line[1:]:
        # print each stat, which was stored as float, as a ' 0.00' string
        out.append("{: 4.2f}".format(stat))
    # show on standard output
    print(' '.join(line))
    # and write to file (with extra \n at the end)
    f.write('{}\n'.format(' '.join(line)))

然后你应该看到它格式化。

N.B。：所有这一切，这段代码不适用于您的数据集，因为最后一个值是 NOT 一个浮点数，而是一个得分（NN-NN）。所以由你来解决它所以最后一个元素不被当作浮动。

如果我是你，我会考虑这个选项（针对第二种策略）：

…
# iterate over the stats, leaving out the first and last value
for stat in line[1:-1]:
    # do stuff with the floats
score = line[-1].split('-') # split the string in two values
line.append(score) # store the value as a tuple

然后在第二个循环中：

…
for stat in line[1:-1]:
    …
line.append('{:02d}-{:02d}'.format(score[0], score[1]))
# show on standard output
print(' '.join(line))
…

然后你应该有一个输出：

Golden State Warriors 67.00  7.00 0.90 40-05
San Antonio Spurs     62.00 12.00 0.83 39-06

HTH

Python文本间距和对齐，同时通过BeautifulSoup获取网页

1 个答案:

策略①：单通，但固定标题列宽

策略②：两遍