Python文本间距和对齐,同时通过BeautifulSoup获取网页

时间:2016-03-30 18:31:41

标签: python beautifulsoup web-crawler urllib

我正在尝试使用beautifulsoup4(Pyhton3.4)实现一个基本的python web爬行脚本。它被用来获得美国国家篮球协会(NBA-Reg Season)目前的“联赛积分榜”。

我试图以更“'表格”的方式查看文字,但无法这样做。示例:

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs      62 12  0.838  39-6

相反,它看起来像这样(疯狂的方式)

Golden State Warriors  67  7  0.905  40-5
San Antonio Spurs  62  12  0.838  39-6

我尝试过使用string.format(),但无济于事。

这是我用于从网页中提取数据的代码段:

for row in tableStats.find_all('tr')[2:]:
    print("\n")
    row_team = row.find_all("td")

    try:
        for stat in row_team:
            print("{0:>5} {1:>5} ".format(stat.text," "), end=" ")
            f.write("{0:^2} {1:^3} ".format(stat.text," "))
        if(i == 16 and flag == 0):
            i = int("0")
            flag = int('1')
            print("\n\n\n\n")
            print("Western Conference".center(10),"\n\n\n")
            f.write("Western Conference\n\n")

        i = i + 1
        f.write("\n")
    except Exception as e:   #In Case a none object gets returned
        pass

有关如何使其正常工作的建议?

1 个答案:

答案 0 :(得分:0)

由于您没有提供可重现的示例,我将继续提供一些建议,以下所有代码都未经过测试,因此需要考虑算法创意,而不是直接复制/粘贴。

你有两个策略来解决这个问题:

  1. 你正在解析每列的宽度;
  2. 您将根据最大的单元格获得每列的大小。
  3. 策略①:单通,但固定标题列宽

    对于第一个策略,你可以在一个循环中完成(正如你所做的那样),但是你需要一种方法来区别对待一行中的第一个单元格,这样你就可以给它一个更大的尺寸。那就是:

    ### within your try/except block:
    # take the first cell to show off the team name on 20 columns
    # and strip it if it's longer than 20 columns. I like to add
    # three dots to strings I'm cutting, so here it goes:
    if len(row_team[0]) > 20:
        out_l = ['{}…'.format(row_team[0][:19])]
    else:
        # the ljust() method pads the right side of your string 
        # with spaces 
        out_l = [row_team[0].ljust(20)]
    for stat in row_team[1:]:
        # for each stat, parse it as float, and reinterpret it so
        # it's a ' 0.00' format, you might want to do 5.2f if some
        # values are in the 100s
        out_l.append("{: 4.2f}".format(float(stat)))
    
    # printing out the line, by making a string out of the list
    # using the ' '.join() method, adding a single space between
    # elements
    out = ' '.join(out_l)
    print(out)
    # write the line with a carriage return
    f.write('{}\n'.format(out))
    
    if(i == 16 and flag == 0):
        # here I'm centering the string's middle at 40 columns
        # considering a full width of 80 columns. If you set 10
        # columns for a string that's 18 characters, it's going
        # to have no effects!
        out = "Western Conference".center(80)
        print() # empty line
        print(out)
        print() # empty line
        # print the string surrounded by empty lines
        f.write("\n{}\n\n".format(out))
    

    BTW,以避免必须将i作为:

    进行管理
    i = 0
    for whatever:
        something
        i = i + 1
    

    你可以这样做:

    for i, row in enumerate(tableStats.find_all('tr')[2:]):
    

    我将为每个值递增。哪个会给你一个输出:

    Golden State Warrio… 67.00  7.00 0.90 40-05
    San Antonio Spurs    62.00 12.00 0.83 39-06
                                          ^^^^^- this is not handled with
                                                 the code above, cf the end
                                                 of my post.
    

    策略②:两遍

    对于第二个策略,您需要构建一个矩阵(基本上是一个列表列表),作为第一个传递:

    # init the matrix as an empty list
    stats_matrix = []
    for row in tableStats.find_all('tr')[2:]:
        row_team = row.find_all("td")
        # build a list, starting with the first cell:
        line = [row_team[0]]
        # find out what's the largest string for the first column
        max_header_size = max(max_header_size, len(row_team[0])
        for stat in row_team[1:]:
            # then all the other cells as floats
            line.append(float(stat))
        # add it to the matrix:
        stats_matrix.append(line)
    

    然后一旦完成,您可以使用max_header_size格式化第一列:

    for line in stats_matrix:
        # show the first cell with a padding on the right of size "max_header_size"
        out = [line[0].ljust(max_header_size)]
        for stat in line[1:]:
            # print each stat, which was stored as float, as a ' 0.00' string
            out.append("{: 4.2f}".format(stat))
        # show on standard output
        print(' '.join(line))
        # and write to file (with extra \n at the end)
        f.write('{}\n'.format(' '.join(line)))
    

    然后你应该看到它格式化。

    N.B。:所有这一切,这段代码不适用于您的数据集,因为最后一个值是 NOT 一个浮点数,而是一个得分(NN-NN)。所以由你来解决它所以最后一个元素不被当作浮动。

    如果我是你,我会考虑这个选项(针对第二种策略):

    …
    # iterate over the stats, leaving out the first and last value
    for stat in line[1:-1]:
        # do stuff with the floats
    score = line[-1].split('-') # split the string in two values
    line.append(score) # store the value as a tuple
    

    然后在第二个循环中:

    …
    for stat in line[1:-1]:
        …
    line.append('{:02d}-{:02d}'.format(score[0], score[1]))
    # show on standard output
    print(' '.join(line))
    …
    

    然后你应该有一个输出:

    Golden State Warriors 67.00  7.00 0.90 40-05
    San Antonio Spurs     62.00 12.00 0.83 39-06
    

    HTH