无法使用精美的文字或漂亮的汤美化html代码

时间:2019-05-06 16:10:42

标签: html web-scraping beautifulsoup

我正试图通过Webscrap网站获取信息。我将要剪贴的页面保存为.html文件,并用sublime text打开了该页面,但是有些部分无法以修饰的方式显示;尝试使用beautifulsoup时遇到相同的问题;请参见下面的图片(我无法真正共享完整的代码,因为它公开了私人信息)。

enter image description here

1 个答案:

答案 0 :(得分:0)

只需将HTML作为多行字符串提供给BeautifulSoup对象,然后使用soup.prettify()。那应该工作。但是beautifulsoup的默认缩进为2个空格。因此,如果要自定义缩进,可以编写一个小的包装,如下所示:

def indentPrettify(soup, indent=4):
    # where desired_indent is number of spaces as an int()
    pretty_soup = str()
    previous_indent = 0
    # iterate over each line of a prettified soup
    for line in soup.prettify().split("\n"):
        # returns the index for the opening html tag '<'
        current_indent = str(line).find("<")
        # which is also represents the number of spaces in the lines indentation
        if current_indent == -1 or current_indent > previous_indent + 2:
            current_indent = previous_indent + 1
            # str.find() will equal -1 when no '<' is found. This means the line is some kind
            # of text or script instead of an HTML element and should be treated as a child
            # of the previous line. also, current_indent should never be more than previous + 1.
        previous_indent = current_indent
        pretty_soup += writeOut(line, current_indent, indent)
    return pretty_soup

def writeOut(line, current_indent, desired_indent):
    new_line = ""
    spaces_to_add = (current_indent * desired_indent) - current_indent
    if spaces_to_add > 0:
        for i in range(spaces_to_add):
            new_line += " "
    new_line += str(line) + "\n"
    return new_line