我正试图通过Webscrap网站获取信息。我将要剪贴的页面保存为.html文件,并用sublime text
打开了该页面,但是有些部分无法以修饰的方式显示;尝试使用beautifulsoup
时遇到相同的问题;请参见下面的图片(我无法真正共享完整的代码,因为它公开了私人信息)。
答案 0 :(得分:0)
只需将HTML作为多行字符串提供给BeautifulSoup
对象,然后使用soup.prettify()
。那应该工作。但是beautifulsoup的默认缩进为2个空格。因此,如果要自定义缩进,可以编写一个小的包装,如下所示:
def indentPrettify(soup, indent=4):
# where desired_indent is number of spaces as an int()
pretty_soup = str()
previous_indent = 0
# iterate over each line of a prettified soup
for line in soup.prettify().split("\n"):
# returns the index for the opening html tag '<'
current_indent = str(line).find("<")
# which is also represents the number of spaces in the lines indentation
if current_indent == -1 or current_indent > previous_indent + 2:
current_indent = previous_indent + 1
# str.find() will equal -1 when no '<' is found. This means the line is some kind
# of text or script instead of an HTML element and should be treated as a child
# of the previous line. also, current_indent should never be more than previous + 1.
previous_indent = current_indent
pretty_soup += writeOut(line, current_indent, indent)
return pretty_soup
def writeOut(line, current_indent, desired_indent):
new_line = ""
spaces_to_add = (current_indent * desired_indent) - current_indent
if spaces_to_add > 0:
for i in range(spaces_to_add):
new_line += " "
new_line += str(line) + "\n"
return new_line