Question

我从随机网站上抓取了一些原始HTML，其中有些脚本，自动关闭标签等内容可能很乱。例如：

ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"

我想返回HTML DOM，而不包含任何字符串，属性或诸如此类的东西，而只是标签结构，以字符串的格式显示父母，子女和兄弟姐妹之间的关系，这将是我的预期的输出< / strong>（尽管使用方括号是个人选择）：

'[html[head[meta, title], body[h1, p[span]]]]'

到目前为止，我尝试使用beautifulSoup（此answer很有帮助）。我认为应该将工作分为两个步骤： -提取html DOM的标签“ skeleton”，清空<html>之前的所有内容，例如字符串，属性和内容。 -返回纯HTML DOM，但结构类似树的定界符，指示每个子代和同级兄弟，例如方括号。我将代码发布为自动答案

Answer 1

您可以使用递归。 name参数将给出标签的名称。您可以检查类型是否为bs4.element.Tag，以确认元素是否为标记。

import bs4
ex="<!DOCTYPE html PUBLIC \\\n><html lang=\\'en-US\\'><head><meta http-equiv=\\'Content-Type\\'/><title>Some text</title></head><body><h1>Some other text</h1><p><span style='color:red'>My</span> first paragraph.</p></body></html>"
soup=bs4.BeautifulSoup(ex,'html.parser')
str=''
def recursive_child_seach(tag):
    global str
    str+=tag.name
    child_tag_list=[x for x in tag.children if type(x)==bs4.element.Tag]
    if len(child_tag_list) > 0:
        str+='['
    for i,child in enumerate(child_tag_list):
        recursive_child_seach(child)
        if not i == len(child_tag_list) - 1: #if not last child
            str+=', '
    if len(child_tag_list) > 0:
        str+=']'
    return
recursive_child_seach(soup.find())
print(str)
#html[head[meta, title], body[h1, p[span]]]
print('['+str+']')
#[html[head[meta, title], body[h1, p[span]]]]

Answer 2

我在这里发布了我的第一个解决方案，该解决方案仍然有些混乱，并且使用了大量正则表达式。第一个函数获取清空的DOM结构并将其作为原始字符串输出，第二个函数修改该字符串以添加定界符。

import re
def clear_tags(htmlstring, remove_scripts=False):
    htmlstring = re.sub("^.*?(<html)",r"\1", htmlstring, flags=re.DOTALL)
    finishyoursoup = soup(htmlstring, 'html.parser')
    for tag in finishyoursoup.find_all():
        tag.attrs = {}
        for sub in tag.contents:
            if sub.string:
                sub.string.replace_with('')
    if remove_scripts:
        [tag.extract() for tag in finishyoursoup.find_all(['script', 'noscript'])]
    return(str(finishyoursoup))
clear_tags(ex)
# '<html><head><meta/><title></title></head><body><h1></h1><p><span></span></p></b
def flattened_html(htmlstring):
    import re
    squeletton = clear_tags(htmlstring)
    step1      = re.sub("<([^/]*?)>", r"[\1",  squeletton) # replace begining of tag
    step2      = re.sub("</(.*?)>",   r"]",    step1) # replace end of tag
    step3      = re.sub("<(.*?)/>",   r"[\1]", step2) # deal with self-closing tag
    step4      = re.sub("\]\[",       ", ",    step3) # gather sibling tags with coma
    return(step4)
flattened_html(ex)
# '[html[head[meta, title], body[h1, p[span]]]]'

展平HTML代码，带有树结构定界符

2 个答案: