Question

说我有这样的文本文件：

<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>

我如何将标签放入这样的列表中：

<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>

Answer 1

我认为这就是你想要的：

html_string = ''.join(input_file.readlines())
matches = re.findall('<.*?>', html_string)
for m in matches:
    print m

希望这有帮助

Answer 2

Python有一个HTMLParser模块。

以下是一些可以满足您需求的代码：

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "<%s>"%tag

    def handle_endtag(self, tag):
        print "</%s>"%tag

parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
        </script>even more words</script>
        <html><head>Headline<html><head>more words
        </script>even more words</script>
        """)

在parser.feed

中输入您的字符串

输出：

$ python htmlparser.py 
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>

关于SO的讨论应该有所帮助：Using HTMLParser in Python efficiently

如何获取HTML标签？

2 个答案: