说我有这样的文本文件:
<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>
我如何将标签放入这样的列表中:
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>
答案 0 :(得分:6)
我认为这就是你想要的:
html_string = ''.join(input_file.readlines())
matches = re.findall('<.*?>', html_string)
for m in matches:
print m
希望这有帮助
答案 1 :(得分:4)
Python有一个HTMLParser模块。
以下是一些可以满足您需求的代码:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
print "<%s>"%tag
def handle_endtag(self, tag):
print "</%s>"%tag
parser = MyHTMLParser();
parser.feed("""<html><head>Headline<html><head>more words
</script>even more words</script>
<html><head>Headline<html><head>more words
</script>even more words</script>
""")
在parser.feed
输出:
$ python htmlparser.py
<html>
<head>
<html>
<head>
</script>
</script>
<html>
<head>
<html>
<head>
</script>
</script>
关于SO的讨论应该有所帮助:Using HTMLParser in Python efficiently