Question

在不依赖Lxml或BeautifulSoup的情况下从HTML页面中提取body标记内容的好方法是什么？

我正在为Django编写一个add on package，对于这么小的任务，我不想在我的插件中添加另一个依赖项。使用我提到的其中一个库真的很容易，但除了那个和正则表达式之外，我想不出另一个方法。

Answer 1

这非常hacky，我确定完全脆弱（不会在实际的<body>标签内部出现<body>等），但是如果你绝对不能使用上面提到的图书馆，也许是这样的？

In [7]: s = '<html><head>More stuff</head><body>Text inside of the body</body>Random text</html>'

In [8]: s.split('<body>')[1].split('</body>')[0]
Out[8]: 'Text inside of the body'

如果实际身体中的<body>标签是一个问题，这种憎恶似乎有效：

In [1]: s = '<html><head>More stuff</head><body>Text inside of the body<body>more sample text</body>and then more text and then another<body> and then another </body> and then end</body>Random text</html>'

In [2]: '</body>'.join('<body>'.join(s.split('<body>')[1:]).split('</body>')[:-1])
Out[2]: 'Text inside of the body<body>more sample text</body>and then more text and then another<body> and then another </body> and then end'

如何提取标签的内容？

1 个答案: