Question

所以我有一个大型文本文件，其中包含很多HTML线，这些文件由webcrawler很好地创建。它的线条看起来像下面的代码。我想知道，我怎样才能得到一个新文本文件，其中只包含＆＃34;所需文本＆＃34;而不是整行的HTML代码？

b'<b><a href="example.html" target="_blank">Desired Text 1</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 2</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 3</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 4</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 5</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 6</a></b>'

Answer 1

看看BeautifulSoup，这个例子有一个关于这个问题的演示：

Beautiful Soup Quick Intro

[编辑] 附件的详细解决方案：

from bs4 import BeautifulSoup

text = """
b'<b><a href="example.html" target="_blank">Desired Text 1</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 2</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 3</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 4</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 5</a></b>'
b'<b><a href="example.html" target="_blank">Desired Text 6</a></b>'
"""

soup = BeautifulSoup(text, 'html.parser')
print soup.getText()

如何使用python访问txt文件中字符串的特定部分？

1 个答案: