我正在重写一个写得不好的网站,最初用php编写。
我正在尝试隔离p标签中的文本,并想知道我如何只采用文本部分。有什么想法吗?
<p>
<span lang="EN-IE" xml:lang="EN-IE">
<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2
<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,
<span lang="EN-IE" xml:lang="EN-IE"> TEXT SAMPLE 4
</span> TEXT SAMPLE 5
<span lang="EN-IE" xml:lang="EN-IE">. </span>
</span><span lang="EN-IE" xml:lang="EN-IE">
<br>
<br>
TEXT SAMPLE 6
</span>
<span lang="EN-IE" xml:lang="EN-IE"> </span>
TEXT SAMPLE 7
答案 0 :(得分:0)
BeautifulSoup是个好地方。特别是get_text函数。
这将输出上面代码段中的所有文字:
from bs4 import BeautifulSoup
CONTENT = """
<p>
<span lang="EN-IE" xml:lang="EN-IE">
<br>
TEXT SAMPLE 1
<br>
<br>
TEXT SAMPLE 2
<span lang="EN-IE" xml:lang="EN-IE">TEXT SAMPLE 3
</span>,
<span lang="EN-IE" xml:lang="EN-IE"> TEXT SAMPLE 4
</span> TEXT SAMPLE 5
<span lang="EN-IE" xml:lang="EN-IE">. </span>
</span><span lang="EN-IE" xml:lang="EN-IE">
<br>
<br>
TEXT SAMPLE 6
</span>
<span lang="EN-IE" xml:lang="EN-IE"> </span>
TEXT SAMPLE 7
"""
if __name__ == '__main__':
soup = BeautifulSoup(CONTENT)
print soup.get_text()
输出可能需要一些字符串操作,因为有许多新行,但这将剥离HTML。